This document shows the differences between the 1.1 and 1.2 versions of the Intel® Cilk™ Plus Language Extension Specification, with additions shown underlied in green and deletions shown struck-through in red. Some sections were reordered for clarity. The reorderings do not affect the technical content of the document and are not highlighted in any way.

Intel® Cilk™ Plus Language Extension Specification
Version 1.1 1.2 (2013-09-06)

Document number: 324396-003US

Copyright © 2011, 2013, Intel Corporation. All rights reserved.

Cilk is a trademark of Intel Corporation in the U.S. and/or other countries.

Other names and brands may be claimed as the property of others.

More information about Intel® Cilk™ Plus can be found at cilk.com

Feedback on this specification is encouraged and welcome; please send to cilkfeedback@intel.com

Contents

  1. Introduction
  2. Related documents
  3. Keywords for Tasking
    1. Keyword Aliases
    2. Grammar
    3. Semantics
      1. Tasking Execution Model
      2. Serialization rule
    4. Spawning Task blocks
    5. _Cilk_for Loops
      1. Syntactic constraints
      2. Requirements on types and operators
      3. Dynamic constraints
      4. Grainsize pragma
    6. Spawn
    7. Sync
    8. Exceptions
  4. Hyperobjects
    1. Description
    2. Reducers
    3. Hyperobjects in C++
      1. C++ hyperobject syntax
      2. C++ reducer class template
      3. C++ Monoid class requirements
      4. C++ View class requirements
      5. C++ hyperobject behavior
    4. Hyperobjects in C
      1. C hyperobject syntax
      2. C hyperobject behavior
  5. Array notation
    1. Description
    2. The Section Expression
    3. Operations on Array Sections
      1. Assignment
      2. Arithmetic Operations
      3. Function calls
      4. Reduction Operations
      5. Array Implicit Index
      6. Sections in if statements
    4. Example: Sections as Array Parameters
  6. SIMD loops (#pragma simd)
    1. Description
    2. SIMD loop definition
      1. The vectorlength clause
      2. The linear clause
  7. Elemental SIMD-enabled functions
    1. Description
    2. Semantics
    3. The vector attribute
      1. The processorarchitecture clause
      2. The vectorlength clause
      3. The uniform clause
      4. The linear clause
      5. The mask clauses
  8. Restrictions on elemental SIMD-enabled functions and SIMD loops
  9. Cilk™ Plus Trademark License
  10. Non-Trademark Use of Cilk™ Plus
  11. Disclaimer and other legal information

Introduction

This document is part of the Intel® Cilk™ Plus Language Specification. The language specification comprises one of a set of technical specifications describing the Intel® Cilk™ Plus language and the run-time support for the language. Together, these documents provide the detail needed to implement a compliant compiler. At this time the language specification contains these parts following specifications are available:

This document defines the Intel® Cilk™ Plus extension to C and C++. The language extension is supported by a run-time user-mode work-stealing task scheduler which is not directly exposed to the application programmer. However, some of the semantics of the language and some of the guarantees provided require specific behavior of the task scheduler. The programmer visible parts of the language include the following constructs:

  1. Three keywords (_Cilk_spawn, _Cilk_sync and _Cilk_for) to express tasking
  2. Hyperobjects, which provide local views to shared objects
  3. Array notations
  4. Elemental SIMD-enabled functions
  5. SIMD loops

An implementation of the language may take advantage of all parallelism resources available in the hardware. On a typical CPU, these include at least multiple cores and vector units. Some of the language constructs, e.g. _Cilk_spawn, utilize only core parallelism; some, e.g. SIMD loops, utilize only vector parallelism, and some, e.g. SIMD-enabled functions, utilize both. The defined behavior of every deterministic Cilk program is the same as the behavior of a similar C or C++ program known as the “serialization.” While execution of a C or C++ program may be considered as a linear sequence of statements, execution of a tasking program is in general a directed acyclic graph. Parallel control flow may yield a new kind of undefined behavior, a “data race,” whereby parts of the program that may execute in parallel access the same memory location in an indeterminate order, with at least one of the accesses being a write access. In addition, throwing if an exception may result in is thrown, code being may still be executed that would not have been executed in a serial execution.

The word “shall” is used in this specification to express a diagnosable constraint on a Cilk Plus program.

Related documents

  1. The Intel® Cilk™ Plus Application Binary Interface
  2. Intel Corporation Vector Function Application Binary Interface
  3. ISO/IEC 9899:2011, Information Technology – Programming languages – C
  4. ISO/IEC 14882:2011, Information Technology – Programming languages – C++
  5. OpenMP Application Program Interface, Version 4.0 - July 2013

Keywords for Tasking

Cilk Plus adds the following new keywords:

A program that uses these keywords other than as defined in the grammar extension below is ill-formed.

Keyword Aliases

The header <cilk/cilk.h> defines the following aliases for the Cilk keywords:

#define cilk_spawn _Cilk_spawn
#define cilk_sync  _Cilk_sync
#define cilk_for   _Cilk_for

Grammar

The three keywords are used in the following new productions:

jump-statement:
_Cilk_sync ;

The call production of the grammar is modified to permit the keyword _Cilk_spawn before the expression denoting the function to be called:

postfix-expression:
_Cilk_spawnopt postfix-expression ( expression-listopt )

Consecutive _Cilk_spawn tokens are not permitted. The postfix-expression following _Cilk_spawn is called a spawned function. The spawned function may be a normal function call, a member-function call, or the function-call (parentheses) operator of a function object (functor) or a call to a lambda expression. Overloaded operators other than the parentheses operator may be spawned only by using the function-call notation (e.g., operator+(arg1,arg2)). There shall be no more than one _Cilk_spawn within a full expression. A function that contains a spawn statement is called a spawning function.

Note: The spawned function call may be a normal function call, a member-function call, the function-call (parentheses) operator of a function object (functor), or a call to a lambda expression.

A program is considered ill formed if the _Cilk_spawn form of this expression appears other than in one of the following contexts:

(A _Cilk_spawn expression may be permitted in more contexts in the future.) The rank of a spawned function call shall be zero. (See The section expression.)

A statement with a _Cilk_spawn on the right hand side of an assignment or declaration is called an assignment spawn or initializer spawn, respectively and the object assigned or initialized by the spawn is called the receiver.

The iteration-statement is extended by adding another form of for loop:

grainsize-pragma:
# pragma cilk grainsize = expression new-line
iteration-statement:
grainsize-pragmaopt _Cilk_for ( expression ; expression ; expression ) statement
grainsize-pragmaopt _Cilk_for ( declaration expression ; expression ) statement

The three items inside parentheses in the grammar, separated by semicolons, are called the initialization, condition, and increment, respectively. (A semicolon is included in the grammar of declaration.)

Semantics

Tasking Execution Model

A strand is a serially-executed sequence of instructions that does not contain a spawn point or sync point (as defined below). At a spawn point, one strand (the initial strand) ends and two strands (the new strands) begin. The initial strand runs in series with is sequenced before each of the new strands but the new strands are unsequenced with respect to one another (i.e. they may run in parallel with each other). At a sync point, one or more strands (the initial strands) end and one strand (the new strand) begins. The initial strands may run in parallel with one another are unsequenced with respect to one another but each of the initial strands runs in series with is sequenced before the new strand. A single strand can be subdivided into a sequence of shorter strands in any manner that is convenient for modeling the computation. A maximal strand is one that cannot be included in a longer strand.

The strands in an execution of a program form a directed acyclic graph (DAG) in which spawn points and sync points comprise the vertices and the strands comprise the directed edges, with time defining the direction of each edge. (In an alternative DAG representation, sometimes seen in the literature, the strands comprise the vertices and the dependencies between the strands comprise the edges.)

Serialization rule

The behavior of a deterministic Intel® Cilk™ Plus program is defined in terms of its serialization, as defined in this section. If the serialization has undefined behavior, the Intel® Cilk™ Plus program also has undefined behavior.

The strands in an execution of a program are ordered according to the order of execution of the equivalent code in the program's serialization. Given two strands, the earlier strand is defined as the strand that would execute first in the serial execution of the same program with the same inputs, even though the two strands may execute in either order or concurrently in the actual parallel execution. Similarly, the terms “earliest,” “latest,” and “later” are used to designate strands according to their serial ordering. The terms “left,” “leftmost,” “right,” and “rightmost” are equivalent to “earlier,” “earliest,” “later,” and “latest,” respectively.

The serialization of a pure C or C++ program is itself.

If a C or C++ program has defined behavior and does not use the tasking keywords or library functions, it is a Cilk Plus program with the same defined behavior.

The serializations of _Cilk_spawn and _Cilk_sync are empty.

If a Cilk Plus program has defined deterministic behavior, then that behavior is the same as the behavior of the C or C++ program derived from the original by removing all instances of the keywords _Cilk_spawn, and _Cilk_sync.

The serialization of _Cilk_for is for.

If a Cilk Plus program has defined deterministic behavior, then that behavior is the same as the behavior of the C or C++ program derived from the original by replacing each instance of the _Cilk_for keyword with for.

Spawning Task blocks

A spawning task block is a region of the program subject to special rules. Task blocks may be nested. The body of a nested task block is not part of the outer task block. Task blocks never partially overlap. The body of a spawning function is a task block. A _Cilk_for statement is a task block and the body of the _Cilk_for loop is a (nested) task block.

Every spawning task block includes an implicit _Cilk_sync executed on exit from the block, including abnormal exit due to an exception. Destructors for automatic objects with scope ending at the end of the task block are invoked before the implicit _Cilk_sync. The receiver is assigned or initialized to the return value before executing the implicit _Cilk_sync at the end of a function. An implicit or explicit _Cilk_sync within a nested task block will synchronize with _Cilk_spawn statements only within that task block, and not with _Cilk_spawn statements in the surrounding task block.

The scope of a label defined in a spawning block is limited to that spawning block.

Programmer note: Therefore, goto may not be used to enter or exit a spawning block.

_Cilk_for Loops

The constraints and semantics of a _Cilk_for loop are the same as those of its serialization, unless specified otherwise.

Each iteration of a _Cilk_for loop is a separate strand; they need not be executed serially.

Within each iteration of the loop body, the control variable is considered a unique variable whose address is no longer valid when the iteration completes. the name of the control variable refers to a local object, as if the name were declared as an object within the body of the loop, with automatic storage duration and with the type of the original object. If the control variable is declared before the loop initialization, then the address of the variable at the end of the loop is the same as the address of the variable before the loop initialization and the final value of the control variable is the same as for the serialization of the program.

Syntactic constraints

To simplify the grammar, some restrictions on _Cilk_for loops are stated here in text form. The three items inside parentheses in the grammar, separated by semicolons, are the initialization, condition, and increment. Where a constraint on an expression is expressed grammatically, parentheses around a required expression or sub-expression are allowed.

A program that contains a return, break, or goto statement that would transfer control into or out of a _Cilk_for loop is ill-formed.

The initialization shall declare or initialize a single variable, called the control variable. In C only, the control variable may be previously declared, but if so shall be reinitialized, i.e., assigned, in the initialization clause. In C++, the control variable shall be declared and initialized within the initialization clause of the _Cilk_for loop. The variable shall have automatic storage duration. No storage class may be specified for the variable within the initialization clause. The variable shall have integral, pointer, or class type. The variable may not be const or volatile. The variable shall be initialized. Initialization may be explicit, using assignment or constructor syntax, or implicit via a nontrivial default constructor. Within each iteration of the loop body, the control variable is considered a unique variable whose address is no longer valid when the iteration completes. If the control variable is declared before the loop initialization, then the address of the variable at the end of the loop is the same as the address of the variable before the loop initialization and the value of the control variable is the same as for the serialization of the program.

The condition shall have one of the following two forms:

var OP shift-expression
shift-expression OP var

where var is the control variable, optionally enclosed in parentheses. The operator denoted OP shall be one of !=, <=, <, >=, or >. The shift-expression that is not the control variable is called the loop limit.

The condition shall have one of the following forms:

expression < expression
expression > expression
expression <= expression
expression >= expression
expression != expression

Exactly one of the operands of the comparison operator shall be just the name of the loop's control variable. The operand that is not the control variable is called the limit expression. Any implicit conversion applied to that operand is not considered part of the limit expression.

The loop increment shall have one of the following forms: where var is the loop control variable, optionally enclosed in parentheses, and incr is a conditional-expression with integral or enum type. The table indicates the stride corresponding to the syntactic form.

Syntax Stride
++var +1
var++ +1
--var -1
var-- -1
var += incr incr
var -= incr -(incr)

The notion of stride exists for exposition only and does not need to be computed. In particular, for the case of var -= incr, a program may be well formed even if incr is unsigned.

++ identifier
identifier ++
-- identifier
identifier --
identifier += expression
identifier -= expression

The variable modified by the increment shall be the control variable.

A program that contains a return, break, goto or switch statement that would transfer control into or out of a _Cilk_for loop is ill-formed.

Requirements on types and operators

The type of var shall be copy constructible. (For the purpose of specification, all C types are considered copy constructible.) The control variable shall have unqualified integral, pointer, or copy-constructible class type.

The initialization, condition, and increment parts of a _Cilk_for shall be defined such that the total number of iterations (loop count) can be determined before beginning the loop execution. Specifically, the parts of the _Cilk_for loop shall meet all of the semantic requirements of the corresponding serial for statement. In addition, depending on the syntactic form of the condition, a _Cilk_for adds the following requirements on the types of var the control variable, limit the limit expression, and stride the stride. (and by extension incr), and

The loop count is computed as follows, evaluated in infinite integer precision when the control variable and limit both have integral or pointer type. ( In the following table, first is the value of var immediately after initialization, var” stands for an expression with the type and value of the control variable, “limit” stands for an expression with the type and value of the limit expression, and “stride” stands for an expression with the type and value of the stride expression. The loop count is computed after the loop initialization is performed, and before the control variable is modified by the loop. The loop count expression shall be well-formed, and shall have integral type. When a stride expression is present, if the divisor of the division is not greater than zero, the behavior is undefined. )

Condition syntax Requirements Loop count
var < limit
limit > var
(limit) - (first) shall be well-formed and shall yield an integral difference_type;
stride shall be > 0
(( limit ) - ( first )) / stride
var > limit
limit < var
(first) - (limit) shall be well-formed and shall yield an integral difference_type;
stride shall be < 0
(( first ) - ( limit )) / -stride
var <= limit
limit >= var
(limit) - (first) shall be well-formed and shall yield an integral difference_type;
stride shall be > 0
(( limit ) - ( first ) + 1) / stride
var >= limit
limit <= var
(first) - (limit) shall be well-formed and shall yield an integral difference_type;
stride shall be < 0
(( first ) - ( limit ) + 1) / -stride
var != limit
limit != var
(limit) - (first) and (first) - (limit) shall be well-formed and yield the same integral difference_type;
stride shall be != 0
if stride is positive
then ((limit) - (first)) / stride
else ((first) - (limit)) / -stride
Loop count expression and value
Form of condition Form of increment
var++
++var
var--
--var
var += stride var -= stride
var < limit
limit > var
((limit)-(var))
n/a
((limit)-(var)-1)/(stride)+1
((limit)-(var)-1)/-(stride)+1
var > limit
limit < var
n/a
((var)-(limit))
((var)-(limit)-1)/-(stride)+1
((var)-(limit)-1)/(stride)+1
var <= limit
limit >= var
((limit)-(var))+1
n/a
((limit)-(var))/(stride)+1
((limit)-(var))/-(stride)+1
var >= limit
limit <= var
n/a
((var)-(limit))+1
((var)-(limit))/-(stride)+1
((var)-(limit))/(stride)+1
var != limit
limit != var
((limit)-(var))
((var)-(limit))
((stride)<0) ?
((var)-(limit)-1)/-(stride)+1 :
((limit)-(var)-1)/(stride)+1
((stride)<0) ?
((limit)-(var)-1)/-(stride)+1 :
((var)-(limit)-1)/(stride)+1

The incr expression shall have integral or enumeration type. The type of the difference between the limit expression and the control variable is the subtraction type, which shall be integral. When the condition operation is !=, (limit)-(var) and (var)-(limit) shall have the same type. The stride shall be convertible to the subtraction type.

For some expression X with the same type as the subtraction type, if the loop increment uses operator ++ or +=, the expression:

var += (difference_type)(incr) X

shall be well-formed; if the loop increment it uses operator -- or -=, the expression

var -= (difference_type)(incr) X

shall be well-formed. The loop is a use an odr-use of the required operator += or -= function.

Dynamic constraints

If the stride does not meet the requirements in the table above, the behavior is undefined. If this condition can be determined statically, the compiler is encouraged (but not required) to issue a warning. (Note that the incorrect loop might occur in an unexecuted branch, e.g., of a function template, and thus should not cause a compilation failure in all cases.)

If the control variable is modified other than as a side effect of evaluating the loop increment expression, the behavior of the program is undefined.

If X and Y are values of var the control variable that occur in consecutive evaluations of the loop condition in the serialization, then the behavior is undefined if

((limit) - X) - ((limit) - Y)

evaluated in infinite integer precision, shall does not equal the stride. If the condition expression is true on entry to the loop, then the behavior is undefined if the computed loop count shall be non-negative is not greater than zero. If the computed loop count is not representable as a value of type unsigned long long, the behavior is undefined.

Programmer note: Unsigned wraparound is not allowed.

If the body of the loop is executed, the increment and limit expressions may be evaluated fewer a different number of times than in the serialization. If different evaluations of the same expression yield different values, the behavior of the program is undefined.

The copy constructor for the control variable may be executed more times than in the serialization.

If evaluation of the increment or limit expression, or a required operator+= or operator-= throws an exception, the behavior of the program is undefined.

If the loop body throws an exception that is not caught within the same iteration of the loop, it is unspecified which other loop iterations execute, but no other iteration is terminated early. If multiple loop iterations throw exceptions that are not caught in the loop body, the _Cilk_for statement throws the exception that would have occurred first in the serialization of the program.

Grainsize pragma

A _Cilk_for iteration-statement may optionally be preceded by a grainsize-pragma. The grainsize pragma shall immediately precede a _Cilk_for loop and may not appear anywhere else in a program, except that other pragmas that appertain to the _Cilk_for loop may appear between the grainsize-pragma and the _Cilk_for loop. The expression in the grainsize pragma shall evaluate to a type convertible to long.

The presence of the pragma provides a hint to the runtime specifying the number of serial iterations desired in each chunk of the parallel loop. The grainsize expression is evaluated at runtime. The grainsize expression need not be evaluated. If it is evaluated, that evaluation is sequenced after the execution of the statement preceding the loop, is sequenced before any execution of the loop body, and is unsequenced with respect to the loop initialization and the evaluation of the limit and stride expressions. If there is no grainsize pragma, or if the grainsize evaluates to 0, then the runtime will pick a grainsize using its own internal heuristics. If the grainsize evaluates to a negative value, the behavior is unspecified. (The meaning of negative grainsizes is reserved for future extensions.) The grainsize pragma applies only to the _Cilk_for statement that immediately follows it – the grain sizes for other _Cilk_for statements are not affected.

Spawn

The _Cilk_spawn keyword suggests to the implementation that an executed statement or part of a statement may be run in parallel with following statements. A consequence of this parallelism is that the program may exhibit undefined behavior not present in the serialization. Execution of a _Cilk_spawn keyword is called a spawn. Execution of a _Cilk_sync statement is called a sync. A statement An expression statement or declaration statement that contains a spawn is called a spawning statement. In a declaration containing a _Cilk_spawn keyword, the initialization of each object declared is treated as a separate statement.

The following sync of a _Cilk_spawn refers to the next _Cilk_sync executed (dynamically, not lexically) in the same task block. Which spawn the sync follows is implied from context. The following sync may be the implicit _Cilk_sync at the end of a task block.

A spawn point is a C sequence point at which a control flow fork is considered to have taken place. Any operations within the spawning expression that are not required by the C/C++ standards to be sequenced after the spawn point shall be executed are sequenced before the spawn point. The strand that begins at the statement immediately following the spawning statement (in execution order) is called the continuation of the spawn. The sequence of operations within the spawning statement that are sequenced after the spawn point comprise the child of the spawn. The scheduler may execute the child and the continuation in parallel. Informally, the parent is the task block containing the initial strand, the spawning statements, and their continuations but excluding the children of all of the spawns. The children of the spawns within a single task block are siblings of one another.

The spawn points associated with different spawning statements are as follows:

For example, in the following two statements:

x[g()] = _Cilk_spawn f(a + b);
a++;

The call to function f is the spawn point and the statement a++; is the continuation. The expression a + b and the initialization of the temporary variable holding that value, and the evaluation of x[g()] take place before the spawn point. The execution of f, the assignment to x[g()], and the destruction of the temporary variable holding a + b take place in the child.

If a statement is followed by an implicit sync, that sync is the spawn continuation.

Programmer note: The sequencing may be more clear if

x[g()] = _Cilk_spawn f(a + b);

is considered to mean

{
	// Evaluate arguments and receiver address before spawn point
	T tmp = a + b; // T is the type of a + b
	U &r = x[g()]; // U is the type of x[0]
	_Cilk_spawn { r = f(tmp); tmp.~T(); }
}

A setjmp/longjmp call pair within the same task block has undefined behavior if a spawn or sync is executed between the setjmp and the longjmp. A setjmp/longjmp call pair that crosses a task block boundary has undefined behavior. A goto statement is not permitted to enter or exit a task block.

Sync

A sync statement indicates that all children of the current task block must finish executing before execution may continue within the task block. The new strand coming out of the _Cilk_sync is not running in parallel with any child strands, but may still be running in parallel with parent and sibling strands (other children of the calling function).

There is an implicit sync at the end of every task block. If a spawning statement appears within a try block, a sync is implicitly executed at the end of on exit from that try block, as if the body of the try were a task block. If a task block has no children at the time of a sync, then the sync has no observable effect. (The compiler may elide an explicit or implicit sync if it can statically determine that the sync will have no observable effect.)

Programmer note: Because implicit syncs follow destructors, writing _Cilk_sync at the end of a function may produce a different effect than the implicit sync. In particular, if an assignment spawn or initializer spawn is used to modify a local variable, the function will generally need an explicit _Cilk_sync to avoid a race between assignment to the local variable by the spawned function and destruction of the local variable by the parent function.

Exceptions

There is an implicit _Cilk_sync before a throw, after the exception object has been constructed. try-block.

If a spawned function terminates with an exception, the exception propagates from the point of the corresponding sync.

When several exceptions are pending and not yet caught, later exception objects (in the serial execution order of the program) are destructed in an unspecified order before the earliest exception is caught.

Hyperobjects

Description

Cilk Plus defines a category of objects called “hyperobjects”. Hyperobjects allow thread-safe access to shared objects by giving each parallel strand running in parallel a separate instance of the object.

Parallel code uses a hyperobject by performing a hyperobject lookup operation. The hyperobject lookup returns a reference to an object, called a view, that is guaranteed not to be shared with any other active strands in the program. The sequencing of a hyperobject lookup within an expression is not specified. The runtime system creates a view when needed, using callback functions provided by the hyperobject type. When strands synchronize, the hyperobject views are merged into a single view, using another callback function provided by the hyperobject type.

The view of a hyperobject visible to a program may change at any spawn or sync (including the implicit spawns and syncs within a _Cilk_for loop). The identity (address) of the view does not change within a single strand. The view of a given hyperobject visible within a given strand is said to be associated with that view. A hyperobject has the same view before the first spawn within a task block as after a sync within the same task block, even though the thread ID may not be the same (i.e., hyperobject views are not tied to threads). A hyperobject has the same view upon entering and leaving a _Cilk_for loop and within the first iteration (at least) of the _Cilk_for loop. A special view is associated with a hyperobject when the hyperobject is initially created. This special view is called the leftmost view or earliest view because it is always visible to the leftmost (earliest) descendent in the depth-first, left-to-right traversal of the program's spawn tree. The leftmost view is given an initial value when the hyperobject is created.

Programmer note: If two expressions compute the same address for a view, then they have not been scheduled in parallel. This property yields one of the simplest ways by which a program can observe the runtime behavior of the scheduler.

Implementation note: An implementation can optimize hyperobject lookups by performing them only when a view has (or might have) changed. This optimization can be facilitated by attaching implementation-specific attributes to the hyperobject creation, lookup, and/or destruction operations.

Reducers

The vast majority of hyperobjects belong to a category known as “reducers.” Each reducer type provides a reduce callback operation that merges two views in a manner specific to the reducer. For a pair of views V1 and V2, the result of calling reduce(V1, V2) is notated as V1⊗V2. Each reducer also provides an identity callback operation that initializes a new view.

The reduce callback for a “classical” reducer implements an operation ⊗ such that (a⊗b)⊗c==a⊗(b⊗c) (i.e., ⊗ is associative). The view-initialization callback for such a reducer sets the view to an identity value I such that I⊗v==v and v⊗I==v for any value v of value_type. Given an associative ⊗ and an identity I, the triplet (value_type, ⊗, I) describes a mathematical monoid. For example, (int, +, 0) is a monoid, as is (list, concatenate, empty). If each individual view, R, of a classical reducer is modified using only expressions that are equivalent to RRv (where v is of value_type), then the reducer computes the same value in the parallel program as would be computed in the serialization of the program. (In actuality, the “⊗” in the expression “RRv” can represent a set of mutually-associative operations. For example, += and -= are mutually associative.) For example, a spawned function or _Cilk_for body can append items onto the view of a list reducer with monoid (list, concatenate, empty). At the end of the parallel section of code, the reducer's view contains the same list items in the same order as would be generated in a serial execution of the same code.

Given a set of strands entering a sync, S1,S2,S3,…Sn, associated with views V1,V2,V3,…Vn, respectively such that Si is earlier in the serial ordering than Si+1, a single view, W, emerges from the sync with value W←V1⊗V2⊗V3⊗…⊗Vn, such that the left-to-right order is maintained but the grouping (associativity) of the operations is unspecified. The timing of this “reduction” is unspecified – in particular, subsequences typically will be computed asynchronously as child tasks complete. Every view except the one emerging from the sync is destroyed after the merge. If any of the strands does not have an associated view, then the invocation of the reduce callback function can be elided (i.e., the missing view is treated as an identity).

A strand is never associated with more than one view for a given reducer, but multiple strands can be associated with the same view if those strands are not scheduled in parallel (at run time). Specifically, for a given reducer, the association of a strand to a view of the reducer obeys the following rules:

  1. The strand that initializes the reducer is associated with the leftmost view.
  2. If two strands execute in series (i.e., both strands are part of a larger strand), then both are associated with the same view.
  3. The child strand of a spawn is associated with the same view as the strand that entered the spawn.
  4. If the continuation strand of a spawn is scheduled in parallel with the child, then the continuation strand is associated with a new view, initialized using identity. The implementation may create the new view at any time up until the first hyperobject lookup following the spawn. If the continuation strand does not perform a hyperobject lookup, then the implementation is not required to create a view for that strand.
  5. If the continuation strand of a spawn is not scheduled in parallel with the child strand (i.e., the child and the continuation execute in series), then the continuation strand is associated with the same view as the child strand.
  6. The strand that emerges from a sync is associated with the same view as the leftmost strand entering the sync.

Even before the final reduction, the leftmost view of a reducer will contain the same value as in the serial execution. Other views, however, will contain partial values that are different from the serial execution.

If ⊗ is not associative or if identity does not yield a true identity value then the result of a set of reductions will be non-deterministic (i.e., it will vary based on runtime scheduling). Such “non-classical” reducers are nevertheless occasionally useful. Note that, for a classical reducer, the ⊗ operator needs to be associative, but does not need to be commutative.

Hyperobjects in C++

C++ hyperobject syntax

Note: The syntax described here is the syntax used in the Intel products. Intel is considering a different syntax for future, either in addition to or instead of the syntax described below.

At present, reducers and holders are the only kind of hyperobject supported. In C++, every reducer hyperobject has a hyperobject type, which type is an instantiation of the cilk::reducer class template, which is defined in the header <cilk/reducer.h>. The cilk::reducer class template has a single template type parameter, Monoid, which shall be a class type. (See C++ Monoid class requirements, below.)

For a given monoid, M, the type cilk::reducer<M> defines a hyperobject type. The cilk::reducer class template provides constructors, a destructor, and (const and non-const versions of) value_type& operator() operator*() and value_type& view(), both of which return a an lvalue reference to the current view, and operator->(), which returns the address of the current view.

A hyperobject reducer is created by defining an instance of cilk::reducer<M>:

cilk::reducer<M> hv(args);

Where args is a list of M::valueview_type constructor arguments used to initialize the leftmost view of hv. A hyperobject lookup is performed by invoking the member function, view() or member operator*() or operator->() on the hyperobject, as in the following examples:

hv.view().append(elem);
(*hv).append(elem);
hv->append(elem); hv().append(elem);

In these examples, append is an operation to be applied to the current view of hv, and is presumably consistent with the associative operation defined in the monoid, M.

Modifying a hyperobject view in a way that is not consistent with the associative operation in the monoid can lead to subtle bugs. For example, addition is not associative with multiplication, so performing a multiplication on the view of a summing reducer will almost certainly produce incorrect results. To prevent this kind of error, it is common to wrap reducers in proxy classes that expose possible for the monoid to define a separate view_type class that wraps the value_type and exposes only the valid associative operations. (See Monoid and View class requirements, below.) All of the reducers included in the standard reducer library have such wrappers.

C++ reducer class template

Where the below table indicates that the signature of a function includes the form Args&&..., in an implementation that supports C++ variadic templates, the function shall be defined as a variadic function template. In an implementation that does not support variadic templates, the function shall be defined as a set of templates taking from 0 to N arguments of type const Arg &, where N is at least 4.

Member Purpose
typename Monoid
Template parameter
typedef
typename Monoid::value_type
	value_type;
Typedef for the type of the data being reduced.
typedef
typename Monoid::view_type
	view_type;
Typedef for the type actually returned by a hyperobject lookup. view_type can be the same as value_type (see below).
template<typename... Args>
reducer(const Args&&... args);
Default-initialize the monoid and construct the leftmost view using constructor arguments, args.
template<typename... Args>
reducer(const Monoid& m,
	const Args&&... args);
Initialize the monoid from m and construct the leftmost view using constructor arguments, args. This constructor is useful only for the rare monoid type that contains state. The monoid state is shared by all views of the reducer.
Monoid& monoid();
Monoid const& monoid() const;
Return the monoid instance for this reducer. The same monoid instance is returned for a given reducer regardless of which strand invoked this accessor. This accessor is useful only for the rare monoid type that contains state.
view_type& view();
view_type& view() const;
Return an lvalue reference to the current view (i.e., the view associated with the currently-executing strand).
void move_in(value_type& obj);
Replace the value in the current view with obj. The value of obj after this operation is unspecified. Note that using this operation in parallel with other operations on the same reducer will cause the final reducer value to be indeterminate.
void move_out(value_type& obj);
Replace the value of obj with the value of the current view. The value of the view after this operation is unspecified. Note that using this operation in parallel with other operations on the same reducer will place an indeterminate value in obj and cause the final reducer value to be indeterminate.
void set_value(const value_type& obj);
Replace the value in the current view with obj. Note that using this operation in parallel with other operations on the same reducer will cause the final reducer value to be indeterminate.
type get_value() const;
Return the value of the current view. Note that using this operation in parallel with other operations on the same reducer will return an indeterminate value. The return type is const value_type& if view_type is identical to value_type; otherwise the return value is the same as that returned by view_type::view_get_value().

C++ Monoid class requirements

To define a reducer, a program defines a monoid class with public members representing the monoid, (T, ⊗, identity) as follows:

Member name/signature Purpose
value_type
typedef for T, the type of the data being reduced
view_type
typedef for the type actually returned by a hyperobject lookup. view_type can be the same as value_type (see below).
reduce(value_type* left,
	value_type* right)
evaluate “*left = *left*right
identity(value_type* p)
construct identity object at *p
destroy(value_type* p)
call the destructor on the object *p
allocate(size_t size)
return a pointer to size bytes of raw memory; return type shall be void*
deallocate(value_type void* p)
deallocate the raw memory at *p, where p is a value returned by a previous call to allocate

If any of the above functions do not modify the state of the monoid (most monoids carry no state), then those functions may be declared static or const. The monoid type may derive from an instantiation of cilk::monoid_base<T,V>, which defines value_type and view_type as aliases for T and V, respectively (where V defaults to T), and provides default implementations for identity, destroy, allocate, and deallocate. The derived class needs to define reduce and override only those functions for which the default is incorrect.

C++ View class requirements

By default, view_type is the same as value_type. Commonly, however, it is a wrapper around value_type that presents a more limited interface in order to achieve a measure of static safety. For example, for a summing reducer, view_type might support += and ++ but not operations like *= that are inconsistent with a summing reduction. Other times, view_type holds a more complex type that allows for more efficient reduction operations.

When view_type is identical to value_type the reducer imposes no further requirements on it beyond those already required by the identity and reduce operations in the monoid.

When view_type differs from value_type, then view_type must provide the following member functions:

Signature Purpose
view_move_in(value_type& v)
Clear the existing contents of the view and replace it with the value v. After calling this function, the new value of v is unspecified (but valid).
view_move_out(value_type& v)
Move the value of the view into v. After calling this function, the new value of the view is unspecified.
view_set_value(const value_type& v)
Set the value of the view to v.
view_get_value() const
Return the value of the view, either as an rvalue or as a const lvalue.

C++ hyperobject behavior

An object of type M::value_type is constructed by the reducer constructor. This object is called the initial view or leftmost view of the hyperobject. When a hyperobject goes out of scope, the destructor is called on the leftmost view. It is unspecified whether M::allocate and M::deallocate are called to allocate and deallocate the leftmost view (they are not called in the current Intel implementation).

The implementation may create a view at any spawn that has been scheduled in parallel, or may lazily defer creation until the first access within a strand. The implementation creates a view by calling M::allocate followed by M::identity. (This is in addition to the initial view created by construction of the hyperobject.) The calls to M::allocate and M::identity are part of the strand for the purpose of establishing the absence of a data race.

At any sync or at the end of any spawned (child) function, the runtime may merge two views by calling M::reduce(left, right), where right is the earliest remaining view that is later than left. The M::reduce function is expected to store the merged result in the left view. After the merge, the runtime destroys the right view by calling M::destroy followed by M::deallocate. Every view except the leftmost view is passed exactly once as the second argument to reduce. The calls to M::reduce, M::destroy and M::deallocate happen after completion of both of the strands that formerly owned the left and right views.

If a monoid member function executes a hyperobject lookup (directly or through a function call), the behavior of the program is undefined.

For purposes of establishing the absence of a data race, a hyperobject view is considered a distinct object in each parallel strand. A hyperobject lookup is considered a read of the hyperobject.

Hyperobjects in C

C hyperobject syntax

Note: The syntax described here is the syntax used in the Intel products. Intel is considering a different syntax for future, either in addition to or instead of the syntax described below.

The C mechanism for defining and using hyperobjects depends on a small number of typedefs and preprocessor macros provided in the Cilk library header <cilk/reducer.h>. C does not have the template capabilities of C++ and thus has a less abstract hyperobject syntax. Unlike C++, each C hyperobject variable is unique – there is no named type that unites similar hyperobjects. There is, however, an implicit “hyperobject type” defined by the operations that comprise the hyperobjects' monoid. The provided macros facilitate creating reducer variables, which are the only type of hyperobject currently supported. The terms “reducer” and “hyperobject” are used interchangeably in this section.

To define a C reducer, the program defines three functions representing operations on a monoid (T, ⊗, identity):

void T_reduce(void* r, void* left, void* right);
void T_identity(void* r, void* view);
void T_destroy(void* r, void* view);

The names of these functions are for illustration purposes only and must be chosen, as usual, to avoid conflicts with other identifiers. The purposes of these functions are as follows:

Function tag Purpose
T_reduce Evaluate “*(T*)left = *(T*) left*(T*) right
T_identity Initialize a T value to identity
T_destroy Clean up (destroy) a T value

The r argument to each of these functions is a pointer to the actual reducer variable and is usually ignored. Since most C types do not require cleanup on destruction, the T_destroy function often does nothing. As a convenience, the Cilk library makes this common implementation available as a library function, __cilkrts_hyperobject_noop_destroy.

A reducer, hv, is defined and given an initial value, init, using the CILK_C_DECLARE_REDUCER and CILK_C_INIT_REDUCER macros as follows:

CILK_C_DECLARE_REDUCER(T) hv =
	CILK_C_INIT_REDUCER(T_identity, T_reduce, T_destroy,
		init);

The init expression is used to initialize the leftmost reducer view. The CILK_C_DECLARE_REDUCER macro defines a struct and can be used in a typedef or extern declaration as well:

extern CILK_C_DECLARE_REDUCER(T) hv;

The CILK_C_INIT_REDUCER macro expands to a static initializer for a hyperobject of any type. After initialization, the leftmost view of the reducer is available as hv.value.

If The behavior is undefined if a reducer is local to a function, it shall be with automatic storage duration is not registered before first use using the CILK_C_REGISTER_REDUCER macro and unregistered after its last use using the CILK_C_UNREGISTER_REDUCER macro:

CILK_C_REGISTER_REDUCER(hv);
/* use hv here */
CILK_C_UNREGISTER_REDUCER(hv);

For the purpose of registration and unregistration, first use and last use are defined with respect to the serialization of the program. If the reducer view immediately before unregistration shall be is not the same (does not have the same address) as the reducer view immediately after registration, the behavior is undefined. In practice, this means that any spawns after the registration have been synced before the unregistration and that no spawns before the registration have been synced before the unregistration. Registration and unregistration are optional for reducers declared in global scope. The value member of the reducer continues to be available after unregistration, but a hyperobject lookup on an unregistered reducer results in undefined behavior unless the reducer is registered again.

A hyperobject lookup is performed using the REDUCER_VIEW macro:

REDUCER_VIEW(hv) += expr;

As in the case of a C++ reducer, modifying a reducer other than through the correct associative operations can cause bugs. Unfortunately, C does not have sufficient abstraction mechanisms to prevent this kind of error. Nevertheless, the Cilk library provides wrapper macros to simplify the declaration and initialization, though not the safety, of library-provided reducers in C. For example, you can define and initialize a summing reducer this way:

CILK_C_DECLARE_REDUCER(long) hv =
	REDUCER_OPADD_INIT(long, 0);

A C reducer can be declared, defined, and accessed within C++ code, but a C++ reducer cannot be used within C code.

C hyperobject behavior

The macro CILK_C_DECLARE_REDUCER(T) defines a struct with a data member of type T, named value. The macro CILK_C_INIT_REDUCER(T,I,R,D,V) expands to a braced-init-list appropriate for initializing a variable, hv, of structure type declared with CILK_C_DECLARE_REDUCER(T) such that hv, can be recognized by the runtime system as a C reducer with value type T, identity function I, reduction function R, destroy function D, and initial value V.

Invoking CILK_C_REGISTER_REDUCER(hv) makes a call into the runtime system that registers hv.value as the initial, or leftmost, view of the C hyperobject hv. The macro CILK_C_UNREGISTER_REDUCER(hv) makes a call into the runtime system that removes hyperobject hv from the runtime system's internal map. Attempting to access hv after it has been unregistered will result in undefined behavior. If a hyperobject is never registered, the leftmost view will be associated with the program strand before the very first spawn in the program and will follow the leftmost branch of the execution DAG. This association is typically useful only for hyperobjects in global scope.

The implementation may create a view at any spawn the start of any strand that has been scheduled in parallel, or may lazily defer creation until the first access within a strand. The implementation creates a view by allocating it with malloc, then calling the identity function specified in the reducer initialization. (This is in addition to the initial view created by construction of the reducer.) The call to the identity function is part of the strand for the purpose of establishing the absence of a data race.

At any sync or at the end of any spawned (child) function, the runtime may merge two views by calling the reduction function (specified in the reducer initialization) on the values left and right, where right is the earliest remaining view that is later than left. The reduction function is expected to store the merged result in the left view. After the merge, the runtime destroys the right view by calling the destroy function for the hyperobject, then deallocates it using free. Every view except the leftmost view is passed exactly once as the second argument the reduction function. The calls to reduction and destroy functions happen after completion of both of the strands that formerly owned the left and right views.

If a monoid function executes a hyperobject lookup, the behavior of the program is undefined.

For purposes of establishing the absence of a data race, a hyperobject view is considered a distinct object in each parallel strand. A hyperobject lookup is considered a read of the hyperobject.

Array notation

Description

This section provides a specification for the array notation portion of the Intel® Cilk™ Plus language extension. Array notation is intended to allow users to directly express high level parallel vector array operations in their programs. This assists the compiler in performing vectorization and auto-parallelization. From the users' point of view, they will see more predictable vectorization, improved performance and better hardware resource utilization. Array notation is an extension of the standard C/C++ languages, including features that are designed for easy expression of array operations and simplified parallel function invocation.

The Section Expression

The section expression selects multiple array elements for a data-parallel operation.

The syntax of a section expression is as follows:

postfix-expression:
section-expression
section-expression:
postfix-expression [ section-triplet ]
section-triplet:
expression : expression : expression
expression : expression
:

Each of the expressions in a section triplet shall have integer type. The postfix expression in a section expression shall have type “pointer to complete object type”; the type of the section expression is “type” (i.e. the same type as the corresponding “simple” subscript expression; there is no section type).

The sequence of expressions delimited by the brackets in a section expression is termed a section triplet (even when there are fewer than three expressions). The expressions in a triplet are interpreted, respectively as: begin, length, and stride. Each section triplet represents a sequence of subscript values, starting at begin, with length elements, where each element increases by stride:

begin, begin + stride, begin + stride * 2, …, begin + stride * (length - 1)

When no stride expression is present, the value of stride is 1. When the triplet contains no expressions (i.e. consists entirely of a single colon), the value of begin is 0, and the value of length is the number of elements in the array being subscripted. If this shorthand is used, the type of the array being subscripted shall have a declared size (which can be non-constant in the case of a VLA).

For example, A[0:3:2] refers to elements 0, 2, and 4 of the array A. A[:] refers to the entire array A, assuming A is a one-dimensional array with known upper bound.

The expressions in a triplet are converted to ptrdiff_t.

If stride is negative, then begin identifies the uppermost index. If length is less than or equal to zero, the sequence of subscript values is empty.

Every expression has a rank, determined as follows.

  1. The rank of an expression with no nested sub-expression is zero. (This rule applies to identifiers and constants.)
  2. Unless otherwise specified, the rank of an expression with one sub-expression operand is the rank of its operand. (This rule applies to parenthesized expressions, most postfix expressions, most unary expressions, and cast expressions.)
  3. Unless otherwise specified, in an expression with more than one sub-expression operand, the rank is the common rank of its operands. The common rank of two expressions is (“Determination of common rank” is commutative and associative; the common rank of more than two expressions can be determined by arbitrarily pairing expressions and intermediate results.)
  4. The rank of a section expression (postfix-expression [ section-triplet ]) is one greater than the rank of its postfix expression operand. The rank of each expression in a section triplet shall be zero.
  5. The rank of a simple subscript expression (postfix-expression [ expression ]) is the sum of the ranks of its operand expressions. The rank of the subscript operand shall not be greater than one.
  6. The rank of an argument expression list (in a function-call expression) is the common rank of the argument expressions if there are more than one, or the rank of the expression if there is exactly one, or zero if there are no expressions.
  7. The rank of a non-member function-call expression is the rank of its argument expression list. The rank of the postfix expression identifying the function to call shall be zero.
  8. The rank of a member function call expression is determined as if the object expression appeared as an additional expression in the argument list.
  9. The rank of a comma expression is the rank of its second operand.
  10. The rank of a sizeof expression or lambda-expression is zero.

In an assignment expression, if the right operand has nonzero rank, the left operand shall have the same rank as the right operand.

Examples:

Expression Rank
A[3:4][0:10] 2
A[3][0:10] 1
A[3:4][0] 1
A[:][:] 2
A[3][0] 0

An array section is an lvalue postfix expression with rank greater than zero.

For example, A[0:3][0:4] refers to 12 elements in the two-dimensional array A, starting at row 0, column 0, and ending at row 2, column 3.

Examples of section expressions:

int *p;
int A[n][m];
p[:] = ... // not valid. p has no declared size.
A[:][:] = ... // The entire 2D array A.
p[1:5] = ... // p[1],p[2], ... p[5].

Every section triplet of an array section has a relative rank, defined as its ordinal number among the other triplets, from left to right, starting at 0.

Examples:

Expression Relative rank of 0:10
A[:][0:10][:] 1
A[0:10][:][:] 0
A[:][:][0:10] 2
A[i[:]][j[0:10]][k[:]] 1

The shape of an array section is defined as a vector: (length0,length1,...,lengthn-1), where n is the rank of the array section, and lengthi is the length of the triplet with relative rank i. The size of an array section is the product of the length values of its shape.

A full expression shall have rank zero, unless it appears in an expression statement or as the controlling expression of an if statement. A constant expression shall have rank zero. An initializer expression shall have rank zero. The size expression in a declarator or abstract declarator for an array shall have rank zero.

In general, a statement containing a section expression is executed once for each element of the section; some operations of these executions may be performed in parallel. However, in such a statement, a sub-expression with rank zero is evaluated only once. For example:

a[:] = f1(b[:]) + f2(c) + d;

f1 is called for each element of b. f2(c) is evaluated once; that value is used to compute the new value of each element of a. If f2 changes the value of d, the value used for d in this example is unspecified. If f1 changes the value of d, the behavior is likely to be undefined.

When two array sections are required to have common rank, if they do not have the same shape, the behavior is undefined.

When two expressions are required to have common rank, and one of them has zero rank, the expression with zero rank is evaluated only once; the resulting value is used as many times as necessary to fully execute the containing statement.

Operations on Array Sections

Assignment

An array section assignment is a parallel operation that modifies every element of the array section on the left-hand side.

When the left operand of assignment has non-zero rank, the assignment of each element is unsequenced with respect to the assignment of every other element, and the value computation for each element is unsequenced with respect to the value computation for every other element.

For example, if any section on the right-hand side of an assignment overlaps with the left-hand side, and the overlap is not complete, the behavior is undefined.

a[0:10] = a[10:10];	// no overlap; well-defined
a[0:10:2] = a[1:10:2];	// no overlap; well-defined
a[0:10] = a[0:10] + 1;	// complete overlap; well-defined
a[0:10] = a[1:10];	// incomplete overlap; undefined

Examples

// Copy elements 10->19 in A to elements 0->9 in B.
B[0:10] = A[10:10];
// Undefined behavior. Triplets 0:10 and 0:100 are not the same size.
B[0:10] = A[0:100];
// Transpose row 0, columns 0-9 of A, into column 0, rows 0-9 of B.
B[0:10][0] = A[0][0:10];
// Copy the specified array section in the 2nd and 3rd dimensions of A into
// the 1st and 4th dimensions of B.
B[0:10][0][0][0:5] = A[3][0:10][0:5][5]
// Undefined behavior. The corresponding triplets with the same relative rank
// (0:9, 0:5) and (0:5, 0:9) do not have the same number of elements.
B[0:9][0:5] = A[0:5][0:9];
// OK since the triplets on both sides have the same number of elements.
// The 5 elements in A are scattered to the even-index elements in B
// (0,2,4...8). The values of the odd-index elements (1,3,5...7) are not
// changed.
B[0:5:2] = A[0:5];

Arithmetic Operations

When applied to an array section or sections, the C/C++ arithmetic operators are applied to each element of the array section(s). Note that multiplication is performed element-wise and does not correspond to traditional vector/matrix multiplication.

Examples

// Set all elements of A to 1.0.
A[:] = 1.0;
// Set every element of 3x3 shape in A to the value of B[0].
A[0:3][0:3] = B[0];
// Error: the number of dimensions (rank) must match,
// or be equal to 0.
A[0:2][0:2] = B[0:2];
// Element-wise addition of all elements in A and B, resulting in C.
C[:] = A[:] + B[:];
// Element-wise multiplication of all elements in A and B,
// resulting in C.
C[:] = A[:] * B[:];
// Add elements 10->19 from A with elements 0->9 from B and place in
// elements 20->29 in C.
C[20:10] = A[10:10] + B[0:10];
// Element-wise addition of the first 10 elements in column 2 of A and
// column 3 of B, resulting in column 0 of C.
C[0:10][0] = A[0:10][2] + B[0:10][3];
// Matrix addition of the 2x2 matrices in A and B starting at A[3][3]
// and B[5][5], placed in C starting at C[0][0].
C[0:2][0:2] = A[3:2][3:2] + B[5:2][5:2];
// Add the array section along the 1st and 2nd dimensions of B to the
// elements in the array section along the 2nd and 3rd dimensions of A,
// placing them in an array section in C.
C[0:9][0][0:9] = A[0][0:9][0:9] + B[0:9][0:9][4];
// Element-wise addition of first 10 elements in A and B,
// resulting in A.
A[0:10] = A[0:10] + B[0:10];
// Element-wise negation of first 10 elements in A, resulting in A.
A[0:10] = -A[0:10];
// Multiply every element in A by 2.
A[:] *= 2;
// Add one to each element in A.
A[:]++;
// Element-wise equality test of B and C, resulting in an array of
// Boolean values, which are placed in A.
A[:] = B[:] == C[:];

Function calls

If a function is called with an array section argument, the function is “mapped”, or called with successive elements in the array. For example:

type fn(type arg1, type2 arg2);
type in[N], out[N];
type2 in2[N];
out[x:y:z] = fn(in[x:y:z], in2[x:y:z]);

The function fn is mapped over array sections of arrays in and in2. The function fn will be called with arguments (in[x], in2[x]), with arguments (in[x+z], in2[x+z]), etc. The results of the function calls are collected into a section of array out.

The executions of function calls mapped from a given statement are unsequenced with respect to one another.

Reduction Operations

The reduction operations accumulate a result over all the values in an array section. There are specialized reductions for specific operations, and two general reductions which perform a user-specified operation. A reduction operation resembles a generic function, which is invoked using function-call syntax; the name of the operation precedes a parenthesized list of arguments. A reduction can take a variety of argument types; in some cases, the return type of the operation matches the type of the section argument.

The section argument of a reduction operation shall have rank greater than zero. The rank of a reduction operation is zero.

Each of the specialized reduction operations resembles a function that takes a single section argument, whose (element) type shall be an arithmetic type. The specialized reduction operations are listed here:

Reduction operation Result value Result type Result value when the argument is an empty section
__sec_reduce_add The sum of the argument values The argument type Zero
__sec_reduce_mul The product of the argument values The argument type One
__sec_reduce_max The maximum argument value The argument type The minimum value representable in the result type
__sec_reduce_min The minimum argument value The argument type The maximum value representable in the result type
__sec_reduce_max_ind The lowest index of the maximum argument value intptr_t Unspecified
__sec_reduce_min_ind The lowest index of the minimum argument value intptr_t Unspecified
__sec_reduce_all_zero One if all argument values are zero; zero otherwise int One
__sec_reduce_all_nonzero One if all argument values are nonzero; zero otherwise int One
__sec_reduce_any_zero One if any argument value is zero; zero otherwise int Zero
__sec_reduce_any_nonzero One if any argument value is nonzero; zero otherwise int Zero

For __sec_reduce_max_ind and __sec_reduce_min_ind, the rank of the argument shall be exactly one.

The general reduction operations are described by these pseudo-declarations, where T represents a type:

T __sec_reduce(T initial, T section, T (*operation)(T, T));
void __sec_reduce_mutating(T &result, T section, T2 (*operation)(T *, T));

The second argument is the section argument; the result type of the reduction is the element type of the section. The result type shall be copy constructible.

The first argument is the initial value for the operation. Its type shall be convertible to the result type of the reduction. For __sec_reduce_mutating, it is also the object into which the result is computed; it shall be an lvalue, and shall have type compatible with the result type of the reduction.

The third argument identifies the abstract operation for which to perform the reduction. It shall designate a function that takes two arguments having the same type as the result of the reduction. For __sec_reduce, it shall return its result as a value of the result type of the reduction. For __sec_reduce_mutating, its first argument will refer to an object of the result type of the reduction, into which it shall is expected to compute its result.

In C, the operation argument shall have pointer-to-function type. For __sec_reduce, its first argument will also be a value of the result type of the reduction; for __sec_reduce_mutating, its first argument will be a pointer to an object of the result type of the reduction.

In C++, if the result type of the reduction is a class type, then names in the operation argument are looked up as if used in a member function of that class in the scope of the class as well as the scope containing the operation, as for operator overload resolution. If the operation argument is an id-expression referring to a set of overloaded functions, overload resolution is performed as for a binary operator; i.e. overload resolution is used to determine whether to invoke the operation as a member function (e.g. (a.f(b)) or as a non-member function (e.g. (f(a, b)). Otherwise, the operation argument shall be a callable object; if it has pointer-to-member type, then the operation is invoked as a member function call, otherwise it is invoked with two arguments.

Invocations of the function designated by the operation argument are unsequenced with respect to one another. It is unspecified how the elements of the section, the initial/result object, and any introduced temporary objects, are paired by calls to the operation function, except that if the rank of the section is one and the operation function is associative, then the result is the same as for left-to-right reduction, where the initial value is taken as leftmost, and the element with index begin + stride * (length - 1) is taken as rightmost.

Example

type fn(type in1, type in2);
type in[N], out;
out = __sec_reduce(initial, in[x:y:z], fn);

The reduction will be computed analogously to:

tmp = initial;
for each element X of in[x:y:z]
	tmp = fn(tmp, X);

The result of the reduction will be the final value of tmp.

For example, the two reduction operations given here compute the same result:

double add(double in1, double in2) { return in1+in2; }
out = __sec_reduce(0, in[x:y:z], add); // accumulate using add()
out = __sec_reduce_add(in[x:y:z]); // accumulate using built-in +

The compiler may produce more optimized code when the specialized reduction operations are used.

Array Implicit Index

In writing code that uses array sections, it is sometimes useful to explicitly reference the indices of the individual elements in a section. For example, the user may wish to fill an array with a function of the element index, rather than a single value.

Conceptually, an array section operation can be thought of as expanding into a loop with an implicit index variable for each relative rank of the section. For each relative rank, the value of the implicit index variable ranges between zero and one less than the length of the triplet with that relative rank. The __sec_implicit_index operation returns the value of the implicit index variable for a specified relative rank. It behaves as a function with the following declaration:

intptr_t __sec_implicit_index(int relative_rank);

The argument shall be an integer constant expression. For purposes of rank checking, the rank of an implicit index operation is zero, although it is reevaluated for each element, like an expression of rank one.

Examples

int A[10], B[10][10];
// A[0] = 0, A[1] = 1, A[2] = 2,...
A[:] = __sec_implicit_index(0);
// B[i][j] = i+j;
B[:][:] = __sec_implicit_index(0) + __sec_implicit_index(1);
// The length of each dimension is 2. The value of __sec_implicit_index
// is either 0 or 1, regardless of the subscripts of the affected elements.
A[i:2:s][j:2:t] = __sec_implicit_index(0) ^ __sec_implicit_index(1);

Sections in if statements

If the rank of the controlling expression of an if clause is nonzero, the rank of every full expression in every substatement of that if statement (including substatements of those substatements, recursively) shall equal the rank of the controlling expression of the if clause. If the shape of a full expression of a substatement does not match the shape of the controlling expression, the behavior is undefined.

When the controlling expression of an if statement has nonzero rank, the entire if statement, including its dependent statements, is executed for each element of the shape of the controlling expression.

Example: Sections as Array Parameters

The array notation enables a “vector kernel” style of programming, where vector code is encapsulated within a function, with a parameterized vector length. In this case, concurrency happens within a single function invocation, unlike when “mapping” function calls over an array section, where concurrency happens between function invocations.

The following example (which contains syntax specific to C99/C11) illustrates how to combine array section vectorization inside a function body with threading _Cilk_for for parallel function calls. The vector length m is array is processed in chunks of 256 (specified by m) in this example.

void saxpy_vec(int m, float a, float x[restrict m], float y[m])
{
	y[:] += a * x[:];
}

int main(void)
{
	float a[2048], b[2048];
        ... // Array a is initialized here
	_Cilk_for (int i = 0; i < 2048; i += 256)
		saxpy_vec(256, 2.0, (a + i), (b + i));
}

By writing the function explicitly with array arguments, the programmer can write code with easily customizable vector lengths and runtime model choices.

Please note that functions cannot return array section values. It may be helpful to return a pointer to the array as in standard C/C++ and take sections of the return value in the callee.

SIMD loops (#pragma simd)

Description

This section describes the SIMD loop portion of the Cilk Plus language. The SIMD pragma can be applied to a loop, to indicate that the iterations of the loop can be divided into contiguous chunks of a specific length, and that within each chunk, multiple iterations can be executed concurrently.

SIMD loop definition syntax

iteration-statement:
simd-loop-pragma iteration-statement
simd-loop-pragma:
# pragma simd simd-loop-clausesopt new-line
simd-loop-clauses:
simd-loop-clause
simd-loop-clauses simd-loop-clause
simd-loop-clause:
vectorlength-clause
simd-loop-linear-clause
simd-loop-openmp-data-clause
simd-loop-openmp-data-clause:
data-privatization-clause [OpenMP 2.14.3.3 private clause]
data-privatization-out-clause [OpenMP 2.14.3.5 lastprivate clause]
data-reduction-clause [OpenMP 2.14.3.6 reduction clause]
data-privatization-in-clause [2.14.3.4 firstprivate clause]

The iteration-statement following #pragma simd shall be a for loop. The loop's control clause and body are subject to the same restrictions as in a _Cilk_for loop. The loop control variable shall have integer or pointer type. The loop control variable is semantically linear, but may be explicitly mentioned in a private clause.

With the exceptions detailed below, the effect of #pragma simd is the same as that of #pragma omp simd. The syntax and semantics of the various simd-openmp-data-clauses are detailed in the OpenMP specification. (http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, Section 2.14.3).

The private clause declares one or more variables to be private to a SIMD lane. The lastprivate clause declares one or more variables to be private to a SIMD lane, and causes the corresponding original variable to be updated after the end of the loop. The reduction clause specifies a reduction-identifier and one or more variables. For each variable, a private copy is created in each SIMD lane, and is initialized with the initializer value of the reduction-identifier. After the end of the loop, the original variable is updated with the values of the private copies using the combiner associated with the reduction-identifier.

The vectorlength clauses clause

vectorlength-clause:
vectorlengthfor ( type-name )
vectorlength ( constant-expression-list )
constant-expression-list:
constant-expression
constant-expression-list , constant-expression

A The constant-expression in a vectorlength clause shall be a valid integer constant expression with a value greater than zero.

If the vectorlengthfor clause is used, the length of the chunk (vector length, or VL) is computed as for a SIMD-enabled function, using the vectorlengthfor argument type as the characteristic type. If not, a VL is selected to try to generate the most efficient vector code. If the vectorlength clause is used, the VL is The value of the argument to the vectorlength clause is the maximum chunk size, MCS, for a SIMD loop. Using a logical numbering of iterations 0, 1,... N-1, where N is the total number iterations in the loop, iteration i is sequenced before iteration i+MSC. If the vectorlength clause is not used, then all iterations are unsequenced with respect to one another (i.e., MSC equals the number of iterations in the loop). A #pragma simd shall not contain more than one vectorlength clause.

The vectorlength clause used with #pragma simd is equivalent to the safelen clause used with #pragma omp simd.

The linear clause

simd-loop-linear-clause:
linear ( simd-loop-linear-variable-list )
simd-loop-linear-variable-list:
simd-loop-linear-variable
simd-loop-linear-variable-list , simd-loop-linear-variable
simd-loop-linear-variable:
id-expression
id-expression : simd-loop-linear-step
simd-loop-linear-step:
conditional-expression

The id-expression in a simd-linear-variable shall designate a variable with scalar type. The conditional-expression in a simd-linear-step shall either satisfy the requirements of an integer constant expression, or be a reference to a variable with integer type. In any SIMD loop, no variable shall be the subject of more than one linear clause, and no variable shall be the subject of a linear clause and also an OpenMP data clause.

If a simd-linear-variable has no associated simd-linear-step, the constant value 1 is used for the simd-linear-step. If the simd-linear-step refers to a variable, and the variable is modified during the execution of the loop, or if the value of the object designated by the id-expression in a simd-linear-variable does not increase by the value of the expression in the associated simd-linear-step in each iteration of the loop, the behavior is undefined.

The linear clause used with #pragma simd has the same effect as the linear clause used with #pragma omp simd, except that each simd-loop-linear-variable is treated as the subject of a separate OpenMP linear clause.

Elemental SIMD-enabled functions

Description

This section describes the “SIMD-enabled functions” portion of the Cilk Plus language. SIMD-enabled functions are a data parallel programming construct. The use of SIMD-enabled functions consists of the following three steps:

  1. The programmer writes a function that uses scalar values in standard C/C++ to describe the operation to be performed on a single element.
  2. The programmer adds a vector attribute with optional clauses to be described below, so that the compiler generates a vector variant of the function, which operates on a vector of elements instead of a single one.
  3. Lastly, the programmer writes the invocation of the function to operate on arrays of arguments instead of single arguments.

The function is invoked with vectors of arguments iteratively until the whole array of arguments is processed. Each such invocation is of a single vector variant of the function.

A SIMD-enabled function is a function declared with one or more vector attributes. The vector attributes enable the compiler to generate multiple variants of the function optimized for use with array sections or within a vectorizable loop (especially a SIMD loop or _Cilk_for loop). Although a call to a SIMD-enabled function from a vectorizable context may be handled specially, calls from "scalar" (non-vector) contexts are not affected by the vector attribute.

Conceptually, an invocation of a SIMD-enabled function from a loop or array-section context is matched against the declared variants. If a variant is found with an appropriate vector length and with matching uniform and linear arguments (see below), then that variant is called. Otherwise, the scalar variant is used.

Semantics

The semantics of a SIMD-enabled function are that the execution order among its invocations is unsequenced.

The execution of a SIMD-enabled function depends on the language construct used at the invocation site, as follows.

  1. If the function is called from a C/C++ for loop, then the compiler may invoke a vector variant instead of the original function, and iterate in the loop fewer times. While in this context the replacement is always correct, whether the replacement will actually be done is implementation dependent and is subject to performance heuristics.
  2. Adding a #pragma simd to the C/C++ for loop will ensure that a vector variant of the function is called. Instances of the function will be invoked iteratively in a single execution strand until all elements of the array arguments have been processed.
  3. If the array notation syntax is used, then the execution is the same as in case #2 above. (Note: Intel is considering augmenting the implementation of this syntax and allowing concurrent execution of invocations.)
  4. If the function is invoked from a _Cilk_for loop, then a vector variant is used, and the execution of the _Cilk_for loop is as described in the _Cilk_for section. This may result in multiple invocations of the function executing concurrently.

The semantics of a SIMD-enabled function differ from a normal function in sequencing and in constraints on the program, as follows:

  1. Invocations of the SIMD-enabled function in consecutive iterations of a vectorizable loop or for consecutive elements of an array section are unsequenced with respect to one another.
  2. The body of a SIMD-enabled function is required to conform to the same constraints as the body of a SIMD loop, as described below.

Implementation note: The compiler will typically generate three translations of the source code into machine code. One is for the compliant version, which operates on a single element per invocation. The second the short vector variant, that operates on a scalar variant of the function, and two vector variants per vector attribute. The first vector variant processes multiple array elements at a time through the use of vector registers and SIMD instructions. The number of elements to be used in by the short vector variant of the function is determined by the types of arguments and return value of the function, and the width of the vector registers in the target CPU. The programmer can use a vectorlength clause (described below) to override the compiler's choice. The third version is designed to receive an additional argument which is a second vector variant does the same, but additionally takes an implicit mask argument, which disables processing of some of the vector lanes. This argument is implicit and the application programmer does not explicitly provide it. The mask argument is required for the cases in which the function is called from a loop under The compiler calls the second variant and supplies the implicit mask when the function is called from a conditional statement for which the condition might differ between iterations. The application programmer can use a mask clause to suppress the generation of either the second or the 3rd version. one of the two vector variants if they are not needed. The compiler may suppress generation of any variant (including the scalar variant) if it knows that the variant will not be called. In a separate-compilation environment, such knowledge might require access to all compilation units that may use the SIMD-enabled function.

The vector attribute

The vector attribute is a general-purpose attribute, applicable to functions.

vector-attribute:
vector
vector ( simd-function-clausesopt )
simd-function-clauses:
simd-function-clause
simd-function-clauses , simd-function-clause
simd-function-clause:
processor-clause architecture-clause
vectorlength-clause
uniform-clause
simd-function-linear-clause
mask-clause

For each vector attribute, one vector variant of the vectorized function is created. The vector attribute and its associated clauses are considered part of the function signature. If the function declaration is inconsistent with the function prototype with respect to these clauses then the behavior is undefined.

A SIMD-enabled function shall not have a dynamic exception specification.

In a vector attribute, no parameter shall be the subject of more than one clause (uniform or linear).

The processorarchitecture clause

architecture-clause:
architecture ( architecture-name )
architecture-name:
identifier

The set of identifiers valid for use in the architecture clause is implementation-defined. The identifier names one of a set of SIMD-instruction-set architectures supported by the target processor. If no architecture-clause is specified, an implementation-specified instruction set is used.

The Intel compiler supports the following architecture names:

sse2
sse3
ssse3
sse4_1
sse4_2
avx
avx2
avx512
processor-clause:
processor ( processor-name )
processor-name:
pentium_4
pentium_4_sse3
core_2_duo_ssse3
core_2_duo_sse_4_1
core_i7_sse4_2

Directs the compiler to create a vector version of the function for the given target processorarchitecture. The default processorarchitecture is implementation-specified and is typically taken from the implicit or explicit processor- or architecture-specific flag in the compiler command line. The target processorarchitecture defines both the vector ISA that the compiler is allowed to use, and the width of the vectors.

The vectorlength clause

A vector attribute shall not contain more than one vectorlength clause. In the context of a vector attribute, the constant-expression-list in a vectorlength clause shall have only one constant-expression.

Every vector variant of a SIMD-enabled function has a vector length (VL). If the vectorlength clause is used, the VL is the value of the argument of that clause. Otherwise the VL is defined by the implementation's ABI. Otherwise, the function's characteristic type is found. If the vectorlengthfor clause is used, the characteristic type is the type specified as its argument; otherwise, if the return type is not void, then the characteristic type is the return type; otherwise, if the function has any non-uniform, non-linear parameters, then the characteristic type is the type of the first such parameter; otherwise, the characteristic type is int. The VL is then determined from the characteristic type and the processor's vector register. For example:

The uniform clause

uniform-clause:
uniform ( uniform-param-list )
uniform-param-list:
parameter-name
uniform-param-list , parameter-name
parameter-name:
identifier
this

The identifier in a parameter-name shall match the name of a parameter of the function to which the containing clause applies; the parameter shall have integral arithmetic or pointer type. The keyword this shall not be used as a parameter-name except with a non-static member function.

If the value of an argument corresponding to a uniform parameter differs between invocations from a single execution of a loop, the behavior is undefined. (The values of these parameters can be broadcasted to all iterations as a performance optimization.) The uniform clause declares one or more parameters to have an invariant value for all invocations of the function in the execution of a single SIMD loop or array-section expression.

The linear clause

simd-function-linear-clause:
linear ( simd-function-linear-param-list )
simd-function-linear-param-list:
simd-function-linear-param
simd-function-linear-param-list , simd-function-linear-param
simd-function-linear-param:
parameter-name
parameter-name : simd-function-linear-step
simd-function-linear-step:
parameter-name
constant-expression

If a simd-function-linear-step consists of a single identifier matching the name of a parameter, or in the declaration of a non-static member function is the keyword this, the simd-function-linear-step is interpreted as referring to the parameter; otherwise, the simd-linear-step shall satisfy the requirements of a constant-expression. A constant-expression in a simd-function-linear-step shall be an integer constant expression. A parameter referenced as a simd-function-linear-step shall be the subject of a uniform clause.

No parameter of a SIMD-enabled function shall be the subject of more than one uniform or linear clause.

For a linear parameter, if the corresponding argument values in consecutive iterations (in the serial version of the program) do not differ by the value of the simd-function-linear-step, the behavior is undefined. The linear clause declares one or more parameters to have values that increase or decrease linearly in consecutive invocations of the function in the execution of a single SIMD loop or array-section expression.

The mask clauses

mask-clause:
mask
nomask

By default, for every vector variant declared clause, two implementations are provided variants are generated: one especially suitable for conditional invocation (i.e. masked), and another especially suitable only for unconditional invocation (i.e. unmasked). If all invocations are conditional, generation of the unmasked implementation variant can be suppressed using the mask clause. Similarly, if all invocations are unconditional, generation of the masked implementation variant can be suppressed using the nomask clause. (Using both clauses together has no effect: both implementations are provided. This is also the default.) A vector attribute shall not have both a mask clause and a nomask clause.

Restrictions on elemental SIMD-enabled functions and SIMD loops

The behavior is undefined if any of the following language constructs shall not appear appears within the body of a SIMD-enabled function, nor or in a SIMD loop:

  1. a goto statement
  2. a _Cilk_spawn expression
  3. a _Cilk_for or _Cilk_sync statement
  4. an OpenMP directive or construct
  5. a try statement
  6. a call to setjmp or longjmp
  7. a declaration of a non-static variable of a type with a destructor

If execution of a function called from a SIMD-enabled function or SIMD loop terminates with an exception or a call to longjmp, the behavior is undefined.

Note: Intel is considering removal of some of these restrictions in the future.

Note: Because invocations of a SIMD-enabled function, and iterations of a SIMD loop, are implicitly allowed to be concurrent, modifying any non-atomic non-local object carries the potential for a data race, and therefore undefined behavior.


Cilk Plus Trademark License

Intel hereby grants to you a limited, nonexclusive, non-transferable, royalty-free, revocable (as described in this paragraph) worldwide license to use the trademark “Cilk Plus” (the “Licensed Trademark”) solely in connection with Compliant Products and no other goods or services. For the purpose of this license grant, Compliant Products mean your software tools, applications and technology which implement and comply with the “Cilk Plus Language Specification” or “Cilk Plus Application Binary Interface Specification” as set forth in this document. You must use the Licensed Trademark only with an applicable noun that identifies your Compliant Product, such as (but not limited to) the following examples: “the Cilk Plus compiler”, “the Cilk Plus-compliant software”, “XYZ Cilk Plus compiler” or “XYZ Cilk Plus software.” Use of the Licensed Trademark is subject to the disclaimer set forth in this document. Should Licensee desire to use the INTEL® mark, the INTEL® logo (or any other Intel marks) in connection with its advertising or promotional materials respecting Compliant Products, any such use must be approved in writing by Intel prior to any such use.

Intel may immediately terminate the license grant described above if you are in breach of any conditions and such breach is not cured within thirty (30) days of written notice from Intel. Any claim or dispute arising under or relating to this license grant shall be governed by the internal substantive laws of the State of Delaware, without regard to principles of conflict of laws.

Non-Trademark Use of Cilk Plus

Provided that it is accurate statement, you may fairly state that your software tools, application or technology complies with the “Cilk Plus Language Specification” or “Cilk Plus Application Binary Interface Specification”, such as (but not limited to) the following examples: “XYZ compiler is compliant with the Cilk Plus Language Specification”, and “XYZ software is compliant with the Cilk Plus Application Binary Interface Specification”.

Disclaimer and other legal information

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO (A) SALE AND/OR USE OF INTEL PRODUCTS, (B) THE SPECIFICATION AND (C)THE LICENSED TRADEMARK, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, VALIDITY OF RIGHTS OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, TRADEMARK OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

NEITHER PARTY SHALL BE LIABLE TO THE OTHER PARTY FOR SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, EVEN IF SUCH PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details.

Cilk is a trademark of Intel Corporation in the U.S. and/or other countries.

* Other names and brands may be claimed as the property of others.

Copyright © 2011, 2013, Intel Corporation. All rights reserved.