No! The following two code fragments have quite different behavior:
cilk_for(int i = 0; i < n; ++i) f(i);
for (int i = 0; i < n; ++i) cilk_spawn f(i); cilk_sync;
The second approach is generally a suboptimal way to spawn n iterations because each iteration is spawned sequentially, thereby generating a span (critical path) that is proportional in n. In contrast, using a cilk_for generates a span that is proportional to only lg n, since a cilk_for is implemented using a parallel recursive divide-and-conquer algorithm.
The Intel Cilk Plus runtime uses a work-stealing scheduler to dynamically load-balance the tasks that are created by a Cilk Plus program.
At a high level, the runtime scheduler executes a Cilk Plus program by using worker threads. More specifically, by default, the runtime scheduler queries the OS to determine how many cores are present on a system, and creates a worker thread for each core. Each worker maintains a deque (short for double-ended queue) for storing tasks that have yet to execute. Intuitively, in the runtime, each worker thread maintains its deque using the following simple algorithm:
Of course, there are other minor implementation details that are not included here. But these three simple rules capture the essence of the Cilk Plus scheduler.
In Cilk Plus, we chose to have pushes and pops occur at the opposite end of the deque from steals to reduce contention. In the common case, a worker thread is able to push and pop from the tail of the deque without being slowed down by a thief who may be trying to steal from the head of the deque. Worker threads in Cilk push and pop work from the same end (and thus operate in a LIFO order instead of a FIFO order) because this scheme tends to exhibit better locality. Because use of cilk_spawn and cilk_sync keywords naturally generates computations which exhibit fork-join parallelism, it turns out that in Cilk, if one considers all the tasks on the deque of a given worker thread, the most deeply nested task appears at the tail and the shallowest task appears at the head. Generally, the most deeply nested tasks on a worker's deque are also the ones most recently executed by that worker, and thus, by looking for work from the tail is more likely to preserve locality. Stealing from the head of a deque has benefits in terms of making progress on the critical path (span) of the computation, since the shallowest tasks in a deque are the oldest tasks. Moreover, in many recursive parallel computations, the oldest tasks are also likely to be the largest tasks, and thus, when a thief steals from the head of a deque, it is more likely to steal a task with enough work to amortize against the overheads of executing the steal.
Randomized work-stealing has proven to be a robust technique for scheduling that is efficient for scheduling all kinds of fork-join computations, even those computations with highly irregular parallelism and complex nested parallelism.
In general, when executing an arbitrary computation in a dynamic runtime environment, it is difficult to predict at a given time exactly which task is “best” to steal. The algorithm for maintaining deques in Cilk Plus helps us out, by ordering tasks on each worker’s deque to have the oldest / shallowest tasks at the head. But since there are generally not any guarantees about which deque would be the best one to steal from, picking a victim deque at random turns out to be a good choice.
The choice of randomized work-stealing is not just an effective heuristic. One can mathematically prove strong theoretical guarantees on the efficiency of a randomized work-stealing scheduler, in a fairly general theoretical fork-join DAG model of computation as shown in the paper Scheduling mulithreaded computations by work stealing.
Real programs and computer systems are, of course, more complicated than mathematical models can completely capture. But we believe these theoretical results provide a solid foundation for the runtime scheduler, and contribute greatly to the good efficiency and scalability of the runtime scheduler that we have observed across a wide variety of Cilk Plus programs and on a variety of different machines.
No, the current version of the product does not support thread affinity or make any attempt to pin worker threads to particular cores. Pinning worker threads to particular cores can hurt overall system performance in a multiprogrammed environment, where there might be other OS threads from other processes running on the same machine.
In general, by having a worker steal from a nearby core instead of a random core, one is making a tradeoff between exploiting locality and making progress on the critical path (span) of the overall computation. Intuitively, when workers repeatedly steal from nearby neighbors, they are more likely to miss stealing a large task that is on the critical path. Thus, although the cost of an individual steal from a nearby core may be less because of improved locality, the overall cost may still be greater, because more steals are needed to finish the entire computation.
Moreover, in terms of implementation, determining which cores are “nearby” is difficult since the current version of the Cilk Plus runtime does not pin worker threads to particular cores.
By default, Intel Cilk Plus will query your OS to determine how many cores are present on your system and create a “worker thread” for each. You may override that default using an API call or by setting an environment variable.
The OS scheduler is ultimately what looks at the tasks coming from Intel Cilk Plus and the other threads and sends them to the hardware. If an Intel Cilk Plus worker thread attempts to steal work and fails, it will return it’s quantum to the operating system, allowing it to schedule other applications. In the future, we hope to see additional interfaces in operating systems to coordinate threaded applications including those built with Intel Cilk Plus.
Yes, a Cilk Plus program can have multiple OS threads that each independently call or spawn Cilk functions. The runtime coordinates to guarantee that only a single copy of the runtime data structures and a single set of worker threads are created on a given machine.
The Cilk Plus runtime attempts to be as lazy as possible in performing initialization. If you are writing a library in which only a portion of the application uses Intel Cilk Plus, the runtime will be automatically initialized (i.e., the runtime will initialize data structures and create worker threads) only when it is first needed.
Once the runtime has been initialized, it may suspend worker threads once it detects that a program is no longer executing any Cilk code. If another region of Cilk code starts, the runtime will wake up the worker threads as needed. Once created, worker threads are not freed, however, unless the user makes an explicit call to shut down the runtime.
In terms of implementation, a cilk_spawn requires significantly less overhead than creating an OS thread, since it only needs to push a task onto a data structure, not make a call into the OS and involve the OS thread scheduler.
In terms of semantics, a cilk_spawn also differs from an OS thread because it exposes parallelism in a program, but does not require it. In a Cilk program, when a function f() has a cilk_spawn of both functions g() and h(), the runtime is allowed, but not guaranteed, to execute g() in parallel with h().
Said differently, the Cilk keywords are designed for programs with serial semantics, i.e., a serial execution of the program must be a valid choice for executing the program. It is not designed for programs that require concurrent semantics, e.g., a producer-consumer application where g() produces a result that is communicated to and consumed by h().
In Cilk Plus, a cilk_for loop over n iterations is executed using a parallel recursive divide-and-conquer algorithm. More specifically, the iterations of the loop are divided into two roughly equal pieces that may execute in parallel. Each piece is recursively subdivided until the base case, where the number of iterations is less than the grain size of the cilk_for loop.
In general, you don’t need to. A cilk_for will automatically choose a grain size based on the number of workers being used to execute your program. In most cases, this automatic calculation will produce an optimal grain size for your loop. Forcing the grain size to 1 can be useful for debugging, or if your loop iterations are extremely unbalanced.