Tutorial: Cilk Plus Keywords

 Up: Overview Next: Reducers

The Intel Cilk Plus standard defines three keywords: _Cilk_spawn, _Cilk_sync, and _Cilk_for.  These names were chosen to conform to C/C++ standards for language extensions, and to minimize the chance of conflicting with names used in existing programs, not to be easy to type. The header file <cilk/cilk.h> defines macros that provide names with simpler conventions; cilk_spawn, cilk_sync and cilk_for.  The remainder of this tutorial uses the names defined in cilk.h.

cilk_spawn and cilk_sync

The following function calculates a Fibonacci number. There are certainly more efficient algorithms to calculate Fibonacci numbers, but this one provides a simple recursive function and makes a good example:

```int fib(int n)
{
if (n < 2)
return n;
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
```

The key idea here is that the calculation of `fib(n-1)` can be executed in parallel with the calculation of `fib(n-2)` without interference. The parallelism can be expressed in Cilk Plus with the following modifications:

```int fib(int n)
{
if (n < 2)
return n;
int x = cilk_spawn fib(n-1);
int y = fib(n-2);
cilk_sync;
return x + y;
}
```

The `cilk_spawn` keyword specifies that the child function may execute in parallel with the caller (sometimes referred to as the parent). The code following a `cilk_spawn` statement is referred to as a continuation.  It's important to note that cilk_spawn permits parallelism. It does not command it. cilk_spawn does not create a thread.  It allows the runtime to steal the continuation to execute in another worker thread.

The `cilk_sync` statement specifies that all child functions spawned from this function must complete before execution continues past this statement.  There is always an implied cilk_sync at the end of every function that contains a cilk_spawn.  It is important to note that cilk_sync only effects the child functions spawned from this function.  Functions spawned higher on the call tree will continue to run in parallel with the function executing the cilk_sync.

You do not need to add a cilk_spawn attribute to the second recursive call to fib() because that will create an empty continuation.

strand is a sequence of instructions that starts or ends on a statement which will change the parallelism. `fib()` contains all or part of four strands:

1. From the function entry to the `cilk_spawn` statement.
2. The first, spawned, recursive call to fib().
3. Any statements after the spawned call to the `cilk_sync` statement.
4. From the `cilk_sync` statement to the end of the function.

Because of the strictly enforced fork/join model of parallelism implemented by Intel Cilk Plus, an application can be throught of as a Directed Acyclic Graph, or DAG. The DAG to the right is a representation of the fib() function shown above. Each of the arcs is a strand, and each of the nodes is a statement which will change the parallelism. In an Intel Cilk Plus program each cilk_spawn has exactly 1 input arc and 2 output arcs. By convention, the continuation is always the right arc of the DAG, or top, if you're drawing left-to-right. A cilk_sync has 2 or more inputs, and exactly 1 output.

Permitting parallelism instead of commanding it is an aspect of the serial semantics of a deterministic Intel Cilk Plus application.  It is always possible to run an Intel Cilk Plus application with a single worker, and it should give identical results to the serialization of that program.

As a general rule, a parallel application should have approximately 10P tasks, where P is the number of cores the application is executing on.  This allows for enough tasks to keep the other cores busy if one core is executing a long task.  However, properly written Intel Cilk Plus applications should not attempt to adapt to the number of cores available.  Instead they should expose sufficient parallelism using cilk_spawn and let the Intel Cilk Plus runtime decide which portions of the application to run in parallel.  A cilk_spawn costs approximately 10x as much a simple call.  While you shouldn't convert every call to a cilk_spawn, you also shouldn't attempt to tailor your program to the number of cores available on your current machine, either. History has shown that the number of cores will continue to grow.  If you expose sufficient parallelism, your application's performance should continue to improve as the number of cores increases.

What is costly is stealing work from another core.  The estimate is that stealing work from another core costs about 15,000 instructions.  A good rule-of-thumb is that you should attempt to break your program into approximately equal portions and insert cilk_spawn's to distribute the work among your cores.  Recursive algorithms are generally very good at this.  For example, the fib() implementation above breaks the work into approximately 2 halves, and spawns half.  In turn, that breaks the work into 2 more halves, etc.  This way the work is quickly distributed among the cores, minimizing stealing.

cilk_for

While cilk_spawn and cilk_sync are great for expressing the parallelism in a recursive algorithm, one of the simplest ways to parallelize a program is to identify a loop with no dependencies between the iterations and run them in parallel.

The cilk_for statement converts a simple `for` loop into a parallel for loop. That is, one where iterations of the for loop body can be executed in parallel. Consider the following loop:

```for (int i = 0; i < 8; ++i)
{
do_work(i);
}
```

An obvious way to parallelize this would be to add the `cilk_spawn` attribute to the call to `do_work()`.

```for (int i = 0; i < 8; ++i)
{
cilk_spawn do_work(i);
}
cilk_sync;
```

However, this would be a mistake:

• It introduces a lot of overhead. The amount of work spawned is small, and all of the remaining work needs to be stolen for every iteration.  So you'll have 7 steals, and steals are costly.
• It has low parallelism. There is only one producer of parallel work at a time.

A better approach is to use a `cilk_for` loop:

```cilk_for (int i = 0; i < 8; ++i)
{
do_work(i);
}
```

The Intel Cilk Plus compiler and runtime cooperate to divide the work of the loop in half, and then divide it in half again, until there are enough pieces to keep the cores busy, but at the same time minimize the overhead imposed by `cilk_spawn`.

Like the recursive implementation of fib() above, this efficiently spreads the work across the available cores and minimizes steals.

 Up: Overview Next: Reducers