FAQ

Comparing Cilk Plus

Should I expect Intel Cilk Plus to outperform TBB, OpenMP and MPI?

Intel Cilk Plus offers a competitive alternative for parallel programming that is far easier to use than MPI.   It is intended to complement TBB and OpenMP, enabling programmers to parallelize applications that may be too complex or cumbersome to fit into one of these frameworks.   Cilk Plus is not intended to be a replacement for these platforms, however.   In general, one should not expect to see performance improvements by convert an existing parallel code in TBB or OpenMP to Cilk Plus.   TBB, OpenMP and MPI continue to be good choices for writing High Performance Computing applications.

Intel Cilk Plus is designed to be a unique, general-purpose solution which provides good scalability for multi-core programs across a variety of applications and machines, and which allows programmers to exploit both data and task parallelism in their program in a straightforward, maintainable manner.     The tradeoff is that Cilk Plus may not be as finely tuned for specific programming patterns or environments as TBB or OpenMP.

In general, each of the parallel programming frameworks has their own strengths and weaknesses.   Which framework offers the best performance depends on both on the kind of application and how much the code has been tuned for a particular platform.   One should not expect naïve conversions of tuned code from one platform to another to necessarily be comparable in performance, since each platform has its own unique tuning methodology and preferred programming style.

Can Intel Cilk Plus and TBB be used together in the same program?

Intel Cilk Plus and TBB can coexist. The Cilk and TBB runtimes coordinate to minimize oversubscription of your cores. That way you can use the tool which best fit the problem.

Can Intel Cilk Plus and OpenMP be used together in the same program?

Yes, although when OpenMP is used in conjunction with other threading libraries, the user must be extremely careful about controlling the number of threads to avoid oversubscription, especially when OpenMP is used at the innermost level.

What if I already have code written using OpenMP… should I migrate that code to Intel Cilk Plus?

In general, the answer is no – since Intel Cilk Plus and OpenMP serve different needs based on the development environment and the actual need for the parallelism and associated algorithms. The exception here is when you already are using OpenMP for certain features that overlap, and plan to add Intel Cilk Plus moving forward, then it would be a good idea to replace the OpenMP code with Intel Cilk Plus. The reason why is because Intel Cilk Plus is designed to anticipate incremental parallelization - allowing additional parallelization without creating unnecessary threads that can lead to over-utilization.

How does Intel Cilk Plus compare to OpenMP?

 

Nearly a decade ago we worked with others to create the OpenMP standard which standardizes a set of constructs similar to vector computing directives for scientific and loop based code. By 2007, every major compiler had implemented or announced support for OpenMP. This is quite an accomplishment, and makes OpenMP a standard way to do parallelism in C and FORTRAN.

However,OpenMP is very loop oriented, and does not address algorithm or data structure level parallelism. While we’ve seen it used to great advantage in financial applications, mp3 codecs, scientific programs and high definition video editing software, OpenMP is best geared for Fortran and loop-oriented C code. Intel Cilk Plus is designed to be a unique, general-purpose solution which provides good scalability for multi-core programs across a variety of applications and machines, and which allows programmers to exploit both data and task parallelism in their program in a straightforward, maintainable manner.If I’m writing new code, how do I choose between Intel Cilk Plus, TBB, OpenMP, and MPI?

It’s not an exclusive choice.  The vectorization and multi-threading capabilities in Cilk Plus are separable, and often the vectorization capabilities are useful in conjunction with TBB or OpenMP.  Likewise, if you need support for distributed memory, MPI is the choice for the message-passing part, but you are free to use one of the other frameworks for multi-threading the code on a single node.  The rest of this answer considers the “how to multi-thread” question.

First, you’ll want to look at the development environment. If the code is written in FORTRAN, use OpenMP.  TBB works only for C++, and makes heavy use of templates.   If the code is C, or you are not comfortable with templates, your choice is narrowed to Cilk Plus or OpenMP.  If you need parallelism with non-Cilk Plus enabled compilers, your choice is narrowed to OpenMP or TBB.  If you need parallelism in C++ with compilers that do not support Cilk Plus or OpenMP, then TBB is the way to go, because it does not require compiler support.

If the code makes heavy use of recursion, then Cilk Plus or TBB is likely a better fit than OpenMP.  Cilk Plus has particularly low cost “spawning”, so it’s a particularly good fit for highly recursive code where the base cases cannot be forced to be large chunks.  If the code is C++, it’s likely that TBB is the best fit. TBB matches especially well with the code that is highly object oriented, and makes heavy use of C++ templates and user defined types.  If the code is written in C or FORTRAN, OpenMP may be the better solution because it fits better than TBB into a structured coding style and for simple cases, it introduces less coding overhead. TBB and native threads don’t require specific compiler support; OpenMP and Cilk do.

How much are you willing to make your code stray from plain serial code?  If you use Cilk Plus without array notation, then the code still looks like serial code with some markup.  In contrast, conversion to TBB will require restructuring certain parts to fit the TBB templates.

Next, look at what you want to make parallel. Use OpenMP if the parallelism is primarily for bounded loops over built-in types, or if it is flat do-loop centric parallelism. OpenMP works especially well with large and predictable data parallel problems. It can be very challenging to match OpenMP performance with Cilk Plus or TBB for such problems. It is seldom worth the effort to bother – just use OpenMP.  Cilk Plus and TBB excels at less structured or consistent parallelism.

Consider using TBB if you need off-the-shelf patterns that go beyond loop-based and recursion-based parallelism, since TBB provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sort, and prefix-scan.  However, many of these patterns can be coded in Cilk Plus too. See  the book Structured Parallel Programming for examples.

If I’m writing new code, how do I choose between Intel Cilk Plus, TBB, OpenMP, and MPI?

It’s not an exclusive choice. The vectorization and multi-threading capabilities in Cilk Plus are separable, and often the vectorization capabilities are useful in conjunction with TBB or OpenMP. Likewise, if you need support for distributed memory, MPI is the choice for the message-passing part, but you are free to use one of the other frameworks for multi-threading the code on a single node. The rest of this answer considers the “how to multi-thread” question.

First, you’ll want to look at the development environment. If the code is written in FORTRAN, use OpenMP. TBB works only for C++, and makes heavy use of templates. If the code is C, or you are not comfortable with templates, your choice is narrowed to Cilk Plus or OpenMP. If you need parallelism with non-Cilk Plus enabled compilers, your choice is narrowed to OpenMP or TBB. If you need parallelism in C++ with compilers that do not support Cilk Plus or OpenMP, then TBB is the way to go, because it does not require compiler support.

If the code makes heavy use of recursion, then Cilk Plus or TBB is likely a better fit than OpenMP. Cilk Plus has particularly low cost “spawning”, so it’s a particularly good fit for highly recursive code where the base cases cannot be forced to be large chunks. If the code is C++, it’s likely that TBB is the best fit. TBB matches especially well with the code that is highly object oriented, and makes heavy use of C++ templates and user defined types. If the code is written in C or FORTRAN, OpenMP may be the better solution because it fits better than TBB into a structured coding style and for simple cases, it introduces less coding overhead. TBB and native threads don’t require specific compiler support; OpenMP and Cilk do.

How much are you willing to make your code stray from plain serial code? If you use Cilk Plus without array notation, then the code still looks like serial code with some markup. In contrast, conversion to TBB will require restructuring certain parts to fit the TBB templates.

Next, look at what you want to make parallel. Use OpenMP if the parallelism is primarily for bounded loops over built-in types, or if it is flat do-loop centric parallelism. OpenMP works especially well with large and predictable data parallel problems. It can be very challenging to match OpenMP performance with Cilk Plus or TBB for such problems. It is seldom worth the effort to bother – just use OpenMP. Cilk Plus and TBB excels at less structured or consistent parallelism.

Consider using TBB if you need off-the-shelf patterns that go beyond loop-based and recursion-based parallelism, since TBB provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sort, and prefix-scan. However, many of these patterns can be coded in Cilk Plus too. See the book Structured Parallel Programming for examples.

How does Intel Cilk Plus compare with Boost threads?

With Intel Cilk Plus, we are attempting to address a problem that did not previously have an appropriate solution. We are not trying to replace Boost or to be a solely a thread wrapper. With tasks, we do look to abstract the developer from low level threading details. However, the true benefit of Intel Cilk Plus is in its ability to provide a scalable solution for multi-core hardware in a way that is easy for C and C++ developers to implement, as well as its ability to specify data parallelism.