REU Site: Interdisciplinary Program in High Performance Computing

Intel Concurrent Collections as a Method for Parallel Programming

Team Members: | Richard Adjogah,
^{1}Randal Mckissack,
and ^{1}Ekene Sibeudu ^{1} |

Graduate Assistant: | Andrew M. Raim ^{2} |

Faculty Mentor: | Matthias K. Gobbert ^{2} |

Client: | Loring Craymer ^{3} |

Team 4, from left to right: Richard Adjogah, Randal Mckissack, Ekene Sibeudu

Our team consisted of UMBC undergraduates Richard Adjogah, Randal Mckissack, and Ekene Sibeudu led by faculty mentor Matthias K. Gobbert. During the 2011 REU site, we were presented a project involving new parallel programming software called Intel Concurrent Collections (CnC), given by Dr. Loring Craymer. We received assistance and feedback throughout the program by Dr. Gobbert, Dr. Craymer, graduate assistant Andrew Raim, and Kath Knobe, a research scientist for the CnC project at Intel.

Computer hardware has become parallel in order to run faster and more
efficiently. One of the current standard parallel coding libraries is MPI
(Message Passing Interface).
When using MPI, the user has to explicitly
send and receive messages
to and from different processes with multiple calls to
functions. These functions have numerous arguments that need to be passed in;
this can be error-prone and a hassle.
The Intel Corporation is developing a new parallel software
and translator called CnC (Concurrent Collections) to make
programming in parallel easier.
CnC uses a system of collections comprised of steps, items, and tags which are organized by a graph system. The graph is writen in Intel's textual notation and then CnC's translator *cnc* compiles the notation into a C++ header file. This system allows the programmer to identify dependencies
among code segments in the program and the
parallelization of these code segments is handled automatically
by CnC at runtime of the program.
Our research evaluates if this new software works faster and more efficiently when creating parallel code and converting serial code to parallel.

To test the difference between the two methods, we used benchmark codes with
both MPI and CnC and compared the results. We started with a prime number
generator provided by Intel as sample code that familiarizes programmers with
CnC. Then we moved on to a pi approximation, for which we used
as starting point a classical MPI sample
code that uses integration to approximate pi. We ran it in MPI first, then
stripped it of all MPI, ported it to C++, and added calls to CnC. We ran
performance studies afterwards to compare the two.
Our last two tests involved doing parameter studies.
The main test case is
an extension of the Poisson equation by a term involving
a non-negative parameter *a*.
The finite difference method is used to discretize this equation,
which results in a symmetric positive definite system of linear equations.
This is solved by the iterative conjugate gradient method.
To demonstrate how readily the setup for a parameter study can
be transferred to a different problem, we apply it to
a project to estimate entropy of DNA bindings
by another team in the REU Site.
In both cases, we took existing serial code for each problem and
were readily able
to run them repeatedly with varying parameters using CnC in parallel
just by creating a couple of new files.
These last two tests showcase a clear advantage of CnC over MPI
in parallelization of parameter studies.

An excellent example of how CnC is useful as a tool for
parallel parameter studies is the extension of the Poisson equation.
For finer mesh resolutions, this is a computationally intensive problem
and it is vital to run studies involving many parameters in parallel
using several threads.
Consider the situation where a randomly chosen value *a*
is desired for this problem.
The iterative conjugate gradient method used in the serial code
requires then different numbers of iterations and thus an
unpredictable amount of run time for each randomly chosen values of *a*.
Therefore, CnC which has no barriers and can execute a computation
whenever its control and data are available
is ideally suited to sending work to any thread that has completed
work on a previously assigned parameter value.
Because CnC handles all parallelization at runtime, it doesn't matter that the programmer cannot predict how the parameter(s) will affect time and memory usage.
We study the performance of CnC on one compute node with 8 cores.

The table below shows the run data of our parallelized parameter
study for the extended Poisson problem for a small demonstration
example of a parameter study of only 8 randomly chosen parameter values.
The mesh resolution of the finite difference method controls
the amount of work for each run of the Poisson code and is selected
as 256-by-256 here to obtain conveniently readable times in units of seconds.
The table shows the results ordered by the value of *a*,
which confirms the decreasing iteration count for increasing *a*.
The last two columns show the run times in seconds of each
Poisson solve first for the case of using only 1 thread (serial code)
and then for the case of a run using 8 threads (the number of
computational cores available on one node).
When run originally, each line of the table would appear in random
order in the output, since different threads finish their work
in random order; to make the output comparable and to clearly
analyze the effect of *a* on the iteration count,
the output was manually ordered.
For the small amount of work in this demonstration example,
we notice that there is some slow-down involved,
since each individual run time for the study using 8 threads
is larger than for the study using 1 thread.
The final line of the table lists the total observed time for the
entire parameter study.
For the serial run, the total run time is the summation of all
individual Poisson solve times, as one would expect, since each
individual solve is run on the same computational core in turn.
By contrast, for the run using 8 threads on a compute node with 8
computational cores, the total run time is driven only by the
slowest individual runtime;
this is clear evidence of CnC performing correctly
for this small illustrative demonstration.

a | #iter | 1 Thread | 8 Threads |

Time | Time | ||

135.44 | 594 | 1.72 | 2.97 |

274.75 | 477 | 1.38 | 2.65 |

486.90 | 369 | 1.09 | 2.35 |

561.38 | 367 | 1.08 | 2.26 |

700.98 | 330 | 0.97 | 2.20 |

840.19 | 300 | 0.89 | 2.02 |

840.19 | 300 | 0.89 | 2.02 |

916.46 | 287 | 0.86 | 1.95 |

Total | 8.88 | 2.98 |

The next table shows the results of a study where we extended the above
study from 8 to *M* Poisson solves for randomly chosen
parameter values. All runs used 8 threads.
The Worst Case column shows the predicted total runtime
if each Poisson solve was distributed to threads
in order of increasing runtimes.
This would make the last thread's total time be the longest,
given by *M* time the maximum individual runtime.
The Best Case column shows if each thread had an even amount of work,
given by *M* time the mean individual runtime.
We can conclude that CnC is effective in its parallelization methods
due to our Actual Times being extremely close to the Best Case times.

M | Worst Case | Best Case | Actual Time |

Time | Time | Time | |

8 | 2.97 | 2.30 | 2.98 |

64 | 31.75 | 20.33 | 21.19 |

256 | 137.82 | 86.84 | 88.44 |

1024 | 556.18 | 351.91 | 352.63 |

Through our research, we have found CnC to be a useful method for parallel computing. By making the parallelization process as abstract as possible, the amount of coding a programmer has to do is reduced and task distribution can be done as effectively as possible at run-time. While the graph concept of how CnC works is a very different way of thinking than how parallelization is done in MPI, it is easy to follow once understood. The nature of CnC's parallelization makes operations that require accessing parallel elements in order counter productive and time costly. However, CnC excels at parameter studies where multiple runs of a method each may vary in memory and run-time in unknown ways.

In the time frame of this REU Site, we only worked with CnC using all computational cores on one compute node. CnC has also a distributed version that can use all cores on several nodes in a cluster. Investigating this and analyzing its performance is the subject of future work.

Richard Adjogah, Randal Mckissack, Ekene Sibeudu, Andrew M. Raim, Matthias K. Gobbert, and Loring Craymer. Intel Concurrent Collections as a Method for Parallel Programming. Technical Report HPCF-2011-14, UMBC High Performance Computing Facility, University of Maryland, Baltimore County, 2011. Reprint and associated files in HPCF publications list

Poster presented at the Summer Undergraduate Research Fest (SURF)

Click here to view Team 1's project

Click here to view Team 2's project

Click here to view Team 3's project

Participants Randal Mckissack and Richard Adjogah Present at James Edward West Symposium