|Team Members:||Adam Cunningham1,
and Jordi Wolfson-Pou3 |
|Graduate Research Assistant:||Jonathan Graf2,
and Samuel Khuvis2 |
|Faculty Mentor:||Matthias K. Gobbert2|
and David J. Mountain4 |
Our team, which consisted of Adam Cunningham, Gerald Payton, Jack Slettebak, and Jordi-Wolfson-Pou, participated in the Interdisciplinary Program in High Performance Computing located in the Department of Mathematics and Statistics at UMBC. Our project was to test the computing capabilities of the maya Cluster using industry benchmarks, a project proposed to us by our clients, Thomas Salter, and David J. Mountain. Assisting us in our research and providing insight and supervision was our faculty mentor, Dr. Matthias K. Gobbert, along with our graduate research assistants, Jonathan Graf, Xuan Huang, and Samuel Khuvis.
Maya is the 240-node supercomputer in the UMBC High Performance Computing Facility.
The 72 newest nodes have two eight-core Intel E5-2650v2 Ivy Bridge CPUs, with 64 GB memory (in eight 8 GB DIMMs) each, making a single node capable of running 16 processes/threads simultaneously.
The nodes are connected by a high-performance quad-data rate (QDR) InfiniBand interconnect.The new hardware requires testing and benchmarking to give insight into its full potential. We report here on the High Performance Conjugate Gradient (HPCG) Benchmark developed by Sandia National Laboratories.
The HPCG benchmark solves the Poisson equation on a three-dimensional domain. A discretization on a global grid with a 27-point stencil at each grid point generates a system of linear equations with a large, sparse, highly structured system matrix. This system is solved by a preconditioned conjugate gradient method. The unknowns in this system are distributed to a 3-D grid of parallel MPI processes.
|27-point stencil||3-D process grid.|
A problem with a sparse system matrix and an iterative solution technique is more relevant to many applications than the dense system matrix of the LINPACK benchmark.
The HPCG benchmark uses a 3-D grid of P = px x py x pz parallel MPI processes. We consider P = 1, 8, 64, 512 in our experiments. Each process hosts a local subgrid of size nx x ny x nz. Thus, Nx = nx px, Ny = ny py, and Nz = nz pz, and the total number of unknowns Nx x Ny x Nz scales with the number of processes.
For example for P = 512 processes, the global grid ranges from millions to billions of unknowns:
|n||P = 1||P = 8||P = 64||P = 512|
We ran the HPCG Benchmark Revision 2.4 with execution time 60 seconds using the Intel C++ compiler and MVAPICH2. The table shows the observed GFLOP/s for several local subgrid dimensions nx x ny x nz. The table reports the results for P = 512 parallel MPI processes using N compute nodes withpN processes per node and nt OpenMP threads per MPI process. Possible combinations for P = 512 are N = 32 nodes with pN = 16 processes per node and nt = 1 thread per process or N = 64 nodes with pN = 8 processes per node and nt = 1 or 2 threads per process.
|nx = ny = nz = 16||nt = 1||nt = 2|
|N = 32 pN = 16||45.58||N/A|
|N = 64 pN = 8||113.50||112.36|
|nx = ny = nz = 32||nt = 1||nt = 2|
|N = 32 pN = 16||170.03||N/A|
|N = 64 pN = 8||209.92||211.84|
|nx = ny = nz = 32||nt = 1||nt = 2|
|N = 32 pN = 16||223.92||N/A|
|N = 64 pN = 8||209.62||238.82|
|nx = ny = nz = 128||nt = 1||nt = 2|
|N = 32 pN = 16||233.98||N/A|
|N = 64 pN = 8||210.94||230.42|
The table allows us to conclude:
Sandia HPCG Benchmark: http://software.sandia.gov/hpcg/Adam Cunningham, Gerald Payton, Jack Slettebak, Jordi Wolfson-Pou, Jonathan Graf, Xuan Huang, Samuel Khuvis, Matthias K. Gobbert, Thomas Salter, and David J. Mountain. Pushing the Limits of the Maya Cluster. Technical Report HPCF-2014-14, UMBC High Performance Computing Facility, University of Maryland, Baltimore County, 2014. Reprint in HPCF publications list