Block Cyclic Distribution of Data in pbdR and
its Effects on Computational Efficiency
||Matthew G. Bachmann1,
Ashley D. Dyas2,
Shelby C. Kilmer3,
and Julian Sass4 |
|Graduate Research Assistant:
||Andrew Raim4 |
||Nagaraj K. Neerchal4
and Kofi P. Adragani4 |
and Ian F. Thorpe6 |
Department of Mathematics, Northeast Lakeview College
Department of Computer Science, Contra Costa College
Department of Mathematics, Bucknell University
Department of Mathematics and Statistics, University of Maryland, Baltimore County
Oak Ridge National Laboratory,
Department of Chemistry and Biochemistry, University of Maryland, Baltimore County
Team 1, from left to right: Julian Sass, Shelby C. Kilmer, Ashley D. Dyas, Matthew G. Bachmann
About the Team
Our team, composed of Matthew Bachmann, Ashley Dyas, Shelby Kilmer, and Julian Sass, performed an efficiency study using a package for R, a popular statistical computing language, called pbdR
(Programming with Big Data in R). This research took place at the UMBC REU Site: Interdisciplinary Program in High Performance Computing. Assisting us in our research and providing insight and
supervision was our faculty mentor, Dr. Nagaraj Neerchal and our graduate assistant, Andrew M. Raim. Our client, Dr. George Ostrouchov, Senior Research Staff Member at the Oak Ridge National
Laboratory, proposed our project. Dr. Ian Thorpe also provided us with data that was used in an application of our study.
Introduction to our Project
pbdR is an R package that is used to implement high performance statistical computing on very large data sets. Our study focused on efficiency while changing two main factors: block cyclic
distribution and processor grid layout. We explored the impact of block size and grid layout on computation by implementing the statistical method PCA (Principal Component Analysis).
Methods and Results
For our study, we implemented PCA on a randomly generated data set and recorded the time it took for the code to run. Our pilot study varied n and k, the dimensions of our data matrix, and the
results allowed us to show that that the relationship between the dimension of the matrix and the run time was predictable, which allowed us to keep n and k constant throughout the rest of our
When changing grid layout and block size, we found that grid layout has less of an effect on the runtime than the block size. We also observed that the 8x8 block
size was consistently faster than the other block sizes. We concluded that the 8x8 block size was consistently faster than the other block sizes, no matter the n, k, or grid layout. Therefore, we
can conclude that block size has a clear effect on computational efficiency.
Applications of our Study
As an application of our study, we used data containing the movement of amino acids in a protein from the lab of Dr. Ian Thorpe. The data was formatted as 3100 snapshots, each snapshot containing
the x, y, and z coordinates of amino acids in different atoms of a protein. We performed PCA on the data matrix and also created a correlation matrix from the data. Once we had a correlation
matrix, we created a level plot from the matrix and saw how different amino acids in various atoms correlate with each other.
We then greyed out the correlations that are not statistically significant and did a level plot of the same data set. There is a significant drop in the amount of data points, showing that few of these correlations are statistically
Matthew G. Bachmann, Ashley D. Dyas, Shelby C. Kilmer, Julian Sass,
Andrew Raim, Nagaraj K. Neerchal, Kofi P. Adragani, George Ostrouchov, Ian F. Thorpe.
Block Cyclic Distribution of Data in pbdR and its Effects on Computational Efficiency
Technical Report HPCF-2013-11, UMBC High Performance Computing Facility,
University of Maryland, Baltimore County, 2013.
Reprint in HPCF
Poster presented at the Summer
Undergraduate Research Fest (SURF)
Click here to view Team 2's project
Click here to view Team 3's project
Click here to view Team 4's project