|Team Members:||Andrew Coates1 and
|Faculty Mentors:||Matthias K. Gobbert1 and
Nagaraj K. Neerchal1|
|Clients:||Patrick O'Neill3 and
Ivan Erill3 |
In the REU Site: Interdisciplinary Program in High Performance Computing in the Department of Mathematics and Statistics at the University of Maryland, Baltimore County (UMBC), during the summer of 2011, our team of two students Andrew Coates and Alexey Ilchenko researched methods in information theory to calculate information content of enzyme binding sites under the guidance of Matthias K. Gobbert and Nagaraj K. Neerchal. This project was proposed to us by Patrick O'Neill and Ivan Erill of the Department of Biological Sciences at UMBC.
Biologists use Information Theory as a method of predicting where proteins will bind to DNA. They specifically use entropy and information content. Expected entropy is calculated in order to determine information content. For the applications under consideration, the biologists are interested in entropy calculations for sample sizes ranging from 1 to about 450. For small sample sizes, the exhaustive calculation of the estimated entropy using its mathematical definition is possible. But already for medium sample sizes, its calculation is very time consuming when done exhaustively and also runs into memory problems. For large sample sizes, a good approximation is available to aid in the calculation of the entropy. Therefore, the most important range for an improved algorithm to compute expected entropy is around 20 to 70 samples.
The number of unique DNA sequences is tremendous and grows as O(4n). However, some of that DNA sequence information is repetitive with respect to binding site entropy calculations, thus, presents itself as problem of extracting only useful information out of O(4n) and then effectively optimizing the calculation.
Our team implemented a computational model for calculating entropic values for an enzyme binding using an exhaustive string based approach in python. This would generate exhaustively all distinct sequences. This leads to memory problems and excessive run times even for relatively small sample sizes, and so a better algorithm is necessary.
In order to reduce the amount of calculations with respect to the multinomial function, our team noticed that we can group certain DNA sequences together into a class of integer compositions of length 4 where each summand corresponds to the number of base pairs in that particular class of DNA sequences. For example, for a sample of length 4, some sequences are CATT, ATTG, and TATC. Following our approach, the sequences CATT and TATC would be grouped into the same composition, namely, 1 A, 2 T, 1 C, and 0 G. As the entropy calculation is the same for any member in the same composition class, we effectively reduce the calculation to O(n3) with respect to the multinomial function.
Our team then ported the Matlab algorithm to C, which shows further significant time improvement. The following table lists the run time in seconds for our code in python, MATLAB, and C for various sample sizes in the range from 1 to 256.
The data bears out that the python code is slower than the alternatives for sample sizes 8 and 13 and runs out of memory (indicated by the notation "O.M" in the table) for already for medium sample sizes. For the most important range for our work from 20 to 70 samples, both the Matlab and the C code take less than 1 second in all cases, thus addressing the problem of making those calculations feasible. Moreover, while the Matlab code begins to take increasingly longer for large sample sizes, the C code continues to perform well.