Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications

详细信息查看全文

作者：Gianni D’Angelo (1) (2)
Salvatore Rampone (1) (2)
刊名：BMC Bioinformatics
出版年：2014
出版时间：May 2014
年：2014
卷：15
期：5-supp
全文大小：
参考文献：1. Kodama Y, Shumway M, Leinonen R: The sequence read archive: explosive growth of sequencing data. / Nucleic Acids Research 2012, 40:D54-D56. CrossRef
2. Baxevanis AD: The Molecolar Biology Database Collection: 2003 update. / Nucleic Acids Research 2003.,31(1):
3. Galperin MY, Fernández-Suárez Xosé M: The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. / Nucleic Acids Research 2012.,40(Database):
4. Fernández-Suárez Xosé M, Galperin MY: The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. / Nucleic Acids Research 2013.,41(Database):
5. Rubin D: / Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc; 1987. CrossRef
6. Dick U, Haider P, Scheffer T: Learning from Incomplete Data with Infinite Imputations. In / Proceedings of the 25th International Conference on Machine Learning. Helsinki, Finland; 2008:232-39.
7. Ibrahim JG: Incomplete data in generalized linear models. / Journal of the American Statistical Association 1990, 85:765-69. CrossRef
8. Wang X, Li A, Jiang Z, Feng H: Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. / BMC Bioinformatics 2006, 7:32. CrossRef
9. Williams D, Carin L: Analytical kernel matrix completion with incomplete multi-view data. / Proceedings of the International Conference on Machine Learning (ICML) Workshop on Learning with Multiple Views 2005, 80-6.
10. Graepel T: Kernel matrix completion by semidefinite programming. / Proceedings of the International Conference on Artificial Neural Networks 2002, 2415:694-99.
11. Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. / Journal of the Royal Statistical Society Series B (Methodological) 1977,39(1):1-8.
12. Tsuda K, Akaho S, Asai K, Williams C: The em algorithm for kernel matrix completion with auxiliary data. / Journal of Machine Learning Research 2003, 4:67-1.
13. Duda RO, Hart PE: / Pattern Classification and Scene Analysis. New York: Wiley; 1973.
14. Quinlan JR: / Programs for Machine Learning. San Francisco: Morgan Kaufmann Publishers; 1993.
15. Geman S, Geman D: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. / IEEE Transactions on Pattern Analysis and Machine Intelligence 1984, 6:721-41. CrossRef
16. Ahmad F, Isa NAM., Osman MK, Hussain Z: Performance comparison of gradient descent and Genetic Algorithm based Artificial Neural Networks training. / Proceedings of the 10th International Conference on Intelligent Systems Design and Applications (ISDA) 2010, 604-09.
17. Rubin DB, Little RJA: / Statistical Analysis with Missing Data. 2nd edition. New York: Wiley Interscience; 2002.
18. R?ssler S: The Impact of Multiple Imputation for DACSEIS. In / Technical Report DACSEIS Research Paper Series 5. Univ. of Erlangen-Numberg, Numberg, Germany; 2004.
19. Schafer JL, Graham JW: Missing Data: Our View of the State of the Art. / Psychological Methods 2002,7(2):147-77. CrossRef
20. Chen J, Xue X, Tian F, Huang H: An algorithm for Classifying Incomplete Data With Selective Bayes Classifiers. / Proceedings of the IEEE International Conference on Computational Intelligence and Security Workshops 2007, 445-48.
21. Williams D, Liao X, Xue Y, Carin L, Krishnapuram B: On Classification with Incomplete Data. / IEEE Transactions on Pattern Analysis and Machine Intelligence 2007,29(3):427-36. CrossRef
22. Li D, Zhong C, Li J: An attribute weighted fuzzy c-means algorithm for incomplete data sets. / Proceedings of the IEEE International Conference on System Science and Engineering (ICSSE) 2012, 449-53.
23. Thangaparvathi B, Anandhavalli D, Mercy Shalinie S: A high speed decision tree classifier algorithm for huge dataset. / Proceedings of the IEEE International Conference on Recent Trends in Information Technology (ICRTIT) 2011, 695-00. CrossRef
24. Menon AK: Large-Scale Support Vector Machines: Algorithms and Theory. In / Research Exam. University of California, San Diego; 2009:1-7.
25. Guosheng W: A Survey on Training Algorithms for Support Vector Machine Classifiers. / Proceedings of the Fourth IEEE International Conference on Networked Computing and Advanced Information Management, NCM '08 2008, 123-28.
26. Lu C, Li X, Pan H: Application of SVM and Fuzzy Set Theory for Classifying with Incomplete Survey Data. / Proceedings of the IEEE International Conference on Service Systems and Service Management 2007, 1-.
27. Chen J, Xue X, Fengzhan T, Huang H: An Algorithm for Classifying Incomplete Data with Selective Bayes Classifiers. / Proceedings of the IEEE International Conference on Computational Intelligence and Security Workshops, CISW 2007, 445-48.
28. Amado N, Gama J, Silva F: Parallel Implementation of Decision Tree Learning Algorithms. / Progress in Artificial Intelligence Lecture Notes in Computer Science 2001, 2258:6-3. CrossRef
29. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armananzas R, Santafe G, Perez A, Robles V: Machine learning in bioinformatics. / Briefings in bioinformatics 2006,7(1):86-12. CrossRef
30. Rampone S: Recognition of spline-junctions on DNA sequences by BRAIN learning algorithm. / Bioinformatics Journal 1998,14(8):676-84. CrossRef
31. Rampone S, Russo C: A fuzzified BRAIN algorithm for learning DNF from incomplete data. / Electronic Journal of Applied Statistical Analysis (EJASA) 2012,5(2):256-70.
32. Rampone S: An Error Tolerant Software Equipment For Human DNA Characterization. / IEEE Transactions on Nuclear Science 2004,51(5):2018-026. CrossRef
33. Aloisio A, Izzo V, Rampone S: VLSI implementation of greedy-based distributed routing schemes for ad hoc networks. / Soft Computing 2007,11(9):865-72. CrossRef
34. Green MR: Pre-mRNA splicing. / Annual Review of Genetics 1986, 20:671-08. CrossRef
35. Michalski RS: A theory and methodology of inductive learning. / Artificial Inteligence 1983, 20:111-16. CrossRef
36. Mitchell TM: Generalization as search. / Artificial Inteligence 1982, 18:203-26. CrossRef
37. Haussler D: Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. / Artificial Inteligence 1988, 36:177-22. CrossRef
38. Zadeh LA: Fuzzy sets. / Information and Control 1965,8(3):338-53. CrossRef
39. Mendelson E: / Introduction to Mathematical Logic. London: Chapman & Hall; 1997.
40. Bürgisser P, Clausen M, Shokrollahi MA: / Algebraic Complexity Theory. Springer; 1997. CrossRef
41. Knuth D: Big Omicron and big Omega and big Theta. / SIGACT News 1976, 18-4. Apr.- June
42. Cormen TH, Leiserson CE, Rivest RL, Stein C: / Introduction to Algorithms.. 3rd edition. Boston: The MIT Press; 2009.
43. Vitter JS: External Memory Algorithms and Data Structures: Dealing with Massive Data. / ACM Computing Surveys 2001,33(2):209-71. June CrossRef
44. Kasim H, March V, Zhang R, See S: Survey on Parallel Programming Model. / NPC Proceedings of the IFIP International Conference on Network and Parallel Computing 2008, 266-75.
45. Flynn MJ: Very high-speed computing systems. / Proceedings of the IEEE 1966,54(12):1901-909. CrossRef
46. Auguin M, Larbey F: OPSILA: an advanced SIMD for numerical analysis and signal processing. / Microcomputers: developments in industry, business, and education, Ninth EUROMICRO Symposium on Microprocessing and Microprogramming, Madrid, September 13-6 1983, 311-18.
47. Darema F: SPMD model: past, present and future, Recent Advances in Parallel Virtual Machine and Message Passing Interface. / Proceedings of the 8th European PVM/MPI Users' Group Meeting 2001, 2131:1-. Santorini/Thera, Greece, September 23-6, / Lecture Notes in Computer Science
48. Message Passing Interface Forum. [http://www.mpi-forum.org/] 2013. Online, last access October 4
49. Jing Y, Weichang S, Gongxiao Y: Construct COW Based on MPICH in Linux Environment. / Proceedings of the First International Workshop on Education Technology and Computer Sciense 2009, 895-98.
50. MPJ Express. [http://mpj-express.org/] 2013. Online, last access October 4
51. Shafi A, Hussain A, Raza J: A Parallel Implementation of the Finite- Domain Time-Difference Algorithm using MPJ Express. In / Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. IPDPS; 2008:1-.
52. Foster I: / Designing and Building Parallel Programs. Addison Wesley; 1996.
53. Nian S, Guangmin L: Dynamic Load Balancing Algorithm for MPI Parallel Computing. / Proceedings of the IEEE International Conference on New Trends in Information and Service Science 2009, 95-9.
54. Ullman JD: NP-Complete Scheduling Problems. / Journal of Computer and System Sciences 1975, 10:384-93. CrossRef
55. Sinnen O, Sousa LA, Sandnes FE: Toward a Realistic Task Scheduling Model. / IEEE Transactions on Parallel and Distributed Systems 2006,17(3):263-75. CrossRef
56. Bache K, Lichman M: [http://archive.ics.uci.edu/ml] / UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science; 2013.
57. Pollastro P, Rampone S: HS3D: Homo Sapiens Splice Site Data Set. / Nucleic Acids Research 2003., (Annual Database):
58. Pollastro P, Rampone S: HS3D, a Dataset of Homo Sapiens Splice Regions, and its Extraction Procedure from a Major Public Database. / International Journal of Modern Physics C 2003,13(8):1105-117. CrossRef
59. Forbes SA: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. / Nucleic Acids Research 2011,39(suppl 1):D945-D950. CrossRef
60. Liggett WH Jr, Sidransky D: Role of the p16 tumor suppressor gene in cancer. / J Clin Oncol 1998,16(3):1197-06. Mar
61. Barney B: Introduction to Parallel Computing. [https://computing.llnl.gov/tutorials/parallel_comp/] / Lawrence Livermore National Laboratory 2013. Online, last access October 4
62. Rabenseifner R, Hager G, Jost G: Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes. In / Proceedings of the 17th Euromicro International Conference on Parallel Distributed and Network-based Processing. IEEE Press; 2009:427-36.
作者单位：Gianni D’Angelo (1) (2)
Salvatore Rampone (1) (2)

1. Department of Science and Technology (DST), University of Sannio, Benevento, Italy
2. Futuridea Innovazione Utile e Sostenibile, Benevento, Italy
ISSN：1471-2105

文摘

Background The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets. Results We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed. Conclusions In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700