并行遗传算法在DNA杂交测序中的应用研究

英文题名：Research on Parallel GA for DNA Sequencing by Hybridization
作者：袁倩倩
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：并行计算 ; 遗传算法 ; 机群 ; DNA杂交测序
英文关键词：parallel computing ; genetic algorithm ; cluster ; DNA sequencing by hybridization
学位年度：2010
导师：谢红薇
学科代码：081203
学位授予单位：太原理工大学
论文提交日期：2010-05-01

摘要

由基因序列解开一切生命的奥秘是生物信息学的研究目标,而DNA测序是了解基因结构与功能的基础,是生物信息学研究的起点。人类基因组计划的启动推进了生物信息学的飞速发展。2008年7月2日,美国能源部联合基因组研究所宣称其群体测序项目将在2009年支持DNA测序计划。上述表明,生物基因组序列的快速测序一直是一种迫切的需求。
     杂交测序是目前使用最广的一种测序方法,它分为杂交实验和序列重构两个步骤。杂交实验由于技术和条件的限制往往出现两类错误,由此得到的非理想谱集就给序列重构带来了极大的困难。已经证明含有错误的序列重构问题是NP-hard问题。
     国内外已有许多关于杂交问题的研究文献。求解杂交测序问题主要有精确算法(分支定界算法,动态规划方法等)和启发式搜索方法(禁忌搜索算法,遗传算法,蚁群算法,模拟退火算法等)两类。比较之下,精确算法只在序列长度较小时适用,而启发式搜索方法因不依赖于问题空间得到了更多的应用。
     遗传算法是基于达尔文生物进化论的自然选择学说和群体遗传学原理而开发出的一种随机全局搜索与优化的自适应智能算法。遗传算法不受搜索空间的限制性假设的约束。已验证在求解含有错误的DNA测序问题上优于其他算法。由于遗传算法固有的并行性非常适用于大规模并行计算,所以并行遗传算法往往能提高运算速度,改善求解性能。
     基于以上所述,本文的意图在于改进现有的并行遗传算法,将其应用到DNA杂交测序问题中,使得重构的DNA序列更加精确,花费的时间更少。本文首先对问题建模,利用现有的求解该问题的遗传算法模型,确定进化策略;然后改进并行遗传算法,对模型进行优化求解;最后搭建机群环境,并对计算结果进行分析讨论。本文的核心在于并行遗传算法框架的选择和迁移操作的改进两方面。本文分析了四种并行模型的通信方式,结合DNA杂交测序问题,选择使用粗粒度-主从式双层并行框架;分析了迁移操作对并行框架的影响,提出一种新的迁移策略。通过实验验证,新的迁移策略有效地减少了通信开销。
     本文的分析研究为使用并行遗传算法对复杂的生物工程实际问题进行计算提供了参照依据,对并行遗传算法的推广和更有效地应用具有实际价值。
The research target of bioinformatics is to reveal the secret of life. DNA sequencing is the foundation of finding out the structure and function of genes, so it is the research starting point of bioinformatics. The startup of human genome project rapidly advanced bioinformatics. On July 2, 2008, U.S. Department of Energy Joint Genome Institute claimed that, their projects, CSP, would support DNA Sequencing plan in 2009. These show that rapid sequencing biological genome has been an urgent demand.
     Sequencing by hybridization is the most widely used method, which is divided into hybrid experiment and sequence reconstruction. Because of the technology limitation, hybrid experiment often includes two kinds of errors. Errors will bring a lot of difficulties in the sequence reconstruction. DNA sequencing problem with errors in the hybridization data has been proved to be strongly NP-hard.
     There is much research literature about SBH problem at home and abroad. Methods which solve the SBH problem are exact algorithms (branch and bound algorithm, dynamic programming method, etc.)and heuristic methods (tabu search algorithm, genetic algorithm, ant colony algorithm, simulated annealing algorithm, etc.). The exact algorithms are used only when the length of sequence is small, however, the heuristic methods got more applications because they don’t depend on the problem space.
     Genetic algorithm is an adaptive global optimization algorithm, which simulates biological evolution process in the natural environment. Genetic algorithm is not under the constraint of the restrictive assumption from search space. It has been proved that, genetic algorithm is superior to the other algorithms in solving the DNA sequencing problems with errors. Due to the intrinsic parallel property, genetic algorithm is suitable to solve large-scale parallel computation problems. Parallel genetic algorithm can often increase speed and improve solution performance.
     Based on the above mentioned, the intention of this paper is obvious, which is to improve existing parallel genetic algorithm, and apply it in DNA sequencing problem to achieve accurate solutions and spent less time. First, the SBH problem is modeled. Then, the evolutionary strategy is determined based on existing genetic algorithm model. Next, new parallel genetic algorithm framework to optimize the solution is designed and the calculation results are discussed. Experimental results show that, the parallel genetic algorithm obtains more optimal solution than genetic algorithm in solving the SBH problem. The core points of the paper are selection of the parallel genetic algorithm framework and improvements of the migration operation. The parallel genetic algorithm has four kinds of implementation framework. Combining with DNA sequencing problem, the communication mode of four models is analyzed, and the coarse grain and master-slave double-level parallel framework is chose to use. Then, the effect of the migration on the parallel operation is considered, and a new migration policy is proposed. Experimental results show that the new migration strategy effectively reduce the communication overhead.
     Parallel genetic algorithm should be applied in solving complex biological engineering problem. The research on the parallel genetic algorithm for DNA sequencing by hybridization provides theoretical basis for the application and spreads it greatly.

引文

[1] http://www.bisti.nih.gov/docs/CompuBioDef.pdf, 2000-07-17.
    [2] Bains W, Smith GC. A novel method for nucleic sequence determination [J]. Journal of theoretical biology, 1988, 135: 303–307.
    [3] Lysov YP, Florentiev VL, Khorlin AA, et al. Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides [J]. Doklady academy nauk sssr, 1988, 303: 1508–1511.
    [4] Drmanac R, Labat I, Brukner I, et al. Sequencing of megabase plus DNA by hybridization [J]. Theory of the method, genomics, 1989, 4: 114–128.
    [5] Rajkumar Buyya编,郑纬民译.高性能集群计算:结构与系统(第一卷),编程与应用(第二卷),电子工业出版社,2001,12
    [6] Xie Hongwei, Yuan Qianqian. Algorithms for DNA sequencing by hybridization: a review [C]. ICBBE2009, 2009.
    [7] Pevzner PA. l-tuple DNA sequencing: computer analysis [J]. Journal of biomolecular structure and dynamics, 1989, 7: 63–73.
    [8] Blazewicz J, Formanowicz P, Kasprzak M, et al. DNA sequencing with positive and negative errors [J]. Journal of computational biology, 1999, 6: 113–123.
    [9] Zhang JH, Wu LY, Zhang XS. Reconstruction of DNA sequencing by hybridization [J]. Bioinformatics, 2003, 19: 14–21.
    [10]张继红,吴凌云,章祥荪.允许长度估计误差的SBH最优重构问题及其算法[J].应用数学学报, 2005, 28(3): 385–395.
    [11] Blazewicz J, Formanowicz P, Kasprzak M, et al. Tabu search for DNA sequencing with false negatives and false positives [J]. European journal of operational research, 2000, 125: 257–265.
    [12] Blazewicz J, Kasprzak M. Complexity of DNA sequencing by hybridization [J]. Journal of computational biology, 2003, 135: 303–307.
    [13] Blazewicz J, Kasprzak M, Kuroczycki W. Hybrid genetic algorithm for DNA sequencing with errors [J]. Journal of heuristics, 2002, 8: 495–502.
    [14] Blazewicz J, Glover F, Kasprzak M. Evolutionary approaches to DNA sequencing with errors [J]. Annals of operations research, 2005, 138: 67–78.
    [15] Brizuela CA, Gonzalez LC, Romero HJ. An improved genetic algorithm for the sequencing by hybridization problem [C]. Lecture notes in computer science. Springer, Heidelberg, 2004: 11–20.
    [16] Endo AT. Probabilistic nucleotide assembling method for sequencing by hybridization [J]. Bioinformatics, 2004, 20(14): 2181–2188.
    [17] Bui TA, Youssef WA. An enhanced genetic algorithm for DNA sequencing by hybridization with positive and negative errors [C]. Proceedings of the GECCO2004——Genetic and evolutionary computation. Conference lecture notes in computer science, Berlin, Germany: Springer, 2004, 3013: 11–20.
    [18]陈国良,并行计算的设计与分析(修订本) [M].北京:高等教育出版社, 2002.
    [19] Foster I T. Designing and building parallel programs: concepts and tools for parallel software engineering [M]. Addison-Wesley, 1995.
    [20]余祥宣,崔国华,邹海明.计算机算法基础[M].武汉:华中科技大学出版社, 2000.
    [21] Holland J. Adaptation in natural and artificial systems [M]. Ann Arbor, Michigan: The University of Michigan Press, 1975.
    [22]蔡自兴,徐光佑.人工智能及其应用[M].北京:清华大学出版社, 2004.8.
    [23]张文修,梁怡.遗传算法的数学基础[M].西安:西安交通大学出版社, 2000.
    [24] Goldberg D. Genetic algorithms in search, optimization, and machine learning [M]. MA: Addison-Wesley, 1989.
    [25] Bethke A. Comparison of genetic algorithms and gradient-based optimizers on parallel processors: efficiency of use of processing capacity [M]. Technical Report, No.197, Ann Arbor, MI: University of Michigan, 1976.
    [26] Cantú-Paz E. A survey of parallel genetic algorithms [M]. IlliGAL Technical Report, No.97003, Urbana, IL: Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana-Champaign, 1997.
    [27] Grefenstette J, Gopal R. Genetic algorithms for the traveling salesman problem [C]. Proc of the 1st ICCA, 1985: 160-168.
    [28] Enrique A, Marco T. Parallelism and evolutionary algorithms [J]. IEEE Transactions on Evolutionary Computation, vol.6, no.5, 2002: 443-462.
    [29]高家全,何桂霞.并行遗传算法研究综述[M].浙江:浙江工业大学学报, 2007(35): 56-59.
    [30] Iorio A, X Li. Parameter control within a co-operative co-evolutionary genetic algorithm [C]. Proceedings of the Seventh Conference on Parallel Problem Solving From Nature, 2002: 247-256.
    [31] Matsumura T, Kamura M, Okech J, et al. A parallel and distributed genetic algorithm on loosely coupled multiprocessor system [J]. IEICE Trans Fundam Elect ron Commun Comput Sci, 1998, 81(4): 540-546.
    [32] Sal A, Glaser H, De R. Parallel implementation of a genetic programming based tool for symbolic regression [C]. Information Processing Letters, 1998, 66(6): 299-307.
    [33] GAO J Q. A parallel hybrid genetic algorithm for solving a kind of non-identical parallel machine scheduling problems [C]. IEEE Computer Society. Proc 4th HPC-ASIA2006, Beijing: IEEE Computer Society, 2005:469-472.
    [34] YANG H T, YANG P C, HUANG C L. A parallel genetic algorithm for deterministic and stochastic labor scheduling problems: implementation on the trans-computer networks [J]. IEEE Trans Power Syst, 1997, 12(2): 661-668.
    [35] Enrique A. Parallel evolutionary algorithms can achieve super-linear performance [C]. Information processing letters, 2002 (82): 7-13.
    [36] Blazewicz J, Glover F, Kasprzak M, et al. Dealing with repetitions in sequencing by hybridization [J]. Computational Biology and Chemistry, 2006(30): 313-320.
    [37] Glover, F, Laguna M. Tabu search [M]. Blackwell Scientific Publications. Oxford, 1993: 70-150.
    [38] Oussaidene M, Chopard B, Pictet O, et al. Parallel genetic programming and its application to trading model induction [J]. Parallel Computing, 1997, 23 (8): 1183-1198.
    [39] Cantú-Paz E, Goldberg D. Predicting speedups of idealized bounding cases of parallel genetic algorithms [C]. Proc 7th Int. Conf. Genetic Algorithms, T. B?ck, Ed., 1997: 113-120.
    [40]于滨,姚宝珍,于艳弘.一种基于粗粒度-主从式的混合并行遗传算法[J].微型电脑应用, 2004, 20(9): 16-19.
    [41]郭彤城,慕春棣.自适应迁移并行遗传法在无线通信网优中的应用[J].清华大学学报, 2002, 1(9).
    [42]岳嵚.粗粒度并行遗传算法的计算性能及其应用研究[D].华中科技大学, 2008: 64-66.
    [43] Francisco H, Manuel L. Gradual distributed real-coded genetic algorithms [J]. IEEE Trans. on Evolutionary Computation, 2000, 4(1): 43-63.
    [44] Needelman, S, Wunsch, C. A general method applicable to the search for similarities of the amino acid sequence of two proteins. Proc. J. Mol. Biol. 1970, 48: 443-453.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700