基因序列比对算法在SNP中的研究及应用

英文题名：Gene Sequence Alignment Algorithm Research and Implement in SNP
作者：康晓军
论文级别：博士
学科专业名称：资源环境信息工程
中文关键词：基因表达数据 ; 基因聚类 ; 序列比对 ; 单核苷酸多态性
英文关键词：Gene expression data ; Genetic clustering ; Sequence alignment ; SNP
学位年度：2011
导师：贺立源
学科代码：071010
学位授予单位：华中农业大学
论文提交日期：2011-06-01

摘要

近年来,生命科学的研究正处于突飞猛进的发展中。随着人类基因组计划(HGP)的基本完成与现代生物技术的飞速发展,大量生物信息的获取已经为揭开生命的奥秘提供了坚实的数据基础。在生命科学的研究进入到后基因组时代(Post-Genome Era)时,生命科学的研究重点已经不再是生物信息的获取,而是转移到对基因组功能及其变化规律的研究,因此对海量数据的处理产生了紧迫的需求。与此同时,计算机技术及网络技术的革命性发展为处理海量数据提供了强有力的支撑,于是,生物信息学便在此前提下迅速的发展起来。并终将为人类破译遗传密码,掌握疾病的遗传信息,破解基因功能,结构功能预测起到巨大的推动作用。
     SNP即单核苷酸多态性,它主要指物种在进化过程中因为基因组中核苷酸的变异从而引起的DNA序列之间的差异,主要包括碱基缺失、插入、转换或者颠换等,单核苷酸多态性所反映的差异位点中包含的遗传信息是导致一些遗传疾病、肿瘤等的重要因素之一,基因突变及SNP在生物学、生物信息学和生物医学等研究中有着极其重要的作用。
     生物信息数据的表现形式为基因序列数据,通过对序列的比较可以发现其中的功能、结构等方面的信息。基因双序列比对或多序列比对的分析是目前生物信息学所关注的研究热点之一。对于基因序列的分析也通常采用聚类算法或者分类算法进行。本文主要研究基于序列比对算法对基因表达数据中SNP问题的分析,主要的工作及创新点概况如下：1)本文首先介绍了生物信息学的相关概念及其重要的意义,并对目前的国内外研究现状进行了概述。
     2)对基因表达数据常用的聚类分析算法进行了较为详细的研究,通过实验进行了初步的分析。
     3)介绍了目前基因序列比对算法的研究现状,并对其进行了分析,为本文中使用的序列比对算法提供依据。
     4)基于对序列比对算法的研究,本文提出了在海量基因序列数据中寻找SNP的实验方案设计。通过对经典BLAST算法的改进分别在PC机平台下及高性能集群环境下对算法进行了并行化设计及实现,并通过实验数据进行了较为详细的分析和测试,实验表明本文的实验方案在时间复杂度及结果方面都获得了较为理想的效果。
     5)以本文提出的方案及算法为基础,设计并实现了基于Windows操作系统和集群平台的序列分析系统,其功能主要包括基因序列数据的导入导出、SNP分析、序列比对、参数设置、结果数据输出、着色处理查看等。
In recent years, life science research is in developing by leaps and bounds. As the human genome project completed and modem biological technology rapid development, Lots of biological information acquisition has to uncover the mystery of life and provides solid data base. In the time of life science research into the Post-Genome Era, Life science research focus is no longer biological information, but moved to the research of genome function and the changing laws. Therefore the pressing needs have been produced of mass data processing. Meanwhile, computer technology and network technology has a revolutionary development to the massive data processing and provides powerful support, and ultimately have vast pushing effect for human crack the genetic code, grasps the disease of the genetic information, cracked gene function, structure and function prediction.
     SNPS namely single-nucleotide polymorphisms, It refers to species in the evolutionary process because of the variation in the genome of nucleotides resulting differences between the DNA sequence. It mainly includes bases loss, insert, conversion etc, SNP reflects the difference of genetic information contained in the site is causing some genetic diseases, cancer and other important factors. Gene mutation and SNPS in biological systems, bioinformatics and biomedical research plays a very important role.
     The expression form of Biological information data is genetic sequence data. Through the comparison of sequence can found the information of the function and structure. Gene double sequence alignment or multi-sequence alignment analysis is one of research hotspot of bioinformatics. For the analysis of the gene sequence is usually adopts clustering algorithms. This paper mainly studies based on sequence alignment algorithm in gene expression data to SNP problem analysis. The main work and innovation points as follows.
     1) This paper firstly introduces bioinformatics the related concepts and their important sense, and summarized current research status in domestic and abroad.
     2) For a detailed study on gene expression data commonly used the cluster analysis algorithm, through the experiment we analyzed the algorithms.
     3) Introduced the research status of Gene sequence alignment algorithm, provides the basis for this paper.
     4) Based on the research of sequence alignment algorithm, this paper puts forward in mass gene sequences in the experiment for SNPS data plan design, through the improvement of classic BLAST algorithm we design and realization the algorithm in PC platform and high performance cluster environment. Furthermore we make a detailed analysis and testing through the experimental. Experiments show that this experiment scheme in time complexity and results are obtained in the ideal result.
     5) As the bases of the algorithm, we design and realized the sequence analysis system based on Windows operating system and cluster platform. Its main functions include gene sequences of derivation, SNPS data analysis, sequence alignment, parameter setting and results data output, the shading treatment check, etc.

引文

1. 蔡立军.基因分类及基因表达数据分析方法的研究.[博士学位论文].湖南：湖南大学,2007
    2. 陈绮.生物信息学中计算机技术应用.电子工业出版社.2010
    3.董文甫,李艳红,张春香.单核苷酸多态(SNP)相关技术研究及其应用.现代畜牧兽医.2006,8：48-51
    4.郝柏林,张淑誉.生物信息学手册,上海科学技术出版社,2000：1-100
    5.郝柏林,张淑誉.生物信息学手册.上海科学技术出版社,2000
    6.黄坚,杜清友,丁雨,王升启.肿瘤相关基因表达检测寡核苷酸芯片的制备及其初步应用.生物化学与生物物理进展.2003,06
    7.韩文玲,马大龙.人类功能基因组研究与开发的进展及对策建议.北京大学学报.2009,3
    8.季华员,张学良,刘林秀,谢明贵,武艳平,杨群.基因组单核普酸多态性图谱构建方法研究进展.中国畜牧兽医.2010,vol：37
    9.李婧,潘玉春,李亦学,石铁流.人类基因组单核普酸多态性和单体型的分析及应用.遗传学报.2005,23
    10.吕建平.一种新型多类别生物芯片cDNA基因表达数据标准化方法.电子与信息学报.2009,06
    11.罗世炜.生物信息学与人类基因组计划.生物学通报.2005,01
    12.齐阳,刘自伟,王修竹.基于进化FCM算法的基因表达数据分析.微计算机信息.2006,22(3)：208-210
    13.乔纳森·佩服斯纳(JonathanPevsner)著,孙之荣主译.生物信息学与功能基因组学.2006
    14.孙啸,王哗,何农跃,赵雨.生物信息学在基因芯片中的应用.生物物理学报.第十七卷,第一期,2001：27-34
    15.孙啸.生物信息学基础.清华大学出版社,2005：1-13
    16.万江.基于SOM基因聚类的基因数据组织样本聚类.[博士学位论文].西安：西安电子科技大学,2005
    17.王翼飞,史定华.生物信息学——智能化算法及其应用.化学工业出版社.2006
    18.闫雷鸣,,孙志挥,吴英杰,张柏礼.联合聚类非线性相关的时序基因表达数据.计算机研究与发展.2008,11
    19.严心池,安伟光.自适应免疫遗传算法.应用力学学报.2005,22(3)：445-448
    20.张绍辉.并行程序设计及实现.软件导刊.2009
    21.赵国屏.生物信息学.科学出版社.2002：118—124
    22.钟扬,张亮.赵琼.简明生物信息学.北京：高等教育出版社,2001
    23.朱婵,许龙飞.聚类算法在基因表达数据分析中的应用.华侨大学学报(自然科学版).2005,01
    24. A.J.Gibbs and G.Mclntyre.The diagram, A method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur,J.Bioehem.1970,16:1-11
    25. Alain V, Denis M, Magali S.A review on SNP and other types of molecular markers and their use in animal genetics. Genet Set Evol,2002,34,275-305.
    26. Altshul, Madden, A A Schaeffer. Gapped BLAST and PSIBLAST:a new generation of protein database search programs. Nucleic Acids Research,1997, 25(17):3389-3402
    27. Anton J, Enright A. A robust algorithm for sequence clustering and domain detection. Bioinfomatics. Vol.16
    28. Attwood T K,D J Parry-Smith.生物信息学概论,罗静初译.北京：北京大学出版社.1999
    29. B.Hanczar,M.Courtine, A.Benisl.Improving Classification of Microarray Data using Prototype-Based Feature Selection.SIGKDD Explorations,2003,5(2):23-30
    30. Bandyopadhyay, Sanghamitra. A parallel pairwise local sequence alignment algorithm. IEEE Transactions on Nanobioscience, v 8, n 2, p 139-146, June 2009
    31. Basharahil, Ramzi. Distributed Shared Arrays:An integration of message passing and multithreading on SMP clusters. Journal of Supercomputing, v 31, n 2, p 161-184, February 2005
    32. Benson D A. GenBank. Nucleic Acids Research.2000,28:15-18
    33. C.F.Juang. A hybrid of genetic algorithm and particle swarm optimization for recurrent network design. IEEE Transactions on systems,Man,and Cubernetics-PartB Cybernetics,2004,34(2):997-1006
    34. Carlson C, Eberle M, Rieder M. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in human. Nat Genet, 2003,33:518-521.
    35. Chen Z R. Sequence comparison methods with the average Precision criterion. Bioinformaties.2003,19(18):2456-2460
    36. Cho RJ, Campbell MJ, Winzeler EA.A genome-wide transcriptional analysis of the mitotic cell cycle.Molecular Cell,1998,2(1):65-73.
    37. Choudhary A K. Data mining in manufacturing:A review based on the kind of knowledge. Journal of Intelligent Manufacturing, v 20, n 5, p 501-521, October 2009
    38. Clark. Finding genes underlying risk of complex disease by linkage disequilibrium mapping. Curr Opin Genet Dev.2003,13(3):296-302.
    39. D J LIPman, S F Altsehul, J D Keeeeioglu. A tool for multiple sequence alignment. PNAS 1989,86:4412-4415
    40. D. Huntley, A. Baldo, Johri. SEAM:SNP prediction and display program utilizing EST sequence cluster. Bioinformatics.2005,22(4):495-496
    41. D.Dembele, P.Kastner. Fuzzy C-means Method for Clustering Microarray Data. Bioinformatics,2003,19(8):973-980
    42. Dai, Jia-Yu. An efficient data mining approach on compressed transactions. Proceedings of World Academy of Science, Engineering and Technology, v 40, p 522-529, April 2009
    43. David W Mount. Bioinformaties:sequence and genome analysis. USA:Cold Spring Harbor Laboratory Press,2002
    44. Ding, Jason Jianxun. Performance characterization of multi-thread and multi-core processors based XML application oriented networking systems. Journal of Parallel and Distributed Computing, v 70, n 5, p 584-597, May 2010
    45. Feng, Xiao-Bing. Integrating parallelizing compilation technologies for SMP clusters. Journal of Computer Science and Technology, v 20, n 1, p 125-133, January 2005
    46. Fukunaga, Takafumi. Implementation and evaluation of improvement in parallel processing performance on the cluster using small-scale SMP PCs. Electronics and Communications in Japan, v93,n10, p1-11,2010
    47. Gang, Kou. Privacy-preserving data mining of medical data using data separation-based techniques. Data Science Journal, v 6, n SUPPL., p S429-S434, July 30.2007
    48. Goad. M Kanehisa. Pattern recognition in nucleic acid sequences.1. A general method for finding local homologies and symmetries. Nucleic Acids Research,1982, 10(1):247-263
    49. Golub T R, Slonim D K, Tamayo P. Molecular classification of cancer:class Discovery and class prediction by gene expression monitoring. Science,1999, 286(18):1194-1206.
    50. H Lin, Z. Zhang, Q Zhang, D Bu. M Li. A note on the single genotype resolution problem. Journal of Computer Science and Technology.2003,19(2):254
    51. Herrero J, Valencia A, Dopazo J. A hierarchical unsupervised growing neural network for clustering gene expression patterns, Bioinformatics,2001,17(2):126-136.
    52. I.Benin, J.H. Zhu, M.D. Gale. SSCP-SNP in pearl millet-a new marker system for comparative genetics. Theoretical and Applied Genetics.2005,8:1467-1472
    53. J Kennedy. R C Eberhart. A discrete binary version of the Particle swarm algorithm. Proceedings of the World Multi-conference on systemic. Cybernetics and Informatics,Piscataway,N J,1997:4104-4109
    54. J. Zhang, D.A. Wheeler, Yakub, et al. SNPdetector:a software tool for sensitive PLoS Computational Biology.2005,1(5):e53
    55. Julie D. Thompson, Desmond GHiggins, TobyJ.Gibson. CLUSTALW:improving The sensitivity of Progressive multiple sequence alignment through sequence weighting, Position-specific gap Penalties and weight matrix choice.Nucleic Acids Research,1994,22(22):4673-4680
    56. K. Zhang, M. Deng, T. Chen, M.S. Waterman, F. Sun. A dynamic programming algorithm for haplotype blocks partitioning. In proceedings of the National Academy of Sciences of the United States of America.2002.99:7335-7339
    57. Kennedy J, Eberhart R. Particle Swarm Optimization. IEEE int. Conf. On Neural Network,pp.1942-1948(1995)
    58. Kevin Y, David W, Cheung, Michael K. A Highly-usable Projected Clustering Algorithm for Gene Expression Profiles.3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics.2003
    59. Kohonen T. Self-organization and associative memory. Berlin:Spring-Verlag,1984.
    60. Kohonen T. The Self-organizing Maps. Proceedings of the IEEE,1990,78
    61. Krause A, Stoye, VinFon. The SYSTERS Pmtein Sequence Cluster Set. Nucleic Acids Research 28(1).2000
    62. Krishnan. Arun GridBLAST:A Globus-based high-throughput implementation of BLAST in a Grid computing framework. Concurrency Computation Practice and Experience, v 17, n 13, p 1607-1623, November 2005
    63. Kun Yang, Jian zhong Li, Zhipeng CAI. A Model-free and Stable Gene Selection in Microarray Data Analysis. In:Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05). Minneapolis,2005,3-10
    64. Le. Thuy T. A detailed MPI communication model for distributed systems Future Generation Computer Systems, v 22, n 3, p 269-278,February 2006
    65. Lee P S. Lee K H. Genomic analysis. Current Opinion in Biotechnology.2000
    66. Lee, Myungho. Performance evaluation of programming models for SMP-based clusters. Journal of the Chinese Institute of Engineers, Transactions of the Chinese Institute of Engineers,Series A/Chung-kuo Kung Ch'eng Hsuch K'an, v 31, n 7, p 1181-1188, November 2008
    67. Leventhal, Adam. Triple-parity RAID and beyond. Communications of the ACM, v 53, n 1, p 58-63, January 1,2010
    68. Li, Rao S, Moser K L. Sib-pair Linkage Analysis of Complex Diseases via Pattern Recognition. In Polygenic Diseases and Human Health-Proceedings of the International Symposium for Mapping and Identification of Genes for Complex Traits. Changsa, China,2002.
    69. Li, Xun-Gui, Wei, Xia. An improved genetic algorithm-simulated annealing hybrid algorithm for the optimization of multiple reservoirs. Water Resources Management, v 22, n 8, p 1031-1049, August 2008
    70. M. Richie, L. Hahn, N. Roodi, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet.2001.69:138-147
    71. Matsuda, Tanaka, Yoshio, Kubota, Kazuto. Motohiko Network interface active messages for low overhead communication on SMP PC clusters. Future Generation Computer Systems, v 16, n 5, p 493-502, March 2000
    72. Mechael P S, Brown, Willam Noble GruIldy, David Lin, Nello Cristianini, Charles Walsh Sugnet, Furey, Manuel Ares. Knowledge-based analysis of microarray gene expression data by using supports Vector machines.2000
    73. Meng, Zhiqing. A multi-classification method of temporal data based on support vector machine. Lecture Notes in Computer Science, v 4456 LNAI, p 240-249,2007
    74. Momiao Xiong, Wuju Li, JinYing Zhao. Feature selection in Gene Expression-based Tumor Classification. Molecular Genetics and Metabolism, 2001(73):239-247
    75. N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer. SMOTE:Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research.2002, 16(6):321-357
    76. Needleman S B, Wunseh C D. A General Method APPlieable to the Seareh for Similarities in the Amina Acid Sequence of Two Proteins. J.mol.Biol.1970, 48:443-453
    77. Needleman, Wunseh. A general method applicable to the search for similarities in the amino acid sequences of two Proteins. Journal of Molecular Biology,1970, 48:443-453
    78. Nowotny, Kwon, Goate. SNP analysis to dissect human traits. Current Opinion in Neurobiology.2001,11:637-641
    79. Nunkesser R, Bernholt T. Schwender T. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics,2007, 23:3280-3288.
    80. P N Leslie. E I Trey, G. P Mark. Mining SNPs from EST Databases. Genome Research.1999,9:167-174
    81. Pardi F, Lewis C M, Whittaker J C. SNP selection for association studies: maximizing power across SNP Choice and study size. Annals of Human Genetics, 2005,69:733-746
    82. Pierre Baldi, Soren Bmnak著.张东晖译.生物信息学一机器学习方法.中信出版社.2003
    83. Pongcharoen P, Hicks C, Braiden P M. Determining optimum Genetic Algorithm parameters for scheduling the manufacturing and assembly of complex products. International Journal of Production Economics, v 78, n 3, p 311-322, August 11,2002
    84. R M Sehwartz, M Dayhoff. Matrices for detecting distant relationships In M.Dayhoff, Atlas of Protein Sequence and Strueture.1978
    85. Rao. Li X, Zhang T. Mining disease-relevant genes from DNA microarray data by an ensemble decision approach. Genet Epidemiol.2003,25:267
    86. S Henikoff, J GHenikoff. Amino acid substitutions matrices from Protein blocks. PNAS.1992,89:10915-10919
    87. Sankoff. Matching sequences under deletion/insertion constraints. Proe. Natl. Acad. Sci.USA 1972,69:4-6
    88. Schwender H. Ickstadt K. Identification of SNP interactions using logic regression. SFB 475. University of Dort-mund, Germany,2006
    89. Shastry B S. SNP alleles in human disease and evolution. Journal of Human Genetics. 2002.47:561-566
    90. Sherlock G Analysis of large-scale gene expression data. Current Opinion in Immunology,2000,12(2):201-205.
    91. Smet F D, Mathys J, Marchal K. Adaptive quality-based clustering of gene expression profiles. Bioinformatics.2002,18(5):735-746.
    92. T Smith, M Waterman. Identification of common molecular sequence. Journal of Molecular Biology.1981,147:195-197
    93. Phuong T M., Z Lin, Altman R B. Choosing SNPs using feature selection. In Proceedings of the IEEE Computational Systems Bioinformatics Conference (CSB).2005
    94. Tan, K.C. A hybrid evolutionary algorithm for attribute selection in data mining Expert Systems with Applications, v 36, n 4, p 8616-8630, May 2009
    95. Tavazoie S, Huges JD, Campbell MJ. Systematic determination of genetic network architecture. Nature Genetics,1999,22(3):281-285.
    96. Tavazoie S Huges JD, Campbell MJ. Systematic determination of genetic network architecture.Nature Genetics,1999,22(3):281-285.
    97. Vingron M, Waterman M S. Sequence Alignment and Penalty Choice. Bioinformatics. 1994,12(1):53-56.
    98. Wang Dong, Tang, Zhi-Min. Implementation and analysis of Smith-Waterman algorithm on systolic array. Jisuanji Xuebao/Chinese Journal of Computers, v 27, n 1, p 12-20, January 2004
    99. Wang Fan. Support vector machine based on data mining technology in traffic flow forecasting. Journal of Information and Computational Science, v 6, n 3, p 1287-1294. June 2009
    100. Wang, Jian-Hua. Application of data mining in the customization design of the sport bicycle. Beijing Gongye Daxue Xuebao/Journal of Beijing University of Technology, v 36, n 6, p 742-747, June 2010
    101.Weidendorfer, Josef. Off-loading application controlled data prefetching in numerical codes for multi-core processors, International Journal of Computational Science and Engineering, v 4, n 1, p 22-28,2008
    102.Weissman D, Schmidt. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature.2001,40 (6822):928-933
    103.Wilbur, D Lipman. Rapid similarity searches of nucleic acid and protein data banks. proc. Natl. Acad. Sei.USA,1983,80:726-730
    104. Williams A, Gilbert D R, Westhead D R. Multiple structural alignment for distantly related all beta structures using TOPS pattern discovery and simulated annealing.Protein Engineering,2003,16(12):913-923.
    105.Wolford J K, Blunt D, Ballecer C.High-throughput SNP detection by using DNA pooling and denaturing high perform-ance liquid chromatography (DHPLC). Hum Genet,2000,107 (5):483-487.
    106.X. Huang, R.C. Hardison, W. Miller. A space-efficient algorithm for local similarities. Comput Appl Biosci.1990.6:373-381
    107.Yeung. K.Y, Haynor D.R, Ruzzo W.L.Validating clustering for gene expression data.Bloinformatics,2001,17(4):309-318.
    108.Zhang. Hongcan. Expandable distributed RAID storage cluster system. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, v 45, n 4, p 741-746, April 2008

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700