基于智能算法的DNA聚类研究及应用

英文题名：Research and Application of DNA Clustering Algorithm Based on Intelligent Algorithm
作者：张丽
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：DNA序列 ; 图形表达 ; 微粒群算法 ; 聚类优化
英文关键词：DNA sequence ; graphical representation ; Particle Swarm Algorithm ; Clustering Optimization
学位年度：2010
导师：李章泉
学科代码：1201
学位授予单位：山东师范大学
论文提交日期：2010-05-26

摘要

随着现代生物技术的不断发展特别是人类基因组计划的实施,人们不断获取大量的基因序列数据。面对如此大量的基因序列数据,只有很少一部分基因我们己经知道它们的功能,而大部分基因的功能还是未知的。数据挖掘中的聚类技术正是能够对大量基因数据进行分析的技术。通过聚类技术将这些基因序列进行聚类,得到一些聚在一起的类。由于同一类中的基因序列具有相似的功能,这样我们就可以利用同一类中己知功能的基因推测同一类中未知功能基因的功能。目前生物信息领域的研究中,聚类分析已经得到了广泛的应用。其中生物序列聚类的关键问题就是如何刻画序列间的相似性。而生物序列数据本身的线性排列表示有时难以体现序列间的相似程度,使得在某些情况下,一些相似性度量失效,从而影响了聚类结果的质量。所以如果完全从序列本身出发设计相似性度量,将不能得到符合真实生物学观测的聚类结果,为DNA序列的进化研究带来了一定的困难。伴随着DNA序列图形表达的研究的深入,Randic等人首先提出利用DNA序列的图形表达来研究序列的聚类问题的思想。本文利用这种思想,借助DNA序列的图形表达所抽取的数学特征对序列进行聚类。本文参考已有的基于碱基对称性的DNA序列的二维图形表达方法,做了相应的改进,提出一种新的图形表达的方法。使得改进后的图形表达方法更加节省空间,而且能够更加清楚的体现出DNA序列的生物学特征。利用这种方法,把每个DNA序列按照三组映射规则,转化成三条二维曲线,然后从曲线中提取特征矩阵,最后利用矩阵的不变量对DNA序列进行聚类研究,这样,一条DNA序列就被转化成一个多维数据对象。因此,对DNA序列的聚类问题就转化成对多维数据的聚类了。
现有的对多维数据进行聚类的常用聚类算法,通常需要事先给定聚类数k。但在大多数情况下,聚类数k事先无法确定,因此需要对最佳聚类数k进行优化处理。本文采用基于微粒群算法的聚类算法。为了解决微粒群聚类算法无法确定聚类数k的现象,通过k均值算法的引入,实现最佳聚类数k的求解和聚类有效性函数的构造,试验证明引入类间距离的聚类有效性检测函数对最佳聚类数判别科学,同时由于检测函数中类间距离权重的引入使该检测函数可以更好的应用于现实数据分析。
With the continuous development of modern biological technology, especially the implement of the Human Genome Project, people have gradually acquired quantities of gene sequences data. Faced with such a large number of genetic sequence data, only a small part of them we have already known their functions, but most of the gene function is unknown. The clustering technology of Data mining is the technology capable of analysising a large number of gene data. Therefore, by clustering technology, these gene sequences are clustered, and we get some classes. because the gene sequences from one class have similar functions, So that, we can speculate the functions of unknown gene sequences using the known ones. The current research in the field of bioinformatics, clustering analysis has been widely used. The key question of clustering of biological sequences is how to characterize the similarity between sequences. The linear arrangement of the biological sequence data itself is sometimes difficult to reflect the degree of similarity, so in some cases, some similarity measure failure. Thus, affecting the quality of clustering results. Therefore, if the similarity measure designed starting entirely from the sequence itself, it will not get the real clustering results up to the biological observations, It brings some difficulties to the evolution study of DNA sequences. With the deeply research of the graphical expression of DNA sequences, Randic first proposed the use of graphical expression of DNA sequences to study the clustering of gene sequences. By this idea, We can cluster the sequences by the mathematical characteristics collected by the the graphical expression of DNA sequences. referring to existing two-dimensional graphical representation based on base Symmetry, I made some improvement and give a new graphical representation method of DNA sequences. The improved method can make a more space-saving, and this method can also reflect some of the biological features of DNA sequences more clearly. So according to mapping rules, each DNA sequence is translated into three two-dimensional curves, and then extract featural matrixs from the curves, and then cluster the DNA sequences using the matrix invariant, so that, a DNA sequence is transformed into a multi-dimensional data, and the clustering of DNA sequences is transformed into the clustering of multi-dimensional data .
The existing common clustering algorithms of multi-dimensional data usually require giving the number of clusters k in advance. However, in most cases, the number of clusters k can not be determined in advance, so the best number of clusters k needs to be optimized. In this paper, I use the clustering algorithm based on particle swarm optimization. In order to solve that the clustering algorithm based on PSO can not determine the number of clusters k, by the k-means algorithm, achieve the best number of cluster k and the structuring of the cluster validity function. The testing has proved the effectiveness of cluster detection function to determine the best number of clusters, and because the introduction of the weights of classes, so that the detection function can be better applied to real data analysis.

引文

[1]陈新.生物信息学概论. http://www.chinalabnet.com/upload/03060218057633.pdf,2003.
    [2]http://www.ncgr.org/gsdb
    [3]http://www.biology.gatech.edu/bioinformatics/whatis.html
    [4]S. Muthukrishnan, Suleyman Cenk Sahinalp. (2000) Approximate Nearest Neighbors And Sequence Comparison with block operations [M].Proc of the 32nd Annual ACM Symp on Theory of Computing.
    [5]黄东,唐俊,汪卫,施伯乐.CuMen:基于最大频繁序列模式的聚类算法及其在基因拼接中的应用[J].计算机科学.2005,32(10):47-49.
    [6]B. Jerman-Blazic, I. Fabic. M. Randic , Comparison of sequences as a method for evaluation of the molecular similarity[J].J. Comput. Chem, 1986, (7): 176-188
    [7]Roskin,K.M.,Diekhans,M.and Haussler,D.(2003) Scoring two species local alignments to try to statistically separate neutrally evolving from selected dna segments[M]. In Proc. RECOMB, ACM Press, Germany, pp. 257-266.
    [8]Enright,A.J.and Ouzounis,C.A.(2000) GeneRAGE:a robust algorithm for sequence clustering and domain detection[J].Bioinformatics,16,451-457.
    [9]The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana[J]. Nature 2000, 408:796-815.
    [10]Liu,H.A.and Califano,A.CASTOR:clustering algorithm for sequence taxonomical Organization and relationships[J].Jounral of Computational Biology, 2003,10(9): 21-45.
    [11]M Randic, G Krilov. Characterization of 3-D sequences of proteins[J]. Chem Phys Lett,1997,2729(3):115-119
    [12]Feng Z P,Zhang C T.A graphic representation of protein sequence and predicting the sub celluar degeneracy[J].Chem Phys Lett,2001, 350(3): 106
    [13]Needleman S B,Wunsch C D.A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of the Protein[J].Journal of Molecular Biology,1970,48(3):443-453
    [14]Virgion M, P Argos. Determination of Reliable Regions in Protein SequenceAlignment[J].Protein Engineering,1990,3(9):565-569
    [15]Volfovsky N, Haas BJ, Salzberg SL.(2001) A clustering method for repeat analysis in DNA sequences. Genome Biology 2001, 2(8):research 0027.1-0027.11
    [16]Abascal, alencia. A Clustering of proximal sequence space for the identification of protein families[J].Bioinformatics,2002,18(8):908-921.
    [17]Ester M, Kriegel H Petal. Density-based Algorithm for Discovering Clusters in Large Spatial Databases[C]. Porland, OR: Proc. of 1996 Int. Conf on Knowledge Discovery and Data Mining, 1996.226-231.
    [18]Doulaye Dembele,FhiliPPe Kastner. Fuzzy C-menas method of clustering Microarray data[J].Bioinformatics.Vol.19 no.8 2003,P973-980.
    [19]王敞,陈增强,孙青林.基于K中心方法的氨基酸序列聚类分析[J].计算机工程, 2003,29(8):42-43.
    [20]Gusfield D.Algorithms on Strings,Trees,and Sequences. NewYork: Cambridge University Press,1997.525-532.
    [21]邓绪斌,朱扬勇.一个基于正则表达式的生物数据抽取方法[J].计算机研究与发展,2005, 42(l2):218-219.
    [22]黄东,唐俊,汪卫,施伯乐.CuMen:基于最大频繁序列模式的聚类算法及其在基因拼接中的应用[J].计算机科学.2005,32(10):47-49
    [23]Chaudhuri P, Das S. Statistical analysis of large DNA sequences using distribution of DNA words[J].Current science,2001,80(9):1161-1166.
    [24]吴君浩,骆嘉伟.利用遗传特征实现微生物基因序列聚类分析[J].计算机工程与应用。2006,20(2):31-33.
    [25]吴君浩,骆嘉伟.基于隐马尔可夫模型的二次k-均值基因序列聚类算法[J].计算机工程与科学. 2007,29(3).48-51.
    [26]吴君浩.基因序列聚类和分类研究[D].中国优秀博硕士学位论文全文数据库(硕士),湖南大学2006.
    [27]M. Randic , M. Vracko. On the similarity of DNA primary sequences[J] .J. Chem. Inf. Comput. Sci , 2000,40 :599-606
    [28]E. Hamori. Graphical representation of long DNA sequences by methods of Hcurves , current results and future aspects [J] . Bio Techniques , 1989,7:710 - 720.
    [29]Gates.M.A. (1986) A simple way to look at DNA[J]. J. Theor. Biol. 119 319-328.
    [30]Guo X F,Randic M,Basak S C.A novel 2-D graphical representation sequences of low degeneracy[J].Chem Phys Lett,2001,350:106-112
    [31]王世英,李湘露. DNA序列的一类非退化的2-D图表示.[J].中原工学院学报.2003,13(1):17-20.
    [32]Randic M,Vradko M,Zupan J.Compact 2-D graphical representation of DNA[J].Chem Phys Let,2003,373(10):558-562
    [33]M. Randic. Condensed Representation of DNA Primary Sequences[J] . J. Chem. Inf. Comput. Sci .2000,40:50-56
    [34]Liao Bo,Wang Tianming.New 2D Graphical Representation of DNA Sequence [J].Journal of Computational Chemistry,2004,25(11):1364-1368
    [35]Liao Bo.A 2D graphical representation of DNA sequence[J].Chemical Physic Letters,2005,401(12):196-199
    [36]Liao Bo,Wang T M.3-D graphical representation of DNA sequences and the numerical characterization[J].J Mol Str:THEOCHEM,2004,68(6):209-212
    [37]Zhang Y,Liao Bo,Ding K. On 3D-curves of DNA sequences[J]. Molecula Simulation,2006,32(1):29-34
    [38]Liao Bo, Zhu Wen, Liu Yang.3D graphical representation of DNA sequence without degeneracy and its application in constructing phylogenetic tree[J].Matc Commun Math Comput Chem,2006,56(8):209-216
    [39]M. Randic. On characterization of DNA primary sequences by condensed matrix[J] .Chem. Phys. Lett ,2000,317:29– 34
    [40]骆嘉伟,李仁发,张白妮.基于多维伪F统计量的基因表达动态聚类分析方法研究[J].系统仿真学报,2006,3(18):586-601
    [41]胡学钢,张圆圆.基于已发现序列模式的序列聚类研究[J].合肥工业大学学报(自然科学版).2008,31(1).70-72
    [42]黄德双,刘海燕,施蕴渝,陈国良.生物信息学中的智能计算理论与方法研究[M].合肥:中国科学技术大学出版社,2006.134-38.
    [43]潘金灯,郭腾冲,涂序彦.生物信息学中的智能模型[J].计算机工程与应用. 2003,39(28):81-84.
    [44] J Han, M Kamber.范明,孟小峰译.数据挖掘概念与技术[M].第二版.北京.机械工业出版社.2007.
    [45]张白妮,骆嘉伟,汤德佑.基于比对相似度动态矩阵聚类算法在基因序列中的应用[J].计算机应用.2004,24(8):35-37.
    [46]朱扬勇,熊贇.DNA序列数据挖掘技术[J].软件学报.2007,11(1). 20-22.
    [47]韩轶平,余杭,刘威等.序列的分类[J].数学的实践与认识.2001,31(1):38-45.
    [48]双聚类算法用于基因组序列直系同源基因聚类的研究[D].中国优秀博硕士学位论文全文数据库(博士),吉林大学.2006.
    [49]刘向东,沙秋夫,刘勇奎.基于粒子群优化算法的聚类分析[J].计算机工程, 2003, 32(6):201-202.
    [50]刘靖明,韩丽川.粒子群优化K均值的混合聚类算法研究[J].中国管理科学, 2004,12(M):96-99.
    [51]王胥鹏,胡劲松.一种改进的微粒群算法[J].计算机应用研究, 2009,13(10): 113-116.
    [52]耶刚强,孙世宇,梁彦,王睿,潘泉.基于动态粒子数的微粒群优化算法[J].信息与控制, 2008,(01) :93-99.
    [53]袁代林,程世娟,陈虬.一种新形式的微粒群算法[J].计算机工程与应用, 2008, (33): 234-239.
    [54]HU Jing-song,HU Gui-wu,WANG Jia-bing. FCMAC based on mine-sweeping strategy [C] .Proc of International Conference on Machine Learning and Cybernetics. 2005 :784-78
    [55]Ian H B. The pig as an intermediate host for influenza A viruses between birds and humans[J]. International Congress Series,2001,1219: 173-178.
    [56]Wentworth D E, Thompson B L, Xu X. An influenza A(H1N1) virus, closely related to swine influenza virus, responsible for a fatal case of human influenza[J]. Journal of Virology,1994,68: 2051-2058.
    [57]张文彤,姜庆五.全球历年人甲型流感病毒H3A1抗原的分子进化研究.中华流行病学杂志,2005,26(11):843-847.
    [58]张文彤,姜庆五,蒋露芳,等.基于基因序列聚类的甲型流感病毒H3抗原变异规律研究.中华流行病学杂志,2004,25(12):1046-1049.
    [59]Rota P A, Rocha E P, Harmon M W. Laboratory characterization of a swine influenza virus isolated from a fatal case of human influenza[J]. J Clin Microbiol, 1989, 27: 1413-1416.
    [60]Bush RM,Fitch WM,Bender CA,et al.Positive Selection on the H3 Hemagglutinin Gene of Human Influenza Virus A.Mol Biol and Evol,1999,16:1457-1465.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700