基于k-mer频率统计的物种分类方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
生物学界的物种分类工作走过了几百年的发展历史,在日积月累的过程中建立了相当详细的分类方法,并发展出形态分类学这门学科,但目前尚未发现和未进行分类的生物物种的数目仍然是非常巨大,传统的形态生物分类学方法在面对如此繁琐的工作时已经遇到了瓶颈。
     随着生物测序技术的发展,DNA测序成本开始降低,而生物学家又意识到真正包含生物最本质特征信息的载体正是生物的基因组序列,所以基因序列内容应该被应用到物种分类工作中。目前生物信息学家进行生物物种分类使用的基本方式是在全基因组中选取一段具有相当特性的片段来代表物种的特征,并且使用这种特征进行物种间的比较,从而进行生物学分类分析。这项分类技术已经取得了令人满意的成果,不过由于该项技术上仍然存在一定程度上的局限性和不足之处,并且由于不同的研究者选择的片段不同,为分类方法的标准统一带来了难题。
     本文尝试用另一种方法来建立一个能将生物自身的序列特征统一的标准系统。这种方法的基础在于:生物基因序列k-mer短片段序列的频率在进化过程中具有相当的稳定性。在这种稳定性的前提下,我们尝试使用生物基因组的大部分序列而非一小部分来描述生物本身的特征。通过对这些序列进行k-mer的频率统计,得到了一个代表物种的特征向量,并使用这个特征向量进行物种的分类鉴别。这样使得各个物种都可在一个统一标准下进行分类划分。我们尝试了细菌和病毒的分类,并取得了一定的成果。在生物分类学的“属”以上级别的分类中产生了非常精确的数据,在亚种或变种级别上的数据结果也达到了一定的精度。
In the past two hundred years, biological classification scientists established a set of classification system which was based on anatomy features, and set a very detailed classification method. By using this set of classification system and classification method, the researchers completed a million of specific species'classification, but there are still more than ten million unknown species are not complete accurate classified. Spend too much time on observing the anatomical characteristics'detail of each species is unrealistic. Biologists need a more efficient and more convenient way to complete the classification.
     The rapid development of genome sequencing technology allows biologists see the hope to solve the problem. Now we recognized that all the characteristics information of living things contained in their genes'sequence, then how to parse the sequence and applied these features to classification work has became a new research hotspot. In the process of analysis, we know that during the work of decipher the ciphertext, the frequency of single words is often a key to the work of decipher. Similarly, we suspect that the gene sequence of several base composition of the short fragment can be also viewed as a word, and then we can study whether the frequency of short segment represents the characteristic properties of all the species.
     The main work of the paper is to introduce the method:classify species based on k-mer frequency statistical and the verification process. This method applies to all species which can be sequenced, classification quickly and all species are under a uniform standard. The main idea of the method is to divide large-scale genome sequence into equal-length windows, and then count k-mer fragment of the frequency in each window. Since the statistical frequency in most segments have quite conservative, we use this conservative values as the frequency characteristics in the work of the classification of species. When all fragments represent statistical completed, we follow the same order of values, and at last generate a standard feature vector to represent the every species. By studying the standard relationship of the distances of the feature vectors, we could complete the work of classification of species.
     We use bacteria and viruses two groups of experimental data to verify the validity of the new method, and compare with the existing classification results. In the bacterial group, we selected six kinds of bacteria from three groups belonging to different Orders and different genera. After de-noising, pattern generation, analysis vector distance between the species and classification, ultimately, the results obtained is very similar to the current biological classification results, indicating that the method use in the level of Orders can make very good classification results. Then we test the viruses, we chose 8 viral sequences belong to 4 different kinds. Since the sequences'characteristics of the virus, we omitted de-noising process, the final classification result is consistent with the known result, verified by a larger data set. We believe that the method in virus species level classification is quite accurate. Then we attempt the lower level of virus species classification. The classification of small-scale data set is accurate, but we find that the distance between species is become smaller and influence classification results, and then we verify this effect in a larger data set, the results indicating that our approach in the ability of sub-species level classification is acceptable.
     Since the method:classify species based on k-mer frequency statistical are applied with fewer restrictions, classification very speed and genus level above classification is accurate. Perhaps in the future this method may become a standard method to study new species at the beginning of the detail research.
引文
[1]Schindel D E. Mille S E. DNA barcoding:a useful tool for taxonomists[J]. Nature, 2005,435:17
    [2]Ebach M C, Holdrege C. DNA barcoding is no substitute for taxonomy[J]. Nature, 2005,434:697
    [3]Gregory T R. DNA barcoding does not compete with taxonomy[J]. Nature, 2005,434:1067
    [4]Marshall E. Will DNA barcodes breathe life into classifcation[J]. Science, 2005,307:1037
    [5]Hebert P D N, Cywinska A, Ball S L, et al. Biological identificaton throngh DNA barcodes[J]. Proc R Soc lond B Biol Sci,2003.270:313-321
    [6]Hebert P D N, Ratnasingham S, deWanrd J R. Barcoding animal life:cytochrome c oxidase subunit 1 divergences among closely related species[J]. Proc R Soc lond B Biol Sci,2003.270:96-99
    [7]Chase M W, Salamin N, Wilkinson M. Land plants and DNA barcodes:short-term and long-term goals[J]. Phil. Trans. R. Soc. B,2005,360:1889-1895
    [8]Vences M, Thomas M, Meijden A V. Comparative performance of the 16S rRNA gene in DNA barcoding of amphibians [J]. Frontier in Zoology,2005,2(5):1-12
    [9]Trifonov EN, Sussman JL. The pitch of chromatin DNA is reflected in its nucleotide sequence[C]. Proceedings of the National Academy of Sciences of the United States of America 1980,77(7):3816-3820
    [10]Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A. Statistical patterns in primary structures of functional regions in the E. coli genome. I. Oligonucleotide frequencies analysis[J]. Molecular Biology 1986,20:826-833
    [11]Karlin S, Burge C. Dinucleotide relative abundance extremes:a genomic signature[J]. Trends Genet 1995,11(7):283-290
    [12]肖金花,肖晖,黄大卫.生物分类学的新动向——DNA条形码技术[J].动物学报50(5):852-855,2004
    [13]Hebert P D N. Stoeckle M Y.Zemlak T S Identification of birds through DNA barcodes. PLOS Biol.2004.2(10):1657-1663
    [14]Sanders G W. Applying DNA barcoding to red macroalgae:a preliminary appraisal holds promise for future applications [J]. Philos Trans R Soc Lond B Biol Sci.2005 Oct 29;360(1462):1879-88
    [15]Vences M, Thomas M, Meijden A V, et al. Comparative performance of the 16S rRNA gene in DNA barcoding of amphibians[J]. Front Zool.2005 Mar 16;2(1):5
    [16]Chase M W, Salamin N. Wilkinson M, et aL. Land plants and DNA barcodes: short-term and long-term goals[J]. Philos Trans R Soc Lond B Biol Sci.2005 Oct 29;360(1462):1889-95.
    [17]Tamura K, Dudley J, Nei M & Kumar S. MEGA4:Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0[J]. Molecular Biology and Evolution 24: 1596-1599.
    [18]Moritz C, Cicero C. DNA Barcoding:Promise and Pitfalls[J]. PLOS Bio,2004,2(10): 28-29
    [19]Fengfeng Zhou, Victor Olman, Ying Xu. Barcodes for Genomes and Applications [J], BMC Bioinformatics 2008 9:546.
    [20]Borodovsky M, Sprizhitskii Y, Golovanov E, Aleksandrov A. Statistical patterns in primary structures of functional regions in the E. coli genome.Ⅱ.Non-homogeneous Markov models[J]. Molecular Biology 1986,20:833-840.
    [21]Jain R, Rivera MC, Lake JA. Horizontal gene transfer among genomes:the complexity hypothesis[J]. Proceedings of the National Academy of Sciences of the United States of America 1999,96(7):3801-3806.
    [22]Frey TK. Neurological aspects of rubella virus infection[J]. Intervirology 1997, 40(2-3):167-175.
    [23]Rybchin VN, Svarchevsky AN. The plasmid prophage N15:a linear DNA with covalently closed ends[J]. Mol Microbiol 1999,33(5):895-903.
    [24]George M. Garrity Julia A. Bell,Timothy G. Lilburn. TAXONOMIC OUTLINE OF THE PROKARYOTES BERGEY'S MANUAL(?) OF SYSTEMATIC BACTERIOLOGY[M], SECOND EDITION Release 5.0 May 2004
    [25]Guoqing Wang, Fengfeng Zhou, Victor Olman, Fan Li, Ying Xu. Prediction of pathogenicity islands in enterohemorrhagic Escherichia coli O157:H7 using genomic barcodes[J], FEBS Letters 2010 584(1):194-198.
    [26]JD Thompson, DG Higgins. CLUSTAL W:improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice[J]. Nucl. Acids Res. (1994) 22 (22):4673-4680. doi: 10.1093/nar/22.22.4673
    [27]MA Steel. Classifying and counting linear phylogenetic invariants for the Jukes-Cantor model [J]. Journal of computational biology,1995
    [28]N Saitou, M Nei. The neighbor-joining method:a new method for reconstructing phylogenetic trees[J]. Mol Biol Evol (1987) 4 (4):406-425

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700