宏基因组中DNA片段物种多样性鉴定研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
高通量测序技术的不断更新,推动着宏基因组学的快速发展,使得目前可测序的宏基因组的数量越来越多,所测DNA序列也越来越大。如何有效地分析和处理这些巨量的宏基因组DNA信息,是生物信息学面临的一个挑战。然而,宏基因组学技术提取的宏基因组是环境样本中多个生物群落的零碎DNA片段的总和,且绝大部分DNA序列的种属未知。因此,自从宏基因组学创建以来,给这些DNA片段鉴定其归属就是一个倍受关注的难题。至今仍没有成熟的解决方案。这大大影响了宏基因组学研究的效率,成为其发展的瓶颈。
     本文从生物信息学的角度,研究可用于鉴定宏基因组内DNA片段物种多样性的数据处理系统中存在的几个关键问题,具体研究内容如下:
     (1).从DNA片段中提取优化的组成特征向量
     由于进化以及基因突变等原因,微生物基因组中往往包含一定比例的外来物种DNA片段,这些片段作为噪声,会影响所提取的数字特征的精度。为此,本文提出了一种提取DNA片段数字特征的新思路,先滤除插入到物种中的外来物种DNA片段,再提取数字特征。实验证明这种过滤后提取的数字特征比过滤前提取的数字特征能更精确地表现物种间的系统发育关系。
     (2).提出双超球SVDD推理模型鉴定宏基因组中DNA片段的物种多样性
     作为训练集的已测序微生物基因组在“种”、“属”甚至“目”的分类学层次上物种间的类内差异和类间差异没有明显界限。这导致现存的分类方法在这些分类层次上识别率较低。本文基于支持向量数据描述(SVDD)算法,结合系统进化树,提出了一种新的双超球SVDD推理方案对宏基因组中的基因片段进行种属鉴定。这种方案可以有效地避免一些误识别、漏识别现象,一定程度上提高了分类精度。
     (3).提出稳健支持向量域描述(WSVDD)模型鉴定宏基因组中DNA片段的物种多样性
     目前已有的分类方法在“属”的分类层次识别率低、在“种”的分类层次上还没有方法可以对其进行分类。这主要受以下几个因素影响,如DNA序列的长度,从DNA序列中提取的组成向量的可靠度,所选分类器对参考基因组的数字特征向量的描述能力等。据我们观察,现有的分类方法(例如,支持向量机,核近邻,朴素贝叶斯分类器等)在参考数据包含噪音的情况下,都无法有效地描述参考数据。然而,众所周知,参考基因组数据(细菌和古细菌的基因组)通常包含一部分横向转移基因(lateral gene transfer,LGT)片段,它们作为噪声阻碍着分类器获得更好的精确度。为了解决这个问题,本文通过对SVDD算法进行改进,提出了一种稳健支持向量域描述(WSVDD)算法来鉴定DNA片段所属的生物群落。它能够有效地避免那些异常值(横向基因转移)对训练数据的干扰,从而提高了分类器的数据描述能力。
     我们相信,本方向的研究能够推动宏基因组学及生物多样性、种群进化关系、功能活性等其相关研究的发展。同时,本研究也为未来工程实践中开发相关的电子产品奠定良好的理论基础。
Advances in the throughput and cost-efficiency of sequencing technology isfueling a rapid increase in the number and size of metagenomic DNA set beinggenerated. Bioinformatics is faced with the problem of how to handle and analyze alarge amount of these DNA sets in an efficient way. However, The direct sequencedmetagenomes are generally very complex, often consist of DNA fragments fromnumerous genomes possibly from different domains, and most of the DNA sequencesare unknown. Therefore, one of the major challenges in metagenomic data analysis isto predict the taxonomic origin of the DNA fragments. This process is calledtaxonomic classification or binning. Depending on different research needs, thebinning process could be performed on different taxonomic levels from Kingdom (thehighest level) to Species (the lowest level). Up to now, some classifiers have beendeveloped to assess the source organism of DNA fragments from metagenome.However, most of these methods cannot achieve the better classification accuracyrequired by current high-complexity metagenomic sets.
     In this paper, we presented some methods to identify species diversity ofvariable-length DNA fragments within a metagenome base on some knowledge frompattern recognition. There are three key points to study deeply:
     (1) Extracting the optimized Composition feature vector from DNA fragment
     A large number of genomes sequences have been produced, how to provide ameans to describe and distinguish them accurately is becoming a key issue oftaxonomy. We proposed an efficient algorithm to filter out most genome fragmentsthat are horizontally transferred, and extracted a new genome vector (GV). Tohighlight the power of GV, we applied it to identify prokaryotes and theirvariable-size genome fragments. The result indicated that our new vector as speciestags can represent genome well after filtering out the abnormal genome fragments that are horizontally transferred.
     (2) Taxonomic Classification DNA Fragment of Metagenome with DS-BinningModel
     Some classifiers have been developed to assess the source organism of DNAfragments from metagenome. However, the majority of existing classifiers usuallysuffer from the lower classification accuracy at lower taxonomic level. One of thereasons is the classifiers cannot discriminate the data from different organismaccurately, especially the boundary isn’t clear between different organism. To get thebetter classification accuracy, we designed a DS-Binning method to predict thetaxonomic organism of the metagenomic DNA fragments. The method based on theknowledge of support vector data description algorithm and phylogenetic tree. Theresult indicated that the method can avoid some mistakenly identification and leakageidentification.
     (3) Taxonomic Classification of Metagenomics Data on Species and Genus LevelUsing Weighted SVDD (WSVDD) Model
     Up to now, there are several composition-based methods. However at genus andspecies level, most of these Composition-based methods cannot achieve the betterclassification accuracy required by current high-complexity metagenomic sets. Thisdifficulty is highly influenced by several factors such as genome length, reliability ofgenome composition vector and discriminating capability of classifier describing thereference genomic data, etc. We observed that the existing composition-basedclassifiers (such as SVMs, kernelized nearest neighbor, naive Bayes classifier, etc.)cannot describe the genomic data effectively on the noise associated with the lateralgene transfer (LGT) in the reference genomic data. However, as we all know, thereference genomic data (bacterial and archaeal genomes) usually contain a portion ofgenomic fragments from LGT, which prohibit the development of classifiers withperfect accuracy.
     To overcome the difficulty, we presented a novel strategy to get a better classification accuracy at genus and species level based on weighted support vectordomain description (WSVDD) model. The WSVDD model can overcome theinterference from LGT in training genomic data objectively, therefore the classifierhas a perfect accuracy.
     We believe that the researches will promote the development of these researches,such as biodiversity, population and evolutionary relationships, functional activity,mutual collaboration relations etc. As well as, the researches will lay an effectivetheoretical groundwork for the development of electronic products for studyingmetagenomic problems in the future.
引文
[1] Handelsman J, Rondon MR, Brady SF et al. Molecular biological access to thechemistry of unknown soil microbes: a new frontier for natural products.Chemistry&biology1998,5(10):R245-R249.
    [2]赵寿元. metagenome定名为“混杂基因组”为好.生命科学2010,(8):816-816.
    [3] Turnbaugh PJ, Gordon JI. An invitation to the marriage of metagenomics andmetabolomics. Cell2008,134(5):708-713.
    [4] Gill SR, Pop M, DeBoy RT et al. Metagenomic analysis of the human distal gutmicrobiome. science2006,312(5778):1355-1359.
    [5] Warnecke F, Luginbühl P, Ivanova N et al. Metagenomic and functional analysisof hindgut microbiota of a wood-feeding higher termite. Nature2007,450(7169):560-565.
    [6] Sogin ML, Morrison HG, Huber JA et al. Microbial diversity in the deep sea andthe underexplored “rare biosphere”. Proceedings of the National Academy ofSciences2006,103(32):12115-12120.
    [7] Rusch DB, Halpern AL, Sutton G et al. The Sorcerer II global ocean samplingexpedition: northwest Atlantic through eastern tropical Pacific. PLoS biology2007,5(3):e77.
    [8] Hemme CL, Deng Y, Gentry TJ et al. Metagenomic insights into evolution of aheavy metal-contaminated groundwater microbial community. The ISME journal2010,4(5):660-672.
    [9] Giovannoni S, DeLong E, Schmidt T, Pace N. Tangential flow filtration andpreliminary phylogenetic analysis of marine picoplankton. Applied andenvironmental microbiology1990,56(8):2572-2575.
    [10] Schmidt TM, DeLong E, Pace N. Analysis of a marine picoplankton communityby16S rRNA gene cloning and sequencing. Journal of Bacteriology1991,173(14):4371-4378.
    [11] Stein JL, Marsh TL, Wu KY et al. Characterization of uncultivated prokaryotes:isolation and analysis of a40-kilobase-pair genome fragment from a planktonicmarine archaeon. Journal of bacteriology1996,178(3):591-599.
    [12] Allander T, Emerson SU, Engle RE et al. A virus discovery methodincorporating DNase treatment and its application to the identification of twobovine parvovirus species. Proceedings of the National Academy of Sciences2001,98(20):11609-11614.
    [13] Breitbart M, Salamon P, Andresen B et al. Genomic analysis of unculturedmarine viral communities. Proceedings of the National Academy of Sciences2002,99(22):14250-14255.
    [14] Zhang T, Breitbart M, Lee WH et al. RNA viral community in human feces:prevalence of plant pathogenic viruses. PLoS biology2005,4(1):e3.
    [15] Rondon MR, August PR, Bettermann AD et al. Cloning the soil metagenome: astrategy for accessing the genetic and functional diversity of unculturedmicroorganisms. Applied and environmental microbiology2000,66(6):2541-2547.
    [16] Henne A, Schmitz RA, B meke M et al. Screening of Environmental DNALibraries for the Presence of Genes Conferring Lipolytic Activity onEscherichiacoli. Applied and environmental microbiology2000,66(7):3113-3116.
    [17] Sebat JL, Colwell FS, Crawford RL. Metagenomic profiling: microarray analysisof an environmental genomic library. Applied and environmental microbiology2003,69(8):4927-4934.
    [18] Li G, Wang K, Liu YH. Molecular cloning and characterization of a novelpyrethroid-hydrolyzing esterase originating from the Metagenome. Microb CellFact2008,7(38):300-311.
    [19] Nam KH, Kim M-Y, Kim S-J et al. Structural and functional analysis of a novelEstE5belonging to the subfamily of hormone-sensitive lipase. Biochemical andBiophysical Research Communications2009,379(2):553-556.
    [20] Kim B, Kim S, Park J et al. Sequence‐based screening for self‐sufficient P450monooxygenase from a metagenome library. Journal of applied microbiology2007,102(5):1392-1400.
    [21] Jiao Y-L, Wang L-H, Dong X-Y et al. Isolation of New Polyketide SynthaseGene Fragments and a Partial Gene Cluster from East China Sea and FunctionAnalysis of a New Acyltransfrase. Applied biochemistry and biotechnology2008,149(1):67-78.
    [22] Park H-J, Jeon JH, Kang SG et al. Functional expression and refolding of newalkaline esterase, EM2L8from deep-sea sediment metagenome. Proteinexpression and purification2007,52(2):340-347.
    [23] Jeon JH, Kim J-T, Kim YJ et al. Cloning and characterization of a newcold-active lipase from a deep-sea sediment metagenome. Applied Microbiologyand Biotechnology2009,81(5):865-874.
    [24] Wu C, Sun B. Identification of novel esterase from metagenomic library ofYangtze river. J Microbiol Biotechnol2009,19(2):187-193.
    [25] Chauhan NS, Ranjan R, Purohit HJ et al. Identification of genes conferringarsenic resistance to Escherichia coli from an effluent treatment plant sludgemetagenomic library. FEMS microbiology ecology2009,67(1):130-139.
    [26] Rashamuse K, Magomani V, Ronneburg T, Brady D. A novel family VIIIcarboxylesterase derived from a leachate metagenome library exhibitspromiscuous β-lactamase activity on nitrocefin. Applied microbiology andbiotechnology2009,83(3):491-500.
    [27] Firkins J, Karnati S, Yu Z. Linking rumen function to animal response byapplication of metagenomics techniques. Animal Production Science2008,48(7):711-721.
    [28] Ferrer M, Beloqui A, Golyshina OV et al. Biochemical and structural features ofa novel cyclodextrinase from cow rumen metagenome. Biotechnology Journal2007,2(2):207-213.
    [29] Rivers AR, Jakuba RW, Webb EA. Iron stress genes in marine Synechococcusand the development of a flow cytometric iron stress assay. EnvironmentalMicrobiology2009,11(2):382-396.
    [30] McDaniel L, Breitbart M, Mobberley J et al. Metagenomic analysis of lysogenyin Tampa Bay: implications for prophage gene expression. PLoS One2008,3(9):e3263.
    [31] Prosser JI, Nicol GW. Relative contributions of archaea and bacteria to aerobicammonia oxidation in the environment. Environmental Microbiology2008,10(11):2931-2941.
    [32] Pernthaler A, Dekas AE, Brown CT et al. Diverse syntrophic partnerships fromdeep-sea methane vents revealed by direct cell capture and metagenomics.Proceedings of the National Academy of Sciences2008,105(19):7052-7057.
    [33] Palenik B, Ren Q, Tai V, Paulsen I. Coastal Synechococcus metagenome revealsmajor roles for horizontal gene transfer and plasmids in population diversity.Environmental microbiology2009,11(2):349-359.
    [34] Pope PB, Patel BK. Metagenomic analysis of a freshwater toxic cyanobacteriabloom. FEMS microbiology ecology2008,64(1):9-27.
    [35] Khardenavis AA, Kapley A, Purohit HJ. Salicylic-acid-mediated enhancedbiological treatment of wastewater. Applied biochemistry and biotechnology2010,160(3):704-718.
    [36] Schlüter A, Krause L, Szczepanowski R et al. Genetic diversity and compositionof a plasmid metagenome from a wastewater treatment plant. Journal ofbiotechnology2008,136(1):65-76.
    [37] Hugenholtz P, Tyson GW. Microbiology: metagenomics. Nature2008,455(7212):481-483.
    [38] Yooseph S, Sutton G, Rusch DB et al. The Sorcerer II Global Ocean Samplingexpedition: expanding the universe of protein families. PLoS biology2007,5(3):e16.
    [39] Patel PV, Gianoulis TA, Bjornson RD et al. Analysis of membrane proteins inmetagenomics: networks of correlated environmental features and proteinfamilies. Genome research2010,20(7):960-971.
    [40] Peterson J, Garges S, Giovanni M et al. The NIH human microbiome project.Genome research2009,19(12):2317-2323.
    [41] Qin J, Li R, Raes J et al. A human gut microbial gene catalogue established bymetagenomic sequencing. Nature2010,464(7285):59-65.
    [42] Arumugam M, Raes J, Pelletier E et al. Enterotypes of the human gutmicrobiome. Nature2011,473(7346):174-180.
    [43] OLBY R, MCCARTY M, MADDOX B et al. A structure for deoxyribosenucleic acid. Nature1953,171737-738.
    [44] Maxam AM, Gilbert W. A new method for sequencing DNA. Proceedings of theNational Academy of Sciences1977,74(2):560-564.
    [45] Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminatinginhibitors. Proceedings of the National Academy of Sciences1977,74(12):5463-5467.
    [46] Margulies M, Egholm M, Altman WE et al. Genome sequencing inmicrofabricated high-density picolitre reactors. Nature2005,437(7057):376-380.
    [47] Fedurco M, Romieu A, Williams S et al. BTA, a novel reagent for DNAattachment on glass and efficient generation of solid-phase amplified DNAcolonies. Nucleic acids research2006,34(3):e22-e22.
    [48] Bentley DR. Whole-genome re-sequencing. Current opinion in genetics&development2006,16(6):545-552.
    [49] Braslavsky I, Hebert B, Kartalov E, Quake SR. Sequence information can beobtained from single DNA molecules. Proceedings of the National Academy ofSciences2003,100(7):3960-3964.
    [50] Harris TD, Buzby PR, Babcock H et al. Single-molecule DNA sequencing of aviral genome. Science2008,320(5872):106-109.
    [51] Schadt EE, Turner S, Kasarskis A. A window into third-generation sequencing.Human molecular genetics2010,19(R2):R227-R240.
    [52] Eid J, Fehr A, Gray J et al. Real-time DNA sequencing from single polymerasemolecules. Science2009,323(5910):133-138.
    [53] Munroe DJ, Harris TJ. Third-generation sequencing fireworks at Marco Island.Nature biotechnology2010,28(5):426-428.
    [54] Wash S, Image C. DNA sequencing: generation next-next. Nature Methods2008,5(3):267.
    [55] Polonsky S, Rossnagel S, Stolovitzky G. Nanopore in metal-dielectric sandwichfor DNA position control. Applied Physics Letters2007,91(15):153103-153103-153103.
    [56] Venter JC, Remington K, Heidelberg JF et al. Environmental genome shotgunsequencing of the Sargasso Sea. science2004,304(5667):66-74.
    [57] Lu J, Santo Domingo J. Turkey fecal microbial community structure andfunctional gene diversity revealed by16S rRNA gene and metagenomicsequences. The Journal of Microbiology2008,46(5):469-477.
    [58] Donato JJ, Moe LA, Converse BJ et al. Metagenomic analysis of apple orchardsoil reveals antibiotic resistance genes encoding predicted bifunctional proteins.Applied and environmental microbiology2010,76(13):4396-4401.
    [59] Gosalbes MJ, Durbán A, Pignatelli M et al. Metatranscriptomic approach toanalyze the functional human gut microbiota. PloS one2011,6(3):e17447.
    [60] McHardy AC, Rigoutsos I. What's in the mix: phylogenetic classification ofmetagenome sequence samples. Current opinion in microbiology2007,10(5):499-503.
    [61] Valdivia-Granda W. The next meta-challenge for Bioinformatics. Bioinformation2008,2(8):358.
    [62] Hess M, Sczyrba A, Egan R et al. Metagenomic discovery of biomass-degradinggenes and genomes from cow rumen. Science2011,331(6016):463-467.
    [63] Mavromatis K, Ivanova N, Barry K et al. Use of simulated data sets to evaluatethe fidelity of metagenomic processing methods. Nature methods2007,4(6):495-500.
    [64] Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: theprimary kingdoms. Proceedings of the National Academy of Sciences1977,74(11):5088-5090.
    [65] Balch W, Fox G, Magrum L et al. Methanogens: reevaluation of a uniquebiological group. Microbiological reviews1979,43(2):260.
    [66] Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms:proposal for the domains Archaea, Bacteria, and Eucarya. Proceedings of theNational Academy of Sciences1990,87(12):4576-4579.
    [67] Bult CJ, White O, Olsen GJ et al. Complete genome sequence of themethanogenic archaeon, Methanococcus jannaschii. Science1996,273(5278):1058-1073.
    [68] Karlin S, Altschul SF. Methods for assessing the statistical significance ofmolecular sequence features by using general scoring schemes. Proceedings ofthe National Academy of Sciences1990,87(6):2264-2268.
    [69] Karlin S, Altschul SF. Applications and statistics for multiple high-scoringsegments in molecular sequences. Proceedings of the National Academy ofSciences1993,90(12):5873-5877.
    [70] Altschul SF, Madden TL, Sch ffer AA et al. Gapped BLAST and PSI-BLAST: anew generation of protein database search programs. Nucleic acids research1997,25(17):3389-3402.
    [71] Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoScomputational biology2010,6(2):e1000667.
    [72] Segata N, Waldron L, Ballarini A et al. Metagenomic microbial communityprofiling using unique clade-specific marker genes. Nature methods2012,9(8):811-814.
    [73] Trifonov EN, Sussman JL. The pitch of chromatin DNA is reflected in itsnucleotide sequence. Proceedings of the National Academy of Sciences1980,77(7):3816-3820.
    [74] Borodovsky MY, Sprizhitskii Y, Golovanov E, Aleksandrov A. Statisticalpatterns in primary structures of functional regions in the E. coli genome. III.Computer recognition of coding regions. Mol. Biol1986,201145-1150.
    [75] Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomesand evolutionary implications. Journal of bacteriology1997,179(12):3899-3913.
    [76] Kariin S, Burge C. Dinucleotide relative abundance extremes: a genomicsignature. Trends in genetics1995,11(7):283-290.
    [77] Nakashima H, Nishikawa K, Ooi T. Di. erences in Dinucleotide Frequencies ofHuman, Yeast, and Escherichia coli Genes. DNA Research1997,4(3):185-192.
    [78] Karlin S, Ladunga I, Blaisdell B. Heterogeneity of genomes: measures andvalues. Proceedings of the National Academy of Sciences1994,91(26):12837-12841.
    [79] Karlin S, Brocchieri L, Mrázek J et al. A chimeric prokaryotic ancestry ofmitochondria and primitive eukaryotes. Proceedings of the National Academy ofSciences1999,96(16):9190-9195.
    [80] Woyke T, Teeling H, Ivanova NN et al. Symbiosis insights through metagenomicanalysis of a microbial consortium. Nature2006,443(7114):950-955.
    [81] Chan C-KK, Hsu AL, Halgamuge SK, Tang S-L. Binning sequences using verysparse labels within a metagenome. BMC bioinformatics2008,9(1):215.
    [82] Krause L, Diaz NN, Goesmann A et al. Phylogenetic classification of shortenvironmental DNA fragments. Nucleic acids research2008,36(7):2230-2239.
    [83] Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data.Genome research2007,17(3):377-386.
    [84] Schreiber F, Gumrich P, Daniel R, Meinicke P. Treephyler: fast taxonomicprofiling of metagenomes. Bioinformatics2010,26(7):960-961.
    [85] Zhang Y, Sun Y. MetaDomain: a profile HMM-based protein domainclassification tool for short sequences. Pac. Sym. Biocomput2012,17271-282.
    [86] Haft DH, Tovchigrechko A. High-speed microbial community profiling. NatureMethods2012,9(8):793-794.
    [87] Campbell A, MRAzek J, Karlin S. Genome signature comparisons amongprokaryote, plasmid, and mitochondrial DNA. Proceedings of the NationalAcademy of Sciences1999,96(16):9184-9189.
    [88] McHardy AC, Martin HG, Tsirigos A et al. Accurate phylogenetic classificationof variable-length DNA fragments. Nature methods2006,4(1):63-72.
    [89] Patil KR, Haider P, Pope PB et al. Taxonomic metagenome sequence assignmentwith structured output models. Nature methods2011,8(3):191-192.
    [90] Diaz NN, Krause L, Goesmann A et al. TACOA–Taxonomic classification ofenvironmental genomic fragments using a kernelized nearest neighbor approach.BMC bioinformatics2009,10(1):56.
    [91] Rosen GL, Reichenberger ER, Rosenfeld AM. NBC: the Naive BayesClassification tool webserver for taxonomic classification of metagenomic reads.Bioinformatics2011,27(1):127-129.
    [92] Brady A, Salzberg SL. Phymm and PhymmBL: metagenomic phylogeneticclassification with interpolated Markov models. Nature methods2009,6(9):673-676.
    [93] Godfray HCJ. Challenges for taxonomy. Nature2002,417(6884):17-19.
    [94] Breed R, Murray E, Hitchens A. Bergey's Manual of Determinative Bacteriology(ed.6) Williams&Wilkins Company. In: Baltimore;1948.
    [95] Borodovskii M, Sprizhitskii Y, Golovanov EI, Aleksandrov A. Statisticalpatterns in the primary structures of functional regions of the genome inEscherichia coli. II. Nonuniform Markov models. Molekulyarnaya Biologiya(Russian)1986,201024-1033.
    [96] Cole JR, Chai B, Marsh TL et al. The Ribosomal Database Project (RDP-II):previewing a new autoaligner that allows regular updates and the newprokaryotic taxonomy. Nucleic acids research2003,31(1):442-443.
    [97] Olsen GJ, Woese CR, Overbeek R. The winds of (evolutionary) change:breathing new life into microbiology. Journal of bacteriology1994,176(1):1.
    [98] Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based onwhole genomes. Nucleic acids research2004,32(suppl2):W45-W47.
    [99] Qi J, Wang B, Hao B-I. Whole proteome prokaryote phylogeny without sequencealignment: a K-string composition approach. Journal of molecular evolution2004,58(1):1-11.
    [100] Karlin S, Mrázek J, Ma J, Brocchieri L. Predicted highly expressed genes inarchaeal genomes. Proceedings of the National Academy of Sciences of theUnited States of America2005,102(20):7303-7308.
    [101] Karlin S, Zhu Z-Y, Karlin KD. The extended environment of mononuclearmetal centers in protein structures. Proceedings of the National Academy ofSciences1997,94(26):14225-14230.
    [102] Zhou F, Olman V, Xu Y. Barcodes for genomes and applications. BMCbioinformatics2008,9(1):546.
    [103] Mrázek J, Bhaya D, Grossman AR, Karlin S. Highly expressed and alien genesof the Synechocystis genome. Nucleic acids research2001,29(7):1590-1601.
    [104] Cover T, Hart P. Nearest neighbor pattern classification. Information Theory,IEEE Transactions on1967,13(1):21-27.
    [105] Soucy P, Mineau GW. A simple KNN algorithm for text categorization. In:Data Mining,2001. ICDM2001, Proceedings IEEE International Conference on.IEEE;2001. pp.647-648.
    [106] Chen YQ, Nixon MS, Damper RI. Implementing the k-nearest neighbour rulevia a neural network. In: Neural Networks,1995. Proceedings., IEEEInternational Conference on. IEEE;1995. pp.136-140.
    [107] Kuncheva LI. Fitness functions in editing k-NN reference set by geneticalgorithms. Pattern Recognition1997,30(6):1041-1049.
    [108] Yao Z, Ruzzo WL. A regression-based K nearest neighbor algorithm for genefunction prediction from heterogeneous data. BMC bioinformatics2006,7(Suppl1):S11.
    [109] Tax DM, Duin RP. Support vector domain description. Pattern recognitionletters1999,20(11):1191-1199.
    [110] Tax DM, Duin RP. Support vector data description. Machine learning2004,54(1):45-66.
    [111] Sj strand K, Hansen MS, Larsson HB, Larsen R. A path algorithm for thesupport vector domain description and its application to medical imaging.Medical image analysis2007,11(5):417-428.
    [112] Lee K, Kim D-W, Lee D, Lee KH. Improving support vector data descriptionusing local density degree. Pattern Recognition2005,38(10):1768-1771.
    [113] Lee K, Kim D-W, Lee KH, Lee D. Density-induced support vector datadescription. Neural Networks, IEEE Transactions on2007,18(1):284-289.
    [114] Guo S-M, Chen L-C, Tsai JS-H. A boundary method for outlier detection basedon support vector domain description. Pattern Recognition2009,42(1):77-83.
    [115] Zhang Y, Chi Z-X, Li K-Q. Fuzzy multi-class classifier based on support vectordata description and improved PCM. Expert Systems with Applications2009,36(5):8714-8718.
    [116] Banerjee A, Burlina P, Diehl C. A support vector method for anomaly detectionin hyperspectral imagery. Geoscience and Remote Sensing, IEEE Transactionson2006,44(8):2282-2291.
    [117] Bu H-g, Wang J, Huang X-b. Fabric defect detection based on multiple fractalfeatures and support vector data description. Engineering Applications ofArtificial Intelligence2009,22(2):224-235.
    [118] Lai C, Tax DM, Duin RP et al. A study on combining image representations forimage classification and retrieval. International Journal of Pattern Recognitionand Artificial Intelligence2004,18(05):867-890.
    [119] Lee S-W, Park J, Lee S-W. Low resolution face recognition based on supportvector data description. Pattern Recognition2006,39(9):1809-1812.
    [120] Seo J, Ko H. Face detection using support vector domain description in colorimages. In: Acoustics, Speech, and Signal Processing,2004.Proceedings.(ICASSP'04). IEEE International Conference on. IEEE;2004. pp.V-729-732vol.725.
    [121] Dong X, Zhaohui W, Wanfeng Z. Support vector domain description forspeaker recognition. In: Neural Networks for Signal Processing XI,2001.Proceedings of the2001IEEE Signal Processing Society Workshop. IEEE;2001.pp.481-488.
    [122] Vapnik V. The nature of statistical learning theory. springer;1999.
    [123] Lee D, Lee J. Domain described support vector classifier for multi-classificationproblems. Pattern Recognition2007,40(1):41-51.
    [124] Kang W-S, Im KH, Choi JY. SVDD-Based method for fast training ofmulti-class support vector classifier. In: Advances in Neural Networks-ISNN2006. Springer;2006. pp.991-996.
    [125] Ban T, Abe S. Implementing multi-class classifiers by one-class classificationmethods. In: Neural Networks,2006. IJCNN'06. International Joint Conferenceon. IEEE;2006. pp.327-332.
    [126] Kang W-S, Choi JY. Domain density description for multiclass patternclassification with reduced computational load. Pattern recognition2008,41(6):1997-2009.
    [127] Camci F, Chinnam RB. General support vector representation machine forone-class classification of non-stationary classes. Pattern Recognition2008,41(10):3021-3034.
    [128] MacDonald NJ, Parks DH, Beiko RG. Rapid identification of high-confidencetaxonomic assignments for metagenomic data. Nucleic acids research2012,40(14):e111-e111.
    [129] Khachatryan ZA, Ktsoyan ZA, Manukyan GP et al. Predominant role of hostgenetics in controlling the composition of gut microbiota. PloS one2008,3(8):e3064.
    [130] Altschul SF, Gish W, Miller W et al. Basic local alignment search tool. Journalof molecular biology1990,215(3):403-410.
    [131] Nalbantoglu OU, Way SF, Hinrichs SH, Sayood K. RAIphy: phylogeneticclassification of metagenomics samples using iterative refinement of relativeabundance index profiles. BMC bioinformatics2011,12(1):41.
    [132] Leung HC, Yiu S, Yang B et al. A robust and accurate binning algorithm formetagenomic sequences with arbitrary species abundance ratio. Bioinformatics2011,27(11):1489-1495.
    [133] Dutta C, Pan A. Horizontal gene transfer and bacterial diversity. Journal ofbiosciences2002,27(1):27-33.