Evaluating the Quantitative Capabilities of Metagenomic Analysis Software
详细信息    查看全文
  • 作者:Csaba Kerepesi ; Vince Grolmusz
  • 关键词:Environmental microbiology ; Genomics ; Microbiology
  • 刊名:Current Microbiology
  • 出版年:2016
  • 出版时间:May 2016
  • 年:2016
  • 卷:72
  • 期:5
  • 页码:612-616
  • 全文大小:516 KB
  • 参考文献:1.Huson DH, Auch AF, Qi J, Schuster SC (2007) MEGAN analysis of metagenomic data. Genome Res 17(3):377–386CrossRef PubMed PubMedCentral
    2.Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN4. Genome Res 21(9):1552–1560CrossRef PubMed PubMedCentral
    3.Kerepesi C, Banky D, Grolmusz V (2014) AmphoraNet: the webserver implementation of the AMPHORA2 metagenomic workflow suite. Gene 533(2):538–540CrossRef PubMed
    4.Kerepesi C, Szalkai B, Grolmusz V (2014) Visual analysis of the quantitative composition of metagenomic communities: the AmphoraVizu webserver. Microb Ecol 69:695–697CrossRef PubMed
    5.Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Frank K, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4(6):495–500CrossRef PubMed
    6.Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke R, Wilkening J, Edwards RA (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform 9:386CrossRef
    7.Richter DC, Ott F, Auch AF, Schmid R, Huson DH (2008) Metasim: a sequencing simulator for genomics and metagenomics. PLoS One 3(10):e3373CrossRef PubMed PubMedCentral
    8.Wu M, Eisen JA (2008) A simple, fast, and accurate method of phylogenomic inference. Genome Biol 9(10):R151CrossRef PubMed PubMedCentral
    9.Wu M, Scott AJ (2012) Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinformatics 28(7):1033–1034CrossRef PubMed
  • 作者单位:Csaba Kerepesi (1)
    Vince Grolmusz (1) (2)

    1. PIT Bioinformatics Group, Eötvös University, Pázmány Péter stny. 1/C, Budapest, 1117, Hungary
    2. Uratim Ltd., Budapest, 1118, Hungary
  • 刊物类别:Biomedical and Life Sciences
  • 刊物主题:Life Sciences
    Microbiology
    Biotechnology
  • 出版者:Springer New York
  • ISSN:1432-0991
文摘
DNA sequencing technologies are applied widely and frequently today to describe metagenomes, i.e., microbial communities in environmental or clinical samples, without the need for culturing them. These technologies usually return short (100–300 base-pairs long) DNA reads, and these reads are processed by metagenomic analysis software that assign phylogenetic composition–information to the dataset. Here we evaluate three metagenomic analysis software (AmphoraNet—a webserver implementation of AMPHORA2—, MG-RAST, and MEGAN5) for their capabilities of assigning quantitative phylogenetic information for the data, describing the frequency of appearance of the microorganisms of the same taxa in the sample. The difficulties of the task arise from the fact that longer genomes produce more reads from the same organism than shorter genomes, and some software assign higher frequencies to species with longer genomes than to those with shorter ones. This phenomenon is called the “genome length bias.” Dozens of complex artificial metagenome benchmarks can be found in the literature. Because of the complexity of those benchmarks, it is usually difficult to judge the resistance of a metagenomic software to this “genome length bias.” Therefore, we have made a simple benchmark for the evaluation of the “taxon-counting” in a metagenomic sample: we have taken the same number of copies of three full bacterial genomes of different lengths, break them up randomly to short reads of average length of 150 bp, and mixed the reads, creating our simple benchmark. Because of its simplicity, the benchmark is not supposed to serve as a mock metagenome, but if a software fails on that simple task, it will surely fail on most real metagenomes. We applied three software for the benchmark. The ideal quantitative solution would assign the same proportion to the three bacterial taxa. We have found that AMPHORA2/AmphoraNet gave the most accurate results and the other two software were under-performers: they counted quite reliably each short read to their respective taxon, producing the typical genome length bias. The benchmark dataset is available at http://​pitgroup.​org/​static/​3RandomGenome-100kavg150bps.​fna.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.