An efficient and extensible approach for compressing phylogenetic trees
详细信息    查看全文
  • 作者:Suzanne J Matthews (1)
    Tiffani L Williams (1)
  • 刊名:BMC Bioinformatics
  • 出版年:2011
  • 出版时间:December 2011
  • 年:2011
  • 卷:12
  • 期:10-supp
  • 全文大小:289KB
  • 参考文献:1. Huelsenbeck JP, Ronquist F: MRBAYES: Bayesian inference of phylogenetic trees. / Bioinformatics 2001,17(8):754鈥?55. CrossRef
    2. Goloboff PA, Farris JS, Nixon KC: TNT, a free program for phylogenetic analysis. / Cladistics 2008,24(5):774鈥?86. j.1096-0031.2008.00217.x">CrossRef
    3. Matthews SJ, Sul SJ, Williams TL: A Novel Approach for Compressing Phylogenetic Trees. In / Bioinformatics Research and Applications, Volume 6053 of Lecture Notes in Computer Science. Springer-Verlag; 2010:113鈥?24.
    4. Felsenstein J: The Newick tree format. / Internet Website last accessed 2009. [Newick URL: http://evolution.genetics.washington.edu/phylip/newicktree.html]
    5. Boyer RS, Hunt WA Jr, Nelesen S: A Compressed Format for Collections of Phylogenetic Trees and Improved Consensus Performance. In / Proc. 5th Int'l Workshop Algorithms in Bioinformatics (WABI'05), Volume 3692 of Lecture Notes in Computer Science. Springer-Verlag; 2005:353鈥?64.
    6. Hunt WA Jr, Nelesen SM: Phylogenetic trees in ACL2. In / Proc. 6th Int'l Conf. on ACL2 Theorem Pro蠀er and its Applications (ACL2'06). New York, NY, USA: ACM; 2006:99鈥?02. CrossRef
    7. Amenta N, Clarke F, John KS: A linear-time majority tree algorithm. / Workshop on Algorithms in Bioinformatics, Volume 2168 of Lecture Notes in Computer Science 2003, 216鈥?27.
    8. Williams HE, Zobel J: Compressing Integers for Fast File Access. / The Computer Journal 1999, 42:193鈥?01. jnl/42.3.193">CrossRef
    9. Lewis LA, Lewis PO: Unearthing the Molecular Phylodiversity of Desert Soil Green Algae (Chlorophyta). / Syst. Bio. 2005,54(6):936鈥?47. CrossRef
    10. Soltis DE, Gitzendanner MA, Soltis PS: A 567-taxon data set for angiosperms: The challenges posed by Bayesian analyses of large data sets. / Int. J. Plant Sci 2007,168(2):137鈥?57. CrossRef
    11. Molin AD, Matthews S, Sul SJ, Munro J, Woolley JB, Heraty JM, Williams TL: Large data sets, large sets of trees, and how many brains? 鈥?Visualization and comparison of phylogenetic hypotheses inferred from rDNA in Chalcidoidea (Hymenoptera). [http://esa.confex.com/esa/2009/webprogram/Sessionll584.html] / poster 2009.
    12. Sul SJ, Williams TL: An Experimental Analysis of Consensus Tree Algorithms for Large-Scale Tree Collections. In / Proceedings of the 5th International Symposium on Bioinformatics Research and Applications (ISBRA'09). Berlin, Heidelberg: Springer-Verlag; 2009:100鈥?11. CrossRef
  • 作者单位:Suzanne J Matthews (1)
    Tiffani L Williams (1)

    1. Department of Computer Science and Engineering, Texas A&M University, College Station, Texas, USA
  • ISSN:1471-2105
文摘
Background Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. Our previous work showed that TreeZip is a promising approach for compressing phylogenetic trees. In this paper, we extend our TreeZip algorithm by handling trees with weighted branches. Furthermore, by using the compressed TreeZip file as input, we have designed an extensible decompressor that can extract subcollections of trees, compute majority and strict consensus trees, and merge tree collections using set operations such as union, intersection, and set difference. Results On unweighted phylogenetic trees, TreeZip is able to compress Newick files in excess of 98%. On weighted phylogenetic trees, TreeZip is able to compress a Newick file by at least 73%. TreeZip can be combined with 7zip with little overhead, allowing space savings in excess of 99% (unweighted) and 92%(weighted). Unlike TreeZip, 7zip is not immune to branch rotations, and performs worse as the level of variability in the Newick string representation increases. Finally, since the TreeZip compressed text (TRZ) file contains all the semantic information in a collection of trees, we can easily filter and decompress a subset of trees of interest (such as the set of unique trees), or build the resulting consensus tree in a matter of seconds. We also show the ease of which set operations can be performed on TRZ files, at speeds quicker than those performed on Newick or 7zip compressed Newick files, and without loss of space savings. Conclusions TreeZip is an efficient approach for compressing large collections of phylogenetic trees. The semantic and compact nature of the TRZ file allow it to be operated upon directly and quickly, without a need to decompress the original Newick file. We believe that TreeZip will be vital for compressing and archiving trees in the biological community.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700