SeqWare Query Engine: storing and searching sequence data in the cloud
详细信息    查看全文
  • 作者:Brian D O’Connor (1)
    Barry Merriman (2)
    Stanley F Nelson (2)
  • 刊名:BMC Bioinformatics
  • 出版年:2010
  • 出版时间:December 2010
  • 年:2010
  • 卷:11
  • 期:12-supp
  • 全文大小:472KB
  • 参考文献:1. Snyder M, Du J, Gerstein M: Personal genome sequencing: current approaches and challenges. / Genes & development 2010,24(5):423. f="http://dx.doi.org/10.1101/gad.1864110">CrossRef
    2. Lander E, Linton L, Birren B, Nusbaum C, Zody M, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, / et al.: Initial sequencing and analysis of the human genome. / Nature 2001,409(6822):860-21. f="http://dx.doi.org/10.1038/35057062">CrossRef
    3. Levy S, Sutton G, Ng P, Feuk L, Halpern A, Walenz B, Axelrod N, Huang J, Kirkness E, Denisov G, / et al.: The diploid genome sequence of an individual human. / PLoS Biol 2007,5(10):e254. f="http://dx.doi.org/10.1371/journal.pbio.0050254">CrossRef
    4. Wheeler D, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y, Makhijani V, Roth G, / et al.: The complete genome of an individual by massively parallel DNA sequencing. / Nature 2008,452(7189):872-76. f="http://dx.doi.org/10.1038/nature06884">CrossRef
    5. Pushkarev D, Neff N, Quake S: Single-molecule sequencing of an individual human genome. / Nature biotechnology 2009,27(9):847-50. f="http://dx.doi.org/10.1038/nbt.1561">CrossRef
    6. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, / et al.: The diploid genome sequence of an Asian individual. / Nature 2008,456(7218):60-5. f="http://dx.doi.org/10.1038/nature07484">CrossRef
    7. Bentley D, Balasubramanian S, Swerdlow H, Smith G, Milton J, Brown C, Hall K, Evers D, Barnes C, Bignell H, / et al.: Accurate whole human genome sequencing using reversible terminator chemistry. / Nature 2008,456(7218):53-9. f="http://dx.doi.org/10.1038/nature07517">CrossRef
    8. McKernan K, Peckham H, Costa G, McLaughlin S, Fu Y, Tsung E, Clouser C, Duncan C, Ichikawa J, Lee C, / et al.: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. / Genome research 2009,19(9):1527. f="http://dx.doi.org/10.1101/gr.091868.109">CrossRef
    9. Ahn S, Kim T, Lee S, Kim D, Ghang H, Kim D, Kim B, Kim S, Kim W, Kim C, / et al.: The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. / Genome research 2009,19(9):1622. f="http://dx.doi.org/10.1101/gr.092197.109">CrossRef
    10. Kim J, Ju Y, Park H, Kim S, Lee S, Yi J, Mudge J, Miller N, Hong D, Bell C, / et al.: A highly annotated whole-genome sequence of a Korean individual. / Nature 2009,460(7258):1011-015.
    11. Drmanac R, Sparks A, Callow M, Halpern A, Burns N, Kermani B, Carnevali P, Nazarenko I, Nilsen G, Yeung G, / et al.: Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. / Science 2010,327(5961):78. f="http://dx.doi.org/10.1126/science.1181498">CrossRef
    12. Ley T, Mardis E, Ding L, Fulton B, McLellan M, Chen K, Dooling D, Dunford-Shore B, McGrath S, Hickenbotham M, / et al.: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. / Nature 2008,456(7218):66-2. f="http://dx.doi.org/10.1038/nature07485">CrossRef
    13. Mardis E, Ding L, Dooling D, Larson D, McLellan M, Chen K, Koboldt D, Fulton R, Delehaunty K, McGrath S, / et al.: Recurring mutations found by sequencing an acute myeloid leukemia genome. / New England Journal of Medicine 2009,361(11):1058. f="http://dx.doi.org/10.1056/NEJMoa0903840">CrossRef
    14. Pleasance E, Stephens P, O’Meara S, McBride D, Meynert A, Jones D, Lin M, Beare D, Lau K, Greenman C, / et al.: A small-cell lung cancer genome with complex signatures of tobacco exposure. / Nature 2010, 463:184-90. f="http://dx.doi.org/10.1038/nature08629">CrossRef
    15. Pleasance E, Cheetham R, Stephens P, McBride D, Humphray S, Greenman C, Varela I, Lin M, Ordó?ez G, Bignell G, / et al.: A comprehensive catalogue of somatic mutations from a human cancer genome. / Nature 2010, 463:191-96. f="http://dx.doi.org/10.1038/nature08658">CrossRef
    16. Clark M, Homer N, O’Connor B, Chen Z, Eskin A, Lee H, Merriman B, Nelson S: U87MG decoded: the genomic sequence of a cytogenetically aberrant human cancer cell line. / PLoS Genet 2010, 6:e1000832. f="http://dx.doi.org/10.1371/journal.pgen.1000832">CrossRef
    17. Rhead B, Karolchik D, Kuhn R, Hinrichs A, Zweig A, Fujita P, Diekhans M, Smith K, Rosenbloom K, Raney B, / et al.: The UCSC genome browser database: update 2010. / Nucleic Acids Res 2010,38(Database issue):D613-D619. f="http://dx.doi.org/10.1093/nar/gkp939">CrossRef
    18. Mungall C, Emmert D, / et al.: A Chado case study: an ontology-based modular schema for representing genome-associated biological information. / Bioinformatics 2007,23(13):i337. f="http://dx.doi.org/10.1093/bioinformatics/btm189">CrossRef
    19. Hubbard T, Aken B, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, / et al.: Ensembl 2007. / Nucleic acids research 2006.
    20. Kent W, Sugnet C, Furey T, Roskin K, Pringle T, Zahler A, / et al.: The human genome browser at UCSC. / Genome research 2002,12(6):996.
    21. Stein L, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich J, Harris T, Arva A, / et al.: The generic genome browser: a building block for a model organism system database. / Genome research 2002,12(10):1599. f="http://dx.doi.org/10.1101/gr.403602">CrossRef
    22. Karolchik D, Hinrichs A, Furey T, Roskin K, Sugnet C, Haussler D, Kent W: The UCSC Table Browser data retrieval tool. / Nucleic acids research 2004,32(Database Issue):D493. f="http://dx.doi.org/10.1093/nar/gkh103">CrossRef
    23. Giardine B, Riemer C, Hardison R, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, / et al.: Galaxy: a platform for interactive large-scale genome analysis. / Genome research 2005,15(10):1451. f="http://dx.doi.org/10.1101/gr.4086505">CrossRef
    24. Fielding R: Architectural Styles and the Design of Network-based Software Architectures. / PhD thesis. University of California; 2000.
    25. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. / Bioinformatics 2009,25(16):2078. f="http://dx.doi.org/10.1093/bioinformatics/btp352">CrossRef
    26. Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman G, Good J, / et al.: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. / Scientific Programming 2005,13(3):219-37.
    27. Dean J, Ghemawat S: MapReduce: Simplified data processing on large clusters. / Communications of the ACM 2008, 51:107-13. f="http://dx.doi.org/10.1145/1327452.1327492">CrossRef
    28. Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Burrows M, Chandra T, Fikes A, Gruber R: Bigtable: A distributed storage system for structured data. / ACM Transactions on Computer Systems (TOCS) 2008,26(2):4. f="http://dx.doi.org/10.1145/1365815.1365816">CrossRef
    29. Langmead B, Schatz M, Lin J, Pop M, Salzberg S: Searching for SNPs with cloud computing. / Genome Biology 2009,10(11):R134. f="http://dx.doi.org/10.1186/gb-2009-10-11-r134">CrossRef
    30. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, / et al.: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. / Genome Research 2010.
  • 作者单位:Brian D O’Connor (1)
    Barry Merriman (2)
    Stanley F Nelson (2)

    1. UNC Lineberger Comprehensive Cancer Center, University of North Carolina, 27599, Chapel Hill, NC, USA
    2. Department of Human Genetics, University of California, 90095, Los Angeles, CA, USA
  • ISSN:1471-2105
文摘
Background Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. Results In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). Conclusions The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700