TagDust2: a generic method to extract reads from sequencing data
详细信息    查看全文
  • 作者:Timo Lassmann (1) (2)

    1. RIKEN Center for Life Science Technologies (CLST)
    ; RIKEN Yokohama Institute ; 1-7-22 Suehiro-cho ; Tsurumi-ku ; Yokohama ; 230-0045 ; Kanagawa ; Japan
    2. Telethon Kids Institute
    ; The University of Western Australia ; 100 Roberts Road ; Subiaco ; Subiaco ; 6008 ; Western Australia ; Australia
  • 关键词:Next generation sequencing ; TagDust
  • 刊名:BMC Bioinformatics
  • 出版年:2015
  • 出版时间:December 2015
  • 年:2015
  • 卷:16
  • 期:1
  • 全文大小:791 KB
  • 参考文献:1. Bernstein, B, Birney, E, Dunham, I, Green, E, Gunter, C, Snyder, M (2012) An integrated encyclopedia of dna elements in the human genome. Nature 489: pp. 57-74 CrossRef
    2. Craig, DW, Pearson, JV, Szelinger, S, Sekar, A, Redman, M, Corneveaux, JJ (2008) Identification of genetic variants using bar-coded multiplexed sequencing. Nat Methods 5: pp. 887-93 CrossRef
    3. Kircher, M, Sawyer, S, Meyer, M (2012) Double indexing overcomes inaccuracies in multiplex sequencing on the illumina platform. Nucleic Acids Res 40: pp. e3 CrossRef
    4. Kivioja, T, V盲h盲rautio, A, Karlsson, K, Bonke, M, Enge, M, Linnarsson, S (2012) Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods 9: pp. 72-4 CrossRef
    5. Lassmann, T, Hayashizaki, Y, Daub, CO (2009) Tagdust鈥攁 program to eliminate artifacts from next generation sequencing data. Bioinformatics 25: pp. 2839-40 CrossRef
    6. Bolger, AM, Lohse, M, Usadel, B (2014) Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 30: pp. 2114-2120 CrossRef
    7. Martin, M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17: pp. 10 CrossRef
    8. Kong, Y (2011) Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics 98: pp. 152-3 CrossRef
    9. BCL, 2FASTQ Conversion Software 1.8.4. [http://support.illumina.com/downloads/bcl2fastq_conversion_software_184.html]
    10. fastx-toolkit. [http://hannonlab.cshl.edu/fastx_toolkit/]
    11. Camerlengo, T, Ozer, HG, Onti-Srinivasan, R, Yan, P, Huang, T (2012) From sequencer to supercomputer: an automatic pipeline for managing and processing next generation sequencing data. AMIA Summits Translational Sci Proc. 2012: pp. 1
    12. Lassmann, T, Hasegawa, A, Daub, C, Carninci, P, Hayashizaki, Y (2014) Moirai: a compact workflow system for cage analysis. BMC Bioinf. 15: pp. 144 CrossRef
    13. Giardine, B, Riemer, C, Hardison, RC, Burhans, R, Elnitski, L, Shah, P (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15: pp. 1451-5 CrossRef
    14. Pachter L. *Seq. Bits of DNA: Reviews and commentary on computational biology. [{http://liorpachter.wordpress.com/seq/} ]
    15. Durbin R. Biological sequence analysis: probabilistic models of proteins and nucleic acids: Cambridge university press; 1998.
    16. Li, H, Ruan, J, Durbin, R (2008) Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res. 18: pp. 1851-8 CrossRef
    17. K盲ll, L, Krogh, A, Sonnhammer, EL (2005) An hmm posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 21: pp. 251-257 CrossRef
    18. Myers, G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM (JACM) 46: pp. 395-415 CrossRef
    19. Faircloth, BC, Glenn, TC (2012) Not all sequence tags are created equal: designing and validating sequence identification tags robust to indels. PloS one 7: pp. 42543 CrossRef
    20. Li H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint. 2013. arXiv:1303.3997.
    21. Islam, S, Kj盲llquist, U, Moliner, A, Zajac, P, Fan, J-B, L枚nnerberg, P (2011) Characterization of the single-cell transcriptional landscape by highly multiplex rna-seq. Genome Res. 21: pp. 1160-7 CrossRef
    22. Kim, D, Pertea, G, Trapnell, C, Pimentel, H, Kelley, R, Salzberg, SL (2013) Tophat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14: pp. 36 CrossRef
  • 刊物主题:Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms;
  • 出版者:BioMed Central
  • ISSN:1471-2105
文摘
Background Arguably the most basic step in the analysis of next generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments. The presence of barcodes, adaptors and artifacts subject to sequencing errors makes this step non-trivial. Results Here I present TagDust2, a generic approach utilizing a library of hidden Markov models (HMM) to accurately extract reads from a wide array of possible read architectures. TagDust2 extracts more reads of higher quality compared to other approaches. Processing of multiplexed single, paired end and libraries containing unique molecular identifiers is fully supported. Two additional post processing steps are included to exclude known contaminants and filter out low complexity sequences. Finally, TagDust2 can automatically detect the library type of sequenced data from a predefined selection. Conclusion Taken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step. The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines. TagDust2 is freely available at: http://tagdust.sourceforge.net.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700