Next generation sequencing reads comparison with an alignment-free distance
详细信息    查看全文
  • 作者:Emanuel Weitschek (10) (9)
    Daniele Santoni (10)
    Giulia Fiscon (10) (11)
    Maria Cristina De Cola (12)
    Paola Bertolazzi (10)
    Giovanni Felici (10)

    10. Institute of Systems Analysis and Computer Science 鈥淎. Ruberti鈥? National Research Council
    ; Via dei Taurini 19 ; 00185 ; Rome ; Italy
    9. Department of Engineering
    ; Roma Tre University ; Via della Vasca Navale 79 ; 00146 ; Rome ; Italy
    11. Department of Computer
    ; Control ; and Management Engineering 鈥淎ntonio Ruberti鈥? Viale Ariosto 25 ; 00185 ; Rome ; Italy
    12. IRCCS Centro Neurolesi 鈥淏onino-Pulejo鈥? S.S.113 Via Palermo C/da Casazza
    ; 98123 ; Messina ; Italy
  • 关键词:Sequence analysis ; Next generation sequencing ; Alignment ; free
  • 刊名:BMC Research Notes
  • 出版年:2014
  • 出版时间:December 2014
  • 年:2014
  • 卷:7
  • 期:1
  • 全文大小:1,457 KB
  • 参考文献:1. Eisenstein, M (2012) The battle for sequencing supremacy. Nat Biotechnol 30: pp. 1023-1026 CrossRef
    2. Liu, L, Li, Y, Li, S, Hu, N, He, Y, Pong, R, Lin, D, Lu, L, Law, M (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol. pp. 251364
    3. Metzker, ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11: pp. 31-46 CrossRef
    4. Earl, D, Bradnam, K, John, JS, Darling, A, Lin, D, Fass, J, Yu, HOK, Buffalo, V, Zerbino, DR, Diekhans, M, Ariyaratne, PN, Sung, W-K, Ning, Z, Haimel, M, Simpson, JT, Fonseca, NA, Birol, I, Docking, TR, Ho, IY, Rokhsar, DS, Chikhi, R, Lavenier, D, Chapuis, G, Naquin, D, Maillet, N, Schatz, MC, Kelley, DR, Phillippy, AM, Koren, S, Nguyen, N (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 21: pp. 2224-2241 CrossRef
    5. Bradnam, KR, Fass, JN, Alexandrov, A, Baranay, P, Bechner, M, Birol, I, Boisvert, S, Chapman, JA, Chapuis, G, Chikhi, R, Chitsaz, H, Chou, W-C, Corbeil, J, Fabbro, CD, Docking, TR, Durbin, R, Earl, D, Emrich, S, Fedotov, P, Fonseca, NA, Ganapathy, G, Gibbs, RA, Gnerre, S, Godzaridis, E, Goldstein, S, Haimel, M, Hall, G, Haussler, D, Hiatt, JB, Ho, IY (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2: pp. 1-31 CrossRef
    6. Nagarajan, N, Pop, M (2009) Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J Comput Biol 16: pp. 897-908 CrossRef
    7. Blazewicz, J, Bryja, M, Figlerowicz, M, Gawron, P, Kasprzak, M, Kirton, E, Platt, D, Przybytek, J, Swiercz, A, Szajkowski, L (2009) Whole genome assembly from 454 sequencing output via modified dna graph concept. Comput Biol Chem 33: pp. 224-230 CrossRef
    8. Compeau, PEC, Pevzner, PA, Tesler, G (2011) How to apply de bruijn graphs to genome assembly. Nat Biotechnol 29: pp. 987-991 CrossRef
    9. Birol, I, Jackman, SD, Nielsen, CB, Qian, JQ, Varhol, R, Stazyk, G, Morin, RD, Zhao, Y, Hirst, M, Schein, JE, Horsman, DE, Connors, JM, Gascoyne, RD, Marra, MA, Jones, SJ (2009) De novo transcriptome assembly with abyss. Bioinformatics 25: pp. 2872-2877 CrossRef
    10. Zerbino, DR, Birney, E (2008) Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res 18: pp. 821-829 CrossRef
    11. Luo, R, Liu, B, Xie, Y, Li, Z, Huang, W, Yuan, J, He, G, Chen, Y, Pan, Q, Liu, Y, Tang, J, Wu, G, Zhang, H, Shi, Y, Liu, Y, Yu, C, Wang, B, Lu, Y, Han, C, Cheung, DW, Yiu, S-M, Peng, S, Xiaoqian, Z, Liu, G, Liao, X, Li, Y, Yang, H, Wang, J, Lam, T-W, Wang, J (2012) Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1: pp. 18 CrossRef
    12. Miller, JR, Koren, S, Sutton, G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95: pp. 315-327 CrossRef
    13. Vinga, S, Almeida, J (2003) Alignment-free sequence comparison-a review. Bioinformatics 19: pp. 513-523 CrossRef
    14. Polychronopoulos, D, Weitschek, E, Dimitrieva, S, Bucher, P, Felici, G, Almirantis, Y (2014) Classification of selectively constrained dna elements using feature vectors and rule-based classifiers. Genomics 104: pp. 79-86 CrossRef
    15. Li, M, Vitnyi, PMB (2008) An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York, NY, USA CrossRef
    16. Almeida, JS, Vinga, S (2002) Universal sequence map (usm) of arbitrary discrete sequences. BMC Bioinf 3: pp. 6 CrossRef
    17. Giancarlo, R, Scaturro, D, Utro, F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25: pp. 1575-1586 CrossRef
    18. Kuksa, P, Pavlovic, V (2009) Efficient alignment-free dna barcode analytics. BMC Bioinf 10: pp. 9 CrossRef
    19. Hide, W, Burke, J, Da Vison, DB (1994) Biological evaluation of d2, an algorithm for high-performance sequence comparison. J Comput Biol 1: pp. 199-215 CrossRef
    20. Teeling, H, Meyerdiekers, A, Bauer, M, Gl枚ckner, FO (2004) Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol 6: pp. 938-947 CrossRef
    21. Pride, DT, Meinersmann, RJ, Wassenaar, TM, Blaser, MJ (2003) Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res 13: pp. 145-158 CrossRef
    22. Teeling, H, Waldmann, J, Lombardot, T, Bauer, M, Gl枚ckner, FO (2004) Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinf 5: pp. 163 CrossRef
    23. Langmead, B, Trapnell, C, Pop, M, Salzberg, SL (2009) Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol 10: pp. 25 CrossRef
    24. Langmead, B, Salzberg, SL (2012) Fast gapped-read alignment with bowtie 2. Nat Methods 9: pp. 357-359 CrossRef
    25. Needleman, SB, Wunsch, CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48: pp. 443-453 CrossRef
    26. NeoBio: Bioinformatics Algorithms in Java [ http://neobio.sourceforge.net/]
    27. Altschul, S, Gish, W, Miller, W, Myers, E, Lipman, D (1990) Basic local alignment search tool. J Mol Biol 215: pp. 403-410 CrossRef
    28. Blast Package Version 2.2.25鈥? http://packages.ubuntu.com/precise/ncbi-blast+
    29. Fawcett, T (2006) An introduction to roc analysis. Pattern Recognit Lett 27: pp. 861-874 CrossRef
    30. NCBI Sequence Read Archive http://www.ncbi.nlm.nih.gov/sra
    31. E. Coli Reads Source http://petang.cgu.edu.tw/Bioinfomatics/Lecture/0_HTS/08/HTS_E08.pdf
    32. Yeast Bowtie Index http://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/s_cerevisiae.ebwt.zip
    33. E. Coli Bowtie Index http://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/e_coli.ebwt.zip
    34. Dazzler Assembler for PacBio Reads http://www.homolog.us/blogs/blog/2014/02/14/dazzle-assembler-pacbio-reads-gene-myers/
    35. Song, K, Ren, J, Zhai, Z, Liu, X, Deng, M, Sun, F (2013) Alignment-free sequence comparison based on next generation sequencing reads. J Comput Biol 20: pp. 64-79 CrossRef
  • 刊物主题:Biomedicine general; Medicine/Public Health, general; Life Sciences, general;
  • 出版者:BioMed Central
  • ISSN:1756-0500
文摘
Background Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis. Methods We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples. Results We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment. Conclusions Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700