A survey of genome sequence assembly techniques and algorithms using high-performance computing
详细信息    查看全文
  • 作者:Munib Ahmed (1)
    Ishfaq Ahmad (1)
    Mohammad Saad Ahmad (1)
  • 关键词:Parallel and distributed computing ; Whole genome sequence assembly ; Sequence alignment
  • 刊名:The Journal of Supercomputing
  • 出版年:2015
  • 出版时间:January 2015
  • 年:2015
  • 卷:71
  • 期:1
  • 页码:293-339
  • 全文大小:930 KB
  • 参考文献:1. Ahmed M, Ahmad I, Khan S (2011) A theoretical analysis of scalability of the parallel genome assembly algorithms. In: Second international conference on bioinformatics models, methods and algorithms. pp 234鈥?37
    2. Ahmed M, Ahmad I, Khan S (2011) A comparative analysis of approaches to parallel genome assembly. J Interdiscipl Sci Comput Life Sci 3:1鈥?. doi:10.1007/s12539-011-00 CrossRef
    3. Ahmed M, Ahmad M, Ahmad I (2008) A multi-pronged parallel approach to enhance speed and accuracy of sequence assembly. In: Biotechnology and bioinformatics symposium
    4. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389鈥?402 CrossRef
    5. Aluru S, Futamura N, Mehrotra K (2003) Parallel biological sequence comparison using prefix computations. J Parallel Distrib Comput 63(3):264鈥?72 CrossRef
    6. Bao Z, Eddy S (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 8:1269鈥?276 CrossRef
    7. Batzoglou S, Jaffe D, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov J, Lander E (2002) Arachne: a whole-genome shotgun assembler. Genome Res 12(1):177鈥?89 CrossRef
    8. Berger M, Munson P (1991) A novel randomized iterative strategy for aligning multiple protein sequences. CABIOS 7:479鈥?84
    9. Blackshields G, Wallace I, Larkin M, Higgins D (2006) Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 6:321鈥?39
    10. Blazewicz J, Figlerowicz M, Jackowiak P, Janny D, Jarczynski D, Kasprzak M, Nalewaj M, Nowierski B, Styszynski R, Szajkowski L, Widera P (2004) Parallel DNA sequence assembly. In: Proceedings of the fifth Mexican international conference in computer science (ENC 鈥?4). IEEE Computer Society, New York, pp 378鈥?82
    11. Brudno M, Batzoglou S (2004) ProbCons: Probabilistic consistency based multiple alignment of amino acid sequences. In: Proceedings of nineteenth national conference on artificial intelligence. pp 703鈥?08
    12. Chao K, Pearson W, Miller W (1992) Aligning two sequences within a specified diagonal band. Comput Appl Biosci 8:481鈥?87
    13. Cheetham J, Dehne F, Pitre S, Rau-Chaplin A, Taillon P (2003) Parallel CLUSTAL W for PC clusters. In: International conference on computational science and its applications. Lecture notes in computer science, vol 2668. pp 300鈥?09
    14. Darling A, Carey L, Feng W (2003) The design, implementation, and evaluation of mpiBLAST. In: Fourth international conference on Linux clusters: the HPC revolution 2003 in conjunction with The ClusterWorld Conference & Expo
    15. Deng X, Li E, Shan J, Chen W (2006) Parallel implementation and performance characterization of MUSCLE. In: Parallel and distributed processing symposium
    16. Dovichi N, Zhang J (2000) How capillary electrophoresis sequenced the human genome. Angew Chemie Int Edition 39:4463鈥?468 CrossRef
    17. Du Z, Lin F (2006) pNJTree: a parallel program for reconstruction of neighbor-joining tree and its application in ClustalW. J Parallel Comput 32:5鈥? CrossRef
    18. Ebedes J, Datta A (2004) Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics 20(7):1193鈥?195
    19. Edgar R (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792鈥?797 CrossRef
    20. Edgar R, Myers E (2005) PILER: identification and classification of genomic repeats. Bioinformatics 1 21(Supplement 1):i152鈥搃158
    21. Essoussi N, Boujenfa K, Limam M (2008) A comparison of MSA tools. Bioinformatics 2:452鈥?55
    22. Ewing B, Hillier L, Wendl M, Green P (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8(3):175鈥?85 CrossRef
    23. Felsenfeld A, Peterson J, Schloss J, Guyer M (1999) Assessing the quality of the DNA sequence from the human genome project. Genome Res 9:1鈥?
    24. Grama A, Gupta A, Kumar V (1993) Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib Technol 1(3):12鈥?1 CrossRef
    25. Green P (1996) http://bozeman.mbt.washington.edu/phrap.docs/phrap.html. Accessed 19 Sep 2014
    26. Gordon D, Abajian C, Green P (1998) Consed: a graphical tool for sequence finishing. Genome Res 8:195鈥?02 CrossRef
    27. Gusfield D (1997) Algorithms on strings, trees and sequences. Cambridge University Press, Cambridge, pp 9鈥?0
    28. Higgins D, Sharp P (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237鈥?4 CrossRef
    29. Higgins D (1994) CLUSTAL V: multiple alignment of dna and protein sequences. Methods Mol Biol 25:307鈥?18
    30. Hirosawa M, Totoki Y, Hoshida M, Ishikawa M (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13鈥?8
    31. Huang X, Wang J, Aluru S, Yang S, Hillier L (2003) PCAP: a whole-genome assembly program. Genome Res 13:2164鈥?170 CrossRef
    32. Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9(9):868鈥?77 CrossRef
    33. Isokawa M, Wayama M, Shimizu T (1996) Multiple sequence alignment using a genetic algorithm. Genome Inform 7:176鈥?77
    34. Jeanmougin F, Thompson J, Gouy M, Higgins D, Gibson T (1998) Multiple sequence alignment with clustal X. Trends Biochem Sci 23:403鈥?05 CrossRef
    35. Johnson D, Metaxas P (1997) Connected components in O(log3/2n) parallel time for the CREW PRAM. J Comput Syst Sci 54(2):227鈥?42
    36. Kalyanaraman A, Kothari S, Brendel V, Aluru S (2003) Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Res 31(11):2963鈥?964 CrossRef
    37. Kalyanaraman A, Aluru S, Brendel V, Kothari S (2003) Space and time efficient parallel algorithms and software for EST clustering. IEEE Trans Parallel Distrib Syst 14:1209鈥?221 CrossRef
    38. Karl JA, Wiseman RW, O鈥機onnor DH (2009) Cost-effective sequence-based nonhuman primate MHC class I genotyping from RNA. Methods 49(1):11鈥?7. doi:10.1016/j.ymeth.2009.05.002
    39. Larkin M, Blackshields G, Brown N, Chenna R, McGettigan P, McWilliam H, Valentin F, Wallace A, Wilm R, Lopez R, Thompson J, Gibson T, Higgins D (2007) Clustal W and clustal X version 2.0. Bioinformatics 23:2947鈥?948 CrossRef
    40. Lee Z, Su S, Chuang C, Liu K (2008) Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment. Appl Soft Comput 8:55鈥?8 CrossRef
    41. Li K (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12) :1585鈥?586
    42. Li R, Zhu H, Ruan J, Qian W, Li S, Yang H, Wang J (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20(2):265鈥?72 CrossRef
    43. Lipman D, Altschul S, Kececioglu D (1989) A tool for multiple sequence alignment. Proc Natl Acad Sci 86:4412鈥?415 CrossRef
    44. Liu X, Pande P, Meyerhenke H, Bader D (2013) PASQUAL: parallel techniques for next generation genome sequence assembly. IEEE Trans Parallel Distrib Syst 24(5):977鈥?86 CrossRef
    45. Luo J, Ahmad I, Ahmed M (2005) Parallel multiple sequence alignment using dynamic scheduling. In: International conference on information technology: coding and computing, vol 1. pp 8鈥?3
    46. Mardis E (2008) Next-generation DNA sequencing methods. Ann Rev Genomics Hum Genet 9:387鈥?02 CrossRef
    47. Martins W, Cuvillo J, Francisco B, Theobald J, Gao G (2001) A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison. In: Proceedings of the Pacific symposium on biocomputing. pp 311鈥?32
    48. Miller P, Nadkarni P, Carriero N (1991) Parallel computation and FASTA: confronting the problem of parallel database search for a fast sequence comparison algorithm. Comput Appl Biosci 7(1):71鈥?8
    49. Mullikin J, Ning Z (2003) The Phusion assembler. Genome Res 1:81鈥?0 CrossRef
    50. Myers E, Sutton G, Smith H, Adams M, Venter J (2002) On the sequencing and assembly of the human genome. Proc Natl Acad Sci 99(7):4145鈥?146 CrossRef
    51. Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two sequences. J Mol Biol 48:443鈥?53 CrossRef
    52. Notredame C, Higgins D, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205鈥?17 CrossRef
    53. Ogden T, Rosenberg M (2006) Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol 56(2):314鈥?28 CrossRef
    54. Pevzner P, Tang H, Waterman S (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98(17):9748鈥?753 CrossRef
    55. Pevzner P, Tang H, Tesler G (2004) De novo repeat classification and fragment assembly. Genome Res 14(9):1786鈥?796 CrossRef
    56. Porreca G (2010) Genome sequencing on nanoballs. Nat Biotechnol 28(1):43鈥?4 CrossRef
    57. Prism ABIABI (1996) DNA sequencing analysis software. In: User鈥檚 manual, PE Applied Biosystems, Foster City
    58. Ralston A (1982) De Bruijn sequences鈥攁 model example of the interaction of discrete mathematics and computer science. Math Magaz 55:131鈥?43 CrossRef
    59. Ronaghi M, Uhlen M, Nyren P (1998) A sequencing method based on real-time pyrophosphate. Science 281(5375):363 CrossRef
    60. Rusk N (2011) Torrents of sequence. Nat Methods 8(1):44鈥?4
    61. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol 4:406鈥?25
    62. Sanger F, Nicklen S, Coulson A (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci 74:5463鈥? CrossRef
    63. Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117鈥?123 CrossRef
    64. Shi W, Zhou W (2005) A parallel Euler approach for large-scale biological sequence assembly. In: Proceedings of the third international conference on information technology and applications
    65. Smit A, Hubley R, Green P (1996鈥?010) RepeatMasker Open-3.0. http://www.repeatmasker.org. Accessed 20 Sep 2014
    66. Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Biol 147:195鈥?97 CrossRef
    67. Southern E (1975) Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol 98:503鈥?17 CrossRef
    68. Sutton G, White O, Adams M, Kerlavage A (1995) TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol 1(1):9鈥?9 CrossRef
    69. Thompson J, Plewniak F, Poch O (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27(13):12682鈥?690 CrossRef
    70. Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek J, Costa G, McKernan K, Sidow A, Fire A, Johnson S (2008) A high-resolution nucleosome position map of C. Elegans reveals a lack of universal sequence-dictated positioning. Genome Res 18(7):1051鈥?063 CrossRef
    71. Venter J, Adams M, Myers E (2001) The sequence of the human genome. Science 16(291):1304鈥?351 CrossRef
    72. Volfovsky N, Haas B, Salzberg S (2001) A clustering method for repeat analysis in DNA sequences. Genome Biol 2(8)
    73. Watson J, Crick F (1953) Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171:737鈥?38 CrossRef
    74. Yap T, Munson P, Frieder O, Martino R (1995) Parallel multiple sequence alignment using speculative computation. In: Proceedings of the international conference on parallel processing
    75. Yap T, Frieder O, Martino R (1998) Parallel computation in biological sequence analysis. IEEE Trans Parallel Distrib Syst 9(3) :283鈥?94
    76. Zhang C, Wong A (1997) A genetic algorithm for multiple molecular sequence alignment. Comput Appl Biosci 13(6):565鈥?81
    77. Zhao F, Li T, Bryant D (2008) A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Res 36(10):3455鈥?462 CrossRef
    78. Zola J, Yang X, Rospondek S, Aluru S (2007) Parallel T-Coffee: a parallel multiple sequence aligner. In: Proceedings of international society for computers and their applications, parallel and distributed computing systems. pp 248鈥?53
  • 作者单位:Munib Ahmed (1)
    Ishfaq Ahmad (1)
    Mohammad Saad Ahmad (1)

    1. University of Texas at Arlington, Arlington, USA
  • 刊物类别:Computer Science
  • 刊物主题:Programming Languages, Compilers and Interpreters
    Processor Architectures
    Computer Science, general
  • 出版者:Springer Netherlands
  • ISSN:1573-0484
文摘
Genome assembly has been an area of active research since the DNA structure was discovered and has gathered more steam after the Human Genome project was launched. A large number of genomes have been assembled and many more are in the pipeline. A number of full-scale assemblers and other special-purpose modules have been reported. Since the volume of data involved in the genome assembly process is extraordinarily large and requires significantly large computational power and processing time, many assemblers have utilized parallel computing to achieve faster and more efficient reconstruction of the DNA. A genome assembler is a multi-step process including different components that may be partly or fully parallelized. Although several assemblers and individual modules that perform various tasks, such as pairwise alignment, multiple sequence alignment, and repeat finding, have been analyzed and documented before, this paper provides a holistic view of the assembly process in the realm of parallel and distributed computing, streamlining all the individual tasks related, but not limited to, the whole genome shotgun sequencing into a sequence of loosely coupled stages where one stage consumes the output of the preceding stage and passes its results to the next one. Many of these tasks are essential to the current and next-generation sequence assemblers. The paper walks through the entire streamlined process while describing, analyzing, and commenting on the algorithms and techniques that have been designed and implemented for each of the stages. Where applicable, the paper suggests improvements that may form the basis of potentially new research work.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700