SNPest: a probabilistic graphical model for estimating genotypes
详细信息    查看全文
  • 作者:Stinus Lindgreen (11) (12) (13)
    Anders Krogh (11) (12)
    Jakob Skou Pedersen (11) (14)

    11. Section for Computational and RNA Biology
    ; Department of Biology ; University of Copenhagen ; Ole Maaloes Vej ; 2200 ; Copenhagen ; Denmark
    12. Center of Excellence for GeoGenetics
    ; Natural History Museum of Denmark and Department of Biology ; University of Copenhagen ; Oester Voldgade 5-7 ; 1350 ; Copenhagen K ; Denmark
    13. Biomolecular Interaction Centre
    ; School of Biological Sciences ; University of Canterbury ; Private Bag 4800 ; 8041 ; Christchurch ; New Zealand
    14. Department of Molecular Medicine
    ; Aarhus University Hospital ; Skejby ; Brendstrupgaardsvej 100 ; DK-8200 ; Aarhus N ; Denmark
  • 关键词:Next ; generation sequencing ; SNP ; Genotyping ; Illumina ; Ancient DNA
  • 刊名:BMC Research Notes
  • 出版年:2014
  • 出版时间:December 2014
  • 年:2014
  • 卷:7
  • 期:1
  • 全文大小:457 KB
  • 参考文献:1. Sanger, F, Coulson, AR (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94: pp. 441-448 CrossRef
    2. Shendure, J, Ji, H (2008) Next-generation DNA sequencing. Nat Biotechnol 26: pp. 1135-1145 CrossRef
    3. Metzker, ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11: pp. 31-46 CrossRef
    4. Miller, W, Drautz, DI, Ratan, A, Pusey, B, Qi, J, Lesk, AM, Tomsho, LP, Packard, MD, Zhao, F, Sher, A, Tikhonov, A, Raney, B, Patterson, N, Lindblad-Toh, K, Lander, ES, Knight, JR, Irzyk, GP, Fredrikson, KM, Harkins, TT, Sheridan, S, Pringle, T, Schuster, SC (2008) Sequencing the nuclear genome of the extinct woolly mammoth. Nature 456: pp. 387-390 CrossRef
    5. Green, RE, Krause, J, Ptak, SE, Briggs, AW, Ronan, MT, Simons, JF, Du, L, Egholm, M, Rothberg, JM, Paunovic, M, Paabo, S (2006) Analysis of one million base pairs of Neanderthal DNA. Nature 444: pp. 330-336 CrossRef
    6. Rasmussen, M, Li, Y, Lindgreen, S, Pedersen, JS, Albrechtsen, A, Moltke, I, Metspalu, M, Metspalu, E, Kivisild, T, Gupta, R, Bertalan, M, Nielsen, K, Gilbert, MT, Wang, Y, Raghavan, M, Campos, PF, Kamp, HM, Wilson, AS, Gledhill, A, Tridico, S, Bunce, M, Lorenzen, ED, Binladen, J, Guo, X, Zhao, J, Zhang, X, Zhang, H, Li, Z, Chen, M, Orlando, L (2010) Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463: pp. 757-762 CrossRef
    7. Rasmussen, M, Guo, X, Wang, Y, Lohmueller, KE, Rasmussen, S, Albrechtsen, A, Skotte, L, Lindgreen, S, Metspalu, M, Jombart, T, Kivisild, T, Zhai, W, Eriksson, A, Manica, A, Orlando, L, De La Vega, FM, Tridico, S, Metspalu, E, Nielsen, K, Avila-Arcos, MC, Moreno-Mayar, JV, Muller, C, Dortch, J, Gilbert, MT, Lund, O, Wesolowska, A, Karmin, M, Weinert, LA, Wang, B, Li, J (2011) An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334: pp. 94-98 CrossRef
    8. Keller, A, Graefen, A, Ball, M, Matzas, M, Boisguerin, V, Maixner, F, Leidinger, P, Backes, C, Khairat, R, Forster, M, Stade, B, Franke, A, Mayer, J, Spangler, J, McLaughlin, S, Shah, M, Lee, C, Harkins, TT, Sartori, A, Moreno-Estrada, A, Henn, B, Sikora, M, Semino, O, Chiaroni, J, Rootsi, S, Myres, NM, Cabrera, VM, Underhill, PA, Bustamante, CD, Vigl, EE (2012) New insights into the Tyrolean Iceman鈥檚 origin and phenotype as inferred by whole-genome sequencing. Nat Commun 3: pp. 698 CrossRef
    9. J贸nsson, H, Ginolhac, A, Schubert, M, Johnson, PLF, Orlando, L (2013) mapdamage2.0: fast approximate bayesian estimates of ancient dna damage parameters. Bioinformatics 29: pp. 1682-1684 CrossRef
    10. Schubert, M, Ginolhac, A, Lindgreen, S, Thompson, J, AL-Rasheid, K, Willerslev, E, Krogh, A, Orlando, L (2012) Improving ancient dna read mapping against modern reference genomes. BMC Genomics 13: pp. 178 CrossRef
    11. Schweiger, MR, Kerick, M, Timmermann, B, Albrecht, MW, Borodina, T, Parkhomchuk, D, Zatloukal, K, Lehrach, H (2009) Genome-wide massively parallel sequencing of formaldehyde fixed-paraffin embedded (FFPE) tumor tissues for copy-number- and mutation-analysis. PLoS ONE 4: pp. 5548 CrossRef
    12. Bishop, CM (2006) Pattern Recognition and Machine Learning. Springer New York, NJ, USA
    13. You, N, Murillo, G, Su, X, Zeng, X, Xu, J, Ning, K, Zhang, S, Zhu, J, Cui, X (2012) Snp calling using genotype model selection on high-throughput sequencing data. Bioinformatics 28: pp. 643-650 CrossRef
    14. Garrison E, Marth G: Haplotype-based variant detection from short-read sequencing . arXiv preprint arXiv:1207.3907 (2012)
    15. Li, H, Handsaker, B, Wysoker, A, Fennell, T, Ruan, J, Homer, N, Marth, G, Abecasis, G, Durbin, R (2009) The sequence alignment/map format and samtools. Bioinformatics 25: pp. 2078-2079 CrossRef
    16. McKenna, A, Hanna, M, Banks, E, Sivachenko, A, Cibulskis, K, Kernytsky, A, Garimella, K, Altshuler, D, Gabriel, S, Daly, M, DePristo, MA (2010) The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Res 20: pp. 1297-1303 CrossRef
    17. DePristo, MA, Banks, E, Poplin, R, Garimella, KV, Maguire, JR, Hartl, C, Philippakis, AA, del Angel, G, Rivas, MA, Hanna, M, McKenna, A, Fennell, TJ, Kernytsky, AM, Sivachenko, AY, Cibulskis, K, Gabriel, SB, Altshuler, D, Daly, MJ (2011) A framework for variation discovery and genotyping using next-generation dna sequencing data. Nat Genet 43: pp. 491-498 CrossRef
    18. Auwera, GA, Carneiro, MO, Hartl, C, Poplin, R, del Angel, G, Levy-Moonshine, A, Jordan, T, Shakir, K, Roazen, D, Thibault, J, Banks, E, Garimella, KV, Altshuler, D, Gabriel, S, DePristo, MA (2013) From fastq data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Curr Proto Bioinformatics 43: pp. 11-10
    19. Langmead, B, Salzberg, SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9: pp. 357-359 CrossRef
    20. Kerpedjiev, P, Frellsen, J, Lindgreen, S, Krogh, A (2014) Adaptable probabilistic mapping of short reads using position specific scoring matrices. BMC Bioinformatics 15: pp. 100 CrossRef
    21. Ewing, B, Green, P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8: pp. 186-194 CrossRef
    22. Jeong, H, Barbe, V, Lee, CH, Vallenet, D, Yu, DS, Choi, S-H, Couloux, A, Lee, S-W, Yoon, SH, Cattolico, L, Hur, C-G, Park, H-S, S茅gurens, B, Kim, SC, Oh, TK, Lenski, RE, Studier, FW, Daegelen, P, Kim, JF (2009) Genome sequences of escherichia coli b strains {REL606} and bl21(de3). J Mol Biol 394: pp. 644-652 CrossRef
    23. Lindgreen, S (2012) Adapterremoval: easy cleaning of next-generation sequencing reads. BMC Res Notes 5: pp. 337 CrossRef
    24. Huang, W, Li, L, Myers, JR, Marth, GT (2011) Art: a next-generation sequencing read simulator. Bioinformatics 28: pp. 593-594 CrossRef
    25. Orlando, L, Ginolhac, A, Raghavan, M, Vilstrup, J, Rasmussen, M, Magnussen, K, Steinmann, KE, Kapranov, P, Thompson, JF, Zazula, G, Froese, D, Moltke, I, Shapiro, B, Hofreiter, M, Al-Rasheid, KA, Gilbert, MT, Willerslev, E (2011) True single-molecule DNA sequencing of a pleistocene horse bone. Genome Res 21: pp. 1705-1719 CrossRef
    An integrated map of genetic variation from 1,092 human genomes. Nature 491: pp. 56-65 CrossRef
    26. Li, H, Durbin, R (2009) Fast and accurate short read alignment with burrows鈥搘heeler transform. Bioinformatics 25: pp. 1754-1760 CrossRef
    27. Sherry, ST, Ward, MH, Kholodov, M, Baker, J, Phan, L, Smigielski, EM, Sirotkin, K (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29: pp. 308-311 CrossRef
    28. Levy, S, Sutton, G, Ng, PC, Feuk, L, Halpern, AL, Walenz, BP, Axelrod, N, Huang, J, Kirkness, EF, Denisov, G, Lin, Y, MacDonald, JR, Pang, AWC, Shago, M, Stockwell, TB, Tsiamouri, A, Bafna, V, Bansal, V, Kravitz, SA, Busam, DA, Beeson, KY, McIntosh, TC, Remington, KA, Abril, JF, Gill, J, Borman, J, Rogers, Y-H, Frazier, ME, Scherer, SW, Strausberg, RL (2007) The diploid genome sequence of an individual human. PLoS Biol 5: pp. 254 CrossRef
    29. Taub, MA, Corrada Bravo, H, Irizarry, RA (2010) Overcoming bias and systematic errors in next generation sequencing data. Genome Med 2: pp. 87 CrossRef
    30. Nakamura, K, Oshima, T, Morimoto, T, Ikeda, S, Yoshikawa, H, Shiwa, Y, Ishikawa, S, Linak, MC, Hirai, A, Takahashi, H, Altaf-Ul-Amin, M, Ogasawara, N, Kanaya, S (2011) Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39: pp. 90 CrossRef
    31. Wang, J, Wang, W, Li, R, Li, Y, Tian, G, Goodman, L, Fan, W, Zhang, J, Li, J, Zhang, J, Guo, Y, Feng, B, Li, H, Lu, Y, Fang, X, Liang, H, Du, Z, Li, D, Zhao, Y, Hu, Y, Yang, Z, Zheng, H, Hellmann, I, Inouye, M, Pool, J, Yi, X, Zhao, J, Duan, J, Zhou, Y, Qin, J (2008) The diploid genome sequence of an Asian individual. Nature 456: pp. 60-65 CrossRef
    32. Li, R, Li, Y, Fang, X, Yang, H, Wang, J, Kristiansen, K, Wang, J (2009) Snp detection for massively parallel whole-genome resequencing. Genome Res 19: pp. 1124-1132 CrossRef
  • 刊物主题:Biomedicine general; Medicine/Public Health, general; Life Sciences, general;
  • 出版者:BioMed Central
  • ISSN:1756-0500
文摘
Background As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underlying DNA sequences that gave rise to the sequenced fragments. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for SNPs (single nucleotide polymorphisms) in order to investigate diseases or phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. As an example, ancient DNA poses special challenges as the data is often shallow and subject to errors induced by post mortem damage. Findings We present a novel approach to the genotyping problem where a probabilistic framework describing the process from sampling to sequencing is implemented as a graphical model. This makes it possible to model technology specific errors and other sources of variation that can affect the result. The inferred genotype is given a posterior probability to signify the confidence in the result. SNPest has already been used to genotype large scale projects such as the first ancient human genome published in 2010. Conclusions We compare the performance of SNPest to a number of other widely used genotypers on both real and simulated data, covering both haploid and diploid genomes. We investigate the effects of read depth, of removing adapters before mapping and genotyping, of using different mapping tools, and of using the correct model in the genotyping process. We show that the performance of SNPest is comparable to existing methods, and we also illustrate cases where SNPest has an advantage over other methods, e.g. when dealing with simulated ancient DNA.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700