Comparing variant calling algorithms for target-exon sequencing in a large sample
详细信息    查看全文
  • 作者:Yancy Lo (1)
    Hyun M Kang (1)
    Matthew R Nelson (2)
    Mohammad I Othman (3)
    Stephanie L Chissoe (2)
    Margaret G Ehm (2)
    Gon莽alo R Abecasis (1)
    Sebastian Z枚llner (1) (4)

    1. Department of Biostatistics
    ; University of Michigan ; 1415 Washington Heights ; Ann Arbor ; MI ; 48109 ; USA
    2. GlaxoSmithKline
    ; Quantitative Sciences ; Research Triangle Park ; NC ; USA
    3. Department of Ophthalmology
    ; University of Michigan ; Ann Arbor ; MI ; USA
    4. Department of Psychiatry
    ; University of Michigan ; Ann Arbor ; MI ; USA
  • 关键词:Next ; generation sequencing ; Targeted sequencing ; Variant calling
  • 刊名:BMC Bioinformatics
  • 出版年:2015
  • 出版时间:December 2015
  • 年:2015
  • 卷:16
  • 期:1
  • 全文大小:456 KB
  • 参考文献:1. Terr, J, Mullikin, J (2010) Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet 19: pp. R145-51 CrossRef
    2. Majewski, J, Schwartzentruber, J, Lalonde, E, Montpetit, A, Jabado, N (2011) What can exome sequencing do for you?. J Med Genet 48: pp. 580-9 CrossRef
    3. Bamshad, MJ, Ng, SB, Bigham, AW, Tabor, HK, Emond, MJ, Nickerson, DA (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12: pp. 745-55 CrossRef
    4. Kaiser J. Affordable 'exomes' fill gaps in a catalog of rare diseases. Science. 2010;330:903鈥?.
    5. Mamanova, L, Coffey, AJ, Scott, CE, Kozarewa, I, Turner, EH, Kumar, A (2010) Target-enrichment strategies for next-generation sequencing. Nat Methods 7: pp. 111-8 CrossRef
    6. Bentley, DR, Balasubramanian, S, Swerdlow, HP, Smith, GP, Milton, J, Brown, CG (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: pp. 53-9 CrossRef
    7. Ng, SB, Turner, EH, Robertson, PD, Flygare, SD, Bigham, AW, Lee, C (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461: pp. 272-6 CrossRef
    8. Choi, M, Scholl, UI, Ji, W, Liu, T, Tikhonova, IR, Zumbo, P (2009) Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A 106: pp. 19096-101 CrossRef
    9. Marth, GT, Yu, F, Indap, AR, Garimella, K, Gravel, S, Leong, WF (2011) 1000 Genomes Project: The functional spectrum of low-frequency coding variation. Genome Biol 12: pp. R84 CrossRef
    10. Zhan, X, Larson, DE, Wang, C, Koboldt, DC, Sergeev, YV, Fulton, RS (2013) Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nat Genet 45: pp. 1375-9 CrossRef
    11. DePristo, MA, Banks, E, Poplin, R, Garimella, KV, Maguire, JR, Hartl, C (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: pp. 491-8 CrossRef
    12. Li, Y, Sidore, C, Kang, H, Boehnke, M, Abecasis, G (2011) Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res 21: pp. 940-51 CrossRef
    13. Wang, Y, Lu, J, Yu, J, Gibbs, RA, Yu, F (2013) An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res 23: pp. 833-42 CrossRef
    14. Nielsen, R, Paul, J, Albrechtsen, A, Song, Y (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12: pp. 443-51 CrossRef
    15. Li, R, Li, Y, Kristiansen, K, Wang, J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24: pp. 713-4 CrossRef
    16. Li, R, Yu, C, Li, Y, Lam, T, Yiu, S, Kristiansen, K (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25: pp. 1966-7 CrossRef
    17. Li, H, Ruan, J, Durbin, RM (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: pp. 1851-8 CrossRef
    18. McKenna, A, Hanna, M, Banks, E, Sivachenko, A, Cibulskis, K, Kernytsky, A (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20: pp. 1297-303 CrossRef
    19. glfSingle - Genome Analysis Wiki [http://genome.sph.umich.edu/wiki/GlfSingle]
    20. Le, SQ, Durbin, R (2010) SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res 21: pp. 952-60 CrossRef
    An integrated map of genetic variation from 1,092 human genomes. Nature 491: pp. 56-65 CrossRef
    A map of human genome variation from population-scale sequencing. Nature 467: pp. 1061-73 CrossRef
    21. Marchini, J, Howie, B (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet 11: pp. 499-511 CrossRef
    22. Browning, BL, Yu, Z (2009) Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 85: pp. 847-61 CrossRef
    23. Browning, SR, Browning, BL (2011) Haplotype phasing: existing methods and new developments. Nat Rev Genet 12: pp. 703-14 CrossRef
    24. Nelson, M, Ehm, M, Wegmann, D, St Jean, P, Verzili, C, Shen, J (2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337: pp. 100-4 CrossRef
    25. Li, H, Durbin, R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: pp. 1754-60 CrossRef
    26. Purcell, S, Neale, B, Todd-Brown, K, Thomas, L, Ferreira, MA, Bender, D (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: pp. 559-75 CrossRef
    27. Li, H (2011) Improving SNP discovery by base alignment quality. Bioinformatics 27: pp. 1157-8 CrossRef
    28. Watterson, GA (1975) On the number of segregating sites in genetical models without recombination. Theor Popul Biol 7: pp. 256-76 CrossRef
    29. Tennessen, JA, Bigham, AW, O鈥機onnor, TD, Fu, W, Kenny, EE, Gravel, S (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337: pp. 64-9 CrossRef
    30. Li, Y, Willer, CJ, Ding, J, Scheet, P, Abecasis, GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34: pp. 816-34 CrossRef
    31. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164鈥?.
    32. Schaibley, VM, Zawistowski, M, Wegmann, D, Ehm, MG, Nelson, MR, St Jean, PL (2013) The influence of genomic context on mutation patterns in the human genome inferred from rare variants. Genome Res 23: pp. 1974-84 CrossRef
    33. Hodgkinson, A, Eyre-Walker, A (2010) Human triallelic sites: evidence for a new mutational mechanism?. Genetics 184: pp. 233-41 CrossRef
    34. Ng, SB, Buckingham, KJ, Lee, C, Bigham, AW, Tabor, HK, Dent, KM (2010) Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42: pp. 30-5 CrossRef
    35. Liu, X, Han, S, Wang, Z, Gelernter, J, Yang, B (2013) Variant callers for next-generation sequencing data: a comparison study. PLoS One 8: pp. e75619 CrossRef
    36. Huebner, C, Petermann, I, Browning, BL, Shelling, AN, Ferguson, LR (2007) Triallelic single nucleotide polymorphisms and genotyping error in genetic epidemiology studies: MDR1 (ABCB1) G2677/T/A as an example. Cancer Epidemiol Biomarkers Prev 16: pp. 1185-92 CrossRef
    37. Curocichin, G, Wu, Y, McDade, TW, Kuzawa, CW, Borja, JB, Qin, L (2011) Single-nucleotide polymorphisms at five loci are associated with C-reactive protein levels in a cohort of Filipino young adults. J Hum Genet 56: pp. 823-7 CrossRef
  • 刊物主题:Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms;
  • 出版者:BioMed Central
  • ISSN:1471-2105
文摘
Background Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. Results Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. Conclusions We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700