The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes

详细信息查看全文

作者：Danny Challis (1) (2) (3)
Lilian Antunes (1) (2) (4)
Erik Garrison (5)
Eric Banks (6)
Uday S Evani (1) (2) (7)
Donna Muzny (1) (2)
Ryan Poplin (6)
Richard A Gibbs (1) (2)
Gabor Marth (5) (8)
Fuli Yu (1) (2) (9)

1. Human Genome Sequencing Center ; Baylor College of Medicine ; Houston ; TX ; 77030 ; USA
2. Department of Molecular and Human Genetics ; Baylor College of Medicine ; Houston ; TX ; 77030 ; USA
3. Present address ; Monsanto Company ; Ankeny ; IA ; 50021 ; USA
4. Present address ; Washington University School of Medicine ; Saint Louis ; MO ; 63110 ; USA
5. Department of Biology ; Boston College ; Wellcome Trust Sanger Institute ; Chestnut Hill ; MA ; 02467 ; USA
6. Program in Medical and Population Genetics ; Broad Institute of Harvard and MIT ; Cambridge ; MA ; 02142 ; USA
7. Present address ; New York Genome Center ; New York ; NY ; 10013 ; USA
8. Present address ; Department of Human Genetics and Utah Center for Genetic Discovery ; University of Utah School of Medicine ; Salt Lake City ; UT ; 84112 ; USA
9. Institute of Neurology ; Tianjin Medical University General Hospital ; Tianjin ; 300052 ; China
关键词：INDEL ; 1000 Genomes Project ; Distribution ; Mutagenesis
刊名：BMC Genomics
出版年：2015
出版时间：December 2015
年：2015
卷：16
期：1
全文大小：1,175 KB
参考文献：1. Bentley, DR, Balasubramanian, S, Swerdlow, HP, Smith, GP, Milton, J, Brown, CG (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: pp. 53-9 new window">CrossRef
2. DePristo, MA, Banks, E, Poplin, RE, Garimella, KV, Maguire, JR, Hartl, C (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: pp. 491-8 new window">CrossRef
3. Li H: Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 2014;30:2843-2851.
4. Abecasis, GR, Altshuler, D, Auton, A, Brooks, LD, Durbin, RM (2010) A map of human genome variation from population-scale sequencing. Nature 467: pp. 1061-73 new window">CrossRef
5. Abecasis, GR, Auton, A, Brooks, LD, DePristo, MA, Durbin, RM (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: pp. 56-65 new window">CrossRef
6. Li, S, Li, R, Li, H, Lu, J, Li, Y, Bolund, L (2012) SOAPindel: efficient identification of indels from short paired reads. Genome Research 23: pp. 195-200 new window">CrossRef
7. Neuman, JA, Isakov, O, Shomron, N (2013) Analysis of insertion鈥揹eletion from deep-sequencing data: software evaluation for optimal detection. Briefings in Bioinformatics 14: pp. 1 new window">CrossRef
8. O鈥橰awe, J, Jiang, T, Sun, G, Wu, Y, Wang, W, Hu, J (2013) Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 5: pp. 28 new window">CrossRef
9. Shen, Y, Wan, Z, Coarfa, C, Drabek, R, Chen, L, Ostrowski, EA (2010) A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 20: pp. 273-80 new window">CrossRef
10. Challis, D, Yu, J, Evani, US, Jackson, AR, Paithankar, S, Coarfa, C (2012) An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13: pp. 8 new window">CrossRef
11. Garrison E, Marth G: Haplotype-based variant detection from short-read sequencing. ArXiv12073907 Q-Bio 2012.
12. Marth, GT, Yu, F, Indap, AR, Garimella, K, Gravel, S, Leong, WF (2011) The functional spectrum of low-frequency coding variation. Genome Biol 12: pp. R84 new window">CrossRef
13. Montgomery, SB, Goode, DL, Kvikstad, E, Albers, CA, Zhang, ZD, Mu, XJ (2013) The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res 23: pp. 749-61 new window">CrossRef
14. Taylor, MS, Ponting, CP, Copley, RR (2004) Occurrence and consequences of coding sequence insertions and deletions in Mammalian genomes. Genome Res 14: pp. 555-66 new window">CrossRef
15. Levy, S, Sutton, G, Ng, PC, Feuk, L, Halpern, AL, Walenz, BP (2007) The diploid genome sequence of an individual human. PLoS Biol 5: pp. e254 new window">CrossRef
16. Li, G, Ma, L, Song, C, Yang, Z, Wang, X, Huang, H (2009) The YH database: the first Asian diploid genome database. Nucleic Acids Research 37: pp. D1025-8 new window">CrossRef
17. Narzisi, G, O鈥橰awe, HA, Iossifov, I, Fang, H, Lee, Y, Wang, Z (2014) Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nature Methods 11: pp. 1033-6 new window">CrossRef
18. Gymrek, M, Golan, D, Rosset, S, Erlich, Y (2012) LobSTR: a short tandem repeat profiler for personal genomes. Genome Research 22: pp. 1154-62 new window">CrossRef
19. Lee, W-P, Stromberg, MP, Ward, A, Stewart, C, Garrison, EP, Marth, GT (2014) MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS ONE 9: pp. e90581 new window">CrossRef
20. Homer, N, Merriman, B, Nelson, SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4: pp. e7767 new window">CrossRef
21. Li, H, Durbin, R (2009) Fast and accurate short read alignment with Burrows鈥揥heeler transform. Bioinformatics 25: pp. 1754-60 new window">CrossRef
22. 1000 Genomes Project Consortium: 1000 Genomes A deep catalog of human genetic variation. 2012. Web. 2013. <http://www.1000genomes.org/>.
23. Kent, WJ (2002) BLAT鈥搕he BLAST-like alignment tool. Genome Res 12: pp. 656-64 new window">CrossRef
24. Gordon, D, Abajian, C, Green, P (1998) Consed: a graphical tool for sequence finishing. Genome Res 8: pp. 195-202 new window">CrossRef
25. Thorvaldsd贸ttir, Helga and Robinson, James T. and Mesirov, Jill P: Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics 2013;14:178-92.
刊物主题：Life Sciences, general; Microarrays; Proteomics; Animal Genetics and Genomics; Microbial Genetics and Genomics; Plant Genetics & Genomics;
出版者：BioMed Central
ISSN：1471-2164

文摘

Background Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. Results This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. Conclusions In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700