Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions
详细信息    查看全文
  • 作者:Jinzhuang Dou (1) (2)
    Xiqiang Zhao (2)
    Xiaoteng Fu (1)
    Wenqian Jiao (1)
    Nannan Wang (2)
    Lingling Zhang (1)
    Xiaoli Hu (1)
    Shi Wang (1)
    Zhenmin Bao (1)
  • 关键词:Next ; generation sequencing ; single nucleotide polymorphism ; genotyping ; maximum likelihood ; mixed Poisson/normal model
  • 刊名:Biology Direct
  • 出版年:2012
  • 出版时间:December 2012
  • 年:2012
  • 卷:7
  • 期:1
  • 全文大小:536KB
  • 参考文献:1. Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML: Genome-wide genetic marker discovery and genotyping using next-generation sequencing. / Nat Rev Genet 2011, 12:499鈥?10. CrossRef
    2. Davey JW, Blaxter ML: RADSeq: next-generation population genetics. / Brief Func Genomics 2011, 9:416鈥?23. CrossRef
    3. Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. / Nat Rev Genet 2011, 12:443鈥?51. CrossRef
    4. Catchen J, Amores A, Hohenlohe P, Cresko W, Postlethwait J: Stacks: building and genotyping locide novofrom short-read sequences. / G3: Genes, Genomes, Genetics 2011, 1:171鈥?82.
    5. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novoassembly of human genomes with massively parallel short read sequencing. / Genome Res 2010, 20:265鈥?72. CrossRef
    6. Hohenlohe PA, Bassham S, Etter PD, Stiffler N, Johnson EA, Cresko WA: Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. / PLoS Genet 2010, 6:e1000862. CrossRef
    7. The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plantArabidopsis thaliana. / Nature 2000, 408:796鈥?15. CrossRef
    8. International Rice Genome Sequencing Project: The map-based sequence of the rice genome. / Nature 2005, 436:793鈥?00. CrossRef
    9. Wang S, Meyer E, McKay JK, Matz MV: 2b-RAD: a simple and flexible method for genome-wide genotyping. / Nat Methods 2012. In press
    10. Etter PD, Preston JL, Bassham S, Cresko WA, Johnson EA: Localde novoassembly of RAD paired-end contigs using short sequencing reads. / PLoS One 2011, 6:e18561. CrossRef
    11. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. / Bioinformatics 2009, 25:1966鈥?967. CrossRef
  • 作者单位:Jinzhuang Dou (1) (2)
    Xiqiang Zhao (2)
    Xiaoteng Fu (1)
    Wenqian Jiao (1)
    Nannan Wang (2)
    Lingling Zhang (1)
    Xiaoli Hu (1)
    Shi Wang (1)
    Zhenmin Bao (1)

    1. Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, 5 Yushan Road, Qingdao, 266003, China
    2. College of Mathematical sciences, Ocean University of China, 238 Songling Road, Qingdao, 266003, China
文摘
Background Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. Results Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents. Conclusions The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome. Reviewers This article was reviewed by Dr. Richard Durbin, Dr. Liliana Florea (nominated by Dr. Steven Salzberg) and Dr. Arcady Mushegian.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700