新一代基因测序的数据处理中的相关问题
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着下一代基因测序技术(NGS, Next Generation Sequencing)的发展,实验设备和流程日趋成熟,越来越多的公司推出了自己的测序平台,基因测序已经逐渐脱离了专业的基因实验室,让更多的研究组和研究人员都开始进入该领域。与之而来的,NGS数据处理面临着越来越高的要求和挑战,研究人员已经不能满足于使用基因测序机器厂商所提供的基本的数据处理程序,转而使用更开放的、灵活的第三方处理软件。
     在本文中,我们重新审视了NGS基因数据处理的过程,从原始的图像数据处理到碱基识别,完成了一整套NGS基因测序数据的处理算法。其中,在现有的一些NGS数据处理工具中,图像处理部分一般采用的水平集分割法或简单的使用拉普拉斯算子进行处理。在我们仔细分析了这些结果之后,发现他们其实并不能精确的完成基因簇定位以及识别的任务,为此,我们重新设计了处理算法(NRDPT, NGS Raw Data Processing Tool)。不同于已有的几种处理方法的是,该方法使用了基于边缘和霍夫变换的基因簇定位算法,有效提高了定位准确度。并且,在基因簇定位准确的基础上,我们设计了一个两步的配准策略,极大的提高了效率(~9倍提高于传统算法)。在本文中我们会详细讨论这部分的算法。
     在碱基识别部分,目前已经有的一些研究均基于Illumina测序平台的测序数据,这些研究主要用来试图修正使用该仪器所经常会出现的相位错乱问题,这些问题一般是来源于所采用的生化反应的缺陷。而在新的一些测序方法中(如SoLiD、HYK等),因为更新了测序流程,这些问题并不存在。在本文中,我们讨论了在不同的测序方法中会出现的问题及其对于碱基识别过程的影响,在仔细考虑了几种不同的碱基识别策略后,我们完成了基于连接反应测序过程的碱基识别方法,并得到了不错的结果。
     基因测序技术的发展很快,我们的研究过程基于我国完全自主知识产权的华因康公司的P-STARII型基因测序仪展开,在整个的研究过程中,机器和测序流程也在不断升级,这些不确定性常常增加了我们研究的难度,但这也正说明本领域正在飞速的发展。在这里,我们期待NGS测序技术的真正成熟,并最终走入临床领域。
In recent years, Benefited by the significant development of the Next Generation Sequencing (NGS) technology, more and more companies launched their own sequencing platforms, and instruments has been invented. Such as the Genome Analyzer (Illumian, San Diego, USA), 454-FLX (Roche, Basel, Switzerland) and SOLiD (Applied Biosystems, California, USA) and so on. According to this, gene sequencing has been graduated from the professional lab. Many research groups and researchers are entering this field, and NGS data processing is facing increasing demands and challenges. Researchers have been not satisfied with the basic pipelines provided by the machine manufactures. And many open and flexible NGS data processing pipelines were developed in the past years, such as BING (Kriseman, 2010) and Swift, but they all based on the Illumina’s data. In this paper, we carefully reviewed the process of NGS data processing, and design the whole pipeline and algorithms, from gene cluster locating, image registration to base-calling.
     Among all, we found that the raw data processing part in the existing NGS pipelines are straightforward or even absence. They use general algorithms like level set segmentation or simply Laplace operator for locating the clusters. After carefully analyzing, it was found that these algorithms could not exactly locate the position of each cluster in the fluorography. We redesigned the processing algorithm (NRDPT, NGS Raw Data Processing Tool) and present here.
     Different with the existing methods, we use edge based Hough transforms to do the cluster positioning, effectively improved the positioning accuracy. And a two-step registration algorithm designed in this paper greatly save the time costs (about 9 times increased). In the base-calling part, existing studies are now based data produced by Illumina sequencing platform.
     These methods mainly designed to correct the phase disorder problems, which are caused by the biochemical processing. But in some of the new sequencing methods (such as SoLiD, etc.), these problems do not exist. In this article, we discussed these problems and carefully considered several strategies. Then, a well-designed base calling method is descripted, which is based on the reactions used in PSTAR-II and got pretty results.
引文
[1] Sanger F, Nicklen S, Coulson A R. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A, 1977, 74(12):5463-5467
    [2] Kircher M, Kelso J: High-throughput DNA sequencing - concepts and limitations. BioEssays 2010, 32:524-536.
    [3] Metzker, M. L. Sequencing technologies—the next generation. Nature Rev. Genet. 11, 31–46 (2010)
    [4] Datta S., et al. Statistical analyses of next generation sequence data: a partial overview. J. Proteomics Bioinformatics. 2010;3:511–515.
    [5] Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, et al. (2005) Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309: 1728–1732.
    [6]盛江.生物工程:国产高通量基因测序仪的产业发展.中国科技投资, China Venture Capital, 2011年06期
    [7] http://www.genome.gov/25522229
    [8] Jeffrey Kriseman , Christopher Busick , Szabolcs Szelinger , Valentin Dinu, BING: Biomedical informatics pipeline for Next Generation Sequencing, Journal of Biomedical Informatics, v.43 n.3, p.428-434, June, 2010
    [9] Christian Ledergerber and Christophe Dessimoz. Base-calling for next-generation sequencing platforms. Brief Bioinform 2011 : bbq077v1-bbq077.
    [10]叶丙刚.高通量基因测序图像处理与数据分析.华南理工大学. 2010.04博士论文
    [11]郑华. DNA分析仪荧光信号采集与处理系统的研究.浙江大学. 2008.07博士论文
    [12]郑华;王立强;石岩;汪洁;陆祖康; DNA测序信号去噪分析的一种新方法.光谱学与光谱分析. 2008年05期
    [13] Erlich, Y., Mitra, P. P., delaBastide, M., McCombie, W. R. & Hannon, G. J. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nature Methods 5, 679–682 (2008).
    [14] W.C. Kao, K. Stevens, and Y.S. Song. Bayescall: A model-based basecalling algorithm for high-throughput short-read sequencing. Genome Research, doi:10.1101/gr.095299.109, 2009
    [15] Kao, W.C. and Song, Y.S. naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing. Proc. 14th Annual Intl. Conf. on Research in Computational Molecular Biology (RECOMB 2010), Lecture Notes in Computer Science 6044, pages 233--247.
    [16] Shendure J, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309:1728–1732.
    [17] J Shendure, R D Mitra, C varma, G M Church. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet (2004)
    [18] Shendure J, Ji H. Next-generation DNA sequencing. Nat. Biotechnol. 2008;26:1135-1145.
    [19] http://www.hykgene.com/Default.aspx?PN=N_Product_data&ps=&pID=86663
    [20]盛司潼申请号CN201010155269.9专利号CN101942000A深圳华因康基因科技有限公司
    [21] Nobuyuki Otsu (1979). "A threshold selection method from gray-level histograms". IEEE Trans. Sys., Man., Cyber. 9 (1): 62–66. doi:10.1109/TSMC.1979.4310076
    [22] Collignon A, Maes F, Vandermeulen D,et al. Automated multi-modality image registration using information theory. Proc of the Information Processing in Medical Imaging Conference,Dordrecht, June 1995: 263-274
    [23] Ramtin Shams, Parastoo Sadeghi, Rodney Kennedy, Richard Hartley. Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images. Computer Methods and Programs in Biomedicine (2010) Volume: 99, Issue: 2, Publisher: Elsevier Ireland Ltd, Pages: 133-146
    [24] Han Xiao, L S Hibbard, V Willcut. GPU-accelerated, gradient-free MI deformable registration for atlas-based MR brain image segmentation. 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2009). Publisher: IEEE, Pages: 141-148
    [25] Gonzalez and Woods. Digital Image Processing 3rd Ed. Prentice Hall, 2008.
    [26] D.H. Ballard, "Generalizing the Hough Transform to Detect Arbitrary Shapes", Pattern Recognition, Vol.13, No.2, p.111-122, 1981
    [27] Davies E R,A modified hough scheme for general circle location,Pattern Recognition Letters, 1988, 7(01)
    [28] Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10: R83.
    [29] Kao W, Stevens C, Song Y (2009) Bayes Call: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res 19: 1884-1895.
    [30] Digabel, H., and Lantuejoul, C. Iterative algorithms. In Actes du Second Symposium Europeen d'Analyse Quantitative des. Microstructures en Sciences des Materiaux, Biologie et Medecine, Caen,4-7 October 1977 (1978), J.-L. Chermant, Ed., Riederer Verlag, Stuttgart, pp. 85-99.
    [31] J.Sijbers, M.Verhoye, P.Scheunders, A.van der Linden. Watershed-based segmentation of 3D mr data for volume quantization. Magnetic Resonance Imaging, 1997,Vol.15, No.6: PP.679-688.
    [32] Osher,S. & Sethian, J. A.(1988),"Fronts propagating with curvature-dependent speed: Algorithms based on Hamilton-Jacobi formulations",J. Comput. Phys.,vol. 79,pp. 12–49.
    [33] Osher, Stanley J.; Fedkiw, Ronald P.. Level Set Methods and Dynamic Implicit Surfaces. Springer-Verlag. 2002. ISBN 0-387-95482-1.
    [34] J.A.Sethian. Fast marching methods. SIAM Rev., 41(1999): pp.199-235
    [35] Z. F. Knops, J. B. A. Maintz, M. A. Viergever, J. P. W. Pluim. Normalized Mutual Information Based Registration Using K-Means Clustering and Shading Correction. Medical Image Analysis. 2006, 10(3): 432~439
    [36] T. Buzug, J. Weese. Improving DSA Images with an Automatic Algorithm Based on Template Matching and an Entropy Measure. J.of Computer Assisted Radiology. 1996, 1124: 145~150
    [37] Jones, Loyd Ancile, and H. R. Condit. 1941. The Brightness Scale of Exterior Scenes and the Computation of Correct Photographic Exposure. Journal of the Optical Society of America 31:11, Nov. 1941, 651–678.
    [38] http://bowtie-bio.sourceforge.net/index.shtml
    [39] Langmead B, Hansen K, Leek J. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biology 11:R83.
    [40] Schatz M, Langmead B, Salzberg SL. Cloud computing and the DNA data race. Nature Biotechnology 2010 Jul;28(7):691-3.
    [41] Langmead B, Schatz M, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biology 10:R134.
    [42] http://aws.amazon.com/elasticmapreduce/
    [43]李作主,基于遗传算法的互信息医学图像配准,电脑知识与技术,2007年16期
    [44]金人超;王金华;宋恩民,基于粗配准和互信息的脑部MR图像配准算法.计算机仿真。2007年04期.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700