基于动态时间规整的基因表达数据分析
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
推断基因表达数据间的相似性是推断基因功能,回答复杂的生物学过程的一个重要途径。
     动态时间规整算法是最早应用于生物信息学中进行序列比对的,考虑到时间序列基因表达谱存在时间上的延迟以及局部相似性等特性,本文将动态时间规整算法用于时间序列基因表达谱的相似性推断,并且实现了动态规整算法的优化,即多分段的动态时间规整算法。实验表明,该算法的时间复杂度低,比对精确度很高。
Science and Technology of the 20th century the rapid development in various fields, especially in information technology application and impact of more extensive development of almost all fields of information technology are inseparable. As the rapid development of biological science and technology produced a large amount of biological data, simply use the traditional biological experiments will be difficult to quickly and comprehensively addressed so many biological data, which is bound to restrict the life sciences and related areas of rapid development. In this case, bioinformatics emerged, Bioinformatics using computer technology, information technology, statistical science, medicine and mathematics and other disciplines of knowledge and technology, mainly to study the basis of available data found in the corresponding knowledge of the law and thus to further guidance and interpretation of biological experiments and life and accelerate the understanding of essential characteristics of life.
     Inferred gene expression data is the similarity between the inference of gene function, to answer complex biological processes is an important way. Solution similar to gene expression time series query There are several ways in which the most commonly used is the basic dynamic time warping algorithm, dynamic warping algorithm to solve many important applications in key technologies, for example, using dynamic programming made in the field of speech recognition great success in biology with genomics to solve matrix multiplication, there are applications to graph the shortest path problem.
     Dynamic time warping algorithm is first used in bioinformatics for sequence alignment, dynamic time warping algorithm is used to process the ratio of the time series gene expression data generated by many problems, including the sparsity of data, height, dimensions, noise measurements, and occurred at similar time series of local deformation. Taking into account the existence of time series gene expression time delay and the local similarity and other features, this dynamic time warping algorithm is used to time-series gene expression date similar inference, and implements the dynamic warping algorithm optimization, a multi-segmented dynamic warping algorithm.
     Mentioned in this article the dynamic warping algorithm for multi-segment method to deal with several key challenges:
     Toxicology Research is a typical time series matrix, contains less than 10 time points measured.
     Since the time series is non-uniform time intervals in the treatment of changes in the sample, at a given point in time the query in the sequence in the database and the measurement points may not be similar.
     You can query multiple measurements or length differences. Some queries may be only constituted by a single observation report, however, may contain many other points in time. Some queries may span only a few hours while others may be included in measurements for several days.
     A given query in the database sequence with its best match in the amplitude, frequency or duration of time, and it can be different. For example, a query expression profile of the treatment may be treated with a gene expression database sequence similarity to the response in addition to reduced or delayed, or the delay occurred more slowly. This query can be seen as a shortened version of the database sequences, and vice versa.
     Experiments show that the number mentioned in sub-dynamic time warping algorithm is better than some other alternative to produce a more accurate comparison and classification, and chemotherapy in a similar relative distortion between the strong.
引文
[1]蒋彦编.基础生物信息学及应用[M].北京:清华大学出版社,2003:4-5 9-20.
    [2]文翰,黄国顺.语音识别中算法改进研究[J] .模式识别, 2006,2.
    [3]唐玉荣.生物信息学中一个优化的全局双序列比对[J].计算机应用,2004,6.
    [4]翁颖钧,朱仲英.基于动态时间弯曲的时序数据聚类算法的研究[J].计算机仿真度,2004,3.
    [5] GDas , KLin , HMannila , GRenganathan &P Smyth. Rule discovery form time series[C] . Proceedings of the 4rd International C on ference of Knowledge Discovery and Data Mining , AAAI Press :16 - 22.
    [6] E K eogh &M Pazzani . An enhanced representation of time series which allows fast and accurate classification , clustering and relevance feed2 back[C] . Proceedings of the 4rd International C on ference of Knowledge Discovery and Data Mining , AAAI Press ,1998 :239 - 241.
    [7]李昭.生物序列相似性比较算法的研究[J].中国科学院计算技术研究所,2002.
    [8] J. D. Thompson, D. G. Higgins and T. J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position - specific gap penalties and weight matrix choice[J]. Nucleic Acids Res, 1994,22(22) :4673 - 4680.
    [9] D.A.Konings,P.Hogeweg.Pattern analysis of RNA secondary structure Similarity and consensus of minimal-energy folding ,J.Mol.Biol.,207,(1989) 597-614.
    [10] B.D.James,G.J.Olsen, N.R..Pace.Phylogenetic comparative analysis of RNA secondary structure , Methods Enzymol., 180(1989)227-239
    [11]吴斌,沈自尹.基因表达谱芯片的数据分析[J].世界华人消化杂志,2006 ,1.
    [12]李衍达,孙之荣.生物信息学基因和蛋白质分析的实用指南[M].北京:清华大学出版社,2000.
    [13]徐伟文.表达谱基因芯片[J].生物化学与生物物理进展杂志,2001.
    [14] Saitou N.Nei M The neighbor-joining method:a new method for reconstructingphylogenetic trees 1987(4)
    [15] Smith AA, Vollrath A, Bradfield C, Craven M. Similarity queries for temporal toxicogenomic expression profiles. PLoS Computational Biology 2008; In press.
    [16] Hayes K, Vollrath A, Zastrow G, McMillan B, Craven M, Jovanovich S, Walisser J, Rank D, Penn S, Reddy J, Thomas R, Bradfield C. EDGE: A centralized resource for the comparison, analysis and distribution of toxicogenomic information. Molecular Pharmacology 2005; 67: 1360–1368.
    [17] Aach J, Church G. Aligning gene expression time series with time warping algorithms. Bioinformatics 2001; 17: 495–508.
    [18] Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE ASSP Magazine 1978; 26: 43–49.
    [19] Sankoff D, Kruskal J. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley 1983.
    [20] Criel J, Tsiporkova E. Gene time expression warper: A tool for alignment, template matching and visualization of gene expression time series. Bioinformatics 2006; 22: 251–252.
    [21] Liu X, M¨uller HG. Modes and clustering for timewarped gene expression profile data. Bioinformatics 2003; 19: 1937–1944.
    [22] Ratanamahatana C, Keogh EJ. Three myths about dynamic time warping data mining. In: Proceedings of SIAM International Conference on Data Mining. SIAM, 506–510.
    [23] It is similar to the method used by Bar- Joseph et al.9. Finally correlation optimized warping (COW)10 is another segment-based method that divides both series into the same number of segments and then sums the cross correlations of corresponding segments expression data. Journal of Computational Biology 2003; 10: 341–356.
    [24] Nielsen NV, Carstensen JM, Smedsgaard J. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. Journal of Chromatography A 1998: 17–35.
    [25] Itakura F. Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 1975; 23: 67–72.
    [26]陈绮.生物信息学中计算机技术应用[M],北京:电子工业出版社,2002:1-5.
    [27]吴祖建,高芳銮,沈建国.生物信息学分析实践[M],北京:科学出版,2003: 1-14.