基于海明距离的DNA序列中相似性重复片段查找技术研究

英文题名：Study on Techniques of Searching for Approximate Repeats in DNA Sequences Based on Hamming Distance
作者：赵毅
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：DNA序列 ; 相似性串联重复片段 ; 相似性反向重复片段 ; 模式单元数组 ; 边界索引
英文关键词：DNA sequence ; approximate tandem repeats ; approximate inverted repeats ; Pattern Unit Array ; Boundary Index
学位年度：2008
导师：王国仁
学科代码：081201
学位授予单位：东北大学
论文提交日期：2007-12-24

摘要

生物信息学是随着人类基因组计划的启动、基因序列和蛋白质序列等生物数据迅猛增加而逐渐兴起的一门通过综合运用数学、计算机科学和信息科学来研究生物系统中信息现象的科学。在其广泛的研究领域中,重复片段查找是一个重要的DNA序列分析基础问题,其中的相似性重复片段查找因具有重要的生物意义以及其问题本身的复杂性,一直以来都是广大生物信息学研究人员致力研究的重要课题之一。
     本文针对DNA序列中两类重要的相似性重复片段——相似性串联重复片段和相似性反向重复片段的查找技术进行了深入研究,在分别为两类重复片段进行形式化定义之后,设计了相应的索引技术和查找算法用于两类相似性重复片段的查找和识别。
     在相似性串联重复片段查找的研究中,首先在海明距离的基础上定义了模式相似度和相邻相似度的概念用于衡量相似性串联重复片段模式间的相似程度,并提出了新的相似性串联重复片段定义Largest Neighbor-similarity-based Approximate Tandem Repeats (LNATR)。之后,通过将DNA序列划分为模式单元,设计了模式单元数组(Pattern Unit Array, PUA)的索引结构用于LNATR的查找。最后在模式单元数组上,根据后继信息进行模式连接以及模式增长,设计了一种基于模式单元数组的LNATR查找算法,并与Gad M. Landau等人提出的查找算法进行了比较。
     在相似性反向重复片段查找的研究中,首先在海明距离的基础上定义了匹配度用于衡量相似性反向重复片段模式间的匹配相似程度,并综合考虑了反向重复片段模式间可能存在间隔的特点,提出了新的相似性反向重复片段定义Largest Matching-degree-based Approximate Inverted Repeats (LMAIR)。之后设计了边界索引(Boundary Index, BI)的索引技术用于LMAIR的查找。最后在边界索引的基础上,分别设计了基本LMAIR查找算法和优化的LMAIR查找算法,并对两种算法进行了比较。
With the start of Human Genome Project and the rapid increase of biological data, bioinformatics is gradually becoming one of the most important research fields, which studies the biological systems by applying mathmatics, computer science and information science. In the broad research areas of bioinformatics, repeats searching problem is an important and basic DNA sequence analysis problem, of which approximate repeats searching is an important issue which many researchers have paid great attention to, since there is great biological significance in approximate repeats and the searching problem itself is a new and complicated one.
     This thesis focuses on the searching problem of two kinds of important approximate repeats, which are approximate tandem repeats and approximate inverted repeats. Based on the proposed definitions of the two kinds of repeats, two indexing structures and relative searching algorithms are designed respectively.
     For the problem of searching for approximate tandem repeats, firstly pattern-similarity and neighbor-similarity are proposed based on hamming distance for similarity measurement, then a new definition Largest Neighbor-similarity-based Approximate Tandem Repeats (LNATR) is presented. After that a new indexing structure named Pattern Unit Array (PUA) is designed, based on which an effective LNATR searching algorithm is proposed, and is compared with another approximate tandem repeats searching algorithm designed by Gad M. Landau.
     For the problem of searching for approximate inverted repeats, the thesis first presents matching-degree based on hamming distance to measure the similarity between the two patterns of inverted repeats, based on which a new definition Largest Matching-degree-based Approximate Inverted Repeats (LMAIR) is presented. Then Boundary Index (BI) is designed for further LMAIR searching. Finally, simple LMAIR searching algorithm and optimized LMAIR searching algorithm are proposed based on BI, and comparation is made between the two LMAIR searching algorithms.

引文

1. D. W. Mount. Bioinformatics Sequence and Genome Analysis [M], Cold Spring Harbor Laborary Press,2001.
    2. F. S. Collins, A. Patrinos, E. Jordan, et al. New goals for the US human genome project [J], Science,1998,282 (5389):682-689.
    3. The Human Genome Project (HGP) [EB/OL], http://www.nhgri.nih.gov/HGP/.
    4.钟扬,张亮,赵琼.简明生物信息学[M],北京：高等教育出版社,2003.
    5.赵国屏等.生物信息学[M],北京：科学出版社,2005.
    6.生物信息学的发展现状和展望[EB/OL], http://www.bioinfo.org.cn/biology/ biodevelope.htm.2001.
    7.陶士珩.生物信息学[M],北京：科学出版社,2007.
    8.蒋彦,王小行等.基础生物信息学及应用[M],北京：清华大学出版社,2003.
    9.P.C. Turner著,刘进元译.分子生物学[M],北京：科学出版社,2002.
    10.染色体、基因与人体基因组研究[EB/OL], http://news.xinhuanet.com/ziliao/ 2003-09/25/content 1099876.htm.
    11.陈竺.医学遗传学[M],北京：人民卫生出版社,2001.
    12. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome [J], Nature 409,2001, (15):860-921.
    13. S. Beleza, C. Alves, A. Gonzalez-Neira, et al. Extending STR markers in Y chromosome haplotypes [J], Int. J. Legal Med,2003,117(1):27-33.
    14. D. R. Young, Z. Tun, K. Honda, et al. Identifying sex chromosome abnormalities in forensic DNA testing using amelogenin and sex chromosome short tandem repeats [J], J. Forensic Sci,2001,46(2):346-348.
    15. S. Gilmore, R. Peakall, J. Robertson. Short tandem repeats (STR) DNA markers are hypervariable and informative in Cannabis sativa:implications for forensic investigations [J], Forensic Sci. Int,2003,131(1):65-74.
    16. C. J. Moore, E. M. Daly, F. Tassone, et al. The effect of pre-mutation of X chromosome CGG trinucleotide repeats on brain anatomy [J], Brain,2004,127,2672-2681.
    17. D. K. Nag, M. Suri, E. K. Stenson. Both CAG repeats and inverted DNA repeats stimulate spontaneous unequal sister-chromatid exchange in Saccharomyces cerevisiae [J], Nucleic Acids Res,2004,32 (18):5677-5684.
    18. J. J. Bissler. DNA inverted repeats and human disease [J], Front Bioscience,1998,3: 408-418.
    19. M. J. Kosovsky, A. Siddiqui. Biochemical and functional properties of a plindromic sequence motif within the hepatitis B virus Enh I [J], Virology,1999,259(1):60-66.
    20. Y. Park, M. Nesterova, S. Agrawal, et al. Dual blockade of cyclic AMP response element (CRE) and AP-1-directed transcription by CRE-transcription factor decoy oligonucleotide [J], Journal of Biological Chemistry,1999,274 (3):1573-1580.
    21. R. Jin, M. E. Fernandez-Beros, R. P. Novick. Why is the initiation nick site of an AT-rich rolling circle plasmid at the tip of a GC-rich cruciform? [J], EMBO Journal,1997,16: 4456-4466.
    22. E. L. Kim, H. Peng, F. M. Esparza, et al. Cruciform-extruding regulatory element controls cell-specific activity of the tyrosine hydroxylase gene promoter [J], Nucleic Acids Research,1998,26:1793-1800.
    23. C. Spiro, C. T. McMurray. Switching of DNA secondary structure in proenkephalin transcriptional regulation [J], Journal of Biological Chemistry,1997,272:33145-33152.
    24. E. Akgun, J. Zahn, S. Baumes, et al. Palindrome resolution and recombination in the mammalian germ line [J], The Molecular and Cellular Biology,1997,17:5559-5570.
    25. D. K. Nag, A. Kurst. A 140-bp-long palindromic sequence induces double-strand breaks during meiosis in the yeast Saccharomyces cerevisiae [J], Genetics,1997,146:835-847.
    26.抗艳红，许寅生.真核生物重复序列的研究进展[J],河北北方学院学报,2006,22(2)：23-26.
    27. W. Fitch, T. Smith, J. Breslow. Detecting internally repeated sequences and inferring the history of duplication [J], Methods in Enzymology, San Diego,1986,128(45):773-788.
    28. S. Kurtz, G. Myers. Estimating the Probability of Approximate Matches [C], Proc. Of the 8th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, Denmark,1991,264:52-647.
    29. S. Kurtz, J.V. Choudhuri, E. Ohlebusch, et al. REPuter:the manifold applications of repeats analysis on a genomic scale [J], Nucl. Acids Res,2001,29(22):4633-4642.
    30. G. M. Landau, J. P. Schmidt. An algorithm for approximate tandem repeats [C], Proc. Of the 4th Annual Symposium on Combinatorial Pattern Matching, Italy,1993,684:120-133.
    31. G. Benson., M. Waterman. A method for fast database search for all k-nucleotide repeats [J], Nucl. Acids Res,1994,22:4828-4836.
    32. G. Benson. Tandem repeats finder:a program to analyze DNA [J], Nucl. Acids Res,1998, 27(2):573-580.
    33. V. N. Babenko, P. S. Kosarev, O. V. Vishnevsky, et al. Investigating extended regulatory regions of genomic DNA sequences [J], Boiinformatics,1999,15(7/8):64-653.
    34. P. Vincens, L. Buffat, C. Andre, et al. A strategy for finding regions of similarity in complete genome sequences [J], Bioinformatics,1998,14(8):715-727.
    35. R. Kolpakov, G. Kucherov. On maximal repetitions in words [C], In Proceedings of the 12th International Symposium on Fundamentals of Computation Theory, Iasi (Romania), 1999,1684 of LNCS:374-385.
    36. Y. Wexler, Z. Yakhini, Y. Kashi, et al. Finding Approximate Tandem Repeats in Genomic Sequences [C], RECOMB 2004, California,2004,223-232.
    37. A. H. L. Porto, V. C. Barbosa. Finding approximate palindromes in strings [J], Pattern Recognition,2002,35:2581-2591.
    38. P. E. Warburton, J. Giordano, F. Cheung, etc. Inverted Repeats Structure of the Human Genome:The X-Chromosome Contains a Preponderance of Large, Highly Homologous Inverted Repeats That Contain Testes Genes [J], Genome Research,2004,14:1861-1869.
    39. G. Benson. An algorithm for finding tandem repeats of unspecified pattern size [C], RECOMB98, New York,1998,20-29.
    40.王镝,王国仁,陈白尘等.一种可用于生物序列分析的轻量级索引——后继数组[J]华中大学学报,2005,33：209-212.
    41.王镝,王国仁,吴青泉等.DNA序列中基于后继数组索引的LPR查找算法[J]计算机研究与发展,2006,43：195-199.
    42.王镝,赵毅,王国仁等.DNA序列中基于后继数组索引的SATR查找算法[J]东北大学学报,2007,28(2)：184-188.
    43. D. Wang, B. C. Chen, G. R. Wang, et al. Finding repetitions in DNA sequences based on a new index-Succeeding Unit Array [J], Journal of Computational Information Systems. 2006,2 (4):1371-1378.
    44. D. Wang, G. R. Wang, Q. Q. Wu, et al. Partition Frequency Distance based Filter Method for Finding Approximate Repetitions in DNA Sequences [C]. Bioinformatics and Bioengineering,2006,45-52.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700