生物序列的相对特征分析及Burrows-Wheeler方法

英文题名：Relative Character Analysis and Burrows-Wheeler Methods for the Biological Sequence
作者：杨连平
论文级别：博士
学科专业名称：应用数学
中文关键词：序列分析 ; 计算生物学 ; 相对特征 ; 公共子串 ; 局部距离 ; Burrows-Wheeler方法
英文关键词：Sequence analysis ; Computational Biology ; Relative character ; Common string ; Local Distance ; Burrows-Wheeler method
学位年度：2011
导师：王天明
学科代码：070104
学位授予单位：大连理工大学

摘要

随着后基因组时代的到来,面对着大量的基因组的完全测序及各种问题的涌现,人们期望低成本的序列比较分析工具能够更精准、更快速的分析和预测序列的结构与功能,从而降低用实验方法测定与分析而带来的高额时间与金钱成本。本文致力于生物序列分析的研究领域,提出具有一定特色的比较分析模型。
     通常,序列的比较分析主要被分成两类模型：比对模型和非比对模型。本文从比较分析流程的拓扑框架上看待各种比较模型,提出将比较分析模型分为特征分析模型及相对特征分析模型。比对模型及基于信息压缩的比较模型都属于相对特征分析模型。在相对特征分析模型中,相似性假设是这类比较模型的一个核心内容。通过分析相似性假设可以得出该模型的主要的优缺点。
     本文重点研究讨论了两类相对特征分析模型：基于序列间公共子串的比较模型和’Burrows-Wheeler方法。本文提出的基于公共子串的比较模型是通过讨论最长公共子串与最短特异子串之间的关系而得出的一种模型。其主要特点是：算法的时间复杂度为线性的,从而适合分析很长的基因组；其中的局部距离度量可以较好的分析基因组间的局部相似性,即使所考虑的局部包含了部分片段的重组信息；根据局部距离度量而得出累积局部距离也能有效的分析基因组的整体相似性。通过对HIV-1全基因组及其片段的子型判别的问题的研究,我们验证了该模型的有效性。
     Burrows-Wheeler方法是另一类本文重点研究讨论的相对特征分析模型。其理论主要基于信息无损压缩理论中的一个重要的可逆变换——Burrows-Wheeler变换。在此变换的基础上而得出的扩展Burrows-Wheeler变换可以有效的分析序列间的共有因子的含量。本文提出了一种称为Burrows-Wheeler相似性分布的概念,并用其来描述序列间的相似性。在此基础上,我们提取Burrows-Wheeler相似性分布的两类数字特征——期望和信息熵,并针对基因序列、蛋白质序列及其结构序列的特点,采用不同的策略比较它们之间的相似性。
As the coming of the post-genome period, we have to face up to the vast complete genomes and kinds of questions. The inexpensive sequence analysis tools are expected to be faster and more accurate to analyze and predict the structure and the function of the biological sequences, which can reduce the high cost of time and money by the experimental methods. In this dissertation, we focus on the field of the biological sequence analysis and propose some models with great value.
     Traditionally, there are two kinds of sequence analysis tools:alignment and alignment free models. However, we point out that the models fall into two categories by the topology structure of the basic comparison frames:one is character analysis and the other is relative character analysis. Models based on alignment or based on text compression are all relative character analysis models. We find that the core of the relative character models is the hypothesis of the similarity. We will find the main merit and demerit by the hypothesis of the similarity.
     The discussion topics of this dissertation are two kinds of relative character comparison models which are based on common strings and Burrows-Wheeler method respectively. The common string model is designed through investigating the relationship between the longest common strings and the shortest absent words. The advantages of this model are:the time complexity is linear which is perfect to analyze the huge genomes; the local distance measure derived by this model can be used to search the similar parts between the genomes, even though the local parts take some gene recombination information in; the local distance deduce the integral local distance easily which can be used to analyze the integral similarity efficiently. The validity is confirmed by classifying the subtype of the complete genomes and their segments of the HIV-1.
     Burrows-Wheeler methods are another kind of relative character methods. The essential foundation is the invertible Burrows-Wheeler transformation which has important applications in the field of the lossless compression. The extensive Burrows-Wheeler transformation is the key generalization for the comparison frame, which can detect the content of the common factors between the biological sequences. We propose a concept called Burrows-Wheeler similarity distribution to represent the similarity of the sequences. Moreover, some digit characteristics, expectation and entropy, are computed to compare kinds of biological sequences with different strategies chosen by the feature of the gene, protein or the structure sequences.

引文

[1]http://www.ncbi.nlm.nih.gov/genbank/[M].
    [2]陈润生.与生物信息学相关的两个前沿方向——非编码基因和复杂生物网络[J].生物物理学报,2007,23：290-295.
    [3]KELLEY L A, MACCALLUM R M, STERNBERG M J E. Enhanced Genome annotation using structural profiles in the program 3D-PSSM [J]. Journal of Molecular Biology,2000,299:499-520.
    [4]STEIN L. Genome annotation:from sequence to biology [J]. Nature Reviews Genetics, 2001,2:493-503.
    [5]HAEUSSLER M, GERNER M, BERGMAN C M. Annotating genes and genomes with DNA sequences extracted from biomedical articles [J]. Bioinformatics,27:980-986.
    [6]BURLEIGH J G, BANSAL M S, WEHE A, et al. Locating Large-Scale Gene Duplication Events through Reconciled Trees:Implications for Identifying Ancient Polyploidy Events in Plants [J]. Journal of Computational Biology,2009,16:1071-1083.
    [7]SCHOLL E H, BIRD D M. Computational and phylogenetic validation of nematode horizontal gene transfer [J]. BMC Biology,2011,9:9.
    [8]SNIR S, TRIFONOV E. A Novel Technique for Detecting Putative Horizontal Gene Transfer in the Sequence Space [J]. Journal of Computational Biology,17:1535-1548.
    [9]LAING C, BUCHANAN C, TABOADA E N, et al. Pan-genome sequence analysis using Panseq:an online tool for the rapid analysis of core and accessory genomic regions [J]. BMC Bioinformatics,2010,11:461.
    [10]MAHADEVAN P, SETO D. Taxonomic Parsing of Bacteriophages Using Core Genes and In Silico Proteome-Based CGUG and Applications to Small Bacterial Genomes [J]. Advances in Computational Biology,680:379-385.
    [11]ZSURKA G, KUDINA T, PEEVA V, et al. Distinct patterns of mitochondrial genome diversity in bonobos (Pan paniscus) and humans [J]. BMC Evolutionary Biology,2010, 10:270.
    [12]ZIMMERMANN P, HIRSCH-HOFFMANN M, HENNIG L, et al. GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox [J]. Plant Physiology,2004,136:2621-2632.
    [13]LU J, GETZ G, MISKA E A, et al. MicroRNA expression profiles classify human cancers [J]. Nature,2005,435:834-838.
    [14]WANG Y X, KLIJN J G M, ZHANG Y, et al. Gene-expression pro-files to predict distant metastasis of lymph-node-negative primary breast cancer [J]. Lancet,2005,365: 671-679.
    [15]BLABY-HAAS C E, DE CRECY-LAGARD V. Mining high-throughput experimental data to link gene and function [J]. Trends Biotechnol,29:174-182.
    [16]BEHR M A, WILSON M A, GILL W P, et al. Comparative genomics of BCG vaccines by whole-genome DNA microarray [J]. Science,1999,284:1520-1523.
    [17]RUBIN G M, YANDELL M D, WORTMAN J R, et al. Comparative genomics of the eukaryotes [J]. Science,2000,287:2204-2215.
    [18]SU A I, COOKE M P, CHING K A, et al. Large-scale analysis of the human and mouse transcriptomes [J]. Proceedings of the National Academy of Sciences of the United States of America,2002,99:4465-4470.
    [19]KELLIS M, PATTERSON N, ENDRIZZI M, et al. Sequencing and comparison of yeast species to identify genes and regulatory elements [J]. Nature,2003,423:241-254.
    [20]BAXTER S W, DAVEY J W, JOHNSTON J S, et al. Linkage Mapping and Comparative Genomics Using Next-Generation RAD Sequencing of a Non-Model Organism [J]. Plos One,2011,6:e19315.
    [21]BENTON M A, RAGER J E, SMEESTER L, et al. Comparative genomic analyses identify common molecular pathways modulated upon exposure to low doses of arsenic and cadmium [J]. BMC Genomics,2011,12:173.
    [22]JONES D T. Protein secondary structure prediction based on position-specific scoring matrices [J]. Journal of Molecular Biology,1999,292:195-202.
    [23]CUFF J A, BARTON G J. Application of multiple sequence alignment profiles to improve protein secondary structure prediction [J]. Proteins-Structure Function and Genetics,2000,40:502-511.
    [24]BABAEI S, GERANMAYEH A, SEYYEDSALEHI S A. Protein secondary structure prediction using modular reciprocal bidirectional recurrent neural networks [J]. Computer Methods and Programs in Biomedicine,100:237-247.
    [25]CHEN K, STACH W, HOMAEIAN L, et al. iFC(2):an integrated web-server for improved prediction of protein structural class, fold type, and secondary structure content [J]. Amino Acids,40:963-973.
    [26]LIU T A, JIA C Z. A high-accuracy protein structural class prediction algorithm using predicted secondary structural information [J]. Journal of Theoretical Biology,267: 272-275.
    [27]ZHANG S L, DING S Y, WANG T M. High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure [J]. Biochimie, 93:710-714.
    [28]SCOTT K A, DAGGETT V. Folding mechanisms of proteins with high sequence identity but different folds [J]. Biochemistry,2007,46:1545-1556.
    [29]CUTELLO V, MORELLI G, NICOSIA G, et al. On discrete models and immunological algorithms for protein structure prediction [J]. Nat Comput,10:91-102.
    [30]SIMONS K T, KOOPERBERG C, HUANG E, et al. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions [J]. Journal of Molecular Biology,1997,268:209-225.
    [31]KUMAR M, GROMIHA M M, RAGHAVA G P S. SVM based prediction of RNA-binding proteins using binding residues and evolutionary information [J]. Journal of Molecular Recognition,2011,24:303-313.
    [32]PANEK J, KRASNY L, BOBEK J, et al. The suboptimal structures find the optimal RNAs:homology search for bacterial non-coding RNAs using suboptimal RNA structures [J]. Nucleic Acids Research,2011,39:3418-3426.
    [33]ZUKER M, MATHEWS D H, TURNER D H. Algorithms and thermodynamics for RNA secondary structure prediction:A practical guide [J]. Rna Biochemistry and Biotechnology,1999,70:11-43.
    [34]DOMAZET-LOSO M, HAUBOLD B. Efficient estimation of pairwise distances between genomes [J]. Bioinformatics,2009,25:3221-3227.
    [35]LEITNER T, KORBER B, DANIELS M, et al. HIV-1 subtype and circulating recombinant form (CRF) reference sequences,2005 [J]. HIV sequence compendium, 2005,2005:41-48.
    [36]WU X M, CAI Z P, WAN X F, et al. Nucleotide composition string selection in HIV-1 subtyping using whole genomes [J]. Bioinformatics,2007,23:1744-1752.
    [37]OTU H H, SAYOOD K. A new sequence distance measure for phylogenetic tree construction [J]. Bioinformatics,2003,19:2122-2130.
    [38]DAI Q, WANG T M. Comparison study on k-word statistical measures for protein:From sequence to'sequence space'[J]. BMC Bioinformatics,2008,9:394.
    [39]VAN TUINEN M, HEDGES S B. Calibration of avian molecular clocks [J]. Molecular Biology and Evolution,2001,18:206-213.
    [40]FORD M J. Molecular evolution of transferrin:Evidence for positive selection in salmonids [J]. Molecular Biology and Evolution,2001,18:639-647.
    [41]JIA C Z, LIU T, ZHANG X D, et al. Alignment-free Comparison of Protein Sequences Based on Reduced Amino Acid Alphabets [J]. Journal of Biomolecular Structure & Dynamics,2009,26:763-769.
    [42]CHANG G S, WANG T M. Phylogenetic Analysis of Protein Sequences Based on Distribution of Length About Common Substring [J]. Protein Journal,30:167-172.
    [43]DING S Y, DAI Q, LIU H M, et al. A simple feature representation vector for phylogenetic analysis of DNA sequences [J]. Journal of Theoretical Biology,265: 618-623.
    [44]ZHANG S L, WANG T M. Phylogenetic Analysis of Protein Sequences Based on Conditional LZ Complexity [J]. Match-Communications in Mathematical and in Computer Chemistry,63:701-716.
    [45]DE TRAD C H, FANG Q, COSIC I. Protein sequence comparison based on the wavelet transform approach [J]. Protein Engineering,2002,15:193.
    [46]PHAM T D. Spectral distortion measures for biological sequence comparisons and database searching [J]. Pattern Recognition,2007,40:516-529.
    [47]HILLIER L W, MILLER W, BIRNEY E, et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution [J]. Nature, 2004,432:695-716.
    [48]YAN M, LIN Z S, ZHANG C T. A new Fourier transform approach for protein coding measure based on the format of the Z curve [J]. Bioinformatics,1998,14:685-690.
    [49]ZHANG C T, WANG J. Recognition of protein coding genes in the yeast genome at better than 95% accuracy based an the Z curve [J]. Nucleic Acids Research,2000,28: 2804-2814.
    [50]GUO F B, OU H Y, ZHANG C T. ZCURVE:a new system for recognizing protein-coding genes in bacterial and archaeal genomes [J]. Nucleic Acids Research, 2003,31:1780-1789.
    [51]ZHANG C T, ZHANG R. An isochore map of the human genome based on the Z curve method [J]. Gene,2003,317:127-135.
    [52]ZHANG C T, ZHANG R, OU H Y. The Z curve database:a graphic representation of genome sequences [J]. Bioinformatics,2003,19:593-599.
    [53]ZHANG C T, LIN Z S, YAN M, et al. A novel approach to distinguish between intron-containing and intronless genes based on the format of Z curves [J]. Journal of Theoretical Biology,1998,192:467-473.
    [54]张春霆.人与其他生物基因组若干重要问题的生物信息学研究[J].自然科学进展,2004,14：1367-1374.
    [55]HE P A, ZHANG Y P, YAO Y H, et al. The Graphical Representation of Protein Sequences Based on the Physicochemical Properties and Its Applications [J]. Journal of Computational Chemistry,2010,31:2136-2142.
    [56]YAO Y H, DAI Q, LI L, et al. Similarity/Dissimilarity Studies of Protein Sequences Based on a New 2D Graphical Representation [J]. Journal of Computational Chemistry, 2010,31:1045-1052.
    [57]YAO Y H, LIAO B, WANG T M. A 2D graphical representation of RNA secondary structures and the analysis of similarity/dissimilarity based on it [J]. Journal of Molecular Structure-Theochem,2005,755:131-136.
    [58]YAO Y H, LIU Y Z, WANG T M. Comment on'Analysis of similarity/dissimilarity of DNA sequences based on a 3-D graphical representation'-Reply to comment on Chem. Phys. Lett.411 (2005) 248 [J]. Chemical Physics Letters,2006,424:456-457.
    [59]YAO Y H, NAN X Y, WANG T M.A new 2D graphical representation-Classification curve and the analysis of similarity/dissimilarity of DNA sequences [J]. Journal of Molecular Structure-Theochem,2006,764:101-108.
    [60]YAO Y H, DAI Q, LI C, et al. Analysis of similarity/dissimilarity of protein sequences [J]. Proteins-Structure Function and Bioinformatics,2008,73:864-871.
    [61]YAO Y H, DAI Q, NAN X Y, et al. Analysis of similarity/dissimilarity of DNA sequences based on a class of 2D graphical representation [J]. Journal of Computational Chemistry,2008,29:1632-1639.
    [62]RANDIC M, ZUPAN J, BALABAN A T, et al. Graphical Representation of Proteins [J]. Chemical Reviews,2011,111:790-862.
    [63]RANDIC M. Another look at the chaos-game representation of DNA [J]. Chemical Physics Letters,2008,456:84-88.
    [64]RANDIC M, MEHULIC K, VUKICEVIC D, et al. Graphical representation of proteins as four-color maps and their numerical characterization [J]. Journal of Molecular Graphics & Modelling,2009,27:637-641.
    [65]YUAN C X, LIAO B, WANG T M. New 3D graphical representation of DNA sequences and their numerical characterization [J]. Chemical Physics Letters,2003,379:412-417.
    [66]LIAO B, WANG T M.3-D graphical representation of DNA sequences and their numerical characterization [J]. Journal of Molecular Structure-Theochem,2004,681: 209-212.
    [67]LIAO B, WANG T M. New 2D graphical representation of DNA sequences [J]. Journal of Computational Chemistry,2004,25:1364-1368.
    [68]LIAO B, WANG T M. A 3D graphical representation of RNA secondary structures [J]. Journal of Biomolecular Structure & Dynamics,2004,21:827-832.
    [69]LIAO B, WANG T M. Analysis of similarity/dis similarity of DNA sequences based on 3-D graphical representation [J]. Chemical Physics Letters,2004,388:195-200.
    [70]LIAO B. A 2D graphical representation of DNA sequence [J]. Chemical Physics Letters, 2005,401:196-199.
    [71]LIAO B, DING K Q, WANG T M. On a six-dimensional representation of RNA secondary structures [J]. Journal of Biomolecular Structure & Dynamics,2005,22: 455-463.
    [72]LI Q A, XU Z, HAO B L. Composition vector approach to whole-genome-based prokaryotic phylogeny:Success and foundations [J]. Journal of Biotechnology,2010, 149:115-119.
    [73]SUN J D, XU Z, HAO B L. Whole-genome based Archaea phylogeny and taxonomy:A composition vector approach [J]. Chinese Science Bulletin,2010,55:2323-2328.
    [74]QI J, LUO H, HAO B L. CVTree:a phylogenetic tree reconstruction tool based on whole genomes [J]. Nucleic Acids Research,2004,32:W45-W47.
    [75]GAO L, QI J, SUN J D, et al. Prokaryote phylogeny meets taxonomy:An exhaustive comparison of composition vector trees with systematic bacteriology [J]. Science in China Series C-Life Sciences,2007,50:587-599.
    [76]HAO B L, GAO L. Prokaryotic branch of the Tree of Life:A composition vector approach [J]. Journal of Systematics and Evolution,2008,46:258-262.
    [77]WANG H, XU Z, GAO L, et al. A fungal phylogeny based on 82 complete genomes using the composition vector method [J]. BMC Evolutionary Biology,2009,9:195.
    [78]XU Z, HAO B L. CVTree update:a newly designed phylogenetic study platform using composition vectors and whole genomes [J]. Nucleic Acids Research,2009,37: W174-W178.
    [79]CHOU K C, SHEN H B. A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites:Euk-mPLoc 2.0 [J]. Plos One, 2010,5:
    [80]CHOU K C. Prediction and classification of alpha-turn types [J]. Biopolymers,1997,42: 837-853.
    [81]CHOU K C, BLINN J R. Classification and prediction of beta-turn types [J]. Journal of Protein Chemistry,1997,16:575-595.
    [82]CHOU K C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes [J]. Bioinformatics,2005,21:10-19.
    [83]VINGA S, ALMEIDA J. Alignment-free sequence comparison-a review [J]. Bioinformatics,2003,19:513-523.
    [84]TIPPING M E. Sparse Bayesian learning and the relevance vector machine [J]. Journal of Machine Learning Research,2001,1:211-244.
    [85]HUA S J, SUN Z R. Support vector machine approach for protein subcellular localization prediction [J]. Bioinformatics,2001,17:721-728.
    [86]DING C H Q, DUBCHAK I. Multi-class protein fold recognition using support vector machines and neural networks [J]. Bioinformatics,2001,17:349-358.
    [87]SCHOLKOPF B, SMOLA A J, WILLIAMSON R C, et al. New support vector algorithms [J]. Neural Computation,2000,12:1207-1245.
    [88]BAUDAT G, ANOUAR F E. Generalized discriminant analysis using a kernel approach [J]. Neural Computation,2000,12:2385-2404.
    [89]BURGES C J C. A tutorial on Support Vector Machines for pattern recognition [J]. Data Mining and Knowledge Discovery,1998,2:121-167.
    [90]CORTES C, VAPNIK V. SUPPORT-VECTOR NETWORKS [J]. Machine Learning, 1995,20:273-297.
    [91]TAPIA E, ORNELLA L, BULACIO P, et al. Multiclass classification of microarray data samples with a reduced number of genes [J]. BMC Bioinformatics,2011,12:59.
    [92]MONJI H, KOIZUMI S, OZAKI T, et al. Interaction site prediction by structural similarity to neighboring clusters in protein-protein interaction networks [J]. BMC Bioinformatics,2011,12:S39.
    [93]FERNANDEZ M, CABALLERO J, FERNANDEZ L, et al. Genetic algorithm optimization in drug design QSAR:Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM) [J]. Molecular Diversity,15:269-289.
    [94]DE BRUYNE K, SLABBINCK B, WAEGEMAN W, et al. Bacterial species identification from MALDI-TOF mass spectra through data analysis and machine learning [J]. Systematic and Applied Microbiology,34:20-29.
    [95]CHOU K C, WU Z C, XIAO X A. iLoc-Euk:A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins [J]. Plos One, 6:
    [96]STATNIKOV A, ALIFERIS C F, TSAMARDINOS I, et al. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis [J]. Bioinformatics,2005,21:631-643.
    [97]BRADFORD J R, WESTHEAD D R. Improved prediction of protein-protein binding sites using a support vector machines approach [J]. Bioinformatics,2005,21:1487-1494.
    [98]PARK K J, KANEHISA M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs [J]. Bioinformatics, 2003,19:1656-1663.
    [99]CHOU K C, CAI Y D. Using functional domain composition and support vector machines for prediction of protein subcellular location [J]. Journal of Biological Chemistry,2002,277:45765-45769.
    [100]HUA S J, SUN Z R. A novel method of protein secondary structure prediction with high segment overlap measure:Support vector machine approach [J]. Journal of Molecular Biology,2001,308:397-407.
    [101]BOCK J R, GOUGH D A. Predicting protein-protein interactions from primary structure [J]. Bioinformatics,2001,17:455-460.
    [102]FENG J, HU Y, WAN P, et al. New method for comparing DNA primary sequences based on a discrimination measure [J]. Journal of Theoretical Biology,2010,266: 703-707.
    [103]ULITSKY I, BURSTEIN D, TULLER T, et al. The average common substring approach to phylogenomic reconstruction [J]. Journal of Computational Biology,2006, 13:336-350.
    [104]KUMAR S, TAMURA K, NEI M. MEGA3:Integrated software for molecular evolutionary genetics analysis and sequence alignment [J]. Briefings in Bioinformatics, 2004,5:150-163.
    [105]EDGAR R C. MUSCLE:multiple sequence alignment with high accuracy and high throughput [J]. Nucleic Acids Research,2004,32:1792-1797.
    [106]CHENNA R, SUGAWARA H, KOIKE T, et al. Multiple sequence alignment with the Clustal series of programs [J]. Nucleic Acids Research,2003,31:3497-3500.
    [107]NOTREDAME C, HIGGINS D G, HERINGA J. T-Coffee:A novel method for fast and accurate multiple sequence alignment [J]. Journal of Molecular Biology,2000,302: 205-217.
    [108]JEANMOUGIN F, THOMPSON J D, GOUY M, et al. Multiple sequence alignment with Clustal x [J]. Trends in Biochemical Sciences,1998,23:403-405.
    [109]THOMPSON J D, GIBSON T J, PLEWNIAK F, et al. The CLUSTAL_X windows interface:flexible strategies for multiple sequence alignment aided by quality analysis tools [J]. Nucleic Acids Research,1997,25:4876-4882.
    [110]朱浩(译).计算分子生物学导论[M].科学出版社,2003.
    [111]黄国泰,王天明(译).计算生物学导论——图谱,序列和基因组[M].科学出版社,2009.
    [112]GIANCARLO R, SCATURRO D, UTRO F. Textual data compression in computational biology:a synopsis [J]. Bioinformatics,2009,25:1575-1586.
    [113]FERRAGINA P, GIANCARLO R, GRECO V, et al. Compression-based classification of biological sequences and structures via the Universal Similarity Metric:experimental assessment [J]. BMC Bioinformatics,2007,8:252.
    [114]LI M, CHEN X, LI X, et al. The similarity metric [J]. Information Theory, IEEE Transactions on,2004,50:3250-3264.
    [115]LI M, VITANYI P M B. An introduction to Kolmogorov complexity and its applications [M]. Springer-Verlag New York Inc,2008.
    [116]LI M, BADGER J H, CHEN X, et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny [J]. Bioinformatics,2001,17: 149.
    [117]CILIBRASI R, VIT NYI P, DE WOLF R. Algorithmic clustering of music based on string compression [J]. Computer Music Journal,2004,28:49-67.
    [118]ABOY M, HORNERO R, ABASOLO D, et al..Interpretation of the Lempel-Ziv complexity measure in the context of biomedical signal analysis [J]. IEEE Transactions on Biomedical Engineering,2006,53:2282-2288.
    [119]ORLOV Y L, POTAPOV V N. Complexity:an internet resource for analysis of DNA sequence complexity [J]. Nucleic Acids Research,2004,32:W628-W633.
    [120]LIPPERT R A. Space-efficient whole genome comparisons with Burrows-Wheeler transforms [J]. Journal of Computational Biology,2005,12:407-415.
    [121]YANG L P, ZHANG X D, WANG T M. The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform [J]. Journal of Theoretical Biology,2010,262:742-749.
    [122]MANTACI S, RESTIVO A, ROSONE G, et al. A New Combinatorial Approach to Sequence Comparison [M]. Theoretical Computer Science.2005:348-359.
    [123]MANTACI S, RESTIVO A, ROSONE G, et al. An extension of the Burrows-Wheeler Transform [J]. Theoretical Computer Science,2007,
    [124]NEEDLEMAN S B, WUNSCH C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins [J]. Journal of Molecular Biology, 1970,48:443-453.
    [125]SMITH T F, WATERMAN M S. Identification of common molecular subsequences [J]. J Mol Bwl,1981,147:195-197.
    [126]DAYHOFF M O, SCHWARTZ R M. A model of evolutionary change in proteins, F, 1978 [C]. Citeseer.
    [127]HENIKOFF S, HENIKOFF J G. Amino acid substitution matrices from protein blocks [J]. Proceedings of the National Academy of Sciences of the United States of America, 1992,89:10915.
    [128]BENNETT C H, GACS P, LI M, et al. Information distance [J]. IEEE Transactions on Information Theory,1998,44:1407-1423.
    [129]CILIBRASI R, VITANYI P M B. Clustering by compression [J]. IEEE Transactions on Information Theory,2005,51:1523-1545.
    [130]MANTACI S, RESTIVO A, SCLORTINO M. Distance measures for biological sequences:Some recent approaches [J]. International Journal of Approximate Reasoning, 2008,47:109-124.
    [131]SU A I, WELSH J B, SAPINOSO L M, et al. Molecular classification of human carcinomas by use of gene expression signatures [J]. Cancer Research,2001,61: 7388-7393.
    [132]BURROWS M, WHEELER D J. A block-sorting lossless data compression algorithm [J]. Digital SRC Research Report,1994,
    [133]MANTACI S, RESTIVO A, ROSONE G, et al. An extension of the Burrows-Wheeler transform [J]. Theoretical Computer Science,2007,387:298-312.
    [134]MANTACI S, RESTIVO A, SCIORTINO M. Burrows-Wheeler transform and Sturmian words [J]. Information Processing Letters,2003,86:241-246.
    [135]CRISTEA P. Genetic signal analysis [M]. New York:Ieee,2001.
    [136]FENG J, WANG T M. Characterization of protein primary sequences based on partial ordering [J]. Journal of Theoretical Biology,2008,254:752-755.
    [137]XIAO X, CHOU K C. Digital coding of amino acids based on hydrophobic index [J]. Protein and Peptide Letters,2007,14:871-875.
    [138]ROBINSON D F, FOULDS L R. Comparison of phylogenetic trees [J]. Mathematical Biosciences,1981,53:131-147.
    [139]STERNBERG M J E, THORNTON J M. On the conformation of proteins:The handedness of the connection between parallel [beta]-strands [J]. Journal of Molecular Biology,1977,110:269-283.
    [140]GILBERT D, WESTHEAD D, VIKSNA J, et al. A computer system to perform structure comparison using TOPS representations of protein structure [J]. Computers & Chemistry,2001,26:23-30.
    [141]MICHALOPOULOS I, TORRANCE G M, GILBERT D R, et al. TOPS:an enhanced database of protein structural topology [J]. Nucleic Acids Research,2004,32: D251-D254.
    [142]VEERAMALAI M, GILBERT D. A novel method for comparing topological models of protein structures enhanced with ligand information [J]. Bioinformatics,2008,24: 2698-2705.
    [143]LIU L W, WANG T M. Comparison of TOPS strings based on LZ complexity [J]. Journal of Theoretical Biology,2008,251:159-166.
    [144]JOHANNISSEN L O, TAYLOR W R. Protein fold comparison by the alignment of topological strings [J]. Protein Engineering,2003,16:949-955.
    [145]GUO Y, WANG T M. A New Method to Analyze the Similarity of Protein Structure Using TOPS Representations [J]. Journal of Biomolecular Structure & Dynamics,2008, 26:367-373.
    [146]BORG I, GROENEN P J F. Modern multidimensional scaling:Theory and Applications [M]. Springer Verlag,2005.
    [147]SHEPARD R N. Metric structures in ordinal data [J]. Journal of Mathematical Psychology,1966,3:287-315.
    [148]THOMAS P D, DILL K A. An iterative method for extracting energy-like quantities from protein structures [J]. Proceedings of the National Academy of Sciences of the United States of America,1996,93:11628-11633.
    [149]LAUNAY G, MENDEZ R, WODAK S, et al. Recognizing protein-protein interfaces with empirical potentials and reduced amino acid alphabets [J]. BMC Bioinformatics, 2007,8:270.
    [150]MURPHY L R, WALLQVIST A, LEVY R M. Simplified amino acid alphabets for protein fold recognition and implications for folding [J]. Protein Engineering,2000,13: 149-152.
    [151]SOLIS A D, RACKOVSKY S. Optimized representations and maximal information in proteins [J]. Proteins-Structure Function and Genetics,2000,38:149-164.
    [152]ANDERSEN C A F, BRUNAK S. Representation of protein-sequence information by amino acid subalphabets [J]. Ai Magazine,2004,25:97-104.
    [153]PETERSON E L, KONDEV J, THERIOT J A, et al. Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment [J]. Bioinformatics, 2009,25:1356-1362.
    [154]ALBAYRAK A, OTU H H, SEZERMAN U O. Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets [J]. BMC Bioinformatics,2010,11:428.
    [155]BACARDIT J, STOUT M, HIRST J, et al. Automated Alphabet Reduction for Protein Datasets [J]. BMC Bioinformatics,2009,10:6-22.
    [156]KAWASHIMA S, KANEHISA M. AAindex:Amino acid index database [J]. Nucleic Acids Research,2000,28:374-374.
    [157]KAWASHIMA S, OGATA H, KANEHISA M. AAindex:Amino Acid Index Database [J]. Nucleic Acids Research,1999,27:368-369.
    [158]KAWASHIMA S, POKAROWSKI P, POKAROWSKA M, et al. AAindex:amino acid index database, progress report 2008 [J]. Nucleic Acids Research,2008,36:D202-D205.
    [159]TOMII K, KANEHISA M. Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins [J]. Protein Engineering,1996, 9:27-36.
    [160]NAKAI K, KIDERA A, KANEHISA M. Cluster analysis of amino acid indices for prediction of protein structure and function [J]. Protein Engineering,1988,2:93-100.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700