基因序列与结构的信息分析及应用算法研究

英文题名：The Research on Information Analysis and Applied Algorithms of Gene Senquence and Structure
作者：向旭宇
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：生物信息学 ; 序列比对 ; 多序列比对 ; 蚁群算法 ; 系统发育分析
英文关键词：Bioinformatics ; Pairwise alignment ; multiple sequence alignment ; Ant
英文关键词：colony algorithm ; Phylogenetic analysis
学位年度：2010
导师：张大方
学科代码：081203
学位授予单位：湖南大学
论文提交日期：2010-06-20
答辩委员会主席：李仁发

摘要

随着人类基因组研究的重点向功能基因组转化,“海量”的生物数据为生命科学研究提供了广阔前景,同时也对现有的生物数据处理能力提出了严峻挑战。如何从浩如烟海的生物序列数据中挖掘出有价值的生物信息,以获取基因、蛋白质结构、功能和进化等理性知识是生物信息学研究的主要目的。因此基因序列与结构的信息分析是生物信息学的一个非常重要的研究课题。
     基因序列与结构信息的获取是通过序列和结构的比较来实现的,序列或结构比对是序列或结构比较的基础。序列或结构信息最终是为获取基因组功能以及进化关系服务的。基因表达的产物是蛋白质,蛋白质也是生命活动的执行体,而蛋白质亚细胞定位与蛋白质功能是密切相关的,蛋白质亚细胞定位信息可以为蛋白质功能的研究提供有用线索。在蛋白质亚细胞定位预测研究中,如何获取更完整的序列特征信息是关键。本文将围绕基因序列或结构特征信息分析这一主题,将从以下三个方面进行深入研究：(1)新型序列和结构比对方法,以提高分歧较大序列的多序列比对准确率；(2)基于图形表示的全基因组系统发育分析方法；(3)基于复合特征的蛋白质亚细胞定位预测方法。论文的主要研究成果如下：
     (1)基于最小编辑距离的序列比对算法中,针对动态规划过程中不是所有的过程都需要进行,提出了更有效的非动态规划算法,其复杂度分别为O(n.L)时间和O(n)空间,其他最快算法是由Pevzner和Waterman提出来,其复杂度分别为O(l+Ln)时间和O(l+Ln)空间。
     (2)针对多序列比对计算的高复杂性,采用一种平面图表示来描述多序列比对进程,既能考虑到每种可能的比对,也定义了空格插入、每种可选路径上迭代信息值和打分规则,引入蚁群遗传算法搜索和探索解空间中的最优近似解,提高了找到可行解的能力和避免过早收敛,能有效提高相同列指标。
     (3)针对现有RNA二级结构表示法存在高复杂性、退化和不同结构可能会对应相同表示的问题,提出了RNA二级结构的三位和四位编码表示方法,利用二进制的异或运算对RNA二级结构进行了比对分析。结构编码方式简单直接地展示了结构信息,有助于更好地实现突变分析可视化,从而推断疾病发生的机理。结构的编码方式也为结构比较提供了一种很好的数学模型,易于发现结构间的相似性和差异性,便于基因的检测和基因功能区的预测。该方法既能很好地区分自由基和基对及其它们的位置,也能区分含假结在内的不同子结构类。
     (4)针对系统发育分析需要构建指导树,且指导树生成方面存在近似程度不高的问题,运用图形表示生物序列的思想,提出了一种新的DNA序列的二维图形表示,给出了一种基于全基因组序列的二维图形表示来分析基因组进化关系的新方法,该方法通过对二维曲线之间的差异测量来得到进化距离。通过冠状病毒DNA序列的相似性／相异性比较实验,利用PHILIP软件包构建系统发育树,结果与实际进化树相吻合。该方法用全基因组的相似矩阵代替了进化距离矩阵,不需要多序列比对。既很好地体现了物种之间的关系,也大大降低了计算复杂性和时间复杂度。
     (5)引入一个基于距离频率的蛋白质序列编码方法,将一个原始序列定义为220维复合特征向量来表示一个蛋白质,包含20个氨基酸成分和200个相同氨基酸的距离频率。然后,我们用支持向量机算法进行蛋白质亚细胞定位预测,实验结果证明了该方法的有效性。
With as the focus of human gene transformed to functional genomics, The accumulation of biology sequence data has offered a bright future to life sciences research, but also a severe challenge to the capacity of contemporary biological data processing. It is the main goal of bioinformatics how to mine valuable biology information from the vast biology sequence data, to understanding the structure, function and evolution of genes and protein. It is very important to research on information analysis of gene sequence and structure.
     The information of gene sequence and structure can be obtained by contrast of them which bases on the alignment of them. It serves for the achievement of gene group structure and Phylogenesis. Protein is the product of gene expression and the undertaker of physical activities. Protein subcellular location has close relation with the function of protein that the information of the former can provide valuable clues for the research of the later. In the protein subcellular localization prediction, how to obtain more complete information of sequence features is key. This essay will focus on the gene sequence or structure information of the subject, from the depth of the following three aspects.1) a new way of contrast of sequence and structure to improve the veracity of the multiple sequence alignment with great diversity.2) Phylogenetic analysis based on the illustrated whole gene group.3) protein subcellular location predication based on the comb characters
     The main work is summarized as follows:
     (1) For not all processes are need in dynamic planning process, this essay put forward a more effective non-dynamic programming algorithms-the minimum edit distance based on sequence alignment algorithm, its execution time complexity is O(nL), space complexity is O(n), Other fastest algorithm is proposed by the Pevzner and Waterman, and its complexity are O (l+Ln) time and O (l+Ln) space
     (2)For multiple sequence alignment (MSA) calculations of high complexity, this paper introdices a ichnography to describe the MSA progress, by which that can take into account of every possible alignment, defines the space insert, iterative information value and scoring rules of each optional path, induct ant colony genetic algorithm to explore the solution space to solve the MSA problem. This method of representation integrates the advantages of both genetic algorithms, improves the ability to find feasible solutions and avoid premature convergence.
     (3) the presence representation of the RNA secondary structure has high complexity, degradation, and different structures which may correspond to the same problem that was proposed RNA secondary structure.by the three or four encoding methods, using the binary OR operation of the RNA to analysis the RNA secondary structure. Structure encoding can display simple and direct structural information to help better realize the visualization of mutation analysis to infer the mechanism of disease. Structure encoding for structural comparison provides a good model, it is easy to find similarities between the structure and differences, to facilitate detection of genes and gene function prediction area.The method can not only well distinguish freebase and base pair on their location but also distinguish different sub-structures objects including Pseudoknot.
     (4) For a phylogenetic analysis needs the guidance of the tree, and the guide tree-level exists the problem of poor similarity, this essay puts forward a new method of analysis genome evolutionary relationship which is represented by two-dimensional graph based on complete genome sequences of a new two-dimensional graphical with the thought of graphical representation of biological sequences and proposed a two-dimensional graphical representation of DNA sequence. The new method gets the evolutionary distance by measuring the difference between two-dimensional curves. The result is consistent with the actual evolutionary tree when experimentally compare the similarity/dissimilarity of Coronavirus DNA sequences and use PHILIP package phylogenetic tree. The method uses the similar matrix of the whole genome instead of the evolutionary distance matrix and does not need multiple sequence alignment. It not only well embodies the relationship between species, but also greatly reduces the complexity in time and in space.
     (5) This essay introduces a protein sequence coding method based on distance frequency, which defines an original sequence as the220-dimensional feature vector to represent a complex protein that contains20amino acids and distance frequency of200same amino acids. Then, we use support vector machine for protein subcellular localization prediction. The experimental results show the effectiveness of the method

引文

[1]李朔.牙鲆免疫相关基因的克隆与表达分析.[中国海洋大学博士学位论文].青岛：中国海洋大学,2009
    [2]Gan H H, Pasquali S, Schlick T. Exploring the repertoire of RNA secondary motifs using graph theory:implications for RNA design. Nuclei Acids Res,2003, 31:2926-2943
    [3]Shapiro B A., Zhang K Z. Comparing multiple RNA secondary structure using tree comparisons. Comput.Biomed.Res,1990,6:309-318
    [4]Lee S Y, Nussinov R, Mazel J V. Tree graphs of RNA secondary structures and their comparison. Computer Biomed. Res,1989,22:461-473
    [5]Hofacker I L, Bernhart S H, Stadler P F. Alignment of RNA Base Pairing Probability Matrices. Bioinformatics,2004,20(14):2222-2227
    [6]McCaskill J S. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers,1990,29:1105
    [7]Yao Yu-Hua, Nan Xu-Ying, Wang TianMing. A class of 2D graphical representations of RNA secondary structures and the analysis of similarity based on them. Journal of Computational Chemistry,2005,26(13):1339-1346
    [8]Liao Bo, Wang TianMing. A 3D Graphical representation of RNA secondary structure. J.Biomol.Struc.Dynamics,2004,21(6):827-832
    [9]Liao Bo, Ding Kequan, Wang Tianming. On A Six-Dimensional Representation of RNA Secondary Structures. J. Biomol. Struc. Dynamics,2005,22(4):455-63
    [10]Liao Bo, Wang TianMing, Ding Kequan. On A Seven-Dimensional Representation of RNA Secondary Structures. Molecular Simulation,2005, 31(14):1063-1071
    [11]Luo Jiawei, Liao Bo, Li Renfa, et al. RNA Secondary Structure 3D Graphical Representation Without Degeneracy. Journal of Mathematical Chemistry,2006, 39:629-636
    [12]Liao Bo, Luo Jiawei, Li Renfa, et al. RNA Secondary structure 2D graphical rep-resentation without degeneracy. International Journal of Quantum Chemistry, 2006,106(8):1749-1755
    [13]Liao Bo, Zhu Wen, Li Pengcheng. On a four-dimensional representation of RNA secondary structures. Journal of Mathematical Chemistry,2007,42(4): 1015-1021
    [14]Liao Bo, Zhu Wen, Luo Jiawei, et al. RNA Secondary Structure Mathematical Representation without Degeneracy. MATCH Communications in Mathematical and in Computer Chemistry,2007,57(3):687-695
    [15]Zhang Yi, Qiu Jiqing, Su Lianqing. Comparing RNA secondary structures based on 2D graphical representation. Chemical Physics Letters,2008,458:180-185
    [16]Feng Jie, Wang TianMing. A 3D graphical representation of RNA secondary structures based on chaos game representation. Chemical Physics Letters,2008, 454:355-361
    [17]Zhang Y. On 3D Graphical Representation of RNA Secondary Structure. MATCH Commun. Math. Comput. Chem.,2007,57:157-168
    [18]Zhang Y. On 2D Graphical Representation of RNA Secondary Structure. MATCH Commun. Math. Comput. Chem.,2007,57:697-710
    [19]Alberto Apostolico, Zvi Galil. Pattern Matching Algorithms. Oxford:Oxford University Press,1997,123-141
    [20]Maier D. The complexity of some problems on subsequences and supersequences. J. ACM,1978,25:322-336
    [21]Mikhail J Atallah. Handbook of Algorithms and Theory of Computation. FL: CRC Press,1998
    [22]Bergroth L, Hakonen H, Raita T. A survey of longest common subsequence algorithms. SPIRE, A Coruna, Spain,2000,39-48
    [23]Hirschberg D S. Algorithms for the longest common subsequence problem. J. ACM,1977,24:664-675
    [24]RozenbergG. Handbook of Formal Languages. Berlin:Springer-Verlag,1997, 361-398
    [25]William J Masek. A faster algorithm computing string edit distances. J. Comput. System Sci,1980,20:18-31
    [26]Guo J Y, Hwang F K. An almost-linear time and linear space algorithm for the longest common subsequence problem. Information Processing Letters,2005,94: 131-135
    [27]Gusfield D. Algorithms on Strings, Trees, and Sequences:Computer Science and Computational Biology. New York:Cambridge University Press,1997
    [28]Sankoff D. Matching sequence under deletion-insertion constraints. Proceedings of the National Academy of Sciences of the United States of America,1972,69: 4-6
    [29]Carrillo H, Lipman D. The multiple sequence alignment problem in biology. J. Appl. Math, SIAM,1998,48(5):1073-1082
    [30]Wang L, Jiang T. On the complexity of multiple sequence alignment. Journal of Computational Biology,1994,1(4):337-348
    [31]Wallace I M, Blackshields G, Higgins D G. Multiple sequence alignments. Curr. Opin. Struct. Biol.,2005,15(3):261-6
    [32]Notredame C, Higgins D G. SAGA:sequence alignment by genetic algorithm. Nucleic Acids Research,1996,24(8):1515-1524
    [33]Notredame C, O'Brien E A, Higgins D G. RAGA:RNA sequence alignment by genetic algorithm. Nucleic Acids Research,1997,25(22):4570-4580
    [34]Chellapilla K, Fogel G B. Multiple sequence alignment using evolutionary programming, in Proceedings of the 1999 Congress on Evolutionary Computation (CEC'99),1999
    [35]Thomsen R, Fogel G B, Krink T. A Clustal alignment improver using evolutionary algorithms, in Proceedings of the 2002 Congress on Evolutionary Computation (CEC'02),2002
    [36]Cai L, Juedes D, Liakhovitch E. Evolutionary computation techniques for multiple sequence alignment, in Proceedings of the 2000 Congress on Evolutionary Computation (CEC'00),2000,829-835
    [37]Lee Znejung, Su Shun-feng, Chuang Chen-chia, et al. Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment. Applied Soft Computing,2008,8:55-78
    [38]Davis L D. Handbook of Genetic Algorithms. New York:Van Nostrand Reinhold Company,1981
    [39]Fogel D B. An introduction to simulated evolutionary optimization. IEEE Transactions on Neural Networks,1994,5(1):3-14
    [40]Zhang Ching, Andrew K C Wong. A genetic algorithm for multiple molecular sequence alignment. Comput. Applic. Biosci,1997,13(6):565-581
    [41]Jiao Licheng, Wang Lei. A Novel genetic algorithm based on immunity. IEEE Trans. Syst., Man Cyber.Part A,2000,30(5):552-561
    [42]Burke E K, Smith A J. Hybrid evolutionary techniques for the maintenance scheduling problem. IEEE Trans. Power Syst,2000,15:122-128
    [43]Merz P, Freisleben B. Fitness landscape analysis and mimetic algorithms for quadratic assignment problem. IEEE Trans. Evol. Comput,2000,4(4):337-352
    [44]Lee Znejung, Su Shun-feng, Lee Chou-yuan. A genetic algorithm with domain knowledge for weapon-target assignment problems. J. Chin. Inst. Eng.,2003, 25(3):287-295
    [45]Lee Znejung, Su Shun-feng, Lee Chou-yuan. Efficiently solving general weapon-target assignment problem by genetic algorithms with greedy eugenics. IEEE Trans. Syst, Man Cyber. Part B,2003,33:113-121
    [46]Jukes T.H.C., C. R Munro H N, Allison J B. Mammalian Protein Metabolism. New York:Academic Press.1969,21
    [47]Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 1980,16(2):111-20
    [48]Barry D, Hartigan J. Statistical Analysis of Hominoid Molecular Evolution. Statist. Sci,1987,2(2):191-207
    [49]Hirohisa Kishino, Masami Hasegawa. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data and the branching order in hominoidea. J Mol Evol.,1989,29:170-179
    [50]Lake J A. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA,1994,91:1455-1459
    [51]Joseph H Camin, Robert R Sokal. A method for deducing branching sequences in phylogeny. Evolution,1965,19:311-326
    [52]Margaret O Dayhoff. Atlas of Protein Sequence and Structure. MD:Silver Spring,1966
    [53]Cavalli-Sforza L L, Edwards A W F. Phylogenetic analysis:models and estimation procedures. Evolution,1967,21:550-570
    [54]Fitch W M. Toward defining the course of evolution:Minimum change for a specific tree topology. Syst Zool,1971,35:406-416
    [55]Felsenstein J. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst Zool,1973,22(3): 240-9
    [56]Felsenstein J. Evolutionary trees from DNA sequences:a maximum likelihood approach. J. Mol. Evol.,1981,17:368-76
    [57]Felsenstein J, Churchill G A. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol.,1996,13:93-104
    [58]Fitch W M, Margoliash E. Construction of phylogenetic trees. Science,1967, 155:279-284
    [59]Needleman S B, Wunsch C D. A General method applicable to the search for similarities in the amino acid sequence of two proteins. J.Moi.Biol.,1970,48: 443-53
    [60]Gibbs A J, McIntyre G. A. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem.,1970,16:1-11
    [61]Smith T F, Waterman M S. Comparison of biosequences. Advances Applied Mathematics,1981,2:482-489
    [62]Giegerich R. A systematic approach to dynamic programming in bioinformatics. Bioinformatics,2000,16(8):665-677
    [63]Myers E W, Miller W. Optimal alignment in line space. CABIOS,1988,4(1): 11-17
    [64]Gotoh O. Optimal alignment between groups of sequences and its application to multiple sequence alignment. CABIOS,1993,9(3):361-370
    [65]Stoye J. Multiple sequence alignment with the divide-and-conquer method. Gene.,1998,211:GC45-56
    [66]Stoye J, Moulton V, Dress A W. DCA:an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci.,1997,13(6):625-627
    [67]Lee C, Grasso C, Shariow M F. Multiple sequence alignment using partial order graphs. Bioinformatics,2002,18(3):452-464
    [68]Katoh K, Misawa K, Kuma K, et al. MAFFT:a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 2002,30(14):3059-3066
    [69]Katoh K, Kuma K, Toh H, et al. MAFFT version S:improvement in accuracy of multiple sequence alignment. Nucleic Acids Research,2005,33(2):511-518
    [70]Eddy S R. Multiple alignment using hidden Markov models. Proc.Int.Conf. Intell. Syst. Mol. Biol.,1995,3:114-120
    [71]Notredama C. Recent progress in multiple sequence alignment:a survey. Pharmacogenomics,2002,3(1):131-144
    [72]Hogeweg P, Hesper B. The alignment of sets of sequences and the construction of phyletic trees:an integrated method. Journal of Molecular Evolution,1984,20: 175-186
    [73]Feng D F, Doolittle R F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J.Mol.Evol.,1987,25:351-360
    [74]Thompson J D, Gibson T J, Higgins D. CLUSTAL W:improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 1994,22:4673-4680
    [75]Berger M P, Munson P J. A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl.Biasci.,1991,7:479-484
    [76]Notredama C, Higgins D G. SAGA:sequence alignment by genetic algorithm. Nucleic Acids Research,1996,24(8):1515-1524
    [77]Brocchieri L, Karlin S. A symmetric iterated multiple alignment of protein sequences. J.Mol.Biol.,1998,276(1):249-64
    [78]Reinert K, Stoye J, Will T. An iterative method for faster sun-of-paire multiple sequence alignment. Bioinformatics,2000,16:808-814
    [79]Wang Y, Li K B. An adaptive and iterative algorithm for refining multiple sequence alignment. Computational Biology and Chemstry,2004,28:141-148
    [80]Gotoh O. Significant improvement in accuracy of multiple protein sequence alignment by iterative refinement as assessed by reference to structural alignment. J.Mol.Biol.,1996,264:823-838
    [81]Krogh A, Brown M, Mian I S, et al. Hidden Markov models in computational biology:applications to protein modeling. J. Mol. Biol.,1994,235:1501-1531
    [82]Baldi P, Chauvin Y, Hunkapiller T, et al. Hidden Markov Models of biological primary sequence information. Proc.Natl.Acad.Sci. U.S.A.,1994,91(2): 1059-1063
    [83]Kim J, Pramanik S, Chung M J. Multiple sequence alignment using simulated annealing. Comp. Appl. Biosic. (CABIOS),1994,10(4):419-426
    [84]Lukashin A V, Engelbrecht J, Brunak S. Multiple alignment using simulated annealing:branch point definition in human mRNA splicing. Nucl. Acids. Res., 1992,20(10):2511-2516
    [85]Lawrence C E, Altschul S F, Boguski M S, et al. Detecting subtle sequence signals:a Gibbs sampling strategy for multiple alignment. Science,1993,262: 208-214
    [86]Corpet F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Research,1988,16(22):10881-10890
    [87]Edgar R C. MUSCLE:multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research,2004,32(5):1792-1797
    [88]Zhang M, Fang W W, Zhang J H, et al. MSAID:Multiple Sequence Alignment Based on a Measure of Information Discrepancy. Computational Biology and Chemistry,2005,29(2):175-181
    [89]Shapiro B, Zhang K. Comparing multiple RNA secondary structures using tree comparisons. Computer.Appl.Biosci,1990,6(4):309-318
    [90]Hamori E, Ruskin J. H curves:a novel method of representation of nucleotide series especially suited for long DNA sequences. Biol Chem,1983,258(2): 1318-1327
    [91]Gates M A. A simple way to look at DNA. J.Theor Biol,1986,119(3):319-328
    [92]Nandy A. A new graphical representation and analysis of DNA sequence structure:I. Methodology and application to globin genes. Curr. Sci.,1994, 66(14):309-314
    [93]Leong P M, Morgenthaler S. Random walk and gap plots of DNA sequences. ApplicBiosc,1995,11(5):503-511
    [94]Randie M, Vracko M, Nandy A, et al. On 3-D Graphical representation of DNA primary sequence and their numerical characterization. J. Chem. Inf. Comput. Sci.,2000,40(5):1235-1244
    [95]Randie M, Vraeko M, Lers N, et al. Novel 2-D graphical representation of DNA sequence and their numerical characterization. Chemical Physics Letters,2003, 368(1-2):1-6
    [96]Jeffrey H I. Chaos game representation of gene structure. Nucleic Acids Res 1990,18(8):2163-2170
    [97]Zhang C T, Zhang R. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res,1991,19(22):6313-6317
    [98]Randie M, Vraeko M, Basak S C. On 3-D graphical representation of DNA primary sequences and their numerical charactenzation. Chem Inf Comput Sci, 2000,40(5):1235-1244
    [99]Leong P M, Morgenthaler S. Random walk and gap plots of DNA sequedces. Comput. Appl. Biosci.,1995,11:503-511
    [100]Zhang R, Zhang C T. Z curve an intuitive tool for visualizing and analyzing DNA sequences. J. Biomol. str. Dyn,1994,11:767-782
    [101]Wu Y, Liew A W, Yan H, et al. DB-Curve:a novel 2D method of DNA sequence visualization and representation. Chemical Physics Letters,2003,367:170-176
    [102]Yao Y H, Wang T M. A class of new 2-D graphical representation of DNA sequences and their application. Chemical Physics Letters,2004,398:318-323
    [103]Liao B, Wang T M. On a 2-D graphical representation of DNA sequence. Chemical Physics Letters,2005,401:196-199
    [104]Li C, Wang J. On a 3-D representation of DNA primary sequences. Comb Chem High Throughput Screen.2004,7:23-27
    [105]Randic M, Balaban A T. On a four-dimensional representation of DNA primary sequences. J. Chem. Inf. Comput. Sci.,2003,43:532-539
    [106]Liao Bo, Tan Mingshu, Ding Kequan. A 4D representation of DNA sequences and its application. Chemical Physic Letters,2005,402:380-383
    [107]Peng C-K, Buldyrev S V, Goldberger A L, Havlin S, et al. Long-range correlations in nucleotide sequences. Nature,1992,356:168-170
    [108]Randie M.2-D graphical representation of proteins based on physico-chemical properties of amino acids. Chem Phys Lett,2007,440(10):291-295
    [109]Yao Y H, Dai Q, Li C, et al. Analysis of similarity/dissimilarity of protein sequences. Proteins,2008,73(4):864-871
    [110]Randie M, Butina D, Zupan J. Novel 2-D graphical representation of proteins. Chem Phys Lett,2006,419(26):528-532
    [111]Feng Z P, Zhang C T. A graphic representation of protein sequence and predicting the subcellular locations of prokaryotic proteins. Int. J. Biochem. Cell Biol.,2002,34:298-307
    [112]Zuker M. On finding all suboptial Foldings of an RNA Molecule. Science,1989, 244:48-52
    [113]Zuker M, Jaeger, Turner D. A comparison of optima] and suboptimal RNA secondary structures predicted by free energy minimization with structures determined by Phylogenetic comparison. NucleicAcidsRes,1991,19(10): 2707-2714
    [114]Liao B, Luo J W, Li R F, et al. RNA Secondary structure 2D graphical representation without degeneracy. International Journal of Quantum Chemistry, 2006,106(8):1749-1755
    [115]Liao Bo, Chen Weiyang, Sun Xingming, et al. A binary coding method of RNA secondary structure and its application. Journal of Computational Chemistry, 2009,30(14):2205-2212
    [116]Luo Jiawei, Liao Bo, Li Renfa, et al. RNA Secondary Structure 3D Graphical Representation without Degeneracy. Journal of Mathematical Chemistry,2006, 39:629-636
    [117]Cao Zhi, Liao Bo, Li Renfa, et al. A three-dimensional cube representation of RNA secondary structure and its application. Journal of Computational and Theoretical Nanoscience,2009,6:1474-1481
    [118]Cao Zhi, Liao Bo, Li Renfa, et al. RNA secondary structure alignment based on an extended binary coding method. International Journal of Quantum Chemistry, 2010,110(12):1-5
    [119]Gutell R R, Lsrsen N, Woese C R. Lessons from an evolving rRNA:16S and 23S rRNA structures from a comparative perpective. Microbiol.Rev.,1994,58:10-26
    [120]Harver P H, Pagel M D. The comparative Method in Evolutionary Biology. Oxfrord:Oxford University press,1991
    [121]Sullivan M J, Swofford D L. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mammal. Evol.,1997,4:77-86
    [122]Swofford D L, Thorne J L, Felsenstein J, et al. The topology-dependent permutation test for monophyly does not test for monophyly. Syst. Biol.,1996b, 45:575-579
    [123]Li W-H. Molecular Evolution. MA:Sinauer Associates, Sunderland.1997
    [124]Saitou N. Reconstruction of gene trees form sequence data. Methods Enzymol, 1996,226:427-449
    [125]Hall B G. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol,2005,22(3):792-802
    [126]吴文佳.蛋白质亚细胞定位预测方法研究：[D南京航空航天大学硕士学位论文].南京：南京航空航天大学,2008,5-6
    [127]李风敏,李前忠.蛋白质亚细胞定位的识别.生物物理学报,2004,4：297-306
    [128]Chou K C, Shen H B. Recent progress in protein subcellular location prediction. Analytical Biochemistry,2007,370:1-16
    [129]Cai Y D, Chou K C. Nearest neighbor algorithm for predicting protein subcelluar location by combining functional domain composition and psudo-amino acid composition. Biochemical and Biophysical Research G:communication,2003, 305:407-411
    [130]Huang Y, Li Y D. Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics,2004,20:121-128
    [131]Hua S, Sun Z R. Support vector machine approach for protein subcellular localization prediction. Bioinformatics,2001,17:721-728
    [132]Nakai K, Kanehisa M. Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins Struct.Funct.Genet,1991,11:95-110
    [133]Nakai K, Horton P. PSORT a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci,1999,24(1): 34-36
    [134]Emanuelsson O, Nielsen H, Brunk S, et al. Predicting subcelluar localization of proteins based on their N-terminal amino acids sequences, J Mol Biol,2000,300: 1005-101
    [135]张树波,赖剑煌,何建国.一种基于最优局部信息融合的蛋白质亚细胞定位预测方法.中山大学学报(自然科学版),2008,47(6)：16-21
    [136]Nakashima H, Nishikawa K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol, 1994,238:54-61
    [137]Cedano J, Aloy P, Perez-Pons J A, et al. Relation between amino acid composition and cellular location of proteins. J Mol Biol,1997,266:594-600
    [138]Park K J, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics,2003,19:1656-1663
    [139]Shi J Y, Zhang S W, Pan Q, et al. Using pseudo amino acid composition to predict protein subcellular location:approached with amino acid composition distribution. Amino Acids,2008,35:321-327
    [140]Chou K C. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins,2001,43:246-255
    [141]Xiao X, Shao S, Ding Y, et al. Using complexity measure factor to predict protein subcellular location. Amino Acids,2005,28:57-61
    [142]Gardy J L, Spencer C, Wang K, et al. PSORT-B:Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res,2003, 31(13):3613-3617
    [143]Matsuda S, Vert J P, Ueda N, et al. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci, 2005,14:2804-2813
    [144]Chou K C, Cai Y D. Predicting protein localization in budding yeast. Bioinformatics,2005,21:944-950
    [145]Zhang Z H, Wang Z H, Zhang Z R, et al. A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine. FEBS Lett,2006,580:6169-6174
    [146]Tung T Q, Lee D. A method to improve protein subcellular localization prediction by integrating various biological data sources. BMC Bioinformatics, 2009,10(Suppl 1):S43
    [147]Chou K C, Shen H B. A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple Sites:Euk-mPLoc 2.0. PLoS One,2010,5:e9931
    [148]Cao Zhi, Liao Bo, Li Renfa, et al. RNA secondary structure alignment based on an extended binary coding method. International Journal of Quantum Chemistry, 2010
    [149]Chen Weiyang, Liao Bo, Zhu Wen, et al. Multiple sequence alignment algorithm based on a dispersion graph and ant colony algorithm. Journal of Computational Chemistry,2009,30(13):2031-8
    [150]Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, et al. Introduction to Algorithms(3e). Cambridge:MIT Press,2009
    [151]Nakai K, Kanehisa M. A Knowledge base for predicting protein localization sites in eukaryotic cells. Genomics.1992,14:897-911
    [152]Horton P, Nakai K. Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proceedings of the 5 th International Conference on Intelligent Systems for Molecular Biology,1997,147-152
    [153]Chou J J, Zhang C T. A joint prediction of the folding types of 1490 human proteins from their genetic codons. J. Theor. Biol.,1993,161:251-262
    [154]Zhou G P. An intriguing controversy over protein structural class prediction. J. Protein. Chem.,1998,17:729-738
    [155]Chou K C, Elrod D. Protein subcellular location prediction. Protein Eng.,1999, 12:107-118
    [156]Hua S, Sun Z R. Support vector machine approach for protein subcellular localization prediction. Bioinformatics,2001,17:721-728
    [157]Cai Y D, Liu X J, Xu X B, et al. Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J. Cell. Biochem.,2002,84:343-348
    [158]Chou K C. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins:Struct. Funct. Genet.,2001,44(43):246-255
    [159]Chou K. C, Cai Y D. Predicting protein quaternary structure by pseudo amino acid composition. PROTEINS:Structure, Function, and Genetics.,2003,53: 282-289
    [160]Chou K C, Cai Y D. Predicting protein structural class by functional domain composition. Biochem Biophys Res Comm.,2004,321:1007-1009
    [161]Chou K C, Cai Y D. Prediction of protein subcellular locations by GO-FunD-PseAA predicor. Biochem Biophys Res Commun,2004,320: 1236-1239
    [162]Pan Y X, Zhang Z Z, Guo Z M, et al. Application of pseudo amino acid composition for predicting protein subcellular location:stochastic processing approach. J. Protein. Chem.,2003,22:395-402
    [163]Wang M Y, Xu Z J, Chou K C. Slle for predicting membrane protein types. J. Theor. Biol.,2004,232:7-15
    [164]Gao Q B, Wang Z Z, Using Nearest Feature Line and Tunable Nearest Neighbor methods for prediction of protein subcellular locations. Comput. Biol. Chem., 2005,29:388-392
    [165]Gao Q B, Wang Z Z, Yan C. Prediction of protein subcellular location using a combined feature of sequence. FEBS Lett.,2005,579:3444-3448
    [166]Chou K C, Cai Y D. Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. J. Cell. Biochem.,2004,91:1197-1203
    [167]Chou K C, Cai Y D. Predicting protein localization in budding yeast. Bioinformatics,2005,21:944-950
    [168]Zhang Z H, Wang Z H, Zhang Z R, et al. A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine. FEBS Lett.,2006,580:6169-6174
    [169]Chen Y L, Li Q Z. Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo amino acid composition. J. Theor. Biol., 2007,248:377-381
    [170]Li L Z, Dong Z M. Using pseudo amino acid composition to predict protein subcellular localization:approached by incorporating evolutionary conservation information. Acta Biophysica sinica.,2009,25(2):125-132
    [171]Matsuda S, Vert J P, Ueda N, et al. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci., 2005,14:2804-2813
    [172]Chen Y L, Li Q Z. Prediction of the subcellular location of apoptosis proteins. J. Theor. Biol.,2007,245:775-783
    [173]Reinhardt A, Hubbard T. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res.,1998,26:2230-2236
    [174]Vapnik V. The Nature of Statistical Learning Theory. New York:Springer,1995
    [175]Huang J, Shi F. Support vector machines for predicting apoptosis proteins types. Acta Biotheor,2005,53:39-47
    [176]Zhou X B, Chen C, Li Z C, et al. Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J Theor Biol,2007,248:546-551
    [177]Zhang L, Liao B. A novel representation for apoptosis protein subcellular localization prediction using support vector machine. J Theor Biol,2009,259: 361-365
    [178]Chang C C, Lin C J. LIBSVM:a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvmi.2001
    [179]Chou K C, Zhang C T. Review:prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol.,1995,30:275-349
    [180]Bulashevska A, Eils R. Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains. BMC Bioinformatics,2006,7:298
    [181]Nasibov E, Kandemir-Cavas C. Protein subcellular location prediction using optimally weighted fuzzy k-NN algorithm. J. Protein. Chem.2008,32:448-451
    [182]Bai Fenglan, Zhu Wen, Wang Tianming. Analysis of similarity between RNA secondary structures. Chemical Physic Letters,2005,408(4):258-263
    [183]张原,齐一琳,洪洞,等SARS冠状病毒及相关病毒的分子系统学分析.北京师范大学学报(自然科学版),2003,29(3)：402-406

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700