蛋白质二级结构的预测以及二级结构与三级结构之间关联的探讨
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
蛋白质的生物功能以其结构为基础。随着人类基因组计划的顺利实施,蛋白质序列信息的积累速度远快于蛋白质结构数量的增长速度。实验上研究蛋白质结构的主要手段有X射线晶体学技术、核磁共振衍射技术、电子纤维技术等。然而,通过实验手段确定蛋白质的结构,不但成本高、耗时,而且实验中还会遇到一些目前无法解决的技术困难,因此人们非常希望利用理论计算的方法直接从序列信息出发来预测蛋白质结构,这是生物信息学研究的重要课题之一。
     目前,直接从氨基酸序列信息出发来预测蛋白质三级结构还是有很多困难。更多的焦点集中在去预测蛋白质二级结构。由于二级结构单元是多肽链在三维空间折叠的基本元素,二级结构预测通常作为蛋白质空间结构预测的第一步,是蛋白质三级结构预测中重要的中间步骤,也是蛋白质折叠理论研究的重要挑战。
     本文重点介绍了一种新的方法,即基于4肽结构字的多样性增量二次判别法(简称TPIDQD算法),对2个大小不同的数据库进行了二级结构的预测。同时对325个标准样本集合,进行了二级结构和三级结构关联的研究。
     (1)新的预测算法大体分三步:首先用定义的三种4肽结构字(alpha、beta、coil)在序列中出现的频次作为多样源,从而建立标准源;然后用多样性增量结合二次判别法对任何一个序列片段中心残基的二级结构进行预测;最后进行一些修正后处理,包括:消除预测中的结构涨落以及用4肽边界字来修正预测后的结构边界。
     (2)用TPIDQD算法首次对CB513数据库的二级结构进行了预测,3折交叉检验的预测精度Q_3达到79.19%。
     (3)建立了一个新的包括1645个非冗余蛋白质链的数据库,其中蛋白质结构分辨率高于3 Angstroms,序列相似性小于25%。用TPIDQD算法对其中21残基片段中心残基的结构性质进行预测,10折交叉检验得到Q_3为79.68%。当考虑长程序列信息时,即取更长的序列片段(大于21残基长度)来预测时,结果将更好。同时随着字库的扩大,用CB513库作为训练集,对1645蛋白库的交叉检验,也取得了79%的精度。
     (4)对325个蛋白的二级结构和其三级结构的关联进行了研究。我们利用广义的二级结构序列信息,定义了两个蛋白之间的距离,和用相似分表示的两个蛋白的三级结构的距离进行了相关性分析。结果发现在排除了长度的依赖性后,在灵敏度α=0.05和α=0.01上,有300个相关系数是高于阈值的。
The knowledge of the structure of a protein is important to understand its function. With the success of human genome project, a widening gap appears between rapidly increasing known protein sequences and slow accumulation of known protein structures. Currently, the main methodologies for high-resolution protein structure determination in experimentation have been available, such as X-ray crystallography, NMR, electron microscopy etc. However, purely experimental approaches for the determination of protein structure are time-consuming and expensive. Thus, the theoretical or computational methods for predicting the structures of proteins become increasingly important.
     Presently, the direct prediction of the protein three-dimensional (3D) structure from its amino acid sequence is a difficult task. A large number of approaches have been developed to predict protein secondary structure. Protein secondary structure prediction is often looked as the first step for understanding and predicting tertiary structure because secondary structure elements constitute the building blocks of the folding units. So, the prediction of protein secondary structure as an intermediate step plays an important role in tertiary structure prediction.
     In this dissertation, we introduce a novel sequence-based method, namely tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary structure prediction in two different dataset. Moreover, we investigate the connection between protein secondary structure and its 3D structure for 325 proteins.
     (1) The proposed TPIDQD method consists of three steps: firstly, using the frequency of three kinds of tetra-peptide structural words occurring in a sequence fragment as diversity; secondly, using the method of increment of diversity combined with quadratic discriminant analysis (IDQD for short) to predict the structure of central residues for a sequence fragment; finally, making the correction to the IDQD prediction: removing the structure fluctuation and correcting the structure boundary by using tetra-peptide boundary words.
     (2) The proposed TPIDQD method is based on tetra-peptide structural words and used to predict the structure of central residue for a sequence fragment. The three state overall per-residue accuracy (Q_3) has attained 79.19% in the three-fold cross-validated test for 21-residue fragments in CB513 dataset
     (3) An enlarged dataset is constructed, which contains 1645 protein chains with higher resolution than 3 Angstroms and lower identity than 25%. The TPIDQD method is tested in 1645 protein dataset and a higher accuracy is obtained. The three state overall per-residue accuracy (Q_3) is 79.68% in the ten-fold cross-validated test for 21-residue fragments. And the accuracy can be further improved as taking long-range sequence information (>21-residue fragments) into account in prediction. Moreover, the accuracy Q_3 has attained 79% in the independent test set with the increase of structural words.
     (4) We have investigated the relation between protein secondary structure and its 3D structure for 325 samples and obtained a better result.
引文
[1]Jiang,T.,Xu,Y.,and Zhang,M.Q.Current topics in computational molecular biology[M].Tsinghua University Press,The MITPress,2002.
    [2]孙啸,陆祖宏,谢建明编著.生物信息学基础.北京:清华大学出版社,2005.
    [3]贺福初.蛋白质组(Proteome)研究-后基因组时代的生力军.科学通报.1999,44(2):113-122.
    [4]Aebersold,R.,and Mann,M.Mass Spectrometry-Based Proteomics.Nature,2003,422(6928):198-207.
    [5]阎隆飞,孙之荣编著.蛋白质分子结构.北京:清华大学出版社,1999.
    [6]赵南明,周海梦编著.生物物理学.北京:高等教育出版社,2000.
    [7]Anfinsen,C.B.,Haber,E.,and Sela,M.,etc.The kinetics of the formation of native ribonuclease during oxidation of the reduced polypeptide chain.Proc Natl Acad Sci USA,1961,47:1309-1314.
    [8]Anfinsen,C.B.Principles that govern the folding of protein chains.Science,1973,181:223-230.
    [9]Baker,D.,and Sali,A.Protein structure prediction and structural genomics.Science,2001,294:93-96.
    [10]夏其昌,曾嵘等编著.蛋白质化学与蛋白质组学.北京:科学出版社,2004.
    [11]Pauling,L.,Corey,R.B.,and Branson,H.R.The structure of proteins:two hydrogen-bonded helical cortfigurafions of the polypepfide chain.Proc Natl Acad Sci USA,1951,37:205-234.
    [12]Chu,W.,and Ghahramani,Z.Protein secondary structure prediction using sigmoid belief networks to parameterize segmental semi-markov models.ESANN'2004proceedings-European Symposium on Artificial Neural Networks Bruges(Belgium),28-30 April 2004,d-side public,ISBN 2-930307-04-8,pp81-86.
    [13]Cuff,J.A.,and Barton,G.J.Evaluation and improvement of multiple sequence methods for protein secondary structure prediction.Proteins,1999,34:508-519.
    [14]Fdshman,D.,and Argos,P.Seventy-five percent accuracy in protein secondary structure.Proteins,1999,27:329-335.
    [15]Chandonia,J.M.,and Karplus,M.New methods for accurate prediction of protein secondary structure.Proteins,1999,35:293-306.
    [16]Pollastri,G.,Przybylski,D.,and Rost,B.,etc.Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles.Proteins,2002,47:228-235.
    [17]Przybylski,D.,and Rost,B.Alignments grow,secondary structure prediction improves.Proteins,2002,46:197-205.
    [18]Rost,B.,and Eyfich,V.A.EVA:Large-scale analysis of secondary structure prediction.Proteins,2001,5:192-199.
    [19]Rost,B.,and Sander,C.Progress of 1D protein structure prediction at last.Proteins,1995,23:295-300.
    [20]Chou,P.Y.,and Fasman,G.D.Prediction of protein conformation.Biochemistry,1974,13:211-215.
    [21]Gamier,J.,Osguthorpe,D.J.,and Robson,B.Analysis and implications of simple methods for predicting the secondary structure of globular proteins.J Mol Biol,1978,120:97-120.
    [22]Rost,B.,and Sander,C.Prediction the secondary structure at better than 70%accuracy.J Mol Biol,1993,232:584-599.
    [23]Jones,D.Protein secondary structure prediction based on position-specific scoring matrices.J Mol Biol,1999,292:195-202.
    [24]Rost,B.,and Sander,C.Combining evolutionary information and neural networks to predict protein secondary structure,Proteins,1994,19:55-72.
    [25]Jones,D.Protein secondary structure prediction based on position-specific scoring matrices.J Mol Biol,1994,292:195-202.
    [26]Petersen,T.N.,Lundegaard,C.,and Nielsen,M.,etc.Prediction of protein secondary structure at 80%accuracy.Proteins,2000,41:17-20.
    [27]Pollastri,G.,and Mclysaght,A.Porter:a new,accuracte server for protein secondary structure prediction.Bioinformatics,2005,21:1719-1720.
    [28]Dor,O.,and Zhou,Y.Achieving 80%ten-fold cross-validated accuracy for secondary structure prediction by large-scale training.Proteins,2007,66:838-845.
    [29]Karplus,K.,Barrett,C.,and Hughey,R.Hidden Markov models for detecting remote protein homologies.Bioinformatics,1998,14:846-856.
    [30]Lin,K.,Simossis,V.A.,and Taylor,W.R.,etc.A simple and fast secondary structure prediction method using hidden neural networks.Bioinformatics,2005,21:152-159.
    [31]Hua,S.,and Sun,Z.A novel method of protein secondary structure prediction with high segment overlap measure:Support vector machine approach.J Mol Biol,2001,308:397-407.
    [32]Ward,J.J.,McGuffin,L.J.,and Buxton,B.F.,etc.Secondary structure prediction with support vector machines.Bioinformatics,2003,19:1650-1655.
    [33]Guo,J.,Chert,H.,and Sun,Z.,etc.A novel method for protein secondary structure prediction using dual-layer SVM and profiles.Proteins,2004,54:738-743.
    [34]Karchin,R.,Cline,M.,and Mandel-Gutfreund,Y.,etc.Hidden Markov models that use predicted local structure for fold recognition:Alphabets of backbone geometry.Proteins,2003,51:504-514.
    [35]Kuang,R.,Leslie,C.S.,and Yang,A.S.Protein backbone angle prediction with machine learning approaches.Bioinformatics,2004,20:1612-1621.
    [36]Montgomerie,S.,Sundararaj,S.,and Gallin,W.J.,etc.Improving the accuracy of protein secondary structure prediction using structural alignment.Bioinformatics,2006,7:301-313.
    [37]Berman,H.M.,Westbrook,J.,and Feng,Z.,etc.The Protein Databank.Nucleic Acids Res,2000,28:235-242.
    [38]Kabsch,W.,and Sander,C.Dictionary of protein secondary structure:pattern recognition of hydrogen bonded and geometrical features.Biopolymers,1983,22:2577-2637.
    [39]Cuff,J.A.,and Barton,G.J.Evaluation and improvement of multiple sequence methods for protein secondary structure prediction.Proteins,1999,34:508-519.
    [40]Wang,G.,and Dunbrack,Jr.R.L.PISCES:a protein sequence culling server.Bioinformatics,2003,19:1589-1591.
    [41]Boberg,J.,Salakoski,T.,Vihinen,M.Selection of a representative set of structures from Brookhaven Protein Data Bank.Proteins,1992,14(2):265-276.
    [42]Hobohm,U.,Sander,C.Enlarged representative set of protein structures.Protein Sci,1994,3(3):522-524.
    [43]Jia,M.W.,Luo,L.F.,and Liu,C.Q.Statistical correlation between protein secondary structure and messenger RNA stem-loop structure.Biopolymers,2004,73:16-26.
    [44]Laxton,R.R.the measure of diversity.J Theor Biol,1978,71:51-67.
    [45]徐克学.生物数学.北京:科学出版社.1999,277-296.
    [46]Li,Q.Z.,and Lu,Z.Q.The prediction of the structural class of protein:application of the measure of diversity.J Theor Biol,2001,213:493-502.
    [47]李晓琴,罗辽复.蛋白质结构类预测的新方法.内蒙古大学学报(自然科学版).1998,5:650-654.
    [48]FENG,Y.E.,and LUO,L.F.Use of tetrapeptide signals for protein secondary structure prediction.Amino Acid.DOI:10.1007/s00726-008-0089-7.
    [49]冯永娥,罗辽复.用4肽结构字预测蛋白质二级结构.内蒙古大学(自然科学版).2008,39(3):300-306.
    [50]Zhang,L.R.,and Luo,L.F.Splice site prediction with quadratic discrirninant analysis using diversity measure.Nucleic Acids Res,2003,31:6214-6220.
    [51]吕军,罗辽复.人类PolⅡ启动子的识别.生物化学与生物物理进展.2005,32(12):1185-1191.
    [52]张颖,罗辽复,吕军.使用多样性增量预测磷酸化位点.内蒙古大学学报(自然科学版).2008,39:34-39.
    [53]李风敏,李前忠.用离散量方法预测蛋白质亚细胞定位.内蒙古大学学报(自然科学版).2003,34(4):416-419.
    [54]Chen,Y.L.,and Li,Q.Z.Prediction of the subcellular location of apoptosis proteins.J Theor Biol,2007,245:775-783.
    [55]Zhang,M.Q.Identification of protein coding regions in the human genome by quadratic discriminant analysis.Proc Natl Acad Sci USA,1997,94:565-568.
    [56]Chou,K.C.,and Zhang,C.T.Review:Prediction of protein structural classes.Critical Reviews in Biochem Mol Biol,1995,30:275-349.
    [57]Diao,Y.,Li,M.,and Feng,Z.,etc.The community structure of human cellular signaling network.J Theor Biol,2007,247:608-615.
    [58]Diao,Y.,Ma,D.,and Wen,Z.,etc.Using pseudo amino acid composition to predict transmembrane regions in protein:cellular automata and Lempel-Ziv complexity.Amino Acids,2008,34:111-117.
    [59]Lin,H.,and Li,Q.Z.Using Pseudo Amino Acid Composition to Predict Protein Structural Class:Approached by Incorporating 400 Dipeptide Components.J Comput Chem,2007,28:1463-1466.
    [60]Zhang,T.L.,and Ding,Y.S.Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes.Amino Acids,2007,33:623-629.
    [61]Fang,Y.,Guo,Y.,and Feng,Y.,etc.Predicting DNA-binding proteins:approached from Chou's pseudo amino acid composition and other specific sequence features.Amino Acids,2008,34:103-109.
    [62]Li,F.M.,and Li,Q.Z.Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach.Amino Acids,2008,34..119-125.
    [63]Zhou,X.B.,Chen,C.,and Li,Z.C.,etc.Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes.J Theor Biol,2007,248:546-551.
    [64]Zhou,G.P.An intriguing controversy over protein structural class prediction.J Protein Chem,1998,17:729-738.
    [65]Zhou,G.P.,and Assa-Munt,N.Some insights into protein structural class prediction.Proteins,2001,44:57-59.
    [66]Zhou,G.P.,and Doctor,K.Subeellular location prediction of apoptosis proteins.Proteins,2003,50:44-48.
    [67]Chou,K.C.,and Shen,H.B.Review:Recent progresses in protein subcellular location prediction.Anal Biochem,2007,370:1-16.
    [68]Chen,C.,Tian,Y.X.,and Zou,X.Y.,etc.Using pseudo-amino acid composition and support vector machine to predict protein structural class,J Theor Biol,2006,243:444-448.
    [69]Chen,C.,Zhou,X.,and Tian,Y.,etc.Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network.Anal Biochem,2006,357:116-121.
    [70]Chen,J.,Liu,H.,and Yang,J.,etc.Prediction of linear B-cell epitopes using amino acid pair antigenicity scale.Amine Acids,2007,33:423-428.
    [71]Chou,K.C.,and Shen,H.B.Hum-PLoc:A novel ensemble classifier for predicting human protein subcellular localization.Biochem Biophys Res Commun,2006,347:150-157.
    [72]Chou,K.C.,and Shen,H.B.Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers.J Proteome Res,2006,5:1888-1897.
    [73]Chou,K.C.,and Shen,H.B.Euk-mPLoc:a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites.J Proteome Res,2007,6:1728-1734.
    [74]Chou,K.C.,and Shen,H.B.Large-scale plant protein subcellular location prediction.J Cellular Biochem,2007,100:665-678.
    [75]Chou,K.C.,and Shen,H.B.MemType-2L:A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM.Biochem Biophys Res Comm,2007,360:339-345.
    [76]Chou,K.C.,and Shen,H.B.Signal-CF:a subsite-coupled and window-fusing approach for predicting signal peptides.Biochem Biophys Res Comm,2007,357:633-640.
    [77]Chou,K.C.,and Shen,H.B.Cell-PLoc:A package of web-servers for predicting subcellular localization of proteins in various organisms.Nature Protocols,2008,3:153-162.
    [78]Duda,R.O.,Hart,P.E.,and Stork,D.G.Pattern classification[M].2~(nd) edition,New York;wiley,2001.
    [79]Baldi,P.,Brunak,S.,and Chauvin,Y.,etc.Assessing the accuracy of prediction algorithms for classification:an overview.Bioinformatics,2000,16:412-424.
    [80]Luo,L.F.Conformation transitional rate in protein folding.Int J Quant Chem,1995,54:243-247.
    [81]Levitt,M.,and Chuthia,C.Structural patterns in globular proteins.Nature,1976,261:552-558.
    [82]Rost,B.,and Sander,C.Third generation prediction of secondary structure.In Protein structure prediction(ed.B.Webster) Humana Press,Clifton,N J,2000,71-95.
    [83]Rackovsky S.On the nature of protein folding code.Proc Natl Acad Sci USA,1993,90:644-648.
    [84]Fdshman,D.,and Argos,P.Knowledge-based secondary structure assignment.Proteins,1995,23:566-579.
    [85]Richards,F.M.,and Kundrot,C.E.Identification of structural motifs from protein coordinate data:secondary structure and first-level supersecondary structure.Proteins,1988,3:71-84.
    [86]Sadeghi,M.,Parto,S.,and Arab,S.,etc.Prediction of protein secondary structure based on residue pair types and conformational states using dynamic programming algorithm.FEBS Lett,2005,579:3397-3400.
    [87]Kihara,D.The effect of long-range interactions on the secondary structure formation of proteins.Protein Sci,2005,14:1955-1963.
    [88]Tsai,C.J.,and Nussinov,R.The implications of higher(or lower)success in secondary structure prediction of chain fragments.Protein Sci,2005,14:1943-1944.
    [89]Luo,L.F.,and Li,X.Q.Prediction of topological structure of protein from its secondary structure sequence.Proceedings of Int Syn on Theor Biophy,1997,71-76.
    [90]李晓琴,罗辽复.蛋白质结构型的定义和识别.生物化学与生物物理进展.2002,29:124-127.
    [91]Luo,L.F.,and Li,X.Q.Recognition and architecture of the framework structure of proteins.Proteins,2000,39:9-25.
    [92]李晓琴,罗辽复.α/β类蛋白的构建模式及拓扑结构预测.内蒙古大学学报(自 然科学版).1998,29:349-352.
    [93]李晓琴,罗辽复.α类蛋白的构建模式及拓扑结构预测.内蒙古大学学报(自然科学版).1999,2:325-328.
    [94]李晓琴,罗辽复.β类蛋白的构建模式及拓扑结构预测.内蒙古大学学报(自然科学版).1999,3:169-173.
    [95]李晓琴,罗辽复.蛋白质结构型和拓扑结构识别.内蒙古大学学报(自然科学版).2000.3l:272-274.
    [96]Hou,J.T.,Jun,S.R.,and Zhang,C.,etc.Global mapping of the protein structure space and application in structure-based inference of protein function.Proc Natl Acad Sci USA,2005,102:3651-3656.
    [97]Vendruscolo,M.,and Dobson,C.M.A glimpse at the organization of the protein universe.Proc Natl Acad Sci USA,2005,102:5641-5642.
    [98]Holm,L.,and Park,J.DaliLite workbench for protein structure comparison.Bioinformatics,2000,16:566-567.
    [99]Panek~*,J.,[U Queensland],Eidhammer,I.,and Aasland,R.A new method for identification of protein(sub)families in a set of proteins based on hydropathy distribution in.proteins.Proteins,2005,58:923-934.
    [100]Eisenberg,D.,and Mclachlan,A.D.Solvation energy in protein folding and binding.Nature,1986,319:199-203.
    [101]李春喜,王志和,王文林编著.生物统计学.北京:科学出版社,2001.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700