用生物统计方法预测蛋白质相互作用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
蛋白质是生命活动的主要物质承担者,一切生命活动都离不开蛋白质的参与。预测蛋白质的功能和作用机理已经成为当今生命科学界非常热门的课题。许多蛋白质通过与其他蛋白质的相互作用来表达它们的生物学功能,而且蛋白质之间的相互作用在细胞生物学水平上起着十分关键的作用:首先,遗传上的相互功能常常与相应的蛋白质间相互作用有关;其次,在信号传递途径中也需要蛋白质的相互作用;再次,蛋白酶-蛋白质底物间的相互作用与生物的催化反应密切相关;最后,蛋白质的相互作用对于整合如RNA多聚酶或对多成分酶促反应也有至关重要的影响。因此研究蛋白质的相互作用,识别与特定蛋白质相互作用的蛋白质,对于了解蛋白质的功能有着非常重要的意义。
     本文首先从DIP数据库中下载得到蛋白质相互作用的数据,并从中筛选出实验所需的正集数据,再结合MIPS数据库中提供的亚细胞定位的分类信息构建负集。我们基于蛋白质的一级结构信息,先采用文献中的CTD编码方法对蛋白质序列进行编码,提取出序列中蕴含的统计特征,用支持向量机(SVM)算法进行建模和预报,平均准确率为79%以上,再采用不同的策略进行变量选择,优化编码后用5-fold交叉验证进行检验,准确率达到了82.43%,比文献的交叉验证结果(76.9%)高出了5%以上。接着,本文采用了另外四种编码方法,从不同的角度对序列进行编码,提取变量,再结合SVM进行预报,结果都比文献值要好。其中预报结果最好的氨基酸双编码的5-fold交叉验证的准确率达到了85.91%,高出了文献值9个百分点。值得一提的是,在另外的这四种编码方法中,氨基酸单编码、氨基酸双编码和伪氨基酸编码以前只用在其他的生物识别问题上。Gauss函数分布编码方法是我们提出的新型编码方法,这种编码方法合理的利用了更多有效信息,预报的效果与氨基酸双编码的结果相近,准确率也达到了85%以上。最后,本文将共识模型引入蛋白质相互作用的预测,选取不同的编码方法建立多个成员子模型,再构建双层结构的SVM融合网络,充分发挥不同编码思想的优点,利用不同模型之间的优势互补关系,从而进一步提高了预测性能,准确率最高达到了86.80%,这是目前据我们所知国际上达到的最佳分类效果。
     本文主要分为四个部分:
Proteins are the primary components of the cellular machinery and it is impossible for body to work without proteins. Nowadays, the prediction of function and principle of proteins is one of the most important topics in the area of life sciences. Many proteins mediate their biological function through protein interactions, and protein interactions are crucial for many aspects of cellular biology. Firstly, genetic interactions often correlate with physical interactions between the corresponding gene products. Secondly, protein interactions are required to tether the components of signal-transduction pathways physically. Thirdly, enzyme-protein substrate interactions are important for catalysis ,and are often found to be more stable than those presumed . Last, protein interactions are crucial for the integrity of multicomponent enzymatic machines such as RNA polymerases and the SPLICEOSOME . Thus, computational prediction of protein interactions has been initiated under the assumption that identification of interaction partners for proteins of unknown function can provide insight into their biological function.
    Here in my work, the positive dataset is downloaded from Saccharomyces cerevisiae core subset of DIP database. Since a noninteracting protein dataset is not readily available, a hypothetical noninteracting protein dataset is generated based on subcellular localization information which is retrieved form MIPS database and consists of protein pairs that do not colocalize together. At first, with the knowledge of the amino acid sequence each protein sequence is converted into a feature vector using CTD encoding approach. A set of SVMs was trained to predict the protein interactions and the prediction accuracy averaged 79% for the ensemble of statistical experiments.After optimizing the set of parameter vectors by different strategies, the predictive accuracy obtain through 5-fold cross-validation tests is 82.43% ,about 5% higher than the literature. Then we predict protein interactions with the other four encoding approachs. All the result are better than the literature.The predictive
引文
[1] 杜荣骞著.生物统计学(第二版).北京:高等教育出版社,2003.4
    [2] Pierre Baldi,Soren Brunak著,张东晖译.Bioinformatics---The Machine Learning Approach(第二版).北京:中信出版社,2003.7
    [3] 陈姗.蛋白质组学研究方法.基础医学,Vol.14 No.1 Feb.2005
    [4] Albertha J. M. Walhout and Marc Vidal. Protein interaction maps for model organisms, Nature Reviews Molecular Cell Biology, 2001, 2(1): 55-63
    [5] 朱新宇,沈百荣.预测蛋白质间相互作用的生物信息学方法.生物技术通讯,Vol.15 No.1 Jan,2004
    [6] Kahn P. From genome to proteome: looking at a cell's proteins. Science, 1995, 270: 369-370
    [7] Tucker CL, Gera JF, Uetz P. Towards an understanding of complex protein networks. Trends Cell Biol, 2001, 11: 102-106
    [8] Siaw Ling Lo, Cong Zhong Cai, Yu Zong Chen, Maxey C.M.Chung. Effect of training datasets on support vector machine prediction of protein-protein interactions Proteomics, 2005, 5, 876-884
    [9] 阎隆飞,孙之荣主编.蛋白质分子结构(第一版).北京:清华大学出版社,1999.5
    [10] Grantham R. Amino acid difference formula to help explain protein evolution. Science 1974: 185: 862-864
    [11] Charton M, Charton BI. The structural dependence of amino acid hydrophobicity parameters. J Theor Biol 1982: 99: 629-644
    [12] Black, S. D., Mould, D. R.. Development of Hydrophobicity Parameters to Analyze Proteins Which Bear post- or cotranslational Modifications. Anal. Biochem. 1991: 193: 72-82
    [13] Jones S, Thornton JM. Principles of protein-protein interactions. Proc Natl Acad Sci USA. 1996 Jan 9; 93(1): 13-20.
    [14] Jones S, Thornton JM. Analysis of protein-protein interaction sites using surface patches. J Mol Biol. 1997 Sep 12; 272(1): 121-32
    [15] 高莹,来鲁华.蛋白质.蛋白质相互作用能够界面统计分析.物理化学学报2004,20(7)676-679
    [16] 张成岗,贺福初编著.生物信息学方法与实践(第一版).北京:科学出版社,2002.6
    [17] 田云,卢向阳.蛋白质问相互作用研究技术进展.生物学通报2003,38(5)
    [18] 张丽苹,霍克克.蛋白质相互作用研究技术进展.高技术通讯2003.11
    [19] 曹建平,马义才,李亦学,石铁流.计算方法在蛋白质相互作用研究中的应用.生命科学,2005,Vol.17(1)
    [20] Albertha J. M. Walhout, Marc Vidal. Protein interaction maps for model organisms. Nature Reviews Molecular Cell Biology, 2001, 2(1): 55-63
    [21] Bartel. P. L., Fields. S. (eds)(1997) The yeast two-hybrid system In Advances in Molecular Biology. Oxford University Press. New York.
    [22] Zozulya S. Mapping signal transduction pathways by phage display. Nature Biotechnol, 1999, 17: 1193-1198
    [23] 高学良,赵群飞.噬菌体展示技术的发展及应用.生命的化学,2001,Vol.(5):432-433
    [24] Williams, C., Addona, T. A.. The integration of SPR biosensors with mass spectrometry: possible applications for proteome analysis. Trends in Biotechnology. 2000, Vol. 18(2): 45-48
    [25] Multhaup Gerd, Strausak, Daniel, Bissig, Karl-Dimiter, Solioz, Marc. Interaction of the Copz Copper Chaperone with the CopA Copper ATPase of Enterococcus hirae assessed by surface plasmon resonance. Biochemical and Biophysical Research Communications. 2001, Vol 288(1): 172-177
    [26] Mochizuki Naoki, Yamashita, Sibgeto, Kurokawa, et al. Spatio-temporal images of growth-factor induced activation of Ras and Rapl. Nature(London, United Kingdom). 2001, Vol 411(6841): 1056-1068
    [27] Christian VM, Roland K, Berend snel, et al.: Comparative assessment of large-scale data sets of protein-protein interaction. Nature, 2002, 417: 399-403
    [28] Faha B, Ewen M. E, Tsai L. H, Harlow E. Interaction between human cyclin A and adenovirus E1A-associated p 107 protein. Science(1992), Vol 255(5040): 87-90
    [29] Schaere Martin T, Kannenberg Kai, Hunziker, Peter et al.Interaction between GABAA receptorβsubunits and the multifunctional protein gC1q-R. Journal of Biological chemistry. 2001, Vol276(28): 26597-26604
    [30] Peter U, Loic G, Gerard C, et al. Acomprehensive analysis of protein-protein interactions in Sacccharomyces cerevisiae. Nature, 2000, 403: 623
    [31] 朱新宇,沈百荣.预测蛋白质相互作用的生物信息学方法.生物技术通讯,2004,Vol.15(1):70-72
    [32] Gaasterland T, Ragan MA. Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microb Comp Genomics, 1998, 3(4): 199
    [33] Pellegrini M, Marcotte EM, Thompson MJ, et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA, 1999, 96(8): 4285
    [34] Thamames J, Casari G, Ouzounis C, et al. Conserved clusters of functionally related genes in two bacterial genomes.
    [35] Marcotte EM, Pellegrini M, Ho-Leung N, et al. Detecting protein function and protein-protein interactions from genome sequences. Science, 1999, 285: 751
    [36] Enright A J, Iliopoulos I, Kyrpides NC, et al. Protein interaction maps for complete genomes based on gene fusion events. Nature, 1999, 402(6757): 86
    [37] Goh C-S, Bogan AA, Joachimiak M, et al. Co-evolution of proteins with their interaction partners. J Mol Biol, 2000, 299: 283
    [38] Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng, 2001, 14: 609
    [39] Olmea O, Valencia A. Improving contact predictiongs by the combination of correlated mutations and other sources of sequence information. Fold Des, 1997, 2(3): 325
    [40] Pazos F, Helmer-Citterich M, Ausiello G, et al. Correlated mutations contain information about protein-protein interaction. J Mol biol, 1997, 271 (4): 511
    [41] Walhout AJM, Sordella R, Lu X, et al. Protein interaction mapping in C. elegans using proteins involved invulval development. Science, 2000, 287: 116
    [42] Matthews LR, Vaglio P, Reboul J. et al. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res, 2001, 11(12): 2120
    [43] Fraser HB, Hirsh AE, Steinmetz LM, et al. Evolutionary rate in the protein interaction network. Science, 2002, 296: 750
    [44] Joel R. Bock, David A. Gough. Predicting protein-protein interactions from primary structure. Bioinformaticts, 2001, Vol. 17(5): 455-460
    [45] Aloy P, Russell RB. InterPreT S:protein interaction prediction through tertiary structure. Bioninformatics, 2003, 19(1): 161-162
    [46] http://dip.doe-mbi.ucla.edu/
    [47] http://www.mips.biochem.mpg.de/
    [48] 张学工.关于统计学习理论与支持向量机.自动化学报,2000,Vol.26(1):32-41
    [49] 邓乃扬,田英杰著.数据挖掘中的新方法—支持向量机.北京:科学出版社,2004
    [50] C. C. Chang, C.J. Lin. LIBSVM: A library for Support Vector Machines [software], 2001, www.csie.ntu.edu.tw/~cjlin/libsvm
    [51] Keun-Joon Park and Minoru Kanehisa. Prediction of protein subcellular locations by support vector machines, using compositions of amino acids and amino acid pairs. Bioinformatics, 2003, Vol. 19(13): 1656-1663
    [52] Edgardo A. Ferran, Bernard Pflugfelder, Pascual Ferrara. Self-organized neural maps of human protein sequences. Protein Science, 1994,3:507-521
    [53] Kuo-Chen Chou. Prediction of protein cellular attributes using Pseudo-Amino Acid composition. Protein:Structure, Function, and Genetics. 2001,43:246-255
    [54] Tanford C. Contribution of hydrophobic interactions to the stability of the globular conformation of proteins. J Am Chem Soc, 1962,84:4240-4274
    [55] Hopp TP, Woods KR. Prediction of protein antigenic determinants from amino acid sequeences. Proc Natl Acad Sci USA, 1981,78:3824-3828
    [56] Chao Chen, Xibin Zhou, Yuanxin Tian, Xiaoyong Zou, Peixiang Cai. Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network. Analytical Biochemistry 2006,357:116-121
    [57] Loredana Lo Conte, Cyrus Chothia ,Joel Janin. The atomic structure of protein-protein recognition sites. J.Mol. Biol. 1999,285:2177-2198

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700