膜蛋白分类的特征提取算法和数据集构建技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
膜蛋白作为生物膜的主要组成成分之一,在生物体中发挥着极其重要的作用。膜蛋白是膜功能的主要承担者,是细胞执行各种功能的物质基础。近些年的研究报道更加表明,某些膜蛋白结构或者功能的改变与人类疾病的产生有着密切的联系,相应受体膜蛋白也成为药物设计的重要靶点。故本文将膜蛋白作为研究对象。
     20世纪90年代初期提出的人类基因组计划(HGP),在全世界科学家的共同努力下取得了巨大的成就,促进了基因组学和蛋白质组学的极大发展。随着生物数据的海量增长,依赖计算机技术的生物信息学研究方法突破了以往的研究手段。通过膜蛋白的一级序列预测其所属类型以获取相关的高级结构和功能信息,从而解决其生物学问题,这是一项极其重要且具有挑战性的研究工作,也是全文研究的目的所在。
     用于膜蛋白分类预测的数据集整理与构建是整个分类模型的基础与前提,数据集构建得好坏决定了算法的准确性,是基于计算的膜蛋白分类问题研究的重要要素之一。膜蛋白序列的特征提取是基于计算的膜蛋白分类研究中最为基本的问题,也是决定分类质量的关键。本文分析了通用数据集构建准则,从SWISS-PROT数据库的最新发布中筛选出膜蛋白序列,构建了新的膜蛋白数据集;本文从膜蛋白的一级序列出发,研究了膜蛋白的结构、功能类型分类预测问题,总结了目前膜蛋白分类预测领域中已有的序列特征提取算法和分类算法,深入剖析了不同算法的数学原理,在此基础上,构造了一种新的膜蛋白特征提取方法;并在新构建的数据集上进行了新特征提取算法与其他膜蛋白分类模型的性能比较。
     1)构建新的膜蛋白序列数据集
     用于分类预测的膜蛋白来自蛋白数据库SWISS-PROT。目前通用的标准数据集CE2059和CE2625建立在SWISS-PROT 35.0(1997年)版本基础上。随着数据库的日新月异,蛋白质序列不断更新和发展,数据量和数据信息更新换代非常快,数据库中的蛋白质数量越来越多、规模越来越大、分类注释越来越精准。因此与时俱进的构建新数据集对于膜蛋白分类研究而言是一件工作量大、意义重大的事情。本文分析了通用数据集CE2059和CE2625的构建年限早和注释不全面等问题之后,从SWISS-PROT国际公共数据库最新发布版本SWISS-PROT Release 57.0(2009年)中,按照现有的公认的标准数据集构建准则筛选出符合标准的膜蛋白序列,收集整理成相应的新的较为完整和理想的标准训练数据集,为该领域做了很好的补充,为后续研究奠定了数据基础。
     2)基于多种氨基酸残基指数构建自相关系数的特征提取算法
     特征提取算法是膜蛋白分类问题的又一关键要素,它是决定分类质量的关键问题。为了能够获得具有更好分类性能的膜蛋白分类预测模型,本文考虑在序列氨基酸组分的基础上,加入序列氨基酸残基的顺序关联信息,从而更大限度地挖掘膜蛋白序列中蕴含的结构和功能信息。考虑膜蛋白序列中氨基酸残基的物理化学特性和长程相关性,提出了基于多种氨基酸残基指数构建自相关系数的特征提取算法,并进一步特征降维,实现维度优化以减少计算量。该模型采用新建膜蛋白序列数据集作为训练集,模型的自适应检验、Jackknife检验和独立测试集检验总体分类预测精度分别是96.78%、91.03%和86.93%,对比已有的膜蛋白分类预测模型,分类预测精度均获得普遍提高。这为进一步推动膜蛋白分类问题的研究打下了良好的基础。
As one of the main components of biomembrane, membrane proteins play a vital role in organisms. Membrane proteins are the main manifestations of biomembrane's function, and make the material basis for cells to implement various functions. Moreover, recent research reports indicate that the structure or function change of some membrane has extremely close relations with the production of human beings' diseases, and the relevant receptor membrane proteins also become an important target for drug design. That is why this thesis focuses on the membrane proteins.
     The Human Genome Project (HGP) raised in the early 1990s has got tremendous achievements under the united efforts of scientists all over the world. Meanwhile the Genomics and Proteomics have accomplished a great development. Nowadays, with the unprecedented quantity growth of biological data, bioinformatics, a new method based on computer technology, is taking the place of the traditional means.
     Predicting the respective types of membrane proteins through their primary sequences to gain the correlative advanced structure and function information, is a crucial fundamental research in the study of the structures and functions of membrane proteins. This important and challenging work will also provide clues for conquering the special biological problems, which is our goal too.
     The construction of the dataset of membrane proteins is the foundation and premise of the whole prediction model, its quality influences the accuracy of the algorithm, is one of the dominant elements in the research of membrane proteins classification. Feature extraction of membrane protein sequences is another basic technique in the research of protein classification based on calculation, and also a key factor of the classification performance. This thesis collects the membrane proteins sequences from the latest release of SWISS-PROT to build a newer, more comprehensive and evenly dataset according to the common dataset CE2059 and CE2625 construction standards. From the membrane proteins' primary sequences, this thesis studies the classification problem for membrane proteins' structures and functions, proposes a new feature extraction algorithm based on the new dataset, further tests and analysis of the feature extraction algorithm are undergoing too. The main work in this thesis is summarized as follows:
     (1) Construction of the new dataset for membrane proteins. The construction of the dataset is one of the dominant elements in the research of membrane protein classification. The common used datasets CE2059 and CE2625 in this field are almost based on the SWISS-PROT Release 35 in 1997. As the development of the databank, the number, scale and annotations of membrane protein sequences are renewed regularly, indicating the significance and necessity of the construction of a new dataset with these latest data. The thesis builds up a larger and more evenly new dataset according to the common dataset construction criterions of the standard datasets from the latest SWISS-PROT Release 57.0 in 2009, providing an important and necessary preparation of the further study.
     (2) The feature extraction algorithm is another key process in this field. In order to get a classification model with better prediction accuracy and further mine the information of structures and functions in the membrane protein sequences, this thesis considers further the physical and chemical properties of amino acid residues and long distance correlation between them, constructing a novel type of membrane proteins classification model which combines two feature classes and support vector machine algorithm (SVM), encompassing the AAC and several indexes of the residues from the amino acid index database. Under three typical tests(Self-consistency, Jackknife and Independent dataset), the accuracy rate of prediction is respectively 96.78%, 91.03% and 86.93% based on the membrane protein new dataset mentioned above. Compared with existing models, the prediction method gets a good performance and a notable improvement.
引文
[1]白玄,柳郁编,基因的革命,北京:中央文献出版社,2000
    [2]贺林编,解码生命,北京:科学出版社,2000
    [3] Abbott A.,And now for the proteome,Nature,2001,409:747
    [4] Marte B.,Proteomics,Nature,2003,422:191
    [5] Tyers M.,Mann M,From genomics to proteomics,Nature,2003,422:193~197
    [6]杨福愉,生物膜,北京:科学出版社,2005
    [7]钱小红,贺福初,蛋白质组学:理论与方法,北京:科学出版社,2003
    [8]夏其昌,曾嵘等,蛋白质化学与蛋白质组学,北京:科学出版社,2004
    [9]古练权,生物化学,北京:高等教育出版社,2000
    [10] Aebersold R.,Mann M.,Mass spectrometry-based proteomics,Nature,2003,422:198~207
    [11] Andersen J.S.,Mann M.,Functional genomics by mass spectrometry,FEBS Lett,2000,480:25~31
    [12] Rajagopal I.,Ahern K.,Protein sequencing in the post-genomic era,Science,2001,294:2571~2573
    [13] Phizicky E.,Bastiaens P.I.,Zhu H.,et al,Protein analysis on a proteomic scale,Nature,2003,422:208~215
    [14] Hanash S.,Disease proteomics,Nature,2003,422:226~232
    [15]胡维新,医学分子生物学,北京:科学出版社,2007
    [16]来鲁华,蛋白质的结构预测与分子设计,北京:北京大学出版社,1993
    [17]姜彬,膜蛋白分类问题的特征提取算法研究,长沙:国防科技大学硕士学位论文,2008
    [18] Gilbert W.,Towards a paradigm shift in biology,Nature,1991,349:99
    [19]郝柏林,刘寄星,理论物理与生命科学,上海:上海科学技术出版社,1999
    [20] Boguski M.S.,McIntosh M.W.,Biomedical informatics for proteomics,Nature,2003,422:233~237
    [21]陈润生,生物信息学,生物物理学报,1999,1:5~11
    [22]孙啸,生物信息学——揭示生物分子数据的内涵,电子科技导报,1998,11:10~16
    [23] Krane D.E.,Raymer M.L.著,孙啸等译,生物信息学概论,北京:清华大学出版社,2004
    [24] Gibas C.等著,孙超等译,生物信息学中的计算机技术,北京:中国电力出版社,2002
    [25] Chou K.C.,Elrod D.W.,Prediction of membrane protein types and subcellular locations,Proteins:Structure,Function,and Bioinformatics,1999,34:137~153
    [26] Chou K.C.,Review:prediction of protein structural classes and subcellular locations,Current Protein & Peptide Science,2000,1:171~208
    [27] Feng Z.P.,An overview on predicting the subcellular location of a protein,In Silico Biology,2002,2:291~303
    [28]张振慧,蛋白质分类问题的特征提取算法研究,长沙:国防科技大学博士学位论文,2006
    [29] Chou K.C.,Cai Y.D.,Using GO-PseAA predictor to identify membrane proteins and their types , Biochemical and Biophysical Research Communications,2005,327:845~847
    [30] Chou K.C.,Cai Y.D.,Prediction of membrane protein types by incorporating amphipathic effects,Journal of Chemical Information and Modeling,2005,45:407~413.
    [31] Chou K.C.,Shen H.B,MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem. Biophys. Res. Commun.,2007,360:339~345
    [32] Feng Z.P.,Zhang C.T.,Prediction of membrane protein types based on the hydrophobic index of amino acids,Journal of Protein Chemistry,2000,19(4):269~275
    [33] Cai Y.D.,Liu X.J.,Xu X.B.,et al,SVM for predicting membrane protein types by incorporating quasi-sequence-order effect,Internet Electronic Journal of Molecular Design,2002,1:219~226
    [34] Cai Y.D.,Zhou G.P.,Chou K.C.,Support vector machines for predicting membrane protein types by using functional domain composition,Biophysical Journal,2003,84:3257~3263
    [35] Shen H.B.,Chou K.C.,Using optimized evidence-theoretic k-nearest neighbor classifier and Pseudo-amino acid composition to predict membrane protein types,Biochemical and Biophysical Research Communications,2005,334:288~292
    [36] Shen H.B.,Yang J.,Chou K.C.,Fuzzy KNN for predicting membrane protein types from Pseudo-amino acid composition,Journal of Theoretical Biology,2006,240:9~13
    [37]徐志节,杨杰,王猛,利用非线性降维方法预测膜蛋白类型,上海交通大学学报,2005,39(2):279~283
    [38] Yang X.G.,Luo R.Y.,Feng Z.P.,Using amino acid and peptide composition to predict membrane protein types,Biochemical and Biophysical Research Communications,2007,353:164~169
    [39] Liu H.,Wang M.,Chou K.C.,Low-frequency Fourier spectrum for predicting membrane protein types,2005,336:737~739
    [40] Liu H.,Yang J.,Wang M.,Xue L.,Chou K.C.,Using Fourier spectrum analysis and pseudo amino acid composition for prediction of membrane protein types,The Protein Journal,2005,24(6):385~389
    [41] Cai Y.D.,Ricardo P.W.,Jen C.H.,Chou K.C.,Application of SVM to predict membrane protein types,Journal of Theoretical Biology,2004,226:373~376
    [42] Nakai K.,Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem,2000,54:277~344
    [43] Horton P.,Nakai K.,A probabilistic classification system for predicting the cellular localization sites of proteins,Proc. Int. Conf. Intellig. Syst. Mol. Biol. 4,1996,109~115.
    [44] Fujiwara Y.,Asogawa M.,Nakai K.,Prediction of mitochondrial targeting signals using hidden Markov models,In: Genome Informatics. Miyano, S., and Takagi, T. (eds) "Genome Informatics 1997" Universal Academy Press, Inc.,Tokyo,Japan,1997,53~60.
    [45] Holm L.,Sander C.,Mapping the protein universe,Science,273:595~602
    [46]王正华,张振慧,王勇献,蛋白质亚细胞定位预测中的序列编码技术,生物信息学,2007,2:82~89
    [47]靳利霞,唐焕文,氨基酸序列的特征描述,计算机与应用化学,2003,20:1~5
    [48] Nakashima H.,Nishikawa K.,Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies,Journal of Molecular Biology,1994,238:54~61
    [49]朱雪龙主编,应用信息论基础,北京:清华大学出版社,2001
    [50] Guo J.,Lin Y.L.,Sun Z.R.,A novel method for protein subcellular localization:Combining residue-couple model and SVM,Proceedings of 3rd Asia-Pacific Bioinformatics Conference,Singapore,2005
    [51] Bu W.S.,Feng Z.P.,Zhang Z.D.,Zhang C.T.,Prediction of protein(domain)structural classes based on amino acid index , European Journal of Biochemistry,1999,266:1043~1049
    [52] Kawashima S.,Ogata H.,Kanehisa M.,AAindex:amino acid index database,Nucleic Acids Res,1999,27(1):368~369
    [53] Chou K.C.,Prediction of protein cellular attributes using pseudo-amino acid composition,Proteins:Structure,Function,and Genetics,2001,43:246~255
    [54] Chou K.C.,Prediction of protein subcellular locations by incorporating quasi-sequence-order effect , Biochemical and Biophysical Research Communications,2000,19:477~483
    [55] Cai Y.D.,Liu X.J.,Xu X.B.,Chou K.C.,Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect,Journal of Cellular Biochemistry,2002,84:343~348
    [56] Feng Z.P.,Zhang C.T.,A graphic representation of protein primary structure and its application in predicting subcellular locations of prokaryotic proteins,Internet Journal of Biochemistry and Cell Biology,2002,34:298~307
    [57] Metfessel B.A.,Saurugger P.N.,Connelly D.P.,Rich S.S.,Cross-validation of protein structural class prediction using statistical clustering and neural networks,Protein Science,1993,2:1171~1182
    [58] Chou K.C.,Cai Y.D.,Predicting protein structural class by functional domain composition,Biochemical and Biophysical Research Communications,2004,321:1007~1009
    [59] Yu X.J.,Wang C.,Li Y.X.,Classification of protein quaternary structure by functional domain composition,BMC Bioinformatics,2006,7:187
    [60] Chou K.C.,Cai Y.D.,Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition,Journal of Cellular Biochemistry,2004,91:1197~1203
    [61] Cai Y.D.,Chou K.C.,Nearest neighbor algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition , Biochemical and Biophysical Research Communications,2003,305:407~411
    [62] Chou K.C. , Cai Y.D. , Prediction of protein subcellular locations by GO-FunD-PseAA predictor , Biochemical and Biophysical Research Communications,2004,320:1236~1239
    [63] Cai Y.D.,Chou K.C.,Predicting subcellular localization of proteins in ahybridization space,Bioinformatics,2004,20:1151~1156
    [64] Chou K.C.,Cai Y.D.,Using functional domain composition and support vector machines for prediction of protein subcellular location,The Journal of Biological Chemistry,2002,277:45765~45769
    [65] Lei Z.D.,Dai Y.,A novel approach for prediction of protein subcellular localization from sequence using Fourier analysis and support vector machines,Proceedings of 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics,2004,Seattle,August 22:11~17
    [66] Bhasin M.,Raghava G.P.S.,ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST,Nucleic Acids Research,2004,32:W414~W419
    [67] Bhasin M.,Garg A.,Raghava G.P.S.,PSLpred:prediction of subcellular localization of bacterial proteins,Bioinformatics,2005,21:2522~2524
    [68] Pan Y.X.,Zhang Z.Z.,Guo Z.M.,et al,Application of pseudo amino acid composition for predicting protein subcellular location:stochastic signal processing approach,Journal of Protein Chemistry,2003,22:395~402
    [69] Chou P.Y.,Amino acid composition of four classes of proteins,In:Abastracts of Papers,Part I,Second Chemical Congress of the North American Continent,Las Vega,1980
    [70] Chou P.Y. , Prediction of protein structural classes from amino acid composition,In Prediction of Protein Structure and the Principles of Protein Conformation,ed,Fasman,G.D.,1986:549~586
    [71] Nakashima H.,Nishikawa K.,Ooi T.,The folding type of a protein is relevant to the amino acid composition,Journal of Biochemistry,1986,99:152~162
    [72] Chou K.C.,A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space,Proteins:Structure,Function,and Genetics,1995,21:319~344
    [73] Chou K.C.,Maggiora G.M.,Domain structural class prediction,Protein Engineering,1998,11:523~538
    [74] Chandonia J.M.,Karplus M.,Neural networks for secondary structure and structural class predictions,Protein Science,1995,4:275~285
    [75]秦红珊,杨新岐,用BP神经网络基于氨基酸特性预测非同源蛋白质二级结构含量,生物物理学报,2002,18:467~473
    [76] Guo J.,Lin Y.L.,Sun Z.R.,A novel method for protein subcellular localizationbased on boosting and probabilistic neural network,Proceedings of the second conference on Asia-Pacific bioinformatics,2004,Dunedin,New Zealand
    [77] Gao Q.B.,Wang Z.Z.,Using nearest feature line and tunable nearest neighbor methods for prediction of protein subcellular locations,Computational Biology and Chemistry,2005,29:388~392
    [78] Yuan Z.,Prediction of protein subcellular locations using Markov chain models,FEBS Letters,1999,14:23~26
    [79] Cai Y.D.,Liu X.J.,Xu X.B.,Chou K.C.,Using neural networks for prediction of subcellular location of prokaryotic and eukaryotic proteins,Molecular Cell Biology Research Communications,2000,4:172~173
    [80] Cai Y.D.,Liu X.J.,Chou K.C.,Artificial neural network model for predicting protein subcellular location,Computers and Chemistry,2002,26:179~82
    [81] Baldi P.,Brunak S.,Chauvin Y.,et al,Assessing the accuracy of prediction algorithms for classification:an overview,Bioinformatics,2000,16:412~424
    [82] Zhang C.T.,Zhang R.,Q9,a content-balancing accuracy index to evaluate algorithms of protein secondary structure prediction,The International Journal of Biochemistry and Cell Biology,2003,35:1256~1262
    [83] Yu C.S.,Lin C.J.,Hwang J.K.,Predicting subcellular localization of proteins for gramnegativebacteria by support vector machines based on n-peptide compositions ,ProteinSci,2004,13(5):1402~1406
    [84] Matsuda S.,Vert J.P.,Saigo H.,et al,A novel representation of protein sequences for prediction of subcellular location using support vector machines,Protein Sci,2005,14(11):2804~2813
    [85] Bairoch A.,Apweiler R.,The SWISS-PROT protein sequence data bank and its supplement TrEMBL,Nucleic Acids Research,1997,25:31~36
    [86] Wang M.,Yang J.,Liu G.P.,Weighted-support vector machines for predicting membrane protein types based on Pseudo-amino acid composition,Protein Engineering,Design and Selection,2004,17(6):509~516
    [87] Wang S.Q.,Yang J.,Chou K.C.,Using stacked generalization to predict membrane protein types based on Pseudo amino acid composition,Journal of Theoretical Biology,2006,242:941~946
    [88] Shen H.B.,Chou K.C.,Using ensemble classifier to identify membrane protein types,Amino Acids,2007,32:483~488
    [89]郭宗明,张治洲,潘宇曦,等,利用支持向量机预测生物膜蛋白类型,上海交通大学学报,2004,38(5):806~809
    [90] Vapnik V.,The Nature of statistical learning theory,NewYork:Springer,1995
    [91]张学工,关于统计学习理论与支持向量机,自动化学报,2000,26(1):32~41
    [92]边肇祺,张学工等,模式识别,北京:清华大学出版社,2002
    [93]曾聪,王正华,贺细平,膜蛋白分类中的特征提取算法和分类算法,2010亚太地区信息论学术会议,西安,2010
    [94]曾聪,王正华,膜蛋白分类中的特征提取算法和分类算法综述,湖南省第三届研究生创新论坛,长沙,2010
    [95] Wang Tong,Xia Tian,Hu X.M.,Geometry preserving projections algorithm for predicting membrane protein types,Journal of Theoretical Biology,2010,262:208~213
    [96]张绍武,潘泉,程咏梅等,基于一种新的特征提取法和支持向量机的膜蛋白分类研究,计算机与应用化学,2006,23(4):294~298
    [97]张绍武,基于支持向量机的蛋白质分类研究,西安:西北伯南工业大学博士学位论文,2003
    [98] Burges C.J.C.,A tutorial on support vector machines for pattern recognition,Data Mining and Knowledge Discovery,1998,2(2):121~167
    [99] Hsu C.W.,Lin C.J.,A comparison of methods for multi-class support vector machines,IEEE Transactions in Neural Networks,2002,13(2):415~425
    [100] Chang C.C.,Lin C.J.,LIBSVM:a library for support vector machines,http://www.csie.edu.tw/~cjlin/papers/libsvm.pdf,2007-06-14
    [101] Swisshp,http://xueshu.anxue.net/redirect.php?tid=175435&goto=lastpost, 2010-10-25

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700