基于pairwise核的蛋白质相互作用对称预测研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
蛋白质是生命活动的直接执行者,蛋白质之间的相互作用是蛋白质实现其功能的重要途径之一,因此构建蛋白质相互作用(protein-protein interaction, PPI)网络是了解分子生物功能、洞悉细胞生命规律的前提,也是研究生物体内疾病的产生与发展、进而从事药物分子靶标识别的关键。蛋白质相互作用预测方法是近年来生物信息学家关注的一个热点问题,它可以有效克服生物实验检测方法周期长、代价昂贵、假阳性率高的缺点。而对称性预测、核函数的选择是基于机器学习核方法进行蛋白质相互作用预测的两个关键因素,它直接关系到预测模型的有效性及准确性。
     本文以蛋白质相互作用的对称性为切入点,研究了pairwise核在保证蛋白质相互作用对称预测方面的必要性,揭示了传统核方法以及传统反例数据集对蛋白质相互作用预测的偏置影响,提出了解决偏置的方案及算法。在此基础上,将无偏置预测模型应用于大豆物种的蛋白质相互作用预测,取得了较好的效果。
     第一,揭示了传统核方法在蛋白质相互作用预测过程中对蛋白质次序的依赖偏置,在充分分析现有pairwise核函数构建规律的基础上,提出了一种新的用以保证蛋白质相互作用对称预测的pairwise核函数,并利用其构建了一种多核组合模型,较之已有的方法,该模型具有更高的预测准确率。
     蛋白质相互作用具有典型的对称特点,即“蛋白质A与B相互作用”等同于“蛋白质B与A相互作用”。在传统的机器学习方法中,当蛋白质以顺序拼接方式构成训练/测试样本时,普通核方法由于无法识别一个样本由两个蛋白质组成的事实,从而对蛋白质的次序变得较为敏感,由此产生预测偏置。这种偏置表现为分类器可能产生“蛋白质A与B相互作用”而“蛋白质B与A不相互作用”的相悖结论。
     Pairwise核克服了传统核以样本作为相似度度量单位的局限,采用蛋白质作为相似度度量单位,有效保证了蛋白质相互作用预测的对称性。本文强调了pairwise核在实现对称预测方面的必要性,总结了现有的几种pairwise核函数在对称性、正定性、均衡性方面的一般特点,分析、提炼了它们在改善预测性能方面的一般规律。在此基础上,提出了一种新的pairwise核函数——AMPK(Arcsin Maximum Pairwise Kernel),并分别基于Cosine核、拉普拉斯核构建了AMPK的多核组合模型,该模型在蛋白质复合体相互作用预测中取得了比已有的核方法更优的预测性能。
     第二,揭示了在简单序列特征(三联氨基酸)的传统数据集上,采用pairwise核方法进行蛋白质相互作用预测存在严重偏置。提出了一种构建合理反例集的方法,从而使分类器的预测性能够得到公正、客观地评价。
     由于传统方法所采用的正、反例数据集分别具有无标度(scale-free)网络以及随机网络性质,一部分称之为hub结点的蛋白质在正、反例集中出现次数差异较大,形成所谓“强势样本”。受训练集中“强势样本”的影响,pairwise核分类器倾向于将含有hub结点的测试样本预测为正例、而将含有非hub蛋白质的测试样本预测为反例——这种偏置效应在基于简单序列特征(即三联氨基酸)的数据上表现得尤为明显,从而导致对分类器预测性能过于乐观的估计。
     基于此,本文提出了一种针对正例集无标度网络结构的、以“平衡随机采样”方式构建合理反例集的方法。通过保证每个蛋白质在正、反例集中出现的次数基本一致来消除正、反例数据集的结构差异。在合理反例集上,分类器的预测性能可以得到公正、客观的评价。最后证明了复杂序列特征(Pfam域)对预测偏置的影响程度以及它在预测蛋白质相互作用中的积极贡献。
     第三,首次基于新近测序的大豆基因组数据,将传统的同源PPI推理方法与本文的无偏置pairwise核预测模型相结合,推理、预测得到10 426条大豆蛋白质相互作用数据。
     大豆蛋白质相互作用网络构建是大豆基因组测序工作完成以后的一项重要任务。本文首次以大豆基因组数据为来源,采用同源PPI(interolog)推理方法与基于域特征的pairwise核预测方法相结合的方式,得到上万条大豆蛋白质相互作用数据。首先,以拟南芥、酵母、人类三个源物种的PPI为源数据,寻找它们在大豆物种中的同源PPI,据此得到大豆蛋白质相互作用候选集;然后,提出跨物种的训练/测试模式,利用域及其相互作用在物种间表现出的保守性,在源物种数据上建立关于InterPro域的无偏置pairwise核预测模型,而后将预测模型应用于大豆PPI候选集,以筛除其中的假阳数据。交叉验证结果表明,预测结果具有较高的可信性,从而表明本文所采用的方法在新近测序物种的蛋白质相互作用预测方面具有较高的参考价值。最后分析了大豆蛋白质相互作用复合体的抗性功能,发现了大豆抗性基因/蛋白质之间的相互作用规律。
Proteins are directly involved in biological processes, often exerting their function via protein-protein interactions. Constructing protein-protein interaction networks is, therefore, very beneficial for investigating molecular functions and discerning where groups of proteins may locate, as well as furthering our understanding of disease associations for identifying drug targets. In silico methods of predicting protein-protein interactions have recently emerged as an important area of Bioinformatics, because they often overcome the drawbacks of wet-lab experiments, such as expense (both time and money) and high false-positive rates. Of the available machine-learning approaches for predicting interaction data, kernel-based methods are popular due to their robustness and high performance. However, methods for maintaining the symmetry of predictions, i.e.‘A is predicted as interacting with B’, should be equivalent to‘B is predicted as interacting with A’, made by kernel functions have not been well studied, and the symmetry problem appears to directly affect the effectiveness and the performance of these predictive models.
     This thesis, thus, focuses on how to retain the symmetry of protein-protein interactions by using pairwise kernels, which adopt symmetric calculations on the measurement of similarity between pairs of proteins. The biases that originate from traditional kernel-based predictors and training datasets are revealed, and the methods for removing these biases are correspondingly proposed. As an application of these methods, unbiased predictive models are created and used to predict a large number of protein-protein interactions in soybean for the first time.
     More specifically, there are three main aspects which are focused on in the thesis:
     Firstly, the prediction bias towards protein order is revealed when traditional kernel-based methods are used. The pairwise kernel is then introduced to fix the problem and a new pairwise kernel is proposed, that utilizes important properties that have already been shown as useful when predicting protein-protein interactions.
     Protein-protein interactions are of symmetric character. However, when examples are formed by simply uniting two proteins sequentially, where one protein behaves as the first half of the example, and the other as the second half, traditional kernel functions are of little use. This is due to their inability to‘split’one example into two proteins, and be sensitive to the order of proteins, resulting in inconsistent prediction conclusions, such as‘A interacts with B’, whilst‘B does not interact with A’.
     Pairwise kernels are appointed to remove asymmetry resulting from the traditional kernels. Pairwise kernel functions regard proteins, rather than examples, as the minimal‘unit’, and consider both‘normal’and‘reverse’orders for measurement of similarity between two pairs of proteins. The necessity of pairwise kernels to keep symmetric prediction is underlined. Furthermore, the principles of creating pairwise kernel functions, such as symmetry, (semi-)positive definiteness, and balances between variables, are summarized. Based on these principles, a novel pairwise kernel, AMPK (Arcsin Maximum Pairwise kernel) is created, which performs on par with the current best pairwise kernel, and a novel combination model of pairwise kernels,‘AMPK based on Cosine plus AMPK based on Laplace’, is also proposed, which has been proven to outperform the current kernel, or kernel-combination methods, in predicting interactions of protein complexes.
     Secondly the performance of pairwise kernel-based classifiers are discovered to be artificially inflated when simple sequence features (neighboring three residues, 3mers) are used on traditional datasets, in which negative datasets are made by the‘simple random sampling’method. The novel‘balanced random sampling’method is proposed to overcome the bias via constructing rational negative dataset, on which objective evaluation of classifiers’performance for unbiased prediction is acquired.
     The traditional PPI positive dataset is shown as a scale-free network, and the traditional PPI negative dataset is as a random network. This causes hub nodes, which are highly connected with other nodes in the positive dataset, to appear less frequently in the traditional negative dataset. The difference of the number of times each protein appears in positive and negative dataset results in prediction bias of protein-protein interactions. When 3mers are used as sequence features, the bias becomes even more serious. In this case, pairwise kernels are prone to labeling examples which involve hub proteins as‘positives’, and those which do not involve hub proteins as‘negatives’. This kind of prediction is purely based on the number of times each protein appears in dataset and does not aid in making predictions, but can still cause prediction performance to appear artificially high.
     In order to remove these biases, the‘balanced random sampling’is proposed, aimed at creating a rational negative dataset, simulated as scale-free like the positive dataset. During the process of balanced random sampling, each protein has equal opportunity to appear in the positive or the negative dataset, and the bias towards the number of occurrences of each protein per dataset is, therefore, removed. Rational datasets form a basis for objective evaluation of the performance of pairwise kernel-based classifiers, and show that previous estimations of prediction performance, using 3mer features, were over-optimistic. However, complex sequence features, i.e. Pfam domains, are proven to be less sensitive to the traditional datasets than 3mer feature, and have a positive contribution to the prediction of protein-protein interactions.
     Thirdly, we use the newly sequenced Glycine max (soybean) genome, to infer a large number of soybean protein-protein interactions for the first time. To make these novel inferences we use conventional methods of homologous protein-protein interactions (interologs) and kernel-based predictive model mentioned above, resulting in 10 426 confidential soybean protein-protein interactions.
     Predicting soybean protein-protein interactions was one of the main tasks following the sequencing of the soybean genome. More than ten thousand soybean protein-protein interactions have been successfully predicted with our in silico method. Soybean interologs are primarily inferred from protein-protein interactions of homologous species, and then filtered by pairwise kernel-based methods, using domains as the classifier feature. More specially, the candidate dataset of soybean interactions are obtained by looking for soybean interologs from homologous protein-protein interactions in Arabidopsis thaliana, Saccharomyces cerevisiae, and Homo sapiens, and then domain-based pairwise kernel methods act as unbiased predictive classifiers to filter interologs, during which a cross-species strategy is used: training on data from the source species (Arabidopsis, Saccharomyces, or Homo sapiens), and testing on data from soybean. This novel transferability of methods between species is proposed according to conserved domain-domain interactions which are presented in both‘source’and‘target’species. This is the first time that a large number of soybean PPIs have been predicted using computational methods, and prediction performance is assessed using cross-validation. The combination of homologous PPIs and domain-based pairwise kernels used in this thesis are concluded to be effective methods in predicting protein-protein interactions of organisms whose genome is newly sequenced. Finally, soybean protein complexes in a predicted protein-protein interaction network are revealed and interactions between Plant Resistance genes/proteins within protein complexes are investigated in order to infer some related biological function.
引文
[1] Human genome project information. (2011-02-03) [2011-04-09]. http://genomics.energy.gov.
    [2] DeLisi C. Genomes: 15 years later a perspective by Charles DeLisi, HGP pioneer. Human Genome News, 2001, 11: 3–4[2011-04-09].
    [3] Collins F S, Morgan M, Patrinos A. The human genome project: Lessons from large-scale biology. Science, 2003, 300(5617): 286-290.
    [4]大科学计划概述:人类基因组计划. (2004-02-23) [2011-04-09]. http://www.cas.cn/.
    [5] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 2001, 409(6822): 860-921.
    [6] Venter J C, Adams M D, Myers E W, et al. The sequence of human genome. Science, 2001, 291(5507): 1304-1351.
    [7] Gregory S G, Barlow K F, McLay K E, et al. The DNA sequence and biological annotation of human chromosome?1. Nature, 2006, 441(7091): 315-321.
    [8] Roach J. How much of the human genome has been sequenced? (2006-06-12) [2011-04-09]. http://www.strategicgenomics.com/Genome/index.htm.
    [9] Biemont C, Vieira C. Genetics: Junk DNA as an evolutionary force. Nature, 2006, 443(7111): 521-524.
    [10] Rinn J. Transcriptomics: Rethinking junk DNA. Nature, 2009, 458(7235): 240-241.
    [11] Goffeau A, Barrell B G, Bussey H, et al. Life with 6000 genes. Science, 1996, 274(5287): 546-567.
    [12] C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 1998, 282(5396): 2012-2022.
    [13] Adams M D, Celniker S E, Holt R A, et al. The genome sequence of D. melanogaster. Science, 2000, 287(5461): 2185-2195.
    [14] The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 2000, 408(6814): 796-815.
    [15] Schmutz J, Cannon S B, Jackson S A, et al. Genome sequence of the palaeopolyploid soybean, Nature, 2010, 463(7278): 178-183.
    [16] Medini D, Serruto D, Parkhill J, et al. Micorbiology in the post-genomic era.Nature Reviews Microbiology, 2008, 6: 419-430.
    [17] Ashburner M, Ball C A, Blake J A, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics, 2000, 25(1): 25-29.
    [18]赵国屏,等.生物信息学.北京:科学出版社, 2002.
    [19] Sprinzak E, Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. Journal of Molecular Biology, 2001, 311(4): 681-692.
    [20] von Mering C, Jensen L J, Snel B, et al. STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Research, 2005, 33(Suppl. 1): D433-D437.
    [21] Anderson N L, Anderson N G. Proteome and proteomics: new technologies, new concepts, and new words. Electrophoresis, 1998, 19(11): 1853-1861.
    [22]关薇,王建,贺福初.大规模蛋白质相互作用研究方法进展.生命科学, 2006, 18(5): 507-512.
    [23]孟菁菁,黄大毛,唐发清,等.蛋白质与蛋白质相互作用的研究进展.国际病理科学与临床杂志, 2008, 28(6): 471-476.
    [24] Nurse P, Hayles J. The cell in an era of systems biology. Cell, 2011, 144(6): 850-854.
    [25] Phizicky E M, Fields S. Protein-protein interactions: Methods for detection and analysis. Microbiological Reviews, 1995, 59(1): 94-128.
    [26] Li X, Keskin O, Ma B, et al. Protein–protein interactions: hot spots and structurally conserved residues often locate in complemented pockets that pre-exist in the unbound states: implications for docking. Journal of Molecular Biology, 2004, 344(3): 781-795.
    [27]高莹,来鲁华.蛋白质-蛋白质相互作用界面统计分析.物理化学学报,2004,20(7): 676-679.
    [28] Tan K, Shlomi T, Feizi H, et al. Transcriptional regulation of protein complexes within and across species. Proceedings of the National Academy of Sciences, USA, 2007, 104(4): 1283-1288.
    [29] Yu H, Braun P, Yildirim M A, et al. High-quality binary protein interaction map of the yeast interactome network. Science, 2008, 322(5898): 104-110.
    [30] Stark C, Breitkreutz B J, Chatr-Aryamontri A, et al. The BioGRID Interaction Database: 2011 update. Nucleic Acids Research, 2010, 39(Database Issue): D698-D704.
    [31]任仙文,李北平,王月兰,等.蛋白质相互作用的生物信息学研究进展.生物技术通讯, 2006, 17(6): 976-980.
    [32]朱新宇,沈百荣.预测蛋白质间相互作用的生物信息学方法.生物技术通讯, 2004, 15(1): 70-75.
    [33] Jeong H, Mason S P, Barabasi A L, et al. Lethality and centrality in protein networks. Nature, 411(6833): 41-42.
    [34] Barabasi A L. Scale-free networks: A decade and beyond. Science, 2009, 325(5939): 412-413.
    [35] Sprinzak E, Altuvia Y, Margalit H. Characterization and prediction of protein-protein interactions within and between complexes. Proceedings of the National Academy of Sciences, USA, 2006, 103(40): 14718-14723.
    [36] Ben-Hur A, Ong C S, Ratsch G, et al. Support vector machines and kernels for computational biology. PLoS Computational Biology, 2008, 4(10): e1000173.
    [37]陈永义,俞小鼎,高学浩,等.处理非线性分类和回归问题的一种新方法(I)——支持向量机方法简介.应用气象学报, 2004, 15(3): 345-354.
    [38] Vapnik V N. The nature of statistical learning theory. New York, USA: Springer-Verlag New York, Inc., 1995.
    [39]陈永义. CMSVM 1.0用户手册. ( 2003年12月) [2011-04-11]. http://stream1.cma.gov.cn/cmsvm/softdown.asp.
    [40] Bock J R, Gough D A. Predicting protein-protein interactions from primary structure. Bioinformatics, 2001, 17(5): 455-460.
    [41] Roy S, Martinez D, Platero H, et al. Exploiting amino acid composition for predicting protein-protein interactions. PLoS ONE, 2009, 4 (11): e7813.
    [42] Guo Y, Yu L, Wen Z, et al. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Research, 2008, 36(9), 3025-3030.
    [43]王兰,刘融,周艳红.基于结构域组合信息预测蛋白质相互作用.生物信息学, 2008(01): 28-30.
    [44]李哲谦,刘书朋,严壮志,等.基于改进支持向量机方法的蛋白质相互作用预测.中国生物医学工程学报, 2009, 28(5): 701-706.
    [45] Ben-Hur A, Noble W S. Kernel methods for predicting protein-protein interactions. Bioinformatics, 2005, 21(Supp1): i38-i46.
    [46] Vert J-P, Qiu J, Noble W S. A new pairwise kernel for biological network inference with support vector machines. BMC Bioinformatics, 2007, 8 (Suppl 10): S8.
    [47] Pavlidis P, Weston J, Grundy W N, et al. Gene functional classification fromheterogeneous data. Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, 2001:242-248.
    [48] Yamanishi Y, Vert J-P, Kanehisa M, et al. Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics, 2003, 19(Suppl 1): i323-i330.
    [49] Martin S, Faulon J. Predicting protein-protein interactions using signature products. Bioinformatics, 2005, 21(2): 218-226.
    [50] Shen J, Zhang J, Jiang H, et al. Predicting protein-protein interaction based only on sequences information. Proceedings of the National Academy of Sciences, USA, 2007, 104(11): 4337-4341.
    [51] Aranda B, Achuthan P, Hermjakob H, et al. The IntAct molecular interaction database in 2010. Nucleic Acids Research, 2009, 38 (Suppl 1): D525-D531.
    [52] Güldener U, Münsterk?tter M, Stümpflen V, et al. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Research, 2006, 34(Database issue): D436-D441.
    [53] Pagel P, Kovac S, Frishman D, et al. The MIPS mammalian protein-protein interaction database. Bioinformatics, 2005, 21(6): 832-834.
    [54] Ruepp A, Brauner B, Mewes H W, et al. CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Research, 2007, 36 (Suppl 1): D646-D650.
    [55] Salwinski L, Miller C S, Eisenberg D, et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Research, 2004, 32(Database issue): D449-D451.
    [56] Ceol A, Chatr Aryamontri A, Cesareni G, et al. MINT, the molecular interaction database: 2009 update. Nucleic Acids Research, 2010, 38(Database issue): D532-D539.
    [57] Peri S, Navarro J D, Kristiansen T Z, et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Research, 2004, 32 (Database issue): D497-D501.
    [58] Rhee S Y, Beavis W, Berardini T Z, et al. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research, 2003, 31(1): 224-228.
    [59] Salwinski L, Licata L, Khadake J, et al. Recurated protein interaction datasets. Nature Methods, 2009, 6(12): 860-861.
    [60] Coward E. Shufflet: shuffling sequences while conserving the k-let counts. Bioinformatics, 1999, 15(12): 1058-1059.
    [61] Lo S L, Cai C Z, Chung M C, et. al. Effect of training datasets on support vector machine prediction of protein-protein interactions. Proteomics, 2005, 5(4): 876-884.
    [62] Jansen R, Yu H, Greenbaum D, et al. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 2003, 302(5644): 449-453.
    [63] Rhodes D R, Tomlins S A, Varambally S, et al. Probabilistic model of the human protein-protein interaction network. Nature Biotechnology, 2005, 23(8): 951-959.
    [64] Wu X, Zhu L, Guo J, et al. Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations. Nucleic Acids Research, 2006, 34(7): 2137-2150.
    [65] Guo J, Wu X, Lin K, et al. Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein-protein interaction dataset. Nucleic Acids Research, 2008, 36(6): 2002-2011.
    [66] Ben-Hur A, Noble W S. Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics, 2006, 7(Suppl 1): S2.
    [67] Doerr A. The importance of being negative. Nature Methods, 2010, 7: 10-11.
    [68] Chen X W, Liu M. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics, 2005, 21(24): 4394-4400.
    [69] Smialowski P, Pagel P, Wong P, et al. The negatome database: a reference set of non-interacting protein pairs. Nucleic Acids Research, 2010, 38 (Database issue): D540-D544.
    [70] Albert R, Jeong H, Barabasi A L. Error and attack tolerance of complex networks. Nature, 2000, 406(5794): 378-382.
    [71] Gomez S M, Lo S-H, Rzhetsky A. Probabilistic prediction of unknown metabolic and signal-transduction networks. Genetics, 2001, 159(3): 1291-1298.
    [72] Seshasayee A S. Social behavior of the yeast protein-protein interaction network. In Silico Biology, 2006, 6(1-2): 127-130.
    [73] Przulj N, Higham D J. Modeling protein–protein interaction networks via a stickiness index. Journal of the Royal Society Interface, 2006, 3(10): 711-716.
    [74] Deeds E J, Ashenberg O, Shakhnovich E I. A simple physical model for scaling in protein–protein interaction networks. Proceedings of the National Academy of Sciences, USA, 2006, 103(2): 311-316.
    [75]中国在大豆基因组研究方面取得重大突破.(2010-11-17)[2011-04-13].http://scitech.people.com.cn/.
    [76] Lam H-M, Xu X, Zhang G, et al. Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nature Genetics, 2010, 42(12): 1053-1059.
    [77] Geisler-Lee J, O′Toole N, Ammar R, et al. A predicted interactome for Arabidopsis. Plant Physiology, 2007, 145(2): 317-329.
    [78] Cui J, Li P, Li G, et al. AtPID: Arabidopsis thaliana protein interactome database–an integrative platform for plant systems biology. Nucleic Acids Research, 2008, 36(Database issue): D999-D1008.
    [79] Lin M, Shen X, Chen X, et al. PAIR: the predicted Arabidopsis interactome resource. Nucleic Acids Research, 2011, 39 (Database issue): D1134-D1140.
    [80] Lin M, Hu B, Chen X. Computational identification of potential molecular interactions in Arabidopsis. Plant Physiology, 2009, 151(1): 34-46.
    [81] De Bodt S, Proost S, Vandepoele K, et al. Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression. BMC Genomics, 2009, 10: 288.
    [82] Brandao M M, Dantas L L, Silva-Filho M C. AtPIN: Arabidopsis thaliana protein interaction network. BMC Bioinformatics, 2009, 10: 454.
    [83] He F, Zhang Y, Peng Y-L, et al. The prediction of protein-protein interaction networks in rice blast fungus. BMC Genomics, 2008, 9: 519.
    [84] Wang Z, Libault M, Joshi T, et al. SoyDB: a knowledge database of soybean transcription factors. BMC Plant Biology, 2010, 10: 14.
    [85] Alkharouf N W, Matthews B F. SGMD: the soybean genomics and microrarray database. Nucleic Acids Research, 2004, 32 (Database issue): D398-D400.
    [86] Cheng K C, Stromvik M V. SoyXpress: a database for exploring the soybean transcriptome. BMC Genomics, 2008, 9: 368.
    [87] Puntervoll P, Linding R, Costantini A, et al. ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Research, 2003, 31(13): 3625-3630.
    [88] Falquet L, Pagni M, Bairoch A, et al. The PROSITE database, its status in 2002. Nucleic Acids Research, 2002, 30(1): 235-238.
    [89] Obenauer J C, Cantley L C, Yaffe M B. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Research, 2003, 31(13): 3635-3641.
    [90] Balla S, Thapar V, Verma S, et al. Minimotif Miner: a tool for investigating protein function. Nature Methods, 2006, 3(3): 175-177.
    [91] Bateman A, Birney E, Cerruti L, et al. The Pfam Protein Families Database. Nucleic Acids Research, 2002, 30(1): 276-280.
    [92] Attwood T K, Bradley P, Flower D R, et al. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research, 2003, 31(1): 400-402.
    [93] Letunic I, Goodstadt L, Dickens N J, et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Research, 2002, 30(1): 242-244.
    [94] Hunter S, Apweiler R, Attwood T K, et al. InterPro: the integrative protein signature database. Nucleic Acids Research, 2009, 37 (Database Issue): D211-D215.
    [95] Deng M, Mehta S, Sun F, et al. Inferring domain-domain interactions from protein-protein interactions. Genome Research, 2002, 12(10): 1540-1548.
    [96] Gomez S M, Noble W S, Rzhetsky A, et al. Learning to predict protein-protein interactions from protein sequences. Bioinformatics, 2003, 19 (15): 1875-1881.
    [97] Betel D, Breitkreuz K E, Isserlin R, et al. Structure-templated predictions of novel protein interactions from sequence information. PLoS Computational Biology, 2007, 3(9): 1783-1789.
    [98] Wang R-S, Wang Y, Zhang X S, et al. Analysis on multi-domain cooperation for predicting protein-protein interactions. BMC Bioinformatics, 2007, 8: 391.
    [99] Han D S, Kim H S, Jang W H, et al. PreSPI: a domain combination based prediction system for protein-protein interaction. Nucleic Acids Research, 2004, 32(21): 6312-6320.
    [100] Ng S K, Zhang Z, Tan S H, et al. Integrative approach for computationally inferring protein domain interactions. Bioinformatics, 2003, 19(8): 923-929.
    [101] Kim W K, Park J, Suh J K, et. al. Large scale statistical prediction of protein-protein interaction by potentially interacting domain (PID) pair. Genome Informatics, 2002, 13: 42-50.
    [102] Riley R, Lee C, Eisenberg D, et al. Inferring protein domain interactions from databases of interacting proteins. Genome Biology, 2005, 6(10): R89.
    [103] Guimaraes K S, Jothi R, Przytycka T M, et al. Predicting domain-domain interactions using a parsimony approach. Genome Biology, 2006, 7(11): R104.
    [104] Guimaraes K S, Przytycka T M. Interrogating domain-domain interactions with parsimony based approaches. BMC Bioinformatics, 2008, 9: 171.
    [105] Nye T M W, Berzuini C, Gilks W R, et al. Statistical analysis of domain ininteracting protein pairs. Bioinformatics, 2005, 21(7): 993-1001.
    [106] Zhao X-M, Chen L, Aihara K. A discriminative approach for identifying domain-domain interactions from protein-protein interactions. Proteins: Structure, Function, and Bioinformatics, 2010, 78(5): 1243-1253.
    [107] Gonzalez A J, Liao L. Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines. BMC Bioinformatics, 2010, 11: 537.
    [108] Raghavachari B, Tasneem A, Jothi R, et al. DOMINE: a database of protein domain interactions. Nucleic Acids Research, 2007, 36 (Database issue): D656-D661.
    [109] Ng S K, Zhang Z, Tan S H, et al. InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Research, 2003, 31(1): 251-254.
    [110] Shoemaker B A, Panchenko A R. Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Computational Biology, 2007, 3(4): e43.
    [111] Aytuna A S, Gursoy A, Keskin O. Predicting of protein-protein interactions by combining structure and sequence conservation in protein interfaces. Bioinformatics, 2005, 21(12): 2850-2855.
    [112] Ritchie D W, Kozakov D, Vajda S. Accelerating and focusing protein–protein docking correlations using multi-dimensional rotational FFT generating functions. Bioinformatics, 2008, 24(17): 1865-1873.
    [113] Ackermann F, Herrmann G, Sagerer G, et al. Estimation and filtering of potential protein-protein docking positions. Bioinformatics, 1998, 14(2): 196-205.
    [114] Bernauer J, Aze J, Poupon A, et al. A new protein–protein docking scoring function based on interface residue properties. Bioinformatics, 2007, 23(5): 555-562.
    [115] Kowalsman N, Eisenstein M. Inherent limitations in protein–protein docking procedures. Bioinformatics, 2007, 23(4): 421-426.
    [116] Aloy P, Russell R B. Interrogating protein interaction networks through structural biology. Proceedings of the National Academy of Sciences, USA, 2002, 99(9): 5896-5901.
    [117] Aloy P, Russell R B. InterPreTS: protein interaction prediction through tertiary structure. Bioinformatics, 2003, 19(1): 161-163.
    [118] Ogmen U, Gursoy A. PRISM: protein interactions by structural matching. Nucleic Acids Research, 2005, 33(Suppl. 2): W331-W336.
    [119] Zhou H-X, Qin S. Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics, 2007, 23(17): 2203-2209.
    [120] Kufareva I, Budagyan L, Raush E, et al. PIER: protein interface recognition for structural proteomics. Proteins: Structure, Function, and Bioinformatics, 2007, 67(2): 400-417.
    [121] Burgoyne N J, Jackson R M. Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces. Bioinformatics, 2006, 22(11): 1335-1342.
    [122] Wang B, Chen P, Huang D S, et al. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Letters, 2006, 580(2): 380-384.
    [123] Bordner A J, Abagyan R. Statistical analysis and prediction of protein-protein interfaces. Proteins: Structures, Function, and Bioinformatics, 2005, 60(3): 353-366.
    [124] Chen H, Zhou H X. Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data. Proteins: Structures, Function, and Bioinformatics, 2005, 61(1): 21-35.
    [125] Neuvirth H, Raz R, Schreiber G. ProMate: a structure based prediction program to identify the location of protein-protein binding sites. Journal of Molecular Biology, 2004, 338(1): 181-199.
    [126] Bradford J R, Needham C J, Westhead D R, et al. Insights into protein–protein interfaces using a Bayesian network prediction method. Journal of Molecular Biology, 2006, 362(2): 365-386.
    [127] Bradford J R, Westhead D R. Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics, 2005, 21(8): 1487-1494.
    [128] Friedrich T, Pils B, Dandekar T, et al. Modelling interaction sites in protein domains with interaction profile hidden Markov models. Bioinformatics, 2006, 22(23): 2851-2857.
    [129] Li M-H, Lin L, Wang X-L, et al. Protein-protein interaction site prediction based on conditional random fields. Bioinformatics, 2007, 23(5): 597-604.
    [130] Matthews L R, Vaglio P, Vidal M, et. al. Identification of potentail interaction networks using sequence-based searches for conserved protein-protein interactions or“interologs”. Genome Research, 2001, 11(12): 2120-2126.
    [131] Yu H, Luscombe N M, Gerstein M, et al. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. GenomeResearch, 2004, 14(6): 1107-1118.
    [132] Ramani A K, Li Z, Marcotte E M, et al. A map of human protein interactions derived from co-expression of human mRNAs and their orthologs. Molecular Systems Biology, 2008, 4: 180.
    [133] Yellaboina S, Dudekula D B, Ko M Sh. Prediction of evolutionarily conserved interologs in Mus musculus. BMC Genomics, 2008, 9: 465.
    [134] Chen C C, Lin C Y, Yang J M, et al. PPISearch: a web server for searching homologous protein-protein interactions across multiple species. Nucleic Acids Research, 2009, 37 (Web Server issue): W369-W375.
    [135] Guda C, King B R, Pal L R, et al. A top-down approach to infer and compare domain-domain interactions across eight model orgamisms. PLoS ONE, 2009, 4(3): e5096.
    [136] Ashburner M, Ball C A, Blake J A, et al. Gene Ontology: tool for the unification of biology. Nature Genetics, 2000, 25(1): 25 -29.
    [137]张茜,王敬泽.基于关联基因本体论注释的蛋白质相互作用预测.生物化学与生物物理进展, 2005, 32(5): 449-455.
    [138] Lord P W, Stevens R D, Brass A, et al. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 2003, 19(10): 1275-1283.
    [139] Wu X, Zhu L, Lin K, et al. SPIDer: Saccharomyces protein-protein interaction database. BMC Bioinformatics, 2006, 7(Suppl 5): S16.
    [140] Valencia A, Pazos F. Computational methods for the prediction of protein interactions. Current Opinion in Structural Biology, 2002, 12(3): 368-373.
    [141] Craig R A, Liao L. Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices. BMC Bioinformatics, 2007, 8: 6.
    [142] Soong T T, Wrzeszczynski K O, Rost B. Physical protein-protein interactions predicted from microarrays. Bioinformatics, 2008, 24(22): 2608-2614.
    [143] Frazer H B, Hirsh A E, Wall D P, et al. Coevolution of gene expression among interacting proteins. Proceedings of the National Academy of Sciences, USA, 2004, 101(24): 9033-9038.
    [144] Asthana S, King O D, Gibbons F D, et al. Predicting protein complex membership using probabilistic network reliability. Genome Research, 2004, 14(6): 1170-1175.
    [145] Kelley B P, Sharan R, Karp R M, et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proceedings of the National Academy of Sciences, USA, 2003, 100(20): 11394-11399.
    [146] Sharan R, Ideker T, Kelley B, et al. Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. Bourne P E. Proceedings of the eighth annual international conference on Resaerch in computational molecular biology. San Diego, California: ACM Press, 2004: 282-289.
    [147] Sharan R,Suthram S, Kelley R M, et al. Conserved patterns of protein interaction in multiple species. Proceedings of the National Academy of Sciences, USA, 2005, 102(6): 1974-1979.
    [148] Qiu J, Noble W S. Predicting co-complexed protein pairs from heterogeneous data. PLoS ONE, 2008, 4(4): e1000054.
    [149] Ozawa Y, Saito R, Fujimori S, et al. Protein complex prediction via verifying and reconstructing the topology of domain-domain interactions. BMC Bioinformatics, 2010, 11: 350.
    [150] Yeger-Lotem E, Sattath S, Kashtan N, et al. Network motifs in integrated cellular networks of transcription–regulation and protein–protein interaction. Proceedings of the National Academy of Sciences, USA, 2004, 101(16): 5934-5939.
    [151] Jiang R, Tu Z, Chen T, et al. Network motif identification in stochastic networks. Proceedings of the National Academy of Sciences, USA, 2006, 103(25): 9404-9409.
    [152] Schueler-Furman O, Wang C, Baker D, et al. Progress in modeling of protein structures and interactions. Science, 2005, 310(5748): 638-642.
    [153] Costanzo M, Baryshnikova A, Bellay J, et al. The genetic landscape of a cell. Science, 2010, 327(5964): 425-431.
    [154] Gavin A C, Bosche M, Krause R, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 2002, 415(6868): 141-147.
    [155] Han J D, Dupuy D, Bertin N, et al. Effect of sampling on topology predictions of protein-protein interaction networks. Nature Biotechnology, 2005, 23(7): 839-844.
    [156] Stumpf M P H, Wiuf C, May R M. Subsets of scale-free networks are not scale-free: Sampling properties of networks. Proceedings of the National Academy of Sciences, USA, 2005, 102(12): 4221-4224.
    [157] Yu C Y, Chou L C, Chang D T. Predicting protein-protein interactions in unbalanced data using the primary structure of proteins. BMC Bioinformatics, 2010, 11: 167.
    [158] Guo Y, Li M, Pu X, et al. PRED_PPI: a server for predicting protein-proteininteractions based on sequence data with probability assignment. BMC Research Notes, 2010, 3: 145.
    [159] Wang Y, Wang J, Yang Z, et al. Sequence-based protein-protein interaction prediction via support vector machine. Journal of Systems Science and Complexity, 2010, 23: 1012-1023.
    [160] Fang J, Haasl R J, Dong Y, et al. Discover protein sequence signatures from protein-protein interaction data. BMC Bioinformatics, 2005, 6: 277.
    [161] Hulo N, Bairoch A, Bulliard V, et al. The 20 years of PROSITE. Nucleic Acids Research, 2008, 36(Database issue): D245-D249.
    [162] Lima T, Auchincloss A H, Bougueleret L, et al. HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Research, 2009, 37(Database issue): D471-D478.
    [163] Corpet F, Servant F, Gouzy J, et al. ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons. Nucleic Acids Research, 2000, 28(1): 267-269.
    [164] Haft D H, Selengut J D, White O. The TIGRFAMs database of protein families. Nucleic Acids Research, 2003, 31(1): 371-373.
    [165] Wu C H, Yeh L L, Huang H, et al. The Protein Information Resource. Nucleic Acids Research, 2003, 31(1): 345-347.
    [166] Gough J, Karplus K, Hughey R, et al. Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure. Journal of Molecular Biology, 2001, 313(4): 903-919.
    [167] Pearl F, Todd A, Thornton J, et al. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Research, 2005, 33(Database Issue): D247-D251.
    [168] Mi H, Lazareva-Ulitsky B, Loo R, et al. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Research, 2005, 33 (Database Issue): D284-D288.
    [169] Batada N N, Reguly T, Breitkreutz A, et al. Still Stratus Not Altocumulus: Further Evidence against the Date/Party Hub Distinction. PLoS Biology, 2007, 5(6): e154.
    [170] Baldi P, Brunak S, Chauvin Y, et al. Assessing the accuracy of prediction algorithms for classication: An overview. Bioinformatics, 2000, 16(5): 412-424.
    [171] Chang C C, Lin C J. LIBSVM: a library for support vector machines. 2010[2011-04-09]. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [172] Joachims T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publisher Norwell, MA, USA, 2002.
    [173] Leslie C, Eskin E, Weston J, et al. Mismatch string kernels for SVM protein classification. In Becker S, Thrun S, and Obermayer K, editors, Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2003: 1441-1448.
    [174] Chen J, Ye J. Training SVM with Indefinite Kernels. Proceedings of the 25th International Conference on Machine Learning. Helsinki, Finland: 2008.
    [175] Chen P H, Lin C J, et al. A study on SMO-type decomposition methods for support vector machines. IEEE Transactions on Neural Networks. 2006, 17(4): 893-908.
    [176] Park Y. Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences. BMC Bioinformatics, 2009, 10: 419.
    [177] Schlicker A, Huthmacher C, Albrecht M, et al. Functional evaluation of domain-domain interactions and human protein interaction networks. Bioinformatics, 2007, 23(7): 859-865.
    [178] Singhal M, Resat H. A domain-based approach to predict protein-protein interactions. BMC Bioinformatics, 2007, 8: 199.
    [179] Leslie C, Eskin E, Noble W S. The spectrum kernel: a string kernel for SVM protein classification. Proceedings of the Pacific Symposium on Biocomputing, 2002, 7: 566-575.
    [180] Keskin O, Tsai C J, Wolfson H, et al. A new, structurally non-redundant, diverse data set of protein-protein interfaces and its implications. Protein Science, 2004, 13(4): 1043-1055.
    [181] Ofran Y, Rost B. Predicted protein-protein interaction sites from local sequence information. FEBS Letters, 2003, 544(1): 236-239.
    [182] Keskin O, Ma B, Nussinov R. Hot regions in protein-protein interactions: the organization and contribution of structurally conserved hot spot residues. Journal of Molecular Biology, 2005, 345(5): 1281-1294.
    [183]黎明民,梁世德,等.蛋白-蛋白作用界面特征及界面预测研究进展.现代生物医学进展,2006, 6(6): 41-43.
    [184] Tyson J J, Chen K, Novak B. Network dynamics and cell physiology. Nature Reviews Molecular Cell Biology, 2001, 2(12): 908-916.
    [185] Akiva E, Itzhaki Z, Margalit H. Built-in loops allow versatility in domain-domain interactions: Lessons from self-interacting domains. Proceedings of the National Academy of Sciences, USA, 2008, 105(36): 13292-13297.
    [186] Peieto C, De Las Rivas J. Structural domain-domain interactions: Assessment and comparison with protein-protein interaction data to improve the interactome. Proteins: Structures, Function, and Bioinformatics, 2010, 78(1): 109-117.
    [187] Wojcik J, Schachter V. Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics, 2001, 17(Suppl. 1): S296-S305.
    [188] Hui S, Bader G D. Proteome scanning to predict PDZ domain interactions using support vector machines. BMC Bioinformatics. 2010, 11: 507.
    [189] Durbin R, Eddy S, Krogh A, et al. Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge: Cambridge University Press, 1998.
    [190] Finn R D, Mistry J, Tate J, et al. The Pfam protein families database. Nucleic Acids Research, 2010, 38 (Database issue): D211-D222.
    [191] Stark C, Breitkreutz B J, Reguly T, et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 2006, 34(Database issue): D535-D539.
    [192] Morrison J L, Breitling R, Higham D J, et al. A lock-and-key model for protein-protein interactions. Bioinformatics, 2006, 22(16): 2012-2019.
    [193] Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research, 2003, 13(11): 2498-2504.
    [194] Itzhaki Z, Akiva E, Margalit H, et al. Evolutionary conservation of domain-domain interactions. Genome Biology, 2006, 7: R125.
    [195] The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research, 2010, 38 (Database issue): D142-D148.
    [196] McDowell J M, Woffenden B J. Plant disease resistance genes: recent insights and potential applications. Trends in Biotechnology, 2003, 21(4): 178-183.
    [197] Blaszczyk L, Chelkowski J, Korzun V, et al. Verification of STS markers for leaf rust resistance genes of wheat by seven European laboratories. Cellular and Molecular Biology Letters, 2004, 9(4b): 805-817.
    [198] Chelkowski J, Koczyk G. Resistance gene analogues of Arabidopsis thaliana:recognition by structure. Journal of Applied Genetics, 2003, 44(3): 311-321.
    [199]汪旭升,吴为人,金谷雷,等.水稻全基因组R基因鉴定及候选RGA标记开发.科学通报, 2005, 50(11): 1085-1089.
    [200] Sanseverino W, Roma G, Ercolano M R, et al. PRGdb: a bioinformatics platform for plant resistance gene analysis. Nucleic Acids Research, 2009, 38(Database issue): D814-D821.
    [201] Liu J, Liu X, Dai L, et al. Recent Progress in Elucidating the Structure, Function and Evolution of Disease Resistance Genes in Plants. Journal of Genetics and Genomics, 2007, 34(9): 765-776.
    [202] von Mering C, Krause R, Snel B, et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002, 417(6887): 399-403.
    [203] Bader G D, Hogue C W. Analyzing yeast protein-protein interaction data obtained from different sources. Nature Biotechnology, 2002, 20(10): 991-997.
    [204] Kumar A, Snyder M. Protein complexes take the bait. Nature, 2002, 415(6868): 123-124.
    [205] Jansen R, Gerstein M. Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Current Opinion in Microbiology, 2004, 7(5): 535-545.
    [206]詹仕林.矩阵乘积的正定性.安徽大学学报(自然科学版), 2003, 27(2): 10-12.
    [207]张学工.模式识别.北京:清华大学出版社, 2000: 108-109.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700