基于机器学习的蛋白亚细胞定位预测
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
蛋白亚细胞定位是分子细胞生物学和蛋白组学的一个重要研究课题,与蛋白功能、代谢、信号传导和生物过程紧密相关,对生物学基础研究和生物医药研究有着重要作用。基于计算的蛋白亚细胞定位预测具有廉价、高效、适用范围广的优点,有可能通过大量蛋白数据分析寻找有效蛋白特征,推断出蛋白特征与蛋白亚细胞定位之间的统计规律。近几年,尽管蛋白亚细胞定位预测研究已取得比较大的进展,但现有预测方法具有以下几个不足:第一、蛋白特征信息挖掘深度不够,忽略了某些重要蛋白特征信息;第二、集成多个异构数据源时,一般采用异构特征空间拼接或者采用基于多数投票的集成学习方法,没有考虑各种特征数据的重要性和数据缺失(data unavailability)问题;第三、现有蛋白亚细胞定位预测模型在不平衡蛋白数据、微观蛋白亚细胞定位和大规模蛋白亚细胞定位几个问题上,预测性能不很理想。
     本文从机器学习角度研究蛋白亚细胞定位预测新方法,提高蛋白亚细胞定位预测的性能,并使预测模型具有实际生物学意义和合理的生物学解释。本文主要贡献概括如下:
     1、引入多示例学习方法(multi-instance learning),挖掘蛋白序列结构域组成信息、结构域序列信息、结构域边界以及结构域序信息。一方面引入多示例学习(multi-instance learning)模型捕获蛋白序列局部结构信息,另一方面引入多类标学习(multi-label learning)处理蛋白多个亚细胞位置问题,为蛋白亚细胞定位预测提供了一种新思路。这种多示例多类标学习模型以包—示例形式表示蛋白—结构域之间的整体与局部关系,能有效地挖掘蛋白结序列局部结构信息,在Gram阳性细菌蛋白实验上取得了与基于基因本体知识的k-近邻集成学习模型相当的预测性能。
     2、提出了一种谱核函数SpectrumKernel+,将多种氨基酸分类信息嵌入到k-mer特征表示中,在此基础上模拟蛋白序列多种可能的模体(motif)进化模式。SpectrumKernel+从蛋白序列进化生化约束角度,解释k-mer中嵌入氨基酸分类信息的生物学意义,与传统谱核函数(spectrum kernel)和(k,l)不匹配核函数((k,l)mismatch kernel)建立联系,具有更合理的生物学意义和直观的生物学解释。SpectrumKernel+综合考虑多种氨基酸分类信息,度量两条蛋白序列之间多种模体进化模式差异和模体分布差异,在此基础上更精确地度量蛋白序列之间相似性。相对于一般蛋白亚细胞定位预测问题,蛋白亚细胞核定位预测(protein subnuclear localization)更具有挑战性,两个亚细胞核蛋白数据集上实验表明,SpectrumKernel+预测性能显著优于基准模型。
     3、提出了一种融合多示例核函数HoMIKernel+,嵌入同源蛋白序列细粒度信息。同源序列进化上的保守性和趋异性决定了同源序列信息在描述目标蛋白亚细胞定位模式上的含糊性,这种含糊性与多示例学习方法中正示例描述类别的含糊性是一致的,是实际生物学意义和多示例学习方法的结合点,也是我们提出HoMIKernel+函数的出发点。HoMIKernel+利用同源蛋白序列集合的k-mer特征表示,共同描述目标蛋白,增强了目标蛋白的模体分布信息,抑制了目标蛋白上可能的噪音。一个原核蛋白数据和三个真核蛋白数据上实验表明,HoMIKernel+预测性能优于基准模型;嵌入同源蛋白序列有助于改善模型的预测性能;多种多示例核函数融合能够显著地提高模型的预测能力。
     4、提出了同源基因本体知识迁移学习、统计相关基因本体知识迁移学习两种蛋白亚细胞定位预测方法,设计了一个简单非参交叉验证方法估计核函数线性组合权重,实现同源相关蛋白之间知识共享,降低核函数权重估计的时空复杂性。通过直观生物意义建立目标蛋白和辅助蛋白之间联系,将同源蛋白基因本体知识、基因本体库内统计相关的基因本体知识迁移给目标蛋白,在此基础上构建一个多核学习模型,用于蛋白亚细胞定位预测。引入同源基因本体知识迁移引具有以下几个优点:丰富目标蛋白基因本体知识、克服新蛋白或者生物实验证据较少蛋白的基因本体知识缺失问题;引入统计相关基因本体知识迁移具有以下几个优点:丰富蛋白基因本体知识、调整基因本体三方面知识的权重分布、嵌入基因本体语义距离信息、调整蛋白基因本体注释覆盖率、降低测试基因本体注释不命中率、避免预测时模型重新训练。核函数权重估计考虑预测性能偏向性指标Matthew相关系数(MCC),能较好地适应大规模不平衡蛋白数据。8个蛋白数据集上实验结果表明,同源相关蛋白知识迁移学习模型能够显著提高蛋白亚细胞定位预测性能,一定程度上抑制了基因本体知识迁移可能带来的噪音和异常影响,较好地克服了大类偏向性,能够很好地处理大规模不平衡蛋白数据。
As an important research field in molecular cell biology and proteomics, protein subcellular localization is closely related to protein function, metabolic pathway, signal transduction and biological process, and plays an important role in basic biological research and biomedicine research. Computational models based protein subcellular localization prediction is cheap, fast, effective and widely applicable. Through statistical analysis on large amount of protein data, computational models can be used to find effective protein feature information and make a general statistical inference about the law between protein feature information and protein subcellular localization pattern. In the recent years, the research field of protein subcellular localization prediction has witnessed great progresss. However, the previous protein subcellular localization predictive models have several disadvantages:firstly, the protein feature information is not fully mined, so that some important aspect of protein information is ignored; secondly, the data integration models generally concatenate heterogeneous feature spaces, or adopts majority votes based ensemble learning, so that no explicit importance evaluation is individually conducted for different protein feature information, and the problem of data unavailability is not handled; finally, the previous models achieve relative poor performance on unbalanced protein data, protein sub-organelle localization and large-scale protein subcellular localization.
     This paper conducts research on novel predictive methods for protein subcellular localization from the standpoint of machine learning, for the purpose of improving the predictive performance of protein subcellular localization and endowing the models with reseanable biological interpretation. The paper contributions are summarized as follows:
     1. Introducing multi-instance learning method into protein subcellular localization prediction, in order to fully exploit the ignored protein domain information:domain composition, domain boundary partition information and the order of domain along protein sequence. On one hand, multi-instance learning is introduced to capture the local structural information of protein sequence in terms of protein domain; on the other hand, multi-label learning is introduced to handle the problem of multiple protein subcellular locations, thus introducing a new way to protein subcellular localization prediction. The proposed multi-instance learning method uses bag-instance representation to describe the whole-part relation between protein sequence and domain, thus effectively exploiting the local structural information of protein sequence. The experiment on Gram-positive bacteria protein data shows that the sequence based multi-instance learning method achieves performance equivalent to the gene ontology based k-NN ensemble learning model.
     2. Proposing a spectrum kernel SpectrumKernel+ to incorporate multiple amino acid classification information into k-mer feature representation, based on which to simulate multiple sequence motif evolution patterns. SpectrumKernel+ interpretes the biological implication of incorporating amino acid classification information into k-mer feature representation, in terms of physiochemical constraints on protein sequence evolution, and makes connection with classicial spectrum kernel and (k,l) mismatch kernel, endowing the model with more reasonable biological meaning and intuitive biological interpretation. SpectrumKernel+ incorporates multiple amino acid classification information to measure the difference between two sequences'motif evolution patterns& motif distributions, based on which to more accurately define the similarity between two protein sequences. As compared to general protein subcellular localization prediction, protein subnuclear localization prediction seems more challenging. The experiments on two subnuclear protein datasets show that SpectrumKernel+ outperforms the baseline models.
     3. Proposing a fused multi-instance kernel HoMIKernel+ to incorporate the fine-grained information of full homologous sequences. The evolutionary conservation and divergence determine the fact that the information of homologous sequences is the vague descriptor of the target protein's subcellular localization pattern. The vagueness is consistent with the positive instances'vagueness in terms of describing object label in multi-instance learning scenario, which is the combination of biological meaning with multi-instance learning method, and also is the standpoint for us to propose HoMIKernel+. HoMIKernel+ uses the k-mer feature representation of homology set to describe the target protein, so that the motif distribution of the target protein is enhanced and the noise is compressed. The experiments on one prokaryotic dataset and three eukaryotic datasets show that outperforms the baseline models; and that homology incorporation benefits the predictive performance; and that multiple multi-instance kernel fusion significantly increase the predictive accuracy.
     4. Proposing two machine learning models:homology based knowledge transfer learning model and statistical correlation based knowledge transfer learning model; and proposing a simple non-parametric cross validation method to estimate the weight distribution of linear kernel combination, based on which to achieve knowledge share between homologous and statistically correlated proteins, and to reduce the time& space complexity of kernel weight estimation. The relatedness between the target protein and the auxilary proteins is derived through intuitive biological meaning, based on which to transfer to the target protein the gene ontology knowledge of homologous proteins and statistically correlated proteins. A multiple kernel learning system is constructed on the transferred knowledge for protein subcellular localization prediction. homology based knowledge transfer demonstrates the following advantages:to enrich the gene ontology knowledge about target protein, to overcome the data unavailability of novel protein and those proteins with few biological evidence; Statistically correlation based knowledge transfer demonstrates the following advantages:to enrich protein gene ontology knowledge, to tune the weight distribution among the three aspects of gene ontology, to incorporate the gene ontology semantic distance, to adjust the gene ontology term coverage, to reduce the missing rate of test gene ontology term, to avod retraing model for novel protein prediction, etc. The kernel weight estimation takes into account the Matthew correlation coefficient (MCC) measure of performance bias to perform better on large-scale unbalanced protein data. The experiments on 8 benchmark datasets show that homology based knowledge transfer learning model and statistical correlation based knowledge transfer learning model significantly improve the performance of protein subcellular localization prediction, to a certain degree to reduce the unfavorable impact of noise and outlier that may be introduced by gene ontology knowledge transfer, overcome the performance bias towards large subcellular locations, and perform well on large-scale unbalanced protein data.
引文
Alexander Z and Cheng S (2007a). An automated combination of kernels for predicting protein subcellular localization. Advances in Neural Information Processing Systems, workshop on Machine Learning in Computational Biology.
    Alexander Z and Cheng S (2007b). Multiclass Multiple Kernel Learning. Proceedings of the 24th International Conference on Machine Learning.
    Alejandro S, Ernesto P and Segovia L (2008). Protein homology detection and fold inference through multiple alignment entropy profiles. Proteins,70:248-256.
    Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W and Lipman D (1997). Gapped BLAST and PSI-BLAST:A New Generation of Protein Database Search Programs. Nucleic Acids Research, vol.25, pp.3389-3402.
    Andrews S, Tsochantaridis I and Hofmann T (2003). Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems 15, pages 561-568. Cambridge, MA:MIT Press.
    Ashburner M et al. (2000). Gene ontology:tool for the unification of biology. The Gene Ontology Consortium. Nat Genet,25:25-29
    Atalay V and Atalay R (2005). Implicit motif distribution based hybrid computational kernel for sequence classification. Bioinformatics, vol.21 no.8, pages 1429-1436.
    Avnimelech R and Intrator N (1999). Boosted Mixture of Experts:An ensemble learning scheme. Neural Computation 11:475-490.
    Bach F, Lanckriet G and Jordan J (2004). Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. Proceedings of the 21 st International Conference on Machine Learning.
    Bhasin M and Raghava G (2004). ESLpred:SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Research,Vol.32, Web Server issue, W414-W419.
    Blum A and Mitchell T (1998). Combining labeled and unlabeled data with co-training. In COLT,1998.
    Blum T, Briesemeister S and Kohlbacher O (2009). MultiLoc2:integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics,10:274.
    Boeckmann B, Bairoch A, Apweiler R, Blatter M, Estreicher A,Gasteiger E, Martin M, Michoud K, Donovan C and Phan I, et al (2003). The SWISS-PROT protein knowledgebase and its Supplement.TrEMBL. Nucleic Acids Research, 31:365-370.
    Boland M and Murphy R (2001). A neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of HeLa cells. Bioinformatics,17(12):1213-1223.
    Branden C and Tooze J (2006). Introduction to Protein Structure (Second Edition). Taylor& Francs Group, pages 12-28.
    Brown D (2008). Efficient functional clustering of protein sequences using the Dirichlet process. Bioinformatics, Vol.24 no.162008, pages 1765-1771.
    Brown K, Mian I, Sjolander K and Haussler D (1994). Hidden Markov models in computational biology:Applications to protein modeling. JMB,235:1501-1531.
    Bulashevska A and Eils R (2006). Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains. BMC Bioinformatics,7:298.
    Busuttil S, Abela J and Pace G (2004). Support Vector Machines with Profile-Based Kernels for Remote Protein Homology Detection. Genome Informatics 15(2): 191-200.
    Cai Y and Chou K (2003). Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudoamino acid composition. Biochem. Biophys. Res. Commun.305,407-411.
    Cai Y and Chou K (2004). Predicting subcellular localization of proteins in a hybridization space. Bioinformatics, vol.20 no.7, pages 1151-1156.
    Cameron J, Hurd T and Robinson B (2005). Computational identification of human mitochondrial proteins based on homology to yeast mitochondrially targeted proteins. Bioinformatics, vol.21 no.9, pages 1825-1830.
    Carpenter A, Jones T, Lamprecht M, Clarke C, Kang I, Friman O, Guertin D,Chang J, Lindquist R, Moffat J, Golland P and Sabatini D (2006). CellProfiler:image analysis software for identifying and quantifying cell phenotypes. Genome Biology 2006,7:R100.
    Cedano J, Aloy P, P'erez-Pons J and Querol E (1997). Relation between amino acid composition and cellular location of proteins. Journal of Molecular Biology, 266,594-600.
    Chen S, Zhao T, Gordon G and Murphy R (2007).Automated image analysis of protein localization in budding yeast. ISMB/ECCB, vol.23,pages i66-i71.
    Chen X, Velliste M, Weinstein S, Jarvik J, Murphy R (2003). Location proteomics-building subcellular location trees from high resolution 3D fluorescence microscope images of randomly-tagged proteins. Proc SPIE.4962:298-306.
    Chen X and Murphy R (2005). Objective Clustering of Proteins Based on Subcellular Location Patterns. Journal of Biomedicine and Biotechnology,2,87-95.
    Chen X.W and Liu M (2005). Prediction of protein-protein interactions using random decision forest framework. Bioinformatics, vol.21 no.24, pages 4394-4400.
    Chen Y and Wang J (2004). Image categorization by learning and reasoning with regions. Journal of Machine Learning Research,5:913-939.
    Chen Y.H, Garcia E, Gupta M, Rahimi A and Cazzanti L (2009). Similarity-based Classification:Concepts and Algorithms. Journal of Machine Learning Research 10747-776.
    Chou K (2000). Prediction of Protein Subcellular Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications,278,477-483.
    Chou K and Cai Y (2002). Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location. The Journal of Biological Chemistry, vol.277, no.48, pp.45765-45769.
    Chou K and Cai Y (2004a) Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochem Biophys Res Commun,320:1236-1239.
    Chou K and Cai Y. (2004b) Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. J. Cell. Biochem.,91,1197-1203.
    Chou K and Cai Y (2005). Predicting protein localization in budding yeast. Bioinformatics, vol.21 no.7, pages 944-950.
    Chou K and Shen H (2006a), Large-scale predictions of gram-negative bacterial protein subcellular locations, J. Proteome Res.5.
    Chou K and Shen H (2006b). Hum-PLoc:A novel ensemble classifier for predicting human protein subcellular localization. Biochem. Biophys. Res. Commun.347, 150-157.
    Chou K and Shen H (2006c). Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers, J. Proteome Res.5,1888-1897.
    Chou K and Shen H (2007a). Recent progresses in protein subcellular location prediction. Analytical Biochemistry,370,1-16.
    Chou K and Shen H (2007b). Euk-mPLoc:a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. J Proteome Res,6:1728-1734.
    Chou K and Shen H (2008). Cell-PLoc:A package of web-servers for predicting subcellular localization of proteins in various organisms. Nature Protocols,3, 153-162.
    Collobert R (2003). Scaling Large Learning Problems with Hard Parallel Mixtures. International Journal of Pattern Recognition and Artificial Intelligence,16:54.
    Dai W, Yang Q, Xue G and Yu Y (2007). Boosting for Transfer Learning. Proceedings of the 24 th International Conference on Machine Learning.
    Dai W, Chen Y, Xue G, Yang Q and Yu Y (2008). Translated Learning:Transfer Learning across Different Feature Spaces. NIPS 2008.
    Damoulas T and Girolami M (2008). Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, vol. 24 no.10, pages 1264-1270.
    Dellaire G, Farrall R and Bickmore W (2003). The Nuclear Protein Database (NPD): subnuclear localisation and functional annotation of the nuclear proteome. Nucl Acids Res,31:328-330.
    Dietterich T, Lathrop R, and Lozano T (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence,89(1-2):31-71.
    Dijk A, Bosch D, Braak C, Krol A, Ham R:Predicting sub-Golgi localization of type II membrane proteins. Bioinformatics 2008,24(16):1779-1786
    Drawid A and Gerstein M (2000). A Bayesian System Integrating Expression Data with Sequence Patterns for Localizing Proteins:Comprehensive Application to the Yeast Genome. J. Mol. Biol.301,1059-1075.
    Eddy S (1998). Profile hidden Markov models. Bioionformatics, vol.14 no.9, pages 755-763.
    Emanuelsson O (2000). Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol.,300,1005-1016.
    Eskin E and Snir S (2005). The Homology Kernel:A Biologically Motivated Sequence Embedding into Euclidean Space. Computational Intelligence in Bioinformatics and Computational Biology. Proceedings of the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.
    Gardy J, Laird M, Chen F, Rey S, Walsh C, Ester M and Brinkman F (2005). PSORTb v.2.0:Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics, vol.21 no.5, pages 617-623.
    Gardy J, Spencer C, Wang K, Esterl M, Tusn'dy G, Simon I, Hua S, DeFays F, Lambert C, Nakai K and Brinkman F (2003). PSORT-B:improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Research,Vol.31, No.13 3613-3617.
    G"artner T, Flach, P., Kowalczyk A and Smola A (2002). Multi-instance kernels. ICML, pp.179-186.
    Gisbert Schneider and Uli Fechner.Review Advances in the prediction of protein targeting signals. Proteomics,2004,4,1571-1580.
    Girolami M and Zhong M (2007). Data Integration for Classification Problems Employing Gaussian Process Priors. Advances in Neural Information Processing Systems,19.
    Gomez S, Noble W and Rzhetsky A (2003).Learning to predict protein-protein interactions from protein sequences. Bioinformatics, vol.19 no.15, pages 1875-1881.
    Guo J, Lin Y and Sun Z (2004). A Novel Method for Protein Subcellular Localization Based on Boosting and Probabilistic Neural Network. The 2nd Asia-Pacific Bioinformatics Conference, Vol.29.
    Guo J, Lin Y and Sun Z (2005). A novel method for protein subcellular localization: Combining residue-couple model and SVM. APBC 2005:117-129.
    Guo J and Lin Y (2006). TSSub:eukaryotic protein subcellular localization by extracting features from profiles. Bioinformatics, vol.22 no.14, pages 1784-1785.
    Hamilton N, Pantelic R, Hanson K and Teasdale R (2007).Fast automated cell phenotype image classification. BMC Bioinformatics,8:110.
    Hoglund A, Pierre Donnes, Torsten Bluml, Hans-Werner Adolph and Oliver Kohlbacher (2006). MultiLoc:prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics, vol.22 no.10, pages 1158-1165.
    Hsu C and Lin C (2002). A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks,13(2):415-425.
    Huang K and Murphy R (2004). Boosting accuracy of automated classification of fluorescence microscope images for location proteomics. BMC Bioinformatics 5:78.
    Huang W, Tunq C, Ho S, Hwang S and Ho S (2008). ProLoc-GO:utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization. BMC Bioinformatics,9:80
    Huang W, Tung C, Huang H and Ho S (2009). Predicting protein subnuclear localization using GO-amino-acid composition features. BioSystems
    Jaakkola T. et al. (2000) A discriminative framework for detecting remote protein homologies. J. Comput. Biol.,7,95-114.
    Jaakkola T, Diekhans M and Haussler D (1999). Using the Fisher kernel method to detect remote protein homologies. ISMB 1999.
    Jacobs R, Jordan M, Nowlan S and Hinton G (1991). Adaptive Mixtures of Local Experts. Neural Computation 3,79-87.
    James T and Cheung P (2007). Marginalized Multi-Instance Kernels. IJCAI,9.
    Jebara T, Kondor R and Howard A (2004). Probability Product Kernels. Journal of Machine Learning Research 5,819-844.
    Jebara T (2004). Multi-task Feature and Kernel Selection for SVMs. Proceedings of the 21st International Conference on Machine Learning.
    Jia P, Qian Z, Zeng Z, Cai Y and Li Y (2007). Prediction of subcellular protein localization based on functional domain composition. Biochemical and Biophysical Research Communications,357,366-370.
    Karplus K, Sj"olander K, Barrett C, Cline M, Haussler D, Hughey R, Holm L and Sander C (1997). Predicting protein structure using hidden Markov models. Proteins:Structure, Function, and Genetics, Suppl.1:134-139.
    Karplus K, Barrett C, and Hughey R (1998). Hidden markov models for detecting remote protein homologies. Bioinformatics,14(10):846-856.
    Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008). AAindex:amino acid index database, progress report 2008. Nucleic Acids Research 36, D202-D205.
    Kowalski M, Szafranski M and Ralaivola L (2009). Multiple Indefinite Kernel Learning with Mixed Norm Regularization. Proceedings of the 26 th International Conference on Machine Learning.
    Kuang R, Ie E, Wang K, Siddiqi M, Freund Y and Leslie C (2005). Profile based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol,3:527-550.
    Kuang R, Jianying Gu, Hong Cai and Yufeng Wang (2009). Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel. Genetica,136:189-209.
    Kuncheva L and Whitaker C (2003). Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Machine Learning,51, 181-207.
    Lanckriet G, DeBie T, Cristianini N, Jordan M and Noble W (2004a). A statistical framework for genomic data fusion. Bioinformatics,20(16):2626-2635.
    Lanckriet G, Cristianini N, Bartlett P, Ghaoui L and Jordan M (2004b). Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research 5,27-72.
    Lartillot N and Philippe H (2004). A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process. Molecular Biology and Evolution vol.21 no.6.
    Lee K, Chuang H, Beyer A, Sung M, Huh W, Lee B and Ideker T (2008). Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species. Nucleic Acids Research,36(20):e136
    Lei Z and Dai Y (2004). A Novel Approach for Prediction of Protein Subcellular Localization from Sequence Using Fourier Analysis and Support Vector Machines. The 4th Workshop on Data Mining in Bioinformatics.
    Lei Z and Dai Y (2005). An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics,6:291.
    Lei Z and Dai Y (2006). Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC Bioinformatics,7:491
    Leslie C, Eskin E, Noble W.:The spectrum kernel:a string kernel for SVM protein classification. Proc. Pac. Biocomput. Symp.2002,7:566-575
    Leslie C, Eskin E and Cohen A, Weston J, and Noble W (2004).Mismatch string kernels for discriminative protein classification. Bioinformatics, vol.20 no.4, pages 467-476.
    Lewis D, Jebara T and Noble W (2006). Support vector machine learning from heterogeneous data:an empirical analysis using protein sequence and structure. Bioinformatics, vol.22 no.22, pages 2753-2760.
    Lima C, Coelho A, Zuben F (2007). Hybridizing mixtures of experts with support vector machines:Investigation into nonlinear dynamic systems identification. Information Sciences 177 2049-2074.
    Lin C, Tsai Y, Lin Y, Chiu T, Hsiung C, Lee M, Simpson J and Hsu C (2007). Boosting multiclass learning with repeating codes and weak detectors for protein subcellular localization. Bioinformatics, vol.23 no.24, pages 3374-3381.
    Lio P (2003). Wavelets in bioinformatics and computational biology:state of art and perspectives. Bioinformatics, vol.19 no.1 pp 2-9.
    Liu Z and Chen D (2005). Classification of Chromosome Sequences with Entropy Kernel and LKPLS Algorithm. Springer, LNCS 3644, pp.543-551.
    Lu Z and Hunter L (2005). GO molecular function terms are predictive of subcellular localization. Pac Symp Biocomput 2005:151-61.
    Mak M, Guo J and Kung S (2008). PairProSVM:Protein Subcellular Localization Based on Local Pairwise Profile Alignment and SVM. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL.5, NO.3.
    Marcotte E, Xenarios I, van Der Bliek A and Eisenberg D (2000). Localizing proteins in the cell from their phylogenetic profiles. Proc. Natl Acad. Sci.,1997, 12115-12120.
    Maron O and Ratan A (1998). Multiple-instance learning for natural scene classification. In Proceedings of the 15th International Conference on Machine Learning,1998, pages 341-349, Madison, MI.
    Marsella L, Sirocco F, Trovato A, Seno F and Tosatto S (2009). REPETITA:detection and discrimination of the periodicity of protein solenoid repeats by discrete Fourier transform. Bioinformatics, vol.25 ISMB, pages i289-i295.
    Matthew B and Thomas H (2006). Conformal Multi-Instance Kernels. NIPS: Workshop on Learning to Compare Examples, 1-6.Mench S, Costa F and Frasconi P (2005).Weighted Decomposition Kernels. Proceedings of the 22 nd International Conference on Machine Learning.
    Medvedovic M and Sivaganesan S (2002). Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles. Bioinformatics, vol.18 no.9.
    Meir R and Ratsch G (2003). An Introduction to Boosting and Leveraging. Springer-Verlag, LNAI 2600, pp.118-183.
    Min R, Bonner A, Li J, and Zhang Z (2009). Learned Random-Walk Kernels and Empirical-Map Kernels for Protein Sequence Classification. Journal of computational biology, vol.16, no.3, pp 955-972.
    Misselwitz B, Strittmatter G, Periaswamy B, Schlumberger M, Rout S, Horvath P, Kozak K and Hardt W (2010). Enhanced CellClassifier:a multi-class classification tool for microscopy images. BMC Bioinformatics,11:30
    Mott R, Schultz J, Bork P and Ponting C (2002). Predicting Protein Cellular Localization Using a Domain Projection Method. Genome Research, 12:1168-1174.
    Murphy R, Boland M and Velliste M (2000). Towards a systematics for protein Subcellular location:quantitative description of protein localization patterns and automated analysis of fluorescence microscope images. Proc Int Conf Intell Syst Mol Biol,8:251-25.
    Murphy R, Velliste M and Porreca G (2002).Robust Classification of Subcellular Location Patterns in Fluorescence Microscope Images. Proceedings of the 2002 IEEE International Workshop on Neural Networks for Signal Processing.
    Murray K, Gorse D and Thornton J (2002). Wavelet Transforms for the Characterization and Detection of Repeating Motifs. J. Mol. Biol.316,341-363.
    Nakai K and Kanehisa M (1991). Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins,11,95-110.
    Nanni L and Lumini A (2008). An ensemble of support vector machines for predicting the membrane protein type directly from the amino acid sequence. Amino Acids 35:573-580.
    Nielsen H, Engelbrecht J, Brunak S and Heijne G (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering vol.10 no.1 pp.1-6.
    Pan S and Yang Q (2008). A Survey on Transfer Learning. Report HKUST-CS08-08, Department of Computer Science and Engineering Hong Kong University of Sci-ence and Technology.
    Pellegrini M, Marcotte E, Thompson M, Eisenberg D and Yeates T (1999). Assigning protein functions by comparative genome analysis:Protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA Vol.96, pp.4285-4288.
    Peng H (2008). Bioimage informatics:a new area of engineering biology. Bioinformatics, vol.24 no.17, pages 1827-1836.
    Pfeifer N and Kohlbacher O (2008). Multiple Instance Learning Allows MHC Class Ⅱ Epitope Predictions Across Alleles. Springer, LNBI 5251,210-221.
    Pierleoni A, Martelli P, Fariselli P and Casadio R (2006). BaCelLo:a balanced subcellular localization predictor. Bioinformatics, vol.22 no.14, pages e408-e416.
    Qi Y, Seetharaman J, and Joseph Z (2005). Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Pacific Symposium on Biocomputing 10:531-542.
    Qiu J, Hue M, Hur A, Vert J and Noble W (2007). A structural alignment kernel for protein structures. Bioinformatics, vol.23 no.9, pages 1090-1098.
    Qiu J, Luo S, Huang J, Sun X and Liang R (2009a). Predicting subcellular location of apoptosis proteins based on wavelet transform and support vector machine. Amino Acids.
    Qiu J, Luo S, Huang J and Liang R (2009b). Using support vector machines to distinguish enzymes:Approached by incorporating wavelet transform. Journal of Theoretical Biology 256,625-631.
    Rakotomamonjy A, Bach F, Canu S and Grandvalet Y (2007). More Efficiency in Multiple Kernel Learning. Proceedings of the 24 th International Conference on Machine Learning.
    Rakotomamonjy A, Bach F, Canu S and Grandvalet Y (2008).SimpleMKL. Journal of Machine Learning Research 9,2491-2521.
    Ramon D and Sara A (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics,7:3.
    Ramo P, Sacher R, Snijder B, Begemann B and Pelkmans L (2009). CellClassifier: supervised learning of cellular phenotypes. Bioinformatics, vol.25 no.22 2009, pages 3028-3030.
    Rangwala H and Karypis G (2005). Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, vol.21 no.23, pages 4239-4247.
    Rashid M, Saha S and Raghava G (2007). Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics,8:337.
    Schneider G and Fechner U (2004). Review Advances in the prediction of protein targeting signals. Proteomics,4,1571-1580.
    Scott M, Calafell S, Thomas D and Hallett M (2005). Refining Protein Subcellular Localization. PLoS Computational Biology, vol.1,6.
    Scott M, Thomas D and Hallett M (2004). Predicting Subcellular Localization via Protein Motif Co-Occurrence. Genome Research,14:1957-1966.
    Shatkay H, Chen N and Blostein D (2006). Integrating image data into biomedical text categorization. Bioinformatics, vol.22 no.14, pages e446-e453.
    Shen H, Yanq J, Chou KC (2007a). Euk-PLoc:an ensemble classifier for large-scale eukaryotic protein subcellular location prediction. Amino Acids,33:57-67
    Shen H and Chou K (2005). Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. Biochemical and Biophysical Research Communications 337,752-756.
    Shen H and Kuo-Chen Chou (2007b). Hum-mPLoc:An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple site. Biochemical and Biophysical Research Communications 355,1006-1011.
    Shen H and Chou K (2007c). Nuc-PLoc:a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng Des Sel,20:561-567
    Shen H and Chou K (2007d). Virus-PLoc:A fusion classifier for predicting the subcellular localization of viral proteins within host and virusinfected cells. Biopolymers 85,233-240.
    Shen H and Chou K (2007e). Gpos-PLoc:an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins.Protein Engineering, Design& Selection vol.20 no.1 pp.39-46.
    Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, and Jiang H (2007). Predicting protein-protein interactions based only on sequences information. PNAS, vol.104, no.11,4337-4341.
    Shi F and Chen Q (2006). Prediction of MHC Class I Binding Peptides Using Fourier Analysis and Support Vector Machine. FSKD 2006, LNAI 4223, pp.1072-1081.
    Shira M, Asaph A, Eytan R and Tomer S (2009). Network-based prediction of metabolic enzymes'subcellular localization. Bioinformatics, vol.25 ISMB 2009, pages i247-i252.
    Sj"olander K, Karplus K, Brown M, Hughey R, Krogh A, Mian S and Haussler D (1996). Dirichlet mixtures:a method for improved detection of weak but significant protein sequence homology.CABIOS, Vol.12 no.4, pages 327-345.
    Sonnenburg S, Ratsch G, Schafer C and Scholkopf B (2006). Large Scale Multiple Kernel Learning. Journal of Machine Learning Research,7.1531-1565.
    Sonnhammer E, Eddy S, Birney E, Bateman A and Durbin R (1998). Pfam:multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Research, Vol.26, No.1,320-322.
    Stuart Andrews, Ioannis Tsochantaridis and Thomas Hofmann (2003). Support Vector Machines for Multiple-Instance Learning. Advances in Neural Information Processing Systems 15.
    Swarup S and Ray S (2006). Cross-Domain Knowledge Transfer Using Structured Representations. Proceedings of the 21st national conference on Artificial intelligence
    Taylor J and Cristianini N (2004). Kernel Methods for Pattern Analysis. Cambridge University Press.
    Trad C, Fang Q and Cosic I (2002). Protein sequence comparison based on the wavelet transform approach. Protein Engineering, vol.15 no.3 pp.193-203.
    Tung T, Lee D (2009). A method to improve protein subcellular localization prediction by integrating various biological data sources. BMC Bioinformatics 2009,10(Suppl 1):S43
    Vapnik V (1998). Statistical Learning Theory, Springer.
    Vert J (2002). Support Vector Machine Prediction Of Signal Peptide Cleavage Site Using A New Class Of Kernels For Strings. Pacific Symposium on Biocomputing 7:649-660.
    Wei N, Flaschel E, Friehs K and Nattkemper T (2008). A machine vision system for automated non-invasive assessment of cell viability via dark field microscopy, wavelet feature selection and classification. BMC Bioinformatics,9:449.
    Weston J, Leslie C, Ie E, Zhou D, Elisseeff A and Noble W (2005). Semi-supervised protein classification using cluster kernels. Bioinformatics, vol.21 no.15, pages 3241-3247.
    Xie D,Li A, Wang M, Fan Z and Feng H (2005). LOCSVMPSI:a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Research, Vol.33, Web Server issue W105-W110.
    Xing E, Sharan R and Jordan M (2004a). Bayesian Haplotype Inference via the Dirichlet Process. Proceedings of the 21st International Conference on Machine Learning.
    Xu X and Frank E (2004). Logistic regression and boosting for labeled bags of instances. In H. Dai, R. Srikant, and C. Zhang, editors, Lecture Notes in Artificial Intelligence 3056, pages 272-281. Springer, Berlin.
    Xu Q, Hu D, Xue H, Yu W and Yang Q (2009). Semi-supervised protein subcellular localization. BMC Bioinformatics,10(Suppl 1):S47.
    Yang J (2005). Review of Multi-Instance Learning and Its applications. http://www-2.cs.cmu.edu/~juny/MILL/MILL.zip
    Yang Q, Chen Y, Xue G, Dai W and Yu Y (2009). Heterogeneous Transfer Learning for Image Clustering via the SocialWeb. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1-9.
    Ying Y, Huang K and Campbell C (2009). Enhanced Protein Fold Prediction through a Novel Data Integration Approach. BMC Bioinformatics,10:267.
    Yuan Z (1999). Prediction of protein subcellular locations using Markov chain models. FEBS Letters 451,23-26.
    Yuan Z and Teasdale R (2002). Prediction of Golgi Type Ⅱ memebrane proteins based on their transmembrane domains. Bioinformatics, vol.18 no 8, pages 1109-1115.
    Zdobnov EM and Apweiler R (2001). InterProScan-an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001,17:847-848.
    Zhao T and Murphy R (2007) Automated learning of generative models for subcellular location:Building blocks for systems biology. Cytometry 71A:978-990.
    Zhou Z and Zhang M (2007). Multi-Instance Multi-Label Learning with Application to Scene Classification. Advances in Neural Information Processing Systems 19.
    Zhou Z, Sun Y and Li Y (2009). Multi-Instance Learning by Treating Instances As Non-I.I.D. Samples. Proceedings of the 26th International Conference on Machine Learning.
    Zhang Z and Henzel W (2004). Signal peptide prediction based on analysis of experimentally verified cleavage sites. Protein Sci.13:2819-2824.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700