基于支持向量机的蛋白质结构域预测方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于支持向量机的蛋白质结构域预测方法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Prediction of Protein Domains Based on Support Vector Machines
作者：邹淑雪
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：蛋白质序列 ; 支持向量机 ; 蛋白质结构域 ; 边界 ; 非平衡数据 ; 模糊系统 ; 粒子群优化 ; 最大熵 ; 欠采样 ; 过采样 ; 接受者操作特性曲线 ; 接受者操作特性曲线下的面积
英文关键词：Protein Sequence ; SVM (Support Vector Machine) ; Protein Domain ; Boundary ; Imbalaced Data Learning ; Fuzzy Systems ; PSO(Particle Swarm Optimization) ; Maximum Entropy ; Undersampling ; Oversampling ; ROC(receiver operating characteristic) ; AUC(Area Under ROC Curve)
学位年度：2009
导师：周春光
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2009-06-01

摘要

生物信息学是随着人类基因组计划的启动而兴起的一门新的交叉学科,是以计算机为工具对生物信息进行储存、检索和分析的科学。随着人类基因组计划宣告完成,生命科学进入后基因组时代,其研究重点也主要转移到基因组学和蛋白组学两方面。其中蛋白组学是以细胞内全部蛋白质的存在及其活动方式为研究对象,而传统的对单个蛋白质进行研究的方式已无法满足后基因组时代的要求。生物信息学在蛋白质高级结构的解析中的重要性将越来越突出。分析蛋白质首先就是确定蛋白质结构域的构成,这是研究蛋白质的最重要步骤。检测蛋白质的结构域是一个富有挑战性的问题,特别是仅从序列信息直接进行结构域分析逐渐成为结构域预测的主要研究目标。
     本文针对从蛋白质序列信息检测结构域边界信号问题进行了较深入的研究。
     1.根据多序列比对结果,定义了几种方法对比对结果进行特征提取,根据蛋白质的构象特征计算种子序列的构象熵值,并利用信息熵理论使得结构域信息最大化,最后使用支持向量机学习系统对提取的特征值进行分类,首先根据序列分析结果提出了相关特征并进行支持向量机学习。
     2.经过探究支持向量机参数对结构域边界信号不敏感的原因,首次提出将蛋白质结构域边界检测问题归结为非平衡数据学习问题,即蛋白质结构域问题中的结构域内部为多数的负类;结构域边界为少数的正类,提出了在支持向量机特征空间中对与正类样本具有距离最大熵值的负类样本进行采样的新的欠采样方法。
     3.在支持向量机学习前,对训练集利用本文提出的基于遗传算法进行采样,为了更有效的评价采样后训练样本的分类器效果,本文采用AUC (Area Under ROC Curve) ,ROC曲线下的面积,作为分类器性能评价指标,并将其作为遗传算法的适应度函数。实验结果表明本文提出的采样技术明显好于随意采样技术,而且在蛋白质结构域的预测应用中明显优于单独使用支持向量机分类器。
     4.借助支持向量机与模糊分类系统的等价性理论证明,提出了基于支持向量机的模糊分类系统模型。首先利用SVM的学习算法获得分类系统的稀疏表示,然后将获得的分类系统映射成等价的正定模糊分类系统,再利用模糊集合的贴近度概念和粒子群优化方法对模糊分类系统的模糊规则库进行约简和优化。模糊分类系统具有更好的范化能力,其学习过程等价于SVM系统参数的优化,但具有较快的训练速度。
Since proteins provide some of the most fundamental information about many processes in almost all organisms, the ability to predict protein structure and functionhas become one of the most important goals in bioinformatics research. Protein domains represent one of the most useful avenues for the understanding of protein function and domain family-based analysis, and are of great importance in the study of individual proteins. Detecting the domain structure of a protein is a challenging problem that how to determine where is the amino acids in the protein domain or in the domain boundary for a given protein sequence. In detail there are two problems. One is that where are the domain or boundary in a given protein structure. The other is that the same problem in a sequence without the known structure. Relatively speaking the latter is more difficult.
     Support Vector machines (SVM) are a new statistical learning technique that can be seen as a new method for training classifiers based on polynomial functions, radial basis functions, neural networks, splines or other functions. Support Vector machines use a hyper-linear separating plane to create a classifier. For problems that can not be linearly separated in the input space, this machine offers a possibility to find a solution by making a non-linear transformation of the original input space into a high dimensional feature space, where an optimal separating hyperplane can be found. The performance of SVM drops significantly while facing imbalanced datasets, though it has been extensively studied and has shown remarkable success in many applications. Once more it is difficult to avoid such decrease when trying to improve the efficient of SVM on imbalanced datasets by modifying the algorithm itself only. Therefore, as the pretreatment of data, sampling is a popular strategy to handle the class imbalance problem since it re-balances the dataset directly.
     In this thesis there is an intensive study on the domain boundary detection only using a given protein sequence.
     A promising method for detecting the domain structure of a protein from sequence information alone was presented. Given a query sequence, our algorithm starts by searching the protein sequence database and generating a multiple alignment of all significant hits. The columns of the multiple alignment are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns, such as the conservation measures on the composition and classification of amino acids in each multiple alignment column, consistency and correlation measures, measures of structural flexibility. Information theory based principles are employed to maximize the information content. Besides we quote a method to predict domain boundary from protein sequence alone. The method is based on theory that the protein unique three dimensional structure is a result of the balance between the gain of attractive native interactions and the loss of conformational entropy. These scores are then combined using a support vector machine to label single columns as core-domain or boundary positions The overall accuracy of the method for a single protein chains dataset, is about 85 %.
     A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. Its unique learning mechanism makes it an interesting candidate for dealing with imbalanced datasets, since SVMs only takes into account those data that are close to the boundary, i.e. the support vectors, for building its model. What’s more important, as kernel-based methods, the classification of SVMs is defined in the feature space. So does our undersampling preprocessing. Therefore, those negtives that are very close or distant to a given possitive one, would not be sampled. The negtives too close to the learned hyperplane may have skewed hyperplane and far away from it could not be the support vector but be trained with uselessness. While for the ones separated by the distance close to the mean distance, their contributions are very high. The negtives which have the maximal entropy value with counterpart possitives are undersampled, in this way, the input data are no longer imbalanced. Thus the learned hyperplane is further away from the positive class. This is done in order to compensate for the skew associated with imbalanced datasets which pushes the hyperplane closer to the positive class.
     Given a query sequence, our algorithm starts by searching the local sequences database and generating a multiple alignment of all significant hits. The columns of the multiple alignments are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns. Information theory based principles are employed to maximize the information content. Besides we get a feature extracted from the conformational entropy of a protein sequence. Thus we get an imbalanced training data set. Next we resample the data set and form N population initialization in Genetic Algorithm. We test respectively the two sampling techniques: over-sampling on minority and under-sampling on majority. SVM learn on each re-sampling training data set and corresponding AUC value is computed. The population is updated by three basic genetic operators, such as reproduction, crossover, mutation, according to the fitness value of AUC. The process of SVM learning and genetic population updated is iterated until convergence or reaching the max iteration. A fuzzy classification system model based on support vector machine is proposed in this paper.
     As a powerful tool in dealing with complex uncertainty problems, Fuzzy System Theroies (L.A. Zadeh et al.) have been succeeding in many applications such as signal processing and pattern recognition.However, they often suffer from the curse os dimensionality for the high-dimentional data. SVM and Fuzzy Systems are complementary in such cases. Some researcher gave the equivalent relation proof on SVM and positive definite fuzzy classifier, which made it possible to combine SVM with Fuzzy Systems. Reduction methods are developed to minimize the complexity of the system by reducing the linguistic terms in the fuzzy rules based on the similarity of fuzzy sets, and removing the redundant and inconsistent fuzzy rules. Finally, the particle swarm optimization is used to adjust the system parameters for compensating the deviation caused by the reduction. Experimental results show that the methods are feasible and effective.

引文

[1] Achuthsankar S. Computational Biology & Bioinformatics - A gentle Overview [R]. Communications of Computer Society of India, January 2007.
    [2] Baxevanis, A.D., Petsko, G.A., Stein, L.D., and Stormo, G.D., eds., Current Protocols in Bioinformatics [M]. New York:John Wiley & Sons, 2007.
    [3] Keedwell,E., Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems [M]. New York:John Wiley & Sons, 2005.
    [4] Gilbert, D. Bioinformatics software resources [J]. Briefings in Bioinformatics, Briefings in Bioinformatics, 2004, 5(3):300-304.
    [5] Wilkins MR, Williams KL, Appel RD, Hochstrasser DF. Proteome Research: New Frontiers in Functional Genomics (Principles and Practice) [M]. Berlin:Springer, 1997.
    [6] Anfinsen, C. B. Principles that govern the folding of protein chains [J]. Science, 1973, 181, 223?230.
    [7] Sali. and Blundell,T.L.,Comparative protein modeling by satisfaction of spatial restraints [J]. J.Mol.Biol., 1993, 234, 779-815.
    [8] Martí-Renom M A, Stuart A C, Fiser A, Sánchez R, Melo F and ?ali, A. Comparative protein structure modeling of genes and genomes [J]. Annu. Rev. Biophys. Biomol. Struct. 2000, 29, 291-325.
    [9] Guex, N. and Peitsch, M.C. SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling [J]. Electrophoresis, 1997, 18, 2714–2723.
    [10] Kopp, J. and Schwede, T. The SWISS-MODEL repository of annotated threedimensional protein structure homology models [J]. Nucleic Acids Res. 2004, 32,D230-D234.
    [11] Sali, A., Potterton, L., Yuan, F., van Vlijmen, H., Karplus, M. Evaluation of comparative protein modeling by MODELLER [J]. Proteins,1995, 23: 318-326.
    [12] A. Sali and T.L. Blundell. Comparative protein modelling by satisfaction of spatial restraints [J]. J. Mol. Biol.,1993, 234, 779-815.
    [13] Schwede, T., Kopp, J., Guex, N. & Peitsch, M.C. SWISS-MODEL: an automated protein homology-modeling server [J]. Nucleic Acids Res.,2003, 31, 3381–3385.
    [14] Bordoli L., Kiefer F., Arnold K., Benkert P., Battey J. and Schwede T., Protein structure homology modeling using SWISS-MODEL workspace [J] , Nature Protocols ,2009, 4,1-13.
    [15] Altschul S.F.,Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool [J]. J. Mol. Biol.,1990, 215:403–410.
    [16] Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs [J]. Nucleic Acids Res.,1997, 25:3389–3402.
    [17] Holm L. and Sander C. Protein structure comparison by alignment of distance matrices [J]. Journal of Molecular Biology, 1993, 233:123-138.
    [18] Jones DT. Protein secondary structure prediction based on position-specific scoring matrices [J]. J. Mol. Biol, 1999, 292:195–202.
    [19] Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT: Protein structure prediction servers at University College London [J]. Nucleic Acids Res., 2005:36-38.
    [20] Cuff JA, Clamp ME, Siddiqui AS, Finlay M, Barton GJ. JPred: a consensus secondary structure prediction server. Bioinformatics, 1998, 14:892–893.
    [21] Cuff JA, Barton GJ. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins, 2000, 40:502–511.
    [22] Christian Cole, Jonathan D. Barber and Geoffrey J.,The Jpred 3 secondary structure prediction server,Nucleic Acids Research, 2008, 36,suppl_2 W197-W201.
    [23] Montgomerie S., Sundararaj S., Gallin W., Wishart D., Improving the accuracy of protein secondary structure prediction using structural alignment [J]. BMC Bioinformatics, 2006, 14:301.
    [24] Montgomerie, S; Cruz, J; Shrivastava, S; Arndt, D; Berjanskii, M, et al. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Res. 2008, 36:202–209.
    [25]鄒文雄與黃明經,蛋白質結構之電腦預測 [J]. Chemistry (The Chinese Chem. Soc. Taipei), 1997, 55, 101-109.
    [26] Jacobson,MP,Pincus,DL,Rapp,CS,Day,TJ,Honig,B.,Shaw,DE,Friesner,RA,A hierarchical approach to all-atom protein loop prediction [J]. Proteins, 2004, 55,351-367.
    [27] Fiser A, Giian D R and ?ali A. Modeling of loops in protein structures [J]. Protein Sci., 2000, 9 ,1753-1773.
    [28] Al-Lazikani B, Jung J, Xiang Z and Honig B. Protein structure prediction [J]. Curr. Opin. in Chem. Biol., 2001, 5, 51-56.
    [29] Soto, C.S., Fasnacht, M., Zhu, J., Forrest, L. and Honig, B. Loop modeling: sampling, filtering and scoring [J]. Proteins, 2008, 70, 834–843.
    [30] Michael J., Fred E., Cohen and Roland L., Dunbrack, Jr. Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool [J].Journal of Molecular Biology, 1997, 267,1268-1282.
    [31] Lovell S.C., Word J.M., Richardson J.S. and Richardson D.C., The penultimate rotamer library [J]. Proteins, 2000 ,40,pp.389-408.
    [32] Huang ES., Koehl P, Levitt M,Pappu RV and Ponder JW. Accuracy of side-chain prediction upon near-native protein backbones generated by ab initio folding methods [J]. Proteins, 1998, 33, 204-217.
    [33] D. Xu, K. Baburaj, C. B. Peterson, and Y. Xu. A Model for the Three Dimensional Structure of Vitronectin: Predictions for the Multi-Domain Protein from Threading and Docking [J]. Proteins: Structure, Function, Genetics, 2001, 44:312-320.
    [34] Mirny L.A., Finkelstein A.V. and Shakhnovich E.I.. Statistical significance of protein structure prediction by threading [J]. PNAS, 2000, 97:9978-9983.
    [35] D. Kim, D. Xu, J. Guo, K. Ellrott, Ying Xu, PROSPECT II: protein structure prediction program for genome-scale application [J]. Protein Engineering, 2003, 16(9),641- 650.
    [36] Jones DT.,GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences [J]. J. Mol. Biol., 1999, 287:797-815.
    [37] McGuffin L.J., Jones D.T., Improvement of the GenTHREADER method for genomic fold recognition [J]. Bioinformatics, 2003, 19,874-881.
    [38] Guex N, Diemand A and Peitsch M C. Protein modelling for all. TIBS, 1999, 24, 36 364-367.
    [39] Bowie JU, Lüthy R, Eisenberg D., A method to identify protein sequences that fold into a known three-dimensional structure [J]. Science, 1991,Jul 12;253(5016):164-70.
    [40] Brenner S., Target selection for structural genomics [J]. Nature Struc. Biol., 2000, 7:967-969.
    [41] Skolnick J.,Fetrow J.S.From genes to protein structure and function:novel applications of computational approaches in the genomic era [J]. Trend Biotechnol, 2000, 18:34-39.
    [42] Moult J., Predicting protein three-dimensional structure [J]. Curr. Opin. Biotech., 1999,10:583-588.
    [43] Duan Y., Kollman P.A.. Pathways to a protein folding intermediate observed in a 1 microsecond simulation in aqueous solution [J]. Science, 1998, 282:740-744.
    [44] Jaroslaw P.,Cezary C.,et al.Recent improvements in prediction of protein structure by global optimization of a pstential energy function [J].PNAS, 2001,98:2329-2333.
    [45] Zhang C., Hou J.T., and Kim S.H., Fold prediction of helical proteins using torsion angle dynamics and predicted restraints [J].PNAS, 2002, 99:3581-3585.
    [46] Robert B.,Russell and Geoffrey J. Barton, The Limits of Protein Secondary Structure Prediction Accuracy from Multiple Sequence Alignment [J]. J. Mol. Biol., 1993,234,951-957.
    [47] Sun ZR , Rao XQ, Li WP, Dong X. Prediction of protein super secondary structures based on the artificial network method [J]. Protein Engineering, 1997, 10(7):763-769.
    [48] Kuhn M, Meiler J, Baker D., Strand-loop-strand motifs: prediction of hairpin and diverging turns in proteins [J]. Protein, 2004, 5: 282-288.
    [49] Shaun M. Lippow and Bruce Tidor, Progress in computational protein design [J].Current Opinion in Biotechnology, 2007, 305-311.
    [50] Laskowski R A, MacArthur M W, Moss D S and Thornton J M. PROCHECK: a program to check the stereochemical quality of protein structures [J]. J. Appl. Cryst., 1993, 26, 283-291.
    [51] Laskowski R A., Antoon J., Rullmann C., MacArthur M W., Kaptein R. and Thornton J M.,AQUA and PROCHECK-NMR: Programs for checking the quality of protein structures solved by NMR [J]. Journal of Biomolecular NMR, 1996, 477-486.
    [52] Manfred J. Sippl, Recognition of Errors in Three-Dimensional Structures of Proteins [J]. Proteins, 1993, 17, 355-362.
    [53] Markus W. and Manfred J. Sippl, ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins [J].Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W407-W410.
    [54] Eisenberg D, Lüthy R, Bowie JU, VERIFY3D: assessment of protein models with three-dimensional profiles [J].Methods Enzymol, 1997, 277:396-404.
    [55] Dym O., Eisenberg D. and Yeates T. O., Detection of errors in protein models [J]. International Tables for Crystallography, 2006, Vol. F, ch. 21.3, pp. 520-530.
    [56] Orengo C. and Taylor W., SSAP: Sequential structure alignment program for protein structure comparison [J]. Methods in Enzymology, 1996, 266:617.635.
    [57] Taylor W. R., Protein structure comparison using iterated double dynamic programming [J]. Protein Science, 1999, 8:654.665.
    [58] Holm L. and Sander C., Protein structure comparison by alignment of distance matrices [J]. Journal of Molecular Biology, 1993, 233:123-138.
    [59] Nussinov R. and Wolfson H.. Effcient detection of three dimensional structural motifs in biological macromolecules by computer vision techniques [J]. Proc.National Academy of Sciences of the USA, 1991, 10495-10499.
    [60] Madej T., Gibrat J.F. and Bryant S. H., Threading a database of protein cores [J]. Proteins, 1995, 23:356-369.
    [61] Pieper U., Eswar N., Webb BM. ,Eramian D., Kelly L., Barkan D.T. Carter H., Mankoo P., Karchin R., Marti-Renom M.A., Davis F.P. and Sali A.,MODBASE, a database of annotated comparative protein structure models and associated resources [J].Nucleic Acids Research, 2008,1-8.
    [62] Baldi P, Brunak S, Bioinformatics-the Machine Learning Approach [M]. Cambridge: MIT Press, 1998.
    [63] Sikder AR, Zomaya AY: An Overview of Protein Folding Techniques: Issues and Perspectives [J]. International Journal of Bioinformatics Research and Application, 2005, 1(1):121-143.
    [64] Copley RR, Doerksa T, Letunica I, Borka P,Protein domain analysis in the era of complete genomes [J]. FEBS Letters, 2002, 513:129-134.
    [65] Richardson JS: The anatomy and taxonomy of protein structure [J]. Adv Protein CHem., 1981, 34:167-339.
    [66] Bork P: Shuffled domains in extra cellular proteins. FEBS Letters, 1991, 286:47-54.
    [67] Wetlaufer DB, Nucleation, rapid folding, and globular intrachain regions in proteins [J]. Proc Natl Acad Sci USA, 1973, 70:697-701.
    [68] Murzin A. G., Brenner S. E., Hubbard T., Chothia C., SCOP: a structural classification of proteins database for the investigation of sequences and structure [J]. J. Mol. Biol., 1995, 247, 536-540.
    [69] Andreeva A., Howorth D., Brenner S.E., Hubbard T.J.P., Chothia C., Murzin A.G., SCOP database in 2004: refinements integrate structure and sequence family data [J]. Nucl. Acid Res., 2004, 32:226-229.
    [70] Andreeva A., Howorth D., Chandonia J.-M., Brenner S.E., Hubbard T.J.P., Chothia C., Murzin A.G., Data growth and its impact on the SCOP database: new developments [J]. Nucl. Acid Res. 2008, 36: D419-D425.
    [71] Pearl F, Todd A, Sillitoe I, et al., The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis [J]. Nucleic Acids Research, 2005, Vol. 33 Database Issue D247-D251.
    [72] Orengo C.A., Michie A.D., Jones D.T., Swindells M.B., Thornton J.M., CATH: A Hierarchic Classification of Protein Domain Structures Structure [J], 1997,5, 1093-1108.
    [73] Andreeva A., Howorth D., Chandonia J.-M., et al., database: new developments [J]. Nucl. Acid Res., 2008, 36: D419-D425.
    [74] Alexandrov N. and Shindyalov I. PDP: protein domain parser [J]. Bioinformatics, 2003, 19:3,429-430.
    [75] Xu Y.and Xu D. et al., Protein domain decomposition using a graph-theoretic approach [J]. Bioinformatics, 2000, 16, 1091–1104.
    [76] Guo J.T., Xu,D. et al., Improving the performance of DomainParser for structural domain partition using neural network [J]. Nucl.Acids Res., 2003, 31,944-952.
    [77] Stella,V., Philip E.,Nicholai N.and IIya N. Toward Consistent Assignment of Structural Domains in Proteins [J]. J. Mol. Biol., 2004, 339,647–678.
    [78] Gouzy J., Corpet F. and Kahn D., Whole genome protein domain analysis using a new method for domain clustering [J]. Comput. Chem. 1999, 23, 333-340.
    [79] Sonnhammer E. L. and Kahn, D., Modular arrangement of proteins as inferred from analysis of homology [J]. Protein Sci. 1994, 3, 482-492.
    [80] Gracy J.and Argos P., Automated protein sequence database classification. I. Integration of copositional similarity search, local similarity search and multiple sequence alignment. II. Delineation of domain boundries from sequence similarity [J]. Bioinformatics, 1998, 14:2, 164-187.
    [81] George R. A.and Heringa J., Protein domain identification and improved sequence similarity searching using PSI-BLAST [J]. Proteins, 2002, 48,672-681.
    [82] Wheelan S. J., Marchler-Bauer A. and Bryant S., Domain size distributions can predict domain boundaries [J]. Bioinformatics, 2000, 16,613-618.
    [83] Bateman A.,Birney E.,Durbin R.,Eddy S. R.,Finn R.D. and Sonnhammer E.L., Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins [J]. Nucl. Acids Res., 1999, 27,260-262.
    [84] Ponting C. P., Schultz J., Milpetz F. and Bork P., SMART:identification and annotation of domains from signalling and extracellular protein sequences [J]. Nucl. Acids Res. 1999, 27, 229-232.
    [85] Copley RR, Doerksa T, Letunica I, Borka P, Protein domain analysis in the era of complete genomes [J]. FEBS Letters, 2002, 513:129-134.
    [86] Cheng J, Sweredoski M, Baldi P, DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks [J]. Data Mining and Knowledge Discovery 2006, 13(1):1-10.
    [87] Nagarajan N, Yona G, Automatic prediction of protein domains from sequence information using a hybrid learning system [J]. Bioinformatics 2004, 20:1335-60.
    [88] Sim J, Kim S-Y, Lee J, PRODO: Prediction of Protein Domain Boundaries using Neural Networks [J]. Proteins, 2005, 59:627-632.
    [89] George RA, Heringa J, SnapDRAGON: a Method to Delineate Protein Structural Domains from Sequence Data [J]. J. Mol. Biol., 2002, 316:839-851.
    [90] Aszo? di, A. & Taylor, W., Folding polypeptide a-carbon backbones by distance geometry methods [J]. Bioploymers, 1994, 34, 489-505.
    [91] Aszo? di, A., Gradwell, M. J. & Taylor, W. R. , Global fold determination from a small number of distance restraints [J]. J. Mol. Biol. 251, 308-326.
    [92] Aszo? di, A. & Taylor, W. R., Hierarchic inertial projection: a fast distance matrix embedding algorithm [J]. Comput. Chem. 1997, 21, 13-23.
    [93] Frishman, D. & Argos, P., Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence [J]. Protein Eng., 1996, 9, 133-142.
    [94] Frishman, D. & Argos, P. , Seventy-five percent accuracy in protein secondary structure prediction [J]. Proteins: Struct. Funct. Genet., 1997 27, 329-335.
    [95] Marsden RL, McGuffin LJ, Jones DT, Rapid protein domain assignment from amino acid sequence using predicted secondary structure [J]. Protein Science, 2002, 11:2814-2824.
    [96] McGuffin, L.J., Bryson, K., Jones, D.T., The PSIPRED protein structure prediction server Bioinformatics, 2000, 16, 404-405.
    [97] Gewehr JE, Zimmer R., SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles [J]. Bioinformatics, 2006, 22(2):181-187.
    [98] Fontana P, Bindewald E, Toppo S, Velasco R, Valle G, Tosatto SC., The SSEA server for protein secondary structure alignment [J]. Bioinformatics, 2005, 21, 393–395.
    [99] Cheng J., DOMAC: an accurate, hybrid protein domain prediction server [J].Nucleic Acids Res., 2007, 1-4.
    [100] Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,Miller,A. A. and Lipman,D.J., Gapped BLAST and PSIBLAST: a new generation of protein database search programs [J]. Nucleic Acids Res., 1997,25,3389–3402.
    [101] Cheng,J., Sweredoski,M.J. and Baldi,P., DOMpro: Protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks [J]. Data Mining and Knowledge Discovery, 2006, 13, 1–10.
    [102] Cheng,J., Randall,A.Z., Sweredoski,M.J. and Baldi,P.,SCRATCH: a protein structure and structural feature prediction server [J]. Nucleic Acids Res., 2005, 33(web server issue), w72–w76.
    [103] Pollastri,G., Przybylski,D., Rost,B. and Baldi,P. ,Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles [J]. Proteins, 2002, 47, 228–235.
    [104] Pollastri,G., Baldi,P., Fariselli,P. and Casadio,R., Prediction of coordination number and relative solvent accessibility in proteins [J]. Proteins, 2002, 47, 142–153.
    [105] Sikder AR, Zomaya AY, Improving the performance of Domain-Discovery of protein domain boundary assignment using inter-domain linker index [J]. BMC Bioinformatics 2006, 7(Suppl5):S6.
    [106] Suyama M, Ohara O: DomCut: prediction of inter-domain linker regions in amino acid sequences [J]. Bioinformatics, 2003, 19(5):673-674.
    [107] Murvai, J., Vlahovicek, K., Szepesvari, C. and Pongor, S., Prediction of Protein Functional Domains from Sequences Using Artificial Neural Networks [J]. Genome Res., 2001,11, 1410-1417.
    [108] Vapnik V著,张学工译,统计学习理论的本质[M].北京,清华大学出版社, 2000年9月.
    [109] Osuna, E., Freund, R. and Girosi, F. Training support vector machines: an application to face detection [J]. In IEEE Conference on Computer Vision and Pattern Recognition, 1997, pages 130–136.
    [110] John C P. Fast training of support vector machines using sequential minimal optimization. In Scholkopf B. et al(ed.), Advances in Kernel Methods-Support Vector Learning [M], Cambridge, MA,MIT Press, 1999,185~208.
    [111] Joachims T.. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning [M], Cambridge: MIT Press, 1999.
    [112] Raskutti B. and Kowalczyk A., Extreme rebalancing for svms: a case study [J]. SIGKDD Explorations, 2004, 6(1):60-69.
    [113] Wu, G. & Chang, E., Class-Boundary Alignment for Imbalanced Dataset Learning. In ICML 2003 Workshop on Learning from Imbalanced Data Sets II [C], Washington, DC. 2003.
    [114] Yan R., Liu Y., Jin R. and Hauptmann A., On predicting rare classes with SVM ensembles in scene classification [C]. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2003.
    [115] Chawla N. V., Hall L. O., Bowyer K. W., and Kegelmeyer W. P., SMOTE: Synthetic Minority Oversampling Technique [J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
    [116] Provost, F., & Fawcett, T., Robust classification for imprecise environments [J]. Machine Learning, 2001, 42,203-231.
    [117] Joshi M. V., Kumar V. and Agarwal R. C., Evaluating boosting algorithms to classify rare cases: comparison and improvements [J]. In First IEEE International Conference on Data Mining, 2001, 257-264.
    [118] Weiss G.M., Mining with rarity: A unifying framework [J]. ACM SIGKDD Explorations Newsletter, 2004,6(1):7-19.
    [119] Kubat M. and Matwin S., Addressing the curse of imbalanced training sets: One sided selection [C]. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179-186, Nashville, Tennesse, 1997. Morgan Kaufmann.
    [120] Zheng Z., Wu X., and Srihari R., Feature selection for text categorization on imbalanced data [J]. SIGKDD Explorations, 2004,6(1):80-89.
    [121] Drummond, C. and Holte R.C., C4.5, Class Imbalance, and Cost Sensitivity: Why Under-sampling beats Over-sampling.In ICML 2003 Workshop on Learning from Imbalanced Data Sets II [C], Washington, DC. 2003.
    [122] Barandela, R., Sánchez, J.S., García, V., Rangel, E., Strategies for learning in class imbalance problems [J], Pattern Recognition, 2003,36(3),849-851.
    [123] Veropoulos, K., Campbell, C. and Cristianini, N., Controlling the sensitivity of support vector machines [C]. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), 55–60, 1999.
    [124] Japkowicz N., Concept-learning in the presence of between-class and within-class imbalances.In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence [C], pages 67-77, 2001.
    [125] Cohen W.W., Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning [C], pages 115-123, 1995.
    [126] Domingos P., MetaCost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining [C], pp. 155-164. ACM Press.
    [127] Estabrooks, and N. Japkowicz. A mixture-of-experts framework for learning from unbalanced data sets. In Proceedings of the 2001 Intelligent Data Analysis Conference [C], pages 34-43, 2001.
    [128] Estabrooks A., Taeho J, and Japkowicz N., A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence [C], 20(1), 18-36,2004.
    [129] Chan P. K. and Stolfo S. J., Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining [C], pages 164-168, 2001.
    [130] Weiss G. M. and Provost F., Learning when training data are costly: the effect of class distribution on tree induction [J]. Journal of Artificial Intelligence Research, 2003, 19:315-354.
    [131] Chawla N. V., Lazarevic A., Hall L. O. and Bowyer K. W., Smoteboost: Improving prediction of the minority class in boosting. In Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases [C], 107-119, Dubrovnik, Croatia, 2003.
    [132] Kotsiantis S., Pintelas P., Mixture of Expert Agents for Handling Imbalanced Data Sets [J]. Annals of Mathematics, Computing & TeleInformatics, 2003, 1:46-55.
    [133] Huang K., Yang H., King I., Michael R., Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition [C], 2004.
    [134] Kubat M. and Matwin S., Addressing the curse of imbalanced training sets: One sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning [C],pages 179-186,Nashville, Tennesse, 1997. Morgan Kaufman.
    [135] Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms [J]. Pattern Recognition, 1997, 30(7): 1145-1159.
    [136] Estabrooks and Japkowicz N.. A mixture-of-experts framework for learning from unbalanced data sets. In Proceedings of the 2001 Intelligent Data Analysis Conference [C], pages 34-43, 2001.
    [137] Provost, F. and Fawcett, T., Robust classification for imprecise environments [J]. Machine Learning, 2001, 42, 203-231.
    [138] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE, The Protein Data Bank [J] .Nucleic Acids Res. 2000 Jan 1;28(1):235-42.
    [139] Laskowski R.A., PDBsum: summaries and analyses of PDB structures [J], Nucleic Acids Res., 2001, Vol. 29, No. 1 221-222.
    [140] Henikoff J.G. and Henikoff S., Using substitution probabilities to improve position-specific scoring matrices [J]. Comp. App. Biosci., 1996,12:2,135-143.
    [141] Kosiol C., Goldman N. and Buttimore N. H., A new criterion and method for amino acid classification [J]. Journal of Theoretical Biology, 2004, 228:97-106.
    [142] Taylor W.R., The classification of amino acid conservation [J]. J. Theoret. Biol., 1986,119,205-218.
    [143] Nagaragan N. and Yona G., Automatic prediction of protein domains from sequence information using a hybrid learn system [J].Bioinformatics, 2004,1,1–27.
    [144] Veropoulos, K., Campbell, C., & Cristianini, N. Controlling the sensitivity of support vector machines [C]. Proceedings of the International Joint Conference on Artificial Intelligence, 1999, 55–60.
    [145] Akbani R., Kwek S., Japkowicz N., Applying support vector machines to imbalanced datasets [C], Proc. 15th. European Conf. Machine Learning (ECML), pp. 39-50, Pisa, Italy, Sep. 2004, 20-24.
    [146] Goldberg D.E., Genetic Algorithms in Search Optimization and Machine Learning, Reading, London: Addison Wesley, 1989.
    [147] Jaynes E.T., Information theory and statical methanics [J]. The Physical Review, 1957,106:602-630.
    [148] Yan L., Dodier R., Mozer M.C., Wolniewicz R., Optimizing classifier performance via the Wilcoxon-Mann-Whitney statistics . Proceedings of the International Conference on Machine Learning [C], 2003.
    [149] Ling C., Huang J. and Zhang H., Auc: a better measure than accuracy in comparing leaning algorithms, in Proceedings of 2003 Canadian Artificial Intelligence Conference [C], 2003.
    [150] Chen Y., Wang J.Z., Support vector learning for fuzzy rule-based classification systems [J]. IEEE Transactions on Fuzzy Systems, 2003, vol. 11, no. 6, pp. 716-728.
    [151] Goldberg D.E., Genetic Algorithms in Search Optimization and Machine Learning [M], Addison Wesley, Reading MA, 1989.
    [152] Wu Shaomin, Flach P. Scored and Weighted AUC Metrics for Classifier Evaluation and Selection [A]. In Proc 2nd Workshop on ROC Analysis in Machine Learning (ROCML - 05) [C] . Bonn, Germany : [ s. n. ] ,2005.
    [153] Hand D J , Till R J . A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems [J]. Machine Learning , 2001 ,45 :171 - 186.
    [154] Chen Y., Wang J.Z.. Support vector learning for fuzzy rule-based classification systems [J], IEEE Transactions on Fuzzy Systems. vol. 11, no. 6, pp. 716-728, 2003.
    [155] Bergh F., Particle Swarm Weight Initialization in Multi-layer Perception Artificial Neural Networks. In Development and Practice of Artificial Intelligence Techniques [C], pages 41-45, Durban, South Africa,September 1999.
    [156] Bergh F. and Engelbrecht A.P., Cooperative Learning in Neural Networks usig Particle Swarm Optimisers [J]. South African Computer Journal, 2000, (26):84-90.
    [157] Shi Y. and Eberhart R.C.. Empirical Study of Particle Swarm Optimization. In proceedings of the Congress on Evolutionary Computation [C], pages 1945-1949, Washington D.C, USA,July 1999.IEEE Service Center, Piscataway, NJ.
    [158] Shi Y. and Eberhart R.C.. A Modified Particle Swarm Optimizer. In IEEE International Conference of Evolutionary Computation [C], Anchorage, Alaska, May 1998.
    [159] Eberhart R.C., P.Simpson and Dobbins. Computational Intelligence PC Tools [M], Boston: Academic Press Professional, 1996.
    [160] Hans R., Magne S., Compact fuzzy models through complexity reduction and evolutionary optimization. in: Proceedings of the 9th IEEE International Conference on Fuzzy System [C], San Antonio, USA, 2000, pp. 762-767.
    [161] Jin Y, Seelen W., Sendhoff B., On generating flexible, complete, consistent and compact fuzzy rule systems from data using evolution strategies [J]. IEEE Transactions on Systems, Man, and Cybernetics, 1999, 29(4):829-845.
    [162] Bondugula R, Michael S. L. and Wallqvist A., FIEFDom: a transparent domain boundary recognition system using a fuzzy mean operator [J], Nucleic Acids Res., 2009 37(2):452-462.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700