多分类器系统在蛋白质功能预测方面的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
作为数据挖掘领域的一个重要分支,分类技术有着广泛的应用,并且经过多年的研究和发展,许多经典的分类方法已经被研究者所熟悉,例如k-近邻、贝叶斯方法、决策树、支持向量机、神经网络等。而这些传统的方法具有一定的局限性,于是研究人员相应地提出了多分类器系统,同时,多分类器系统的研究进展又面临着一些重要的问题。
     蛋白质功能预测作为后基因组时代面临的主要挑战之一,许多机器学习方面的算法逐渐被研究出来了。G蛋白偶联受体(G-protein coupled receptors ,GPCRs)是一类非常重要的信号分子受体,因能结合与调节G蛋白活性而得名。由于GPCRs的结构特征及其在信号传导中的重要作用,决定了它可以作为药物靶标,当前畅销药物中有20%属于GPCRs相关药物,世界药物市场大约有1/3的小分子药物是GPCRs的激活剂或拮抗剂。另外,GPCRs的功能失调会导致多种疾病产生。由此可见,研究GPCRs的功能相关数据有着极其重要的应用价值。
     本文采用数据挖掘的相关技术,通过研究前人的理论和实践成果,针对多分类器系统的实现所存在的主要研究问题,提出了相应的改善方案和策略,并且基于weka数据挖掘分析平台编程实现了该系统,并对GPCRs的功能数据进行操作和处理,实验结果表明,系统的分类性能有了一定程度的提高。
With the rapid development of information technology, in order to extract hidden important information from the stored large amounts of data, data mining techniques have emerged.
     In the field of data mining, classification plays as an important role of data analysis techniques, which analyses the inputing data through training the data set with focused characteristics, looks for an accurate description or model, and then predicts data type for unknown data sample. Classification problem in artificial intelligence, machine learning, pattern recognition and other fields has been extensively studied, and there are a number of traditional classification algorithms. However, these algorithms, with training through the known types of data set to get a single classifier, are reckless in scalability and efficiency. In addition, it is very difficult for them to deal with the classification task of the complex mass of data . Thus, the multiple classifier system has been put forward, which make use of the members of the classifier combination, related testing information and a ensemble approach to obtain a comprehensive classification prediction information, thereby enhance classification accuracy and reliability. How to obtain more useful information from the different members of the integrated systems to improve the classification performance, has become an important research questions in the field of data mining.
     Classification usually needs to predict the class label the forecast data belongs to. In sample set, each data belongs to a certain type of discrete disorder. Classification algorithm train from data set, analyses them, and then establishes classification model. The next phase is to classify the unknown types of data with this classification model. Here, we described the traditional classification techniques, including the commonly used classifier models, such as the k-neighbors, decision tree, support vector machines, Bayesian methods, neural networks, etc.; then the methods evaluating performance of classifier, such as hold-out and cross-validation method, were introduced.
     For the multiple classifier systems, with good performance should be in accordance with necessary and sufficient conditions : the base classifiers should be accurate and diverse. In other words, multiple classifier systems need to solve the following issues: the base classifier generation strategy, the base classifier selection, the base classifier fusion methods, and its assessment. The“overproduce and choose”strategy is adopted. As for the classifier generation strategy, you can operate data sets, classes, as well as properties, or change the classification model of the structure ,or improve the classification algorithm.
     The author studied the structure of multiple classifier systems and level of integration strategies at all levels, did research on diversity evaluation, and summarized combination methods. Then proposed classifiers generated strategy with training on different sources of data set ,which is a method of operating data set to extract the most representative samples . It considers classification performance and selecting the representative data set, and can generate candidate classifiers with better performance. With these candidate classifiers, we needs to select a subset of the optimal classifier from them. In order to care about the systematic assessment of performance, we carry out the selection method based on diversity and accuracy. The selection method takes account not only diversity problems the conventional classifier considered, but also the classification accuracy itself and the ensemble performance, which will help improve the total classification accuracy. In the final phase, with output of the member classifiers, we select a combination with the maximum principle to determine the final output as the final output.
     On protein function prediction, this paper introduced the commonly used protein databases, and devided protein function prediction methods into three categories from the perspective of machine learning, which are: supervised methods, semi-supervised methods, unsupervised methods.
     In this paper, multiple classifier systems show good results on theoretical and technical aspects. However, there are many problems needed to be in deeper study. For example: the structure of multiple classifier system topology and integration of decision-making research, the candidate classifier set of selected optimal subset needed to be considered acts of independence among classifiers, diversity, locality and other conditions; how to integrate multiple member classifiers to determine output information to get better classification performance, involved with building a fusion system, etc., therefore, the impact of various factors that affect classification system should be considered. In the phase of selecting members of the classifier, the mutual independence, should be concerned about as whether you can make a more sound theoretical analysis to give a better measure for the members of the classifier correlation, as well as the comprehensive consideration of the problems in procedure of the classifier generation and combination. In addition, the optimization of system design, as a research priorities, has been carried out to achieve some meaningful results, but it can’t dynamically choose the best multiple classifier system architecture for a given categorization task, which is still an unresolved issue.
     In addition, the research in multiple classifier systems are always fixed in such conventional pattern, maybe we should search for another way to improve .
引文
[1]邵峰晶,于忠清.数据挖掘原理与算法.北京:中国水利水电出版社, 2003: 5-8.
    [2]杨利英,覃征,王卫红.多分类器融合系统设计与应用[J].计算机工程, 2005(5), 31(5): 175-177.
    [3]韩宏,杨静宇.多分类器组合及其应用[J],计算机科学, 2000, 27(l): 58-61.
    [4]葛红,田联房.信息融合技术在模式识别中的应用[J].计算机应用研究, 2009, 26(1): 19-24.
    [5] Suen C. Y.(Eds.). Frontiers in Handwriting Recognition[C]. Montreal, Canada: International Workshop on Frontiers in Handwriting Recognition, 1990: 131-143.
    [6] M.Grabisch, M.Sugeno. Multi-attribute classification using fuzzy integral[C]. 1st IEEE Int. Conf. on Fuzzy Systems, San Diego, 1992: 47-54.
    [7] Hossein Tahani, James M. Keller. Information fusion in computer vision using the fuzzy integral[J]. Fuzzy Sets and Systems, 1992: 61-67.
    [8] James M.Keller, Paul Gader, Hossein Tahani, Jung-Hsien Chiang,Magdi Mohamed. Advances in fuzzy integration for pattern recognition. Fuzzy Sets and Systems, 1994, 65: 273-283.
    [9] Breiman L. Bagging predicators[J]. Machine Learning, 1996, 24(2): 123-140.
    [10] Y. Freund, R. E. Schapire. Experiments with a new boosting algorithm[C]. Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann, 1996: 148-156.
    [11] B. Parmanto, P. W. Munro,and H. R. Doyle. Reducing variance of committee Prediction with resampling techniques. Connection Science, 1996, 8: 405-426.
    [12] Sharkey A. J. Special Issue: Combining Artificial Neural Nets: Ensemble Approaches[J]. Connection Science, 1996, 8: 3-4.
    [13] Kittler J.,Hatef M., Duin R. P. W., et al. On Combining Classifiers[J]. IEEE Transactions on Pattem Analysis and Machine Intelligence, 1998, 20(3): 226-239.
    [14] Roli F., Giacinto G., Vermazza G.. Methods for designing multiple classifier systems [C]. In Proc second Int Workshop Multiple Classifier Syst. London, 2001: 78-87.
    [15] C. A. Shipp, and L. I. Kuncheva. Relationships between combination methods and measures of diversity in combining classifiers[J]. Information Fusion, 2002, 3: 135-148.
    [16]权太范.信息融合神经网络—模糊推理理论与应用[M].国防工业出版社, 2002.
    [17]寇忠宝,张长水.基于Multi-Agent的分类器融合[J].计算机学报, 2003, 26(2): 174-179.
    [18] Kim K., Park J., and Suen C.Y.. Recognition of Handwritten Numerals Using a Combined Classifier with Hybrid Features [C]. SSPR & SPR2004, LNCS3138, Springer-Verlag, Berlin, Heidelberg, 2004: 992-1000.
    [19]张石清,赵知劲.基于多分类器投票组合的语音情感识别[J].微电子学与计算机, 2008, 25, 12: 17-20.
    [20] Igino Corona, Giorgio Giacinto, and Fabio Roli. Intrusion Detection in Computer Systems using Multiple Classifier Systems[J]. Studies in Computational Intelligence, Springer Berlin, Herdelberg, 2008, 126: 91-113.
    [21]张涛,赵红领,杨海波,魏爽,王宗敏.混合多分类器结合算法在遥感影像分类中的应用研究.计算机应用研究, 2009, 11: 4368-4370,4374.
    [22]常军民.基于多特征多分类器融合决策的印鉴识别[D].浙江工业大学. 2005.
    [23]朱辉,唐降龙,孙广玲.多分类器融合在银行票据识别中的应用[J].计算机工程与应用, 2003, 39(30): 219-222.
    [24]唐克,张罗政,魏琪.基于支持向量机的多分类军事目标识别应用.火力与指挥控制, 2009, 8:97-100.
    [25]徐阳,刘培勋,龙伟. GPCR的计算机模拟研究.生命的化学, 2009, 29(3): 422-426.
    [26]刘同明.数据挖掘技术及其应用[M].北京:国防工业出版社, 2001: 2-10.
    [27]钟晓,马少平,张拔,俞瑞钊.数据挖掘综述.模式识别与人工智能, 2002, 14(1): 48-55.
    [28] Han J., Kamber M. Data Mining: Concepts and techniques.北京:高等教育出版社, 2001.
    [29]滕月阳.数据挖掘中若干数学模型与算法研究[D].大连理工大学. 2005.
    [30] S. Amari, S. Wu. Improving support vector machine classifier by modifying kernel function. Neural networks, 1999, 12: 783-789.
    [31]阎平凡,张长水.人工神经网络与模拟进化计算.北京:清华大学出版社, 2005: 1-30.
    [32] Witter I. H. and Frank E..数据挖掘:实用机器学习技术(原书第二版)[M].董琳,于晓峰等译,北京机械工业出版社, 2006.
    [33] Dietterich T. G. and Bakiri G.. Solving multielass learning Problems via error-correcting output codes[J]. Journal of Artificial intelligence Research, 1995, 2: 263-286.
    [34] Dymitr Ruta and Bogdan Gabrys. An overview of classifier fusion methods. Computing and information systems, 2000, 7: 1-10.
    [35] Ruta D. and Gabrys B.. Classifier Selection for Majority Voting[J]. Information Fusion, 2005, 6: 63-81.
    [36] Wang Ming-wen, Nie Jian-yun. A Dempster-Shafer Model for Query Expansion[J]. Journal of JiangXi Normal university, 2005, 29(3): 210-216.
    [37] Y. S. Huang, C. Y. Suen. A method of combining multiple experts for the recognition of unconstrained handwritten numerals[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(1): 90-94.
    [38] J. C. Borda. Memoire sur les elections au scrutin[M]. Histoire de 1. Academie Royale des Sciences, 1781.
    [39] L. Kuncheva, J. Bezdek, and R. Duin. Decision templates for multiple classifier fusion:an experimental comparison. Pattern Recognition, 1999, 34(2): 299-324.
    [40] Kuncheva,L. I.,and Whitaker, C. J.. Measures of diversity in classifier ensembles and relationship with ensemble accuracy. Machine learning, 2003, 51(2): 181-207.
    [41] Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids,2008, 35(3): 517–530.
    [42]陈慧萍,林莉莉,王建东,苗新蕊. Weka数据挖掘平台及其二次开发.计算机工程与应用, 2008, 44(19): 76-79.
    [43]潘国庆. G蛋白偶联受体研究进展.青海师范大学学报(自然科学版), 2005, 3: 57-61.
    [44] Q. B. Gao and Z. Z. Wang. Classification of G-Protein Coupled Receptors at Four Levels[J]. Protein Eng Des Sel, 2006, 19(11): 511-516.
    [45] M. Lapnish, et al. Classification of G-Protein Coupled Receptors by Alignment-Independent Extraction of Principle Chemical Properties of Primary Amino Acid Sequences[J]. Protein Science, 2002, 11: 795-805.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.