基于AP和SVM算法的融合研究与应用

英文题名：Based on AP and SVM Algorithm Fusion Research and Application
作者：钟毅
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：支持向量机 ; AP聚类算法 ; 偏向参数 ; AP-SVM分类器 ; 预测
英文关键词：Support Vector Machines ; AP clustering algorithm ; the preference ; AP-SVM classifier ; prediction
学位年度：2012
导师：周春光
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2012-05-01

摘要

支持向量机兴起于20世纪90年代，在随后的几十年内迅速发展，现已非常广泛地应用于机器学习与数据挖掘领域，成为不可或缺的标准工具之一，但是其结果直接依赖于所选取的训练样本，因此需要大量高质量的有标记样本，这在一定程度上限制了SVM的应用。针对此问题，本文提出一种基于AP聚类算法和SVM分类器相融合的新的AP-SVM分类器，使用PSOP-AP聚类算法优化数据集，得到高质量、小样本的SVM分类器的训练集，解决了目前已提出的各类SVM分类器分类精度的问题。实验结果表明：与传统的SVM分类器相比AP-SVM分类器具有更高的分类精度。尤其在心脏病的预测问题上，本文提出的AP-SVM分类器取得很好的效果。这为医学疾病研究提供了一种新的理论依据。
In1992, Support vector machine (Support Vector Machine, SVM) was introduced intothe field of machine learning at the conference on computational learning theory, and hasaroused wide concern. Thorough and comprehensive development in the late1990s, now hasbecome the standard tools in the field of machine learning and data mining. SVM hasachieved good results in the handwritten numeral recognition, face recognition, functionregression and density estimation, but the SVM in dealing with large-scale set of trainingsamples to learn, there is a slow learning speed, storage requirements and other issues.Therefore, the SVM learning speed has become the bottleneck for its widely used one.
     Cluster analysis is also known as group analysis, is actually a process that make acollection of physical or abstract objects divided into a number of collections (clusters), eachcluster is composed of similar objects. In the same cluster between the objects similarity ishigh and with other clusters of objects similarity is low. Although clustering analysis is abranch of the taxonomy, but clustering and classification is not same, the difference is:classification need to know the classification of property of the data set in classificationproblem, but clustering need to find the classification attributes from the data sets, that is notrequired to achieve the regulations required divides the class number. Clustering analysisincluding AP clustering algorithm, fuzzy clustering, system clustering, dynamic clustering,graph theory clustering, sequential sample clustering method, clustering forecast method andso on.
     Although the traditional SVM classification accuracy is usually higher, but its resultsdirectly depend on the selected training samples, and therefore requires a large number ofhigh-quality labeled samples, to same extent, which limit the application of SVM. Thetraditional SVM in the selection of training set, usually randomly selected, making it difficultto choose a representative data as training samples, and the classification results are not stable.If Artificial selection, which the need to spend a lot of manpower and time, making the overallclassification efficiency is low. Clustering algorithm does not need any training sets, onlyneed to provide a set of data. In this data set to find the law, the automatic clustering, althoughthe clustering algorithm is fast, but usually the accuracy is not high.
     In summary, it should be combined with the advantages of both algorithms to achieve thedesired results. Firstly use clustering algorithm to cluster large data objects to selectrepresentative data points from each class as a training set of SVM classifiers, which willimprove the accuracy of the SVM classifier. The main work of this paper is as follows:
     First, read the works of several related fields, and listen to well-known biochemistry and molecular biology lecture. Reading a lot of literature, to understand the cluster analysistechniques and SVM theory research and write code with matlab language.
     Second, this paper proposed a novel PSOP algorithm based on the Particles SwarmOptimization, which used In-Group Proportion index as fitness function to search the optimalpreference of Affinity Propagation algorithm. This approach is not contrary to the AP scharacteristics of without a pre-given number of clusters needed, and can get the bestclustering results.
     Then, this article based on the AP and SVM algorithm fusion research, proposed a newAP-SVM classifier. Experiments reveal the feasibility of the proposed method, both in thecase of category of linear separable or nonseparable. Compared with the traditional SVMclassification, the AP-SVM classifier classification has a higher accuracy, while avoiding thetedious and difficult of the manually select the training samples, and save a lot of time andmanpower.
     Last, application of AP-SVM classifier to medical research for heart prediction, whichobtained a better effect, has certain significance.

引文

[1]胡庆林,叶念渝,朱明富.数据挖掘中聚类算法的综述[J].计算机与数字工程,2007,35(2):17-20.
    [2] Vapnik N, Vladimir. The Nature of Statistical Learning Theory[M]. Berlin: Springer,1995:267-289.
    [3] Seong Whan Lee. Pattern recognition with support vector machines[M]. Berlin:Springer,2002:24-68.
    [4] Nello Cristianini, John Shawe-Taylor. An Introduction to Support Vector Machines andOther Kernel-based Learning Methods [M]. England: Cambridge University Press,2000:49-100.
    [5]曹魏,赵英航,高世伟等.基于模糊核聚类的多类支持向量机[J].化工学报,2010,61(2):420-424.
    [6] Kumar, Khemchandani, Gopal. Knowledge based Least Squares Twin support vectormachines[J]. INFORMATION SCIENCES,2010,180(23):4606-4618.
    [7]徐庆伶,汪西莉.一种基于支持向量机的半监督分类方法[J].计算机技术与发展,2010,20(10):115-121.
    [8]韩家炜.数据挖掘概念与技术[M].北京:机械工业出版社,2007:30-85.
    [9]李朝鹏,李肯立,成运,李朝健.基于数据预处理的并行分层聚类算法[J].计算机应用研究,2010,27(1):71-73.
    [10]谷建光,张为华,王中伟,解红雨.一种基于划分聚类和模糊神经网络的机器学习方法[J].系统仿真学报,2007,19(23):5581-5586.
    [11]JM Wu, WH Yu. Optimization and improvement based on K-Means Clusteralgorithm[C].2009Second International Symposium on Knowledge Acquisition andModeling (Kam2009),2009,3:335-339.
    [12]Yih, JM, YH Lin, HSC Liu. Clustering Analysis Method based on Fuzzy C-MeansAlgorithm of PSO and PPSO with Application in Image Data[C]. Proceedings of the8thWseas International Conference on Applied Computer Science (Acs'08),2008:54-59.
    [13]Frey, Brendan J, Dueck. Clustering by passing messages between data points[J].Science,2007,315:972-976.
    [14]李国正,王猛,曾华军.支持向量机导论[M].北京:电子工业出版社,2000:49-70.
    [15]张学工.统计学习理论的本质[M].北京:清华大学出版社,2000:14-20.
    [16]罗林开.支持向量机的核选择[D].厦门:厦门大学,2007.
    [17]Zhong Yi, Zhou Chunguang, Huang Lan, Wang Yan, Yang Bin. Support VectorRegression for Prediction of Housing Values[C].2009International Conference onComputational Intelligence and Security,2009:61-65.
    [18]GY Zhang, Y Sha, and YJ He. A new method for selecting initial cluster centers ink-means clustering algorithm[C].2008Proceedings of Information Technology andEnvironmental System Sciences (Itess2008),2008:879-883.
    [19]MB Al-Daoud. A New Algorithm for Cluster Initialization[J]. Proceedings of WorldAcademy of Science,2005,4:74-76.
    [20]Furtlehner Cyril, Sebag Michele, ZHANG Xiang-liang. Scaling analysis of affinitypropagation[J]. Phys Rev E Stat Nonlin Soft Matter Phys,2010,81(6):066-102.
    [21]ZHANG Qinghe, CHEN Xiaoyun. Agglomerative Hierarchical Clustering based onAffinity Propagation Algorithm[C]. KAM2010,2010:50-253.
    [22]Yi ZHONG, Ming ZHENG, Jianan WU, Wei SHEN, You ZHOU, Chunguang Zhou.Search the Optimal Preference of Affinity Propagation Algorithm[C].2012The FifthInternational Conference on Intelligent Computation Technology and Automation(ICICTA2012),2012:304-307.
    [23]PSO SL Ho, SY Yang, GZ Ni, KF Wong. An improved PSO method with application tomultimodal functions of inverse problems[J]. Ieee Transactions on Magnetics,2007,43(4):1597-1600.
    [24]H. H. Zhu, Y. Wang, K. S. Wang, and Y. Chen. Particle Swarm Optimization (PSO) forthe constrained portfolio optimization problem[J]. Expert Systems with Applications,2011,38(8):10161-10169.
    [25]周世兵,徐振源,唐旭清.基于近邻传播算法的最佳聚类数确定方法比较研究[J].计算机科学,2011,38(2):225-228.
    [26]钟毅,刘桂霞,郑明,沈威,赖丽娜,周春光.基于AP算法的支持向量机的设计与应用[J].吉林大学学报(理学版),2011,05(49):906-910.
    [27]史峰,王小川,郁磊. MATLAB神经网络30个案例分析[M].北京:北京航空航天大学出版社,2010:112-114.
    [28]马驰,阮秋琦.基于离散微粒群优化算法的SVM参数选择[J].计算机技术与发展,2007,17(12):20-23.
    [29]赵璐华,彭涛.一种有效的SVM参数优化选择方法[J].制造业自动,2010,32(9):146-149.
    [30]Jianan Wu, Yi Zhong, Guixia Liu, Yuan Tian, Muji Wu, You Zhou. The Design ofDifferentially Expressed Gene Recognition Software Based on Meta-analysis[C].2011International Conference on Opto-Electronics Engineering and Information Science(ICOEIS2011),2011:1174-1179.
    [31]Carolan BJ, Heguy A, Harvey BG. Up-regulation of expression of the ubiquitincarboxyl-terminal hydrolase L1gene in human airway epithelium of cigarettesmokers[C]. Cancer Res2006,2006,66(22):29-40.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700