基于支持向量机的数据挖掘应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
数据挖掘是从大量、复杂的数据中迅速获取新颖、有效的知识的过程。支持向量机(support vector machine,SVM)是数据挖掘中的一项新技术,是借助于最优化方法解决机器学习问题的新工具。它是在统计学习理论基础上发展起来的一种通用学习机器,具有全局最优、结构简单和推广能力强等优点。
     传统的支持向量机是一种有监督的机器学习算法,即要求训练样本的类别信息是已知的。但在将SVM应用到实际问题中时,经常只能获得少量的有标签样本,而大量的样本是没有标签的,这时传统的SVM算法在这类问题面前就无能为力了。为了解决这一问题,T.Joachims提出了直推式学习的方法TSVM(Transductive SupportVector Machine)。陈毅松等人对TSVM作了改进,提出了渐进直推式支持向量机PTSVM(Progressive Transductive Support Vector Machine)。本文对PTSVM作了进一步的改进,提出了基于离散度量的支持向量机SDSVM (Separation Degree Support Vector Machine)。该算法引入了Fisher准则中的样本离散度作为度量标准,利用Fisher准则函数作为评价函数,试图使算法在训练结束时能找到这样一个分割平面,使同类样本间尽量密集而不同类样本间距离尽量拉大。达到了降低了算法训练的时间复杂度和提高测试精度的目的。
     简单的支持向量机只能处理二值分类问题,不能直接处理多值分类问题。而现实世界中的大部分数据都是多类数据,所以需要对简单支持向量机作进一步扩展,使之能解决多值分类问题。本文介绍了几种用于多值分类的SVM算法,包括“一对多”、“一对一”、有向无环图SVM以及基于决策树的SVM,并比较了它们各自的优点和缺点。通过分析SDSVM的不足之处,对它作了进一步的改进,并将其成功与多值分类的SVM算法相结合。实验结果表明,SDSVM在应用于半监督的多值分类问题中取得了较好的性能。
Data mining is a technology that finds underlying rules and extracts valuable knowledge.data mining aims at extracting novel and useful knowledge from large volumes of data.Support Vector Machine (SVM) is a new technology of Data Mining and a new implement recurred to optimization techniques to solve the problems of Machine Learning.It is a kind of new general learning machine based on statistical learning theory and has the advantages of global optimization, simple structure and high practicability.
     The traditional SVM is a supervised machine learning algorithm,which requires the label of the training samples is known.We only get a few labeled samples when SVM is applied to practical problems.In fact,a large number of samples are unlabeled.At this time the traditional SVM algorithm is so powerless to face such problems.In order to solve this problem, T.Joachims proposed the method of TSVM.Chen Yi-song and others improved TSVM and proposed PTSVM.This paper makes a further improvement for PTSVM,and SDSVM is proposed which is based on seperation degree. a semi-supervised classification algorithm based on the combination of the separation degree and support vector machine is devised, which uses the separation degree in Fisher criteria as metric and Fisher criteria as evaluation function. Try to make the algorithm get such a split plane which makes the same labeled samples' distance so close and the different labeled samples' so far at the end of training, to achieve the objective of improving classification accuracy. It reduces the number of training and the time complexity.
     The traditional SVM is only able to deal with binary classification.It can not deal with multiclass problems directly. In the real world,most of samples are multiclass datas.We need make a further expansion for traditional SVM so that it can deal with multiclass problems.This paper introduced some SVM algorithms which can deal with multiclass problems,such as one-a-rest,one-a-one,DAGSVM and based on decision tree SVM and Compared their performance. By analyzing the shortcomings of the SDSVM,we make a further improvement for it and successed in combining it with multiclass SVM. The results show that SDSVM gets a better performance in appling to semi-supervised classification problems than PTSVM.
引文
1.Jiawei Hn,Micheline K,etc.数据挖掘概念与技术[M].范明,孟小峰等译.北京:机械工业出版社,2001
    2.Cristianini N, Shawe-Taylor,J. An Introduction to Support Vector Machines[M]. Cambridge University Press,Cambridge,UK,2000
    3.Vapnik V,Chervoknenkis A Y.On the uniform convergence of relative of frequencies of events to their probabilities[J]. Theory of Probab and its application.1971,16(2):264-280
    4.Boser B,Guyon I,,Vapnik V.A Training Algorithm for Optimal Margin Classifiers[C]. Proceedings of the Fifth Annual Workshop in Computional Learning Theory, Pittsburgh, PA, USA: 1992,144-152
    5.Cortes C,Vapnik V. The soft margin classifier.,AT&T Bell Labs,1993
    6.Vapnik V. The nature of statistical learning theory[M]. New York: Springer Verlag,1995
    7.刘同明.数据挖掘技术及其应用[M].北京,国防工业出版社,2001
    8.胡侃,夏绍纬.基于大型数据仓库的数据采掘[J].软件学报,1998,9(1):53-61
    9 . Han J, Kamber M.Data Mining:Concepts and Techniques[M].Morgan Kaufmann Publishers,San Francisco,2001
    10.陈安,陈宁等.数据挖掘技术及应用[M].北京:科学出版社,2006.31-33,154-158
    11.Margaret H.Dunham.数据挖掘教程[M].郭崇慧,田凤占等译.北京:清华大学出版社,2005:8-15
    12.Chen M, Han J, Yu P.Data Mining:An Overview from a Database Perspective[J].IEEE Transactions on Knowledge and Data Engineering,1996,8(6):866-883
    13.Breslow L A,.Aha D W.Simplifying Decision Trees:A Survey[J].Knowledge Engineering Review, 1997,12(1):1-40
    14.Quinlan J R.Simplifying Decision Tree[J]. International Journal of Man Machine Studies,1987,27(3):221-234
    15.Langley P, Iba W, Thomapson K.An analysis of Bayesian classifiers[C].Proceedings of the Tenth National Conference on Artificial Intelligence,1992:223-228
    16.Heckerman D.Bayesian Networks for Data Mining[J].Data Mining and Knowledge Discovery, 1997,1(1):79-119
    17.蔡自兴.人工智能及其应用.北京:清华大学出版社,2004
    18.苏小红,杨博,王亚东.基于进化稳定策略的遗传算法.软件学报,2003,14(11):1863-1868
    19.Tan P N,Steinbach M,Kumar S V.数据挖掘导论[M].范明,范宏建等译.北京:人民邮电出版社,2006:114-116
    20.Efron B,Tibshirani R.Cross-validation and the Bootstrap:Estimating the Error Rate of a Prediction Rule[M].Technical report,Standford University,1995
    21.Han J,Kamber M.Data Mining:Concepts and Techniques[M].北京:高等教育出版社:2001
    22.Vapnik V.Statistical Learning Theory[M].John Wiley and Sons,1998
    23.张学工.关于统计学习理论与支持向量机.自动化学报,2000,26(1):32-42
    24.Vapnik V,Levin E,Le C Y.Measuring the VC-dimension of a Learning Machine[J]. Neural Computation,1994,6:851-876
    25.Cristianini N, Shawe-taylor J.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods[M].Cambridge University Press,2000
    26.Bruges C J C.A Tutorial on Support Vector Machines for Pattern Recognition[M].Data Mining and Knowledge Discovery,1998,2(2):121-167
    27.Platt J.Probabilities for Support Vector Machines[J].In Advances in Large Margin Classifiers, Cambridge,MA, MIT Press,2000:61-74
    28.Mangasarian O L.Data Mining via Support Vector Machines[J].Technical Report,Data Mining Institute, 2001
    29 .许建华,张学工.一种基于核函数的非线性感知器算法[J] .计算机学报,2002,25(7):689-693
    30.Amari S, Wu S.Improving Support Vector Machine Classifiers by Modifying Kernel Funcitons[J].Neural Networks,1999,12:783-789
    31.Weston J,Watkins C.Multi-class Support Vector Machines[J].Technical Report in Royal Holloway University of London,1998
    32 . Kearns M J,Solla S A,Cohn D A.Support Vector Machines Applied to Face Recognition[J].Advances in Neural Information Processing Systems.MIT Press,1999
    33.Burges C J C.,Sch?lkopf B.Improving the Accuracy and Speed of Support Vector Learning Machines[J].Advances in Neural Information Processing Systems.MIT Press,1997
    34.Osuna E,Freund R,Girosi F.Training Support Vector Machines:an Application to Face Detection[J].Proc. Computer Vision and Pattern Recognition,San Juan,1997:130-136
    35.Vapnik V.Estimation of Dependences based on Empirical Data[M].Springer Verlag, Berlin,1982
    36 . Platt J.Fast Training of Support Vector Machines using Sequential Minimal Optimization[M].Advances in Kernel Methods-Support Vector Learning, Sch?lkopf B,Burges C J C,Smola A J.Cambridge,MIT Press,1999,185-208
    37.张召,黄国兴等.一种改进的SMO算法.计算机科学,2003,30(8):128-133
    38.骆世广,杨晓伟等.一种改进的序贯最小优化算法.计算机科学,2006,33(11):146-148
    39 . Joachims T.Transductive inference for text classification using support vector machines[C].Proceedings of the 16th International Conference on Machine Learning.San Francisco,Morgan Kaufmann Publishers,1999,200-209
    40.Altun Y,McAllester D,Belkin M.Maximum Margin Semi-Supervised Learning for Structured Variables[J]. Advances in Neural Information Processing,2005
    41.Chapelle O,Zien A.Semi-Supervised Classification by Low Density Separation[C]. Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics,2005
    42.陈毅松,汪国平,董士海.基于支持向量机的渐进直推式分类学习算法[J].软件学报,2003,14(3):451-460
    43.边肇祺,张学工等.模式识别(第二版)[M] .清华大学出版社,2000:87-90
    44.Joachims T.SVMlight[EB/OL]. http://svmlight.joachims.org/,2004
    45.Mayoraz E,Alpaydin E.Support vector machines for multi-class classification[C], Proceedings of International Workshop on Artificial Neural Networks,1999,2:838-842
    46.Platt J C,Cristianini N,Shawe-Taylor J.Large margin DAGs for multiclass classification[J], Advances in Neural Information Processing Systems.Cambridge,MA,MIT Press,2000,23: 547-553
    47.Hsu C W,Lin C J.A simple decomposition method for support vector machines[J].Machine Learning, 2002,46:291-314

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700