基于粗糙集理论与支持向量机的数据挖掘方法算法研究

英文题名：Data Mining Methods Research Based on Rough Sets Theory and Support Vector Machines
作者：王志龙
论文级别：硕士
学科专业名称：应用数学
中文关键词：粗糙集理论 ; 支持向量机 ; 变精度粗糙集理论 ; 信息熵 ; 知识约简 ; SVM多分类机
英文关键词：rough sets theory ; Support Vector Machine ; variable precision rough sets theory ; comentropy ; knowledge reduction ; classification learning machine of SVM
学位年度：2007
导师：王建州
学科代码：070104
学位授予单位：兰州大学
论文提交日期：2007-09-28

摘要

论文首先就Rough Sets(RS)理论在数据挖掘中的应用所涉及到的一些关键技术问题进行了研究。众所周知，在大型知识库中，经常存在大量的冗余数据。冗余数据的存在，不仅浪费储存空间，而且干扰了人们做出正确而简洁的决策。论文分别从知识属性体系等价的角度、属性依赖程度及重要性的角度、可识辨矩阵的角度和信息论的角度研究了信息系统的知识约简问题。通过研究得到了这五个角度实施约简的方法程序，且发现了诸如信息系统中属性增多时信息熵单调不减的规律。
     事实上，经典的粗糙集理论在进行分类时其类之间的分界线很严格，这样提高了知识属性对被研究对象识别分类的精度，但这种方式的容错能力很差，使得模型的实际适用性很弱，为了改变这一缺陷，接下来探讨研究了变精度粗糙集的理论及约简问题。
     然后，分别探讨了支持向量机的模式分类法及回归分析法的建模原理、适用范围及求解问题。同时发现，SVM在数据挖掘中的优势也是其隐患之所在。若在小样本集合中存在噪音或矛盾信息，则对小样本预测的结果会产生很大地影响。在进行支持向量机预测分类之前，发现这些问题，并进行预先处理，正好是粗糙集理论的优势。
     于是，基于粗糙集理论和支持向量机方法各次的优点，探讨分析了如何将两者有机的结合起来，得出了将粗糙集理论和支持向量机多分类学习机结合的方法程序，给出了利用粗糙集和支持向量机构造多分类机的方法，举例阐述了各种类型的SVM多分类机构造的具体方法。
This thesis studies some key technology questions in data mining based on rough set theory firstly. It is well known that there are usually much redundant data in large knowledge repository. These data waste the storage space and disturb making decision. In the thesis, the knowledge reduction is studied from equivalent relation of attribute system angle of view, from the dependent level of attribute and importance of attribute angle of view, from the discernibility matrix angle of view, from the viewpoint of information theory. By experimental studies, five ways reduced the attributes were obtained. Meanwhile we discover that the changing tendency of the information entropy is non-rigorous monotonically decreasing in comentropy, when the number of the attributes is increasing.
     In fact, the sorted boundary is very accurate when it is classified on the basis of classical rough sets theory. Although accuracy improves greatly in those ways for the classification recognition, its fault tolerance and the serviceability of model is very poor. For removing the defects, this thesis has studied variable precision rough sets theory and its reduction.
     Then it studied modeling principle of classification method of SVM and regression analysis of SVM, and studied their sphere of application and problem solving. At one time, it was discovered that its merits is hidden danger in data mining. If there are noise or contradictory information, results prediction based on small sample set will be greatly influenced. Before forecasting and classifying of SVM, these questions have been found and foreclosed, which is precisely what rough sets theory has advantage.
     Upon that, based on advantage and virtue of the rough sets theory and SVM methods, this paper studied how to combine two methods and obtained procedure methods which organically combine RST(Rough Sets Theory) with more than one classification learning machine of SVM. Also, it has given a method that through them construct more than one classification learning machine, and described it with examples.

引文

[1] 吴振坤．迎接知识经济时代的挑战[J]．作家文摘，No．320，1999．
    [2] 孙玉芬，卢炎生．流数据挖掘综述[J]．计算机科学，2007，№1：(34)1～5．
    [3] Han J．W，Kamber M．数据挖掘一概念与技术[M]．北京：机械工业出版社，2001，P10-18．
    [4] 张云涛，龚玲．数据挖掘原理与技术[M]．北京：电子工业出版社．2004．4，P13-56．
    [5] Salchenberger.New Tool for Predicting L.M., Cinar. E. M. &Lash. N. A. Neural Networks Thrift Failures. Decision Sciences, 1992, 23, 899-916.
    [6] S. Wesley Changchien, Tzu-Chuen Lu. Mining Association Rules Procedure to Support on-line Recommendation by Customers and Products Fragmentation. Expert Systems with Application, 2001,20, 325-335.
    [7] Lynn Ling X Li.Knowledge-based Problem Solving: an Approach to Health Assessment. Expert Systems with Application, 1999, 16, 33-42.
    [8] Hong Ahang, Guangghui Zhao. An Expert System in the Coal Mining Industry. Expert Systems withApplications; 1999, 16, 73-77.
    [9] Dhond. A., Gupta. A.&Vadhavkar. S. Data Mining Techniques for Optimizing Inventories for Electronic Commerce. In Proceeding of the ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, 12, 480-486.
    [10] Wang. C.H., Hong. T. P. &Tseng. S. S. Integrating Membership Functions and Fuzzy Rule Setsfrom Multiple Knowledge Sources. Fuzzy Sets and Systems, 2000, 112, 141-154.
    [11] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for Mining Association Rules in Larges Databases. Proceeding of the 21th International Confer- ence on Very Large Data Bases,Zurich, Switzerland, 1995,432-444.
    [12] R.Wille. Concept Lattices and Conceptual Knowledge Systems. Computers and Mathematics with Applications, 1992, 23,493-515.
    [13] Z. Pawlak.Rough Set[J].International Journal of Computer and Information Sciences, 1982, 11(5): 341-356.
    [14] 曾黄麟。粗集理论及其应用[M]。重庆大学出版社，1998．3，P5-9．
    [15] 刘清．Rough集及Rough推理[M]．北京：科学出版社，2001，P78-86．
    [16] 王国胤．Rough集理论与知识获取[M]．西安：西安交通大学出版社，2001，P16-21．
    [17] 张文修，吴伟志等．粗糙集理论与方法[M]．北京：科学出版社，2005．1．
    [18] 张文修，梁怡，吴伟志等．信息系统与知识发现[M]．北京：科学出版社，2005．1．
    [19] 王志海，胡可云等．基于粗糙集合理论的知识发现综述[J]．模式识别与人工智能，Vol．Ⅱ No．2，1998．337-344．
    [20] 韩祯样等．粗糙集理论及其应用综述[J]．控制理论与应用，Vol．16，No．21999．
    [21] 张冬玲．基于粗糙集理论的属性约简算法的实现[J]．计算机应用，2006，6 No．26．
    [22] 王志龙等．模糊关系的运算与其传递闭包算子[J]．兰州工业高等专科学校学报，2006，Vol．13，No．2．
    [23] 王志龙．等价关系矩阵与等价关系在Rough集合中的应用[J]．兰州工业高等专科学校学报，2006，(1)：33．
    [24] 于洪．Rough Set理论及其在数据挖掘中的应用研究[D]．重庆大学，2003，．P26-40．
    [25] 冯瑞等．熵[M]．北京：科学出版社，1992，P18-32．
    [26] 王彬．熵与信息[M]．西安：西北工业大学出版社，1993，P32-67．
    [27] 苗夺谦，胡桂荣．知识约简的一种启发式算法[J]。计算机研究与发展，1999，36(6)：681-684．
    [28] 王国胤，于洪等．基于条件信息熵的决策表约简[J]．计算机学报，Vol25，No．7，2002，pp．759-766．
    [29] I.Duntsch,G.Gediga. Uncertainty measures of rough set prediction [J]. Artificial Intelligence, 106 (1998),109-137.
    [30] 孙林，徐久成，马媛媛．基于新的条件熵的决策树规则提取方法[J]．计算机应用，Vol．27，No．4．Apr．2007．
    [31] 王寿仁．信息论的数学理论[M]．北京：科学出版社，1957，P45-67．
    [32] 钟义信．信息科学原理(第一分册)[M]．北京：北京现代管理学院出版社，1985．4，P185．
    [33] 巩增泰，孙秉珍等．一般关系下的变精度粗糙集模型[J]．兰州大学学报(自然科学版)，2005．12：1-5．
    [34] 颜锦江，黄兵．不完备信息系统中基于相似度的变精度粗糙集模型[J]．系统工程理论与实践，2006．10：67-72．
    [35] 桑妍丽．变精度粗糙集下基于信息熵的属性约简算法[J]．山西师范大学学报(自然科学版)，2005．9：27-30．
    [36] 陈湘辉等．基于熵和变精度粗糙集的规则不确定性度量[J]．清华大学学报(自然科学版)，2001，41(3)：109～112．
    [37] Nello Cristianini(李国正、王猛等翻译)．支持向量机导论[M]．北京：电子工业出版社，2005．
    [38] Nello Cristianini,John Shawe-Taylor.An Introduction to Support Vector Machines and Other Kernel-based Learning Methods[M]. Cambridge University Press. 2000-03-28.
    [39] Cortes,C.,Vapnik, V Support Vector networks[J].Machine Learning, 1995, 70(1):1-25.
    [40] 李晓黎，刘继敏，史忠值．基于支持向量机与无监督聚类相结合的中文网页分类器[J]．计算机学报，2001，24(1)：62-68．
    [41] 王晓丹，王积勤．支持向量机研究与应用[J]．空军工程大学学报(自然科学版)．2004，5(3)：49～55．
    [42] 徐袭，姚琼荟等．基于粗糙集与支持向量机的故障智能分类法[J]．计算机技术与自动化，2006，3，No1：32-34．
    [43] 韩虎，任恩恩等．基于粗糙集理论的支持向量机分类方法研究[J]．计算机工程与设计，2007，6，No11：2640-2645．
    [44] 高尚，杨静宇等．基于支持向量机的武器系统参数费用模型[J]．数学的实践与认识，2006，9，No9：8-13．
    [45] 冯利军，李书全，宋连友．利用粗糙集理论提高SVT预测系统的实效性[J]．计算机技术与发展，2006，9，No9：30-34．
    [46] 李波，郭凤菊，李新军．一种多属性约简支持向量机混合分类方法[J]．昆明理工大学学报(理工版)，2006，8，No9：113-117．
    [47] 李波，李新军．一种基于粗糙集和支持向量机的混合分类算法[J]．计算机应用，2004，3，No3：65-70．
    [48] 姜维，关毅等．基于支持向量机的音字转换模型[J]．中文信息学报，2007，3，No2：100-105．
    [49] 张德政等．基于支持向量机挖掘不一致事例隐含的异常信息[J]．北京科技大学学报，2004，10，No5：564-568．
    [50] 邓乃扬，田英杰．数据挖掘中的新方法—支持向量机[M]．北京：科学出版社，2004，P32-51．

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700