基于粗糙集理论的数据挖掘研究

英文题名：Research on Rough Sets Theory Based Data Mining
作者：王书青
论文级别：硕士
学科专业名称：农业机械化工程
中文关键词：数据挖掘 ; 知识发现 ; 粗糙集 ; 约简 ; 分明矩阵
英文关键词：Data Mining ; Knowledge Discovery in Database(KDD) ; Rough Sets ; Reduction ; Discernibility Matrix
学位年度：2004
导师：蒋文科
学科代码：082801
学位授予单位：河北农业大学

摘要

随着计算机、网络和通讯等信息技术的高速发展，信息的增长呈现超指数上升。信息量的急剧增长，使传统数据库的检索查询机制和统计分析方法已远远不能满足现实的需要，许多数据来不及分析就过时了；也有许多数据因其数据量极大而难以分析数据间的关系。如何从大规模的数据中挖掘深层次的知识和信息，而不仅仅是数据表面的信息，已经成为众多领域的研究热点。在这样的背景下，新的数据处理技术——知识发现便应运而生。
     知识发现是从数据集中识别出有效的、新颖的、潜在有用的，以及最终可理解的模式的非平凡过程。数据挖掘是知识发现过程中的核心步骤，是目前相当活跃的研究领域。
     粗糙集理论是波兰数学家Pawlak Z于1982年提出的一种分析模糊和不确定知识的强有力的数学工具。粗糙集理论作为人工智能领域的一个新的研究热点，它能够有效地处理不完整、不确定知识的表达和推理。这个特点使得粗糙集理论非常适合应用于数据挖掘。目前，基于粗糙集理论的数据挖掘方法已经成为主要的数据挖掘方法之一。研究基于粗糙集理论的数据挖掘具有极大的理论意义和现实意义。
     介绍了粗糙集和数据挖掘的相关理论。在深入研究经典粗糙集理论的一些不足后，我们提出了一种粗糙集的拓广模型，即带隶属度及权重的粗糙集模型。在这种模型中，我们给出了带隶属度及权重的信息系统，进行了噪音的处理、近似空间的划分、决策属性对条件属性的依赖度的计算、属性的约简、关联规则挖掘步骤的建立等方面的研究，并用算例验证了该模型是可行的。这种粗糙集的拓广模型克服了经典粗糙集分类过于严格、对噪音过于敏感、某些隐藏在边界中的规则丢失等缺陷。它完全继承了粗糙集的性质，拥有粗糙集的所有优点。该模型提供了一种数理统计中常用的在一个给定错误率的条件下将尽可能多的对象进行分类的方法。该模型将在信息系统分析、人工智能及应用、决策支持系统、知识发现、模式识别、分类以及故障诊断等方面取得较好的应用。
     今后的工作是开发基于这种粗糙集模型的实用软件系统和理论上的深入研究。
With the rapid development of information technology such as computer, network, communication and so on, the increase of information takes on going up beyond the exponential speed. The mechanism of searches and query of traditional databases and the method of statistical analysis greatly cannot meet the realistic demand with the information sharp increasing. Lots of data is outdated before its analysis. And it is too difficult to analyze the relations among a great deal of data because the data is too much. It has become research hotspot in many fields that how not only ostensible but also embedded knowledge and information are mined from a great deal of data. In the background, the new technology of data processing, that is Knowledge Discovery in Database, is produced.
    Knowledge discovery in databases is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in databases. Data Mining is the core step during the course of Knowledge Discovery in database. At present, it is a quite active research field.
    The theory of Rough Sets, presented in 1982 by Polish mathematician Pawlak Z, is a powerful mathematical tool for analyzing uncertain, fuzzy knowledge. Rough sets, as a new hotspot in the field of artificial intelligence, can effectively deal with the expression and deduction of incomplete, uncertain knowledge. The theory of Rough Sets is specially fit for the application to Data-Mining because of its features. Now the method of Data-Mining based on Rough Sets has become one of the main methods of Data-Mining. The study on Rough Sets based Data Mining has greatly theoretical and realistic meaning.
    The correlative theory of Rough Sets and Data Mining was delivered in this dissertation. We presented a kind of expanding model of Rough Sets, that is the model of Rough Sets with the grade of membership and weight, after lucubrating the deficiencies of the theory of traditional Rough Sets. In this model, we dissertated the information system with the grade of membership and weight, and researched into the process of noise, the partition of approximate space, the calculation of the dependent grade of decision-making attribute to conditional ones, the attributes reduction, the construction of excavating step of correlative rules etc. And the model



    is feasible through the validation of an example. This expanding model of Rough Sets overcomes the deficiencies that its classification is too strict and it is excessively sensitive to the noise and some rules kept in boundary are lost etc. as far as traditional Rough Sets is concerned. This model completely succeeds the characters of Rough Sets and holds its all strongpoints. It provides a method that is commonly used in statistic and applied to more objects being classified on the condition of a given error ratio. It will obtain better application in some aspects such as analysis of information system, artificial intelligence and its application, decision support system, knowledge discovery in database, pattern recognition, classification and fault diagnosis etc.
    For the future, realistic soft system based on this model of Rough Sets will be theoretically lucubrated and exploited.

引文

[1] 张尧庭，谢邦昌，朱世武．数据采掘入门及应用[M]．北京：中国统计出社，2001年，11月．
    [2] 李德毅．发现状态空间理论[J]．小型微型计算机系统，1994，Vol．15(11)：1-6．
    [3] PawlakZ.Vagueness and uncertainty-A rough set Prospective[J]. Computational Intelligence, 1995,11 (2):227-232.
    [4] Pawlak Z. Rough Set Theoretical Aspects of Reasoning about Data[C], Dordrecht, Kluwer Academic Publishers, 1991,9-30.
    [5] Pawlak Z. Rough sets[J]. International Journal of Computer and Information Sciences 1982,11: 341-356.
    [6] Wen-Lung Gau, Daniel J. Buehrer. Vague sets[J], IEEE Transactions on systems, Man and Cybernetics 1993, 23(2): 610-614.
    [7] Ivo D(?)ntsch, G(?)nther Gediga. Rough set data analysis[C]. In Encyclopedia of Computer Science and Technology, 2000, 43, 281-301.
    [8] Jouni,Jrvinen. Rough sets defined by tolerances[EB/OL]. http://www.cs.utu.fi/jjarvine/papers/RouTol.pdf.
    [9] W. Ziarko, Variable precision rough set model[J], Journal of Computer and System Sciencs 46(1993): 39-59.
    [10] 王国胤．Rough集理论在不完备信息系统中的扩充[J]．计算机研究与发展，2002，39(10)：1238-1243．
    [11] 胡可云，陆玉昌，石纯一．粗糙集理论及其应用进展[J]．清华大学学报(自然科学版)，2001，41(117)：64-68．
    [12] 李克文，吴孟达．基于序关系的粗糙集(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003．1359-1363．
    [13] 史忠值．知识发现[M]．2002年1月，北京：清华大学出版社。
    [14] Duntsssch I. Statistical evaluation of rough set dependency analysis[J]. International Journal of Human-computer Studies,1997,46:589-604.
    [15] Pawlak Z, Rough Logic[J], Bulletin of the Polish Academy of Science: Technical Sciences, 1987,35(5-6),253-258.
    [16] Pawlak Z, Rough Sets:Theoretical Aspects of Reasoning about Data[M], Dordrecht:Kluwer Acasemic Publisher, 1991.
    [17] Slowinski R,Intelligent Decision Support: Handbook of Applications and Advances of Rough Sets Theory[M],Dordrecht: Kluwer Academic Publisher, 1992.


    [18] 曾黄麟．粗集理论及其应用关于数据推理的新方法[M]．重庆：重庆大学出社，1998。
    [19] 安海忠，郑链，王广祥．粗糙集知识发现的研究现状和展望[J]．计算机测量与控制，2003．11(2)：8l-83．
    [20] 钟义信．知识理论框架，中国工程科学，2000，Vol．2(9)：50-64．
    [21] Pawlak Z．刘清译．Rough集[J]．计算机科学，1997，1-24．
    [22] 韩祯祥，张琦，文福拴．粗糙集理论及其应用综述[J]．控制理论及应用，1999，Vol．16(2)：153-157．
    [23] Pawlak Z. Rough Set Theory and its Applications to Data Analysis[J]. Cybernetics and Svstems. 1998, 29 (7):661-688.
    [24] 石红，沈毅，刘志言等．关于粗糙集理论及应用问题的研究[J]．计算机工程，2003，Vol．29(3)：1-4．
    [25] 杨宝华，钱远军，胡学钢．基于粗糙集(Rough Set)理论的数据挖掘(KDD)过程及其实现[J]．《计算机与农业》2003，NO．7，11-13．
    [26] 姚小群，陈统坚，姚锡凡．基于粗糙集理论的数据发掘算法[J]．《机床与液压》，2003．No 4．26．29．
    [27] I.Duntsch,G.Gediga. Simple Data Filtering in Rough Set Systems[J]. International ournal of Approximate Reasoning, 1998,Vol. 18(1-2), 93-106.
    [28] 谢振华，商琳，李宁等．粗糙集和神经网络结合技术的研究综述[C]．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，371．375．
    [29] 张丽，马良．基于粗糙集属性约简的模糊模式识别[J]．上海理工大学学报，2003，Vol．25(1)：50-53．
    [30] 石金彦，黄士涛，雷文平．粗糙集与决策树结合诊断故障的数据挖掘方法[J]．郑州大学学报(工学版)，2003，Vol．24(1)：119-122．
    [31] 吴成东，张颖，刘航．粗糙集遗传算法在机器人路径规划中的应用[J]．沈阳建筑工程学院学报(自然科学版)，2003，Vol．19(4)：236-239．
    [32] 张文修，吴伟志．基于随机集的粗糙集模型[J]西安交通大学学报，2001(4)：425-429．
    [33] 马志锋，邢汉承，郑小妹等．基于不分明与相似关系的Rough集的超图描述[J]．计算机科学，1999(9)：35-39．
    [34] 杨善林，刘业政，马溪骏．基于Rough Set理论的冠心病治疗方案分析(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，1384-1389．
    [35] 周庆敏，李永生，殷晨波等．基于粗糙集理论的故障诊断规则获取方法研究[J]．计算机工程与应用，2003，No．26：64-65．
    [36] 张晖，Julia Johnson，吴斌．基于粗糙集理论的控制规则生成方法[J]．计算机工程与应用，2003，No．13：98-100．
    [37] Ruhe G, Gesselschaft F.Rough set based data analysis in goal-oriented software

    measurement[C].Proc, of IEEE International software Metrics Symposium,Los Alamitos, 1996,10-19.
    [38] Wojcik Z.Application of rough sets for edge enhancing image fiiters[C].Proc.of IEEE International Conference on Image Processing, Los Alamitos, 1994,525-529.
    [39] Aijun A. Discovering rules for water demand prediction-an enhanced rough set approach[J].Engineering Applications of Artificial Intelligence, 1996, 9(6): 645-653.
    [40] Nurmi H. Probabilistic,fuzzy and rough concepts in social choice[J].European Journal of Operational Research, 1996,95(2):264-277.
    [41] 张琦，韩祯祥，文福拴．一种基于粗糙集理论的电力系统故障诊断和警报处理新方法[J]．中国电力，1998，31(4)：32-35．
    [42] 束洪春，孙向飞，司大军．基于粗糙集理论的配电网故障诊断研究[J]．中国电机工程学报，2001，21(10)：73-78．
    [43] 孙秋野，刘鑫蕊，张化光等．应用带有置信度的粗糙集约简算法进行电网故障诊断[C]，中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，1364-1368．
    [44] 李滔，王俊普，徐杨．一种基于粗糙集的网页分类方法[J]．小型微型计算机系统，2003，Vol．24(3)：520-522．
    [45] 张维明．数据仓库原理与应用[M]．北京：电子工业出版社，2002．3．
    [46] Fayyad U M, Piatetsky-Shapiro G, Lisa Lewinson et al. Advances in Knowledge Discovery and Data Mining[C].American:AAA/MIT Press, 1996:16-35.
    [47] 数据挖的定义[EB/OL]. 上海科技(在线学习), http://www.stcsm.gov.cn/learning/lesson/xinxi/20021125/lesson-2.asp.
    [48] Hart J, Fu Y. Discovery of multiple-level association rules from large database[C]. In Proc. 1995 int.Conf. Very Large Data Bases.Zurich, Switzerland, Sept.1995.420-431.
    [49] 刘同明．数据挖掘技术及其应用[M]．北京：国防工业出版社，2002，23-76．
    [50] 刘莉，徐玉生，马志新．数据挖掘中数据预处理技术综述[J]．甘肃科学学报， 2003，Vol．15(1)：117-119．
    [51] 胡可云，王志海，徐本柱．基于Rough set的知识发现系统[J]．合肥工业大学学报 (自然科学版)，1998，vol．21(1)：71-74．
    [52] 王国胤．Rough集理论与知识获取(M)．西安：西安交通大学出版社，2001，13-52．
    [53] 数据挖掘的功能[EB／OL]．上海科技(在线习)，http://www.stcsm.gov.cn/learning/lesson/xinxi/20021125/lesson-5.asp
    [54] Lu Hongjun. Rudy Setiono,Liu Huan.Effective data mining using neural network[J]. IEEE Transactions on Knowledge and Data Engineering, 1996,8(6):957-961.
    [55] 蒋良孝，蔡之华．基于神经网络的数据挖掘技术研究(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，1345-1348．
    [56] 刘振凯，贵忠华．蔡青．基于神经网络结构学习的知识求精方法[J]，计算机研究

    与发展，1999，Vol．36(10)：1069-1073.
    [57] 张朝辉．利用神经网络发现分类规则[[J]，计算机学报，1999，Vol．22(1)：108-112．
    [58] 陈世福，陈兆乾．人工智能与知识工程[M]．南京：南京大学出版社，1997．
    [59] FisherD. Optimization and simplication of hierarchical clustering[C]. In :Proceedings of the Ist International Conference on Knowledge Discovery and DataMining (KDD). Montreal,Canada,Aug. 1995.118～123.
    [60] 黄解军，潘和平，万幼川．数据挖掘技术的应用研究[J]．计算机工程与应用，2003， No．2：45-48．
    [61] 李水平，陈意云，黄刘生．数据采掘技术回顾[J]．小型微型计算机系统，1998，Vol．19(4)：74-81．
    [62] 涂占新．数据挖掘方法及其应用展望[J]．中南财经政法大学学报，2003年第2期，117-120．
    [63] 金昕，金靖华．数据挖掘技术及应用[J]．甘肃科技，2003，Vol．19(1)：33-35．
    [64] 朱世武，崔嵬，张尧庭．数据挖掘运用的理论与技术[J]．统计研究，2003，No．8：45-51．
    [65] 钟义信，智能学：信息-知识-策略-行为的统—理论(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，64-70．
    [66] 杨炳儒，王建新．KDD中双库协同机制的研究(Ⅰ)[J]．中国工程科学，2002，4(4)：26-32．
    [67] 杨炳儒，王建新．KDD中双库协同机制的研究(Ⅲ)[J]．中国工程科学，2002，4(5)：34-43．
    [68] 杨炳儒，唐菁，菅志刚等．复杂类型数据挖掘的机理、模型与算法的研究综述(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，229-236．
    [69] Knorr E, Ng R, Algorithms for mining distance-based outliers in large datasets[C]. In Proc. 1998 Intl. Conf. Very Large Data Bases (VLDB\98), New York,Ny, Aug. 1998,392-403.
    [70] Breunig M, Kriegel H, Ng R et al, OPTICS-OF: Identifying local outliers[C]. In Proceedings of 3rd European Conference on Principles of Data Mining and Knowledge Discovery, Prague,Czech,Republic,1999,262-270.
    [71] Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets(C). In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 427-438, Dallas, Texas, U.S.A., 2000, 427-435.
    [72] 陶兰，朱礼军，王弼佐．智能技术在生物信息资源融合与知识发现中的应用(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，673-676．
    [73] 柯鸿威，曹立明，王小平．基于移动Agent的数据挖掘研究[J]．福建电脑，2003，No．10：12-13．
    [74] 刘君强，孙晓莹，杨传明等．移动式数据挖掘平台模型…．微电子学与计算机，

    2003年第8期，82-85．
    [75] 王万军．可拓方法在知识工程中的应用(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，1132-1135．
    [76] 汤家法，姚令侃，杨明．可拓空间数据挖掘技术及其应用[J]．系统工程理论与实践，2003，10：69-75．
    [77] 李立希．可拓知识库系统及其应用[J]．中国工程科学，2001，3(3)：61-64．
    [78] 张立厚．知识管理中的可拓工程初探[J]．广东工业大学学报，2001，18(1)：76-79．
    [79] 李德毅，淦文燕，刘璐莹．人工智能与认知物理学(C)．中国人工智能进展(2003)，北京：北京邮电大学出版社，2003，6．15．
    [80] 张潇，恽爽，陆桑璐等．并行数据挖掘研究[J]．计算机工程，2003，Vol．29(17)：58-60。
    [81] 张春华，王阳．数据挖掘技术、应用及发展趋势[J]．信息化与网络建设，2003，No．4：47-50．
    [82] 数据挖掘未来研究方向及热点[EB／OL]．上海科技(在线学习)，http://www.stcsm.gov.cn/learning/lesson/xinxi/20021125/lesson-8.asp.
    [83] 黄解军，潘和平，万幼川．数据挖掘的体系框架研究[J]．计算机应用研究，2003，No．5：1-3．
    [84] 黄晓霞，萧蕴诗．数据挖掘集成技术研究[J]．计算机应用研究，2003，No．4：37-39．
    [85] 蒋良孝，蔡之华．空间数据挖掘的回顾与展望[J]．计算机工程，2003，Vol．29(6)：9-11．
    [86] Lin T.Y., Lin.Q. First-order rough logic I: approximate reasoning via rough sets[J]. Fundamental Information, 1996,27(2,3): 137-154.
    [87] 常犁云，王国胤，吴渝．一种基于Rough Set理论的属性约简及规则提取方法[J]．软件学报，1999，10(11)：1206-1211．
    [88] Agrawal R,T Imielinski,Swami A.Mining association rules between sets of items in large database[C]. Proceedings ofACM SIGMOD Intl Conf on Management of Data. SIGMOD 93,207～216.
    [89] 关联规则分析，http://www.dmgroup.org.cn/sf2.htm.
    [90] 刘业政．基于粗糙集数据分析的智能决策支持系统(合肥工业大学博士论文)[D]．2002，7．
    [91] 刘业政，杨善林，钟金宏．基于粗集理论的冗余分割点约简[J]．计算机工程，2002(8)：17-19．
    [92] 聂承启，李雪斌．基于粗糙集理论的关联规则挖掘模型[J]．计算机与现代化，2003．Nn 6：3-5

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700