基于粗集理论的KDD技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
尽管对KDD技术的研究已经取得了丰硕的成果,但是进一步研究KDD技术仍然具有重要的实际意义。众多的理论和工具都已经成功地应用于解决KDD过程中的某些具体问题,粗集理论是其中最具发展前景的工具之一。由于基于粗集模型的智能数据分析过程可以不依赖外界参数或先验知识,因此粗集理论具有其它工具无法比拟的优势,研究基于这一理论的KDD技术就有望为KDD过程提供更为理想的解决方案。
    在KDD的数据集成阶段,数据离散化是其中一件非常重要的工作。有效的离散化可以显著地提高系统对样本的聚类能力,增强系统对数据噪音的鲁棒性。粗集理论已经成功地应用于数据离散化。论文对典型的基于粗集模型的启发式数据离散化过程进行了深入研究:首先提出了新的计算候选断点集合的方法,在同样能够保证系统分辨关系的前提下,按照该方法得到的候选断点集合的基数远远小于按传统方法得到的结果;其次,论文研究了通过“断点分辨矩阵”来度量候选断点重要性的启发式方式。度量候选断点的重要性不但要考虑该矩阵的列方向特征,而且还要以适当的方式考虑矩阵的行方向特征。但是,列和行方向特征对候选断点重要性的反应能力是不对称的,后者不如前者准确。在此基础上,定义“断点选择概率”来度量断点的重要性。断点选择概率不但具有明确的物理意义,而且充分考虑了“断点分辨矩阵”列和行方向特征的差异,将这两个方向的特征合理地统一起来。最后,提出了基于断点选择概率的结果断点集合计算方法。算法分析和仿真实验结果表明,所提出的算法可以高效率和高性能地解决数据离散化问题。
    在KDD的数据集成阶段,特征子集选择是其中另一件非常重要的工作。特征子集选择不但可以缩减学习系统的规模,而且能够有效地从系统中剔除冗余信息,从而凸显系统中数据之间潜在的相互联系,最终能够提高数据挖掘结果的应用性能和应用精度。论文深入研究了特征子集选择技术,提出了高效的属性核计算方法,定义了“系统熵”概念,并以属性对系统熵的影响为启发式依据来度量属性之间的相对重要性。系统熵的计算较“条件熵”简单,并且能够有效地克服条件熵的不足,不但能够度量系统中非冗余属性之间的相对重要性,而且能够分辨冗余属性之间的相对重要性。论文揭示了系统熵的一些代数性质,研究了它在取值规律上的固有倾向。在有效地抵消了其固有取值倾向的影响之后,基于系统熵概念定义了“属性重要性”概念,并将其应用到反向删除方式的特征子集选择算法
    
    中。算法分析和仿真实验结果表明,所提出的算法能够高效率地解决特征子集选择问题,并能够得到比较理想的结果。
    由于决策规则本质上是一种以决策属性集合为标签的分类规则,因此决策规则的学习过程本质上就是样本分类规则的挖掘过程。由于通过传统的基于粗集模型的学习算法得到的决策规则描述和体现的主要是不同类型样本之间的分辨特征,不能反映同类型样本之间的共同特征,于是,论文提出了一种新的决策规则学习算法,该算法能够产生完备的决策规则系统,在规则的学习过程中,不但考虑了不同类别样本之间的分辨特征,而且也注重提取同类型样本之间的共同特征。仿真测试结果表明,该算法具有较高的学习精度,并且对系统的不一致性具有较强的适应能力。
    由于对系统的任何智能处理过程都有可能影响到系统的不确定性,因此系统不确定性的度量方法是一个具有实际意义的重要问题。定量地描述系统的不确定性有助于观测和跟踪系统不确定性的变化规律,从而据此来分析相应的处理过程对系统的影响趋势和影响程度,甚至可以在一定程度上反映和评估这些处理方式的合理性。论文首先分析了现有的基于粗集模型的系统不确定性度量方式,然后提出对决策信息系统,可以用条件熵来度量其不确定性,分析了条件熵在其取值规律上与系统不确定性概念之间的一致性;对决策规则系统,首先将系统的不确定性分为随机性和冲突性两种,分别刻画了它们具体的表现形式,然后给出了相应的不确定性度量方法。最后研究了系统不确定性对典型的决策规则学习算法性能的影响,得到了一些有益的结论。
Though rich achievements of the researches on technologies for Knowledge Discovery from Databases have been reported and seen recently, to further develop new technologies for KDD is still necessary in practice. Rough set theory is one of the most prosperous tools and theories that have already been successfully applied to resolve some specific problems in KDD process. Theoretically speaking, intelligent data analyses based on rough set model could be well done without any extra parameter or external knowledge, therefore rough set theory has incomparable advantages over other theories and tools. Consequently, to further develop technologies based on rough set theory for KDD is hopeful to provide more effective KDD resolution.
    In the data integration stage of KDD, data discretization is one of the most important jobs. Effective data discretization can obviously improve system ability on clustering instances, and can also make systems more robust to data noise. Rough set theory has been successfully applied in data discretization. Based on the typical heuristic frameworks of rough set based data discretization, some further researches are made in this topic. Firstly, new method of computing the candidate cut set of a learning system is put forward. To compare with other analogous traditional algorithms, candidate cut sets with much smaller cardinalities can be produced through the new method while the system discernibility relation could still be maintained. Subsequently, to heuristically measure the relative importance of candidate cuts, relevant metrics are studied based on "cut discernibility matrix". When measuring the importance of a candidate cut, both the characteristics of columns and rows of this matrix should be reasonably taken into consideration. It is deserved to point out that the contribution of column and row characteristics to candidate cut importance is unbalanced at all, the latter is much inferior. Then a new conception of "Cut Selection Possibility" is defined to effectively measure the importance of candidate cuts. Cut Selection Possibility is not only physically meaningful, but also fully considers the difference between the characteristics of matrix columns and rows, and then harmonically and reasonably put them together. At last, an approach based on Cut Selection Possibility is proposed to find out the result cut sets from candidate sets. To real life databases, theoretical analyses and simulation experiments show that the proposed approach can efficiently
    
    and effectively solve the problem of data discretization.
    In the data integration stage of KDD, feature subset selection is another one of the most important jobs, which can not only diminish the data scale of a system, but also effectively remove the redundant information from the system and then emphasize and enlarge the potential data relation of the system. Consequently, it greatly contributes to improving the application performances of the data mining results. Technologies for feature selection is firstly deeply inspected, and then the notion of "System Entropy" is defined and the influence of a feature on System Entropy is taken to heuristically measure relative feature importance. The notion System Entropy can effectively break the confines of "Conditional Entropy", another notion also based on information theory and rough set theory. System Entropy can not only measure the relative importance of useful features, but also discriminate the relative importance of redundant features. Moreover, its computation is much easier and simpler than that of Conditional Entropy. Some algebraic characteristics of System Entropy are disclosed, and its intrinsic value biases are also studied. Then the conception of "Feature Significance" is clarified based on System Entropy after its value biases are effectively counteracted. Feature Significance heuristically measures feature importance in a new suggested algorithm for feature selection that selects feature subsets in typical "Backward Elimination" way. Algorithm analyses and simulati
引文
[1] U. M. Fayyad and G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, eds. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA. AAAI/MIT Press, 1996.
    [2] K. J. Cios, W. Pedrycz and R. Swiniarski. Data Mining Methods for Knowledge Discovery. Dordrecht. Kluwer, 1998.
    [3] R. Agrawal, T. Imielinski and A. Swami. Mining association rules between sets of items in large databases. Proc. of the ACM SIGMOD Conference on Management of Data, Washington, D.C., 1993, 207-216.
    [4] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. Proc. of the 20th Int. Conf. on Very Large Databases, Santiago, Chile, 1994, 478-499.
    [5] R. Srikant and R. Agrawal. Mining Generalized Association Rules. Proc. Of the 21th Int. Conf. On Very Large Databases, Zurich, Switzerland, 1995, 407-419.
    [6] A. Gupta, V. Harinarayan, and D. Quass. Aggregate Query Processing in Data Warehousing Environment. Proc. Of the 21th Int. Conf. On Very Large Databases, Zurich, Switzerland, 1995, 358-369.
    [7] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. Proc. Of ACM SIGMOD Int. Conf. Management of Data, pages Montreal, 1996, 205-216.
    [8] J. Widom. Research Problems in Datawarehousing. Proc. Of the 4th Int. Conf. On Information and Knowledge Management, 1995, 25-30.
    [9] W. P. Yan and P.A. Larson. Eager Aggregation and Lazy Aggregation. Proc. Of the 21th Int. Conf. On Very Large Databases, Zurich, Switzerland, 1995, 345-357.
    [10] J. Han, Y. Cai, N. Cercone. Data-Driven Discovery of Quantitative Rules in Relational Databases. IEEE Transactions On Knowledge and Data Engineering, 1993, 5: 29-40.
    [11] J. Han and Y. Fu. Exploration of the Power of Attribute-oriented Induction in Data Mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, Menlo Park, CA. AAAI/MIT Press, 1996, 399-421.
    [12] V. Dhar, A. Tuzhilin. Abstract-Driven Pattern Discovery in Databases. IEEE Trans. On Knowledge and Data Engineering, 1993, 5: 925-938.
    [13] P. Cheeseman, J. Stutz. Bayesian Classification (AutoClass): Theory and Results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, Menlo Park, CA. AAAI/MIT Press, 1996, 153-180.
    
    
    [14] J. Elder IV, D.Pregibon. A Statistical Perspective on Knowledge Discovery in Databases, In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, Menlo Park, CA. AAAI/MIT Press, 1996, 83-115.
    [15] G. Piatetsky-Shapiro. Discovery, Analysis, and Presentation of Strong Rules. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery in Databases, Menlo Park, CA: AAAI/MIT Press, 1991, 229-248.
    [16] A.K. Jain, R.C. Dubes. Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall, 1998.
    [17] D. Fisher. Improving Inference through Conceptual Clustering. In Proc. Of AAAI Conf., Seattle, Washington, 1987, 461-465.
    [18] D. Fisher. Optimization and Simplification of Hierarchical Clusterings. Proc. Of 1th Int. Conf. On Knowledge Discovery and Data Mining (KDD'95), Montreal, Canada, 1995, 118-123.
    [19] R. Ng, J. Han. Efficient and Effective Clustering Method for Spatial Data Mining. Proc. of the 20th Int. Conf. On Very Large Databases, Santiago, Chile, 1994, 144-155.
    [20] M. Ester, H.P. Kriegel, X. Xu. Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. Proc. 4th Int. Symposium On Large Spatial Databases (SSD'95), 1995, 67-82.
    [21] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: John Wiley & Sons, Inc., 1990.
    [22] J.S. Park, M.S. Chen and P.S. Yu. An Effective Hash Based Algorithm for Mining Association Rules. Proc. Of ACM SIGMOD Conference, San Jose, California, USA, 1995, 175-186.
    [23] 李绍成. 规则向量投影算法— 一种归纳机器学习算法. 软件学报,1996,7(6): 364-370.
    [24] 黄源, 萧嵘, 张福炎. 神经网络的规则提取. 计算机研究与发展,1999,36(9): 1086-1091.
    [25] Z. Pawlak: Rough Sets. Int. Journal of Computer and Information Science, 1982, 11(5): 341-356.
    [26] 周育健,王珏. RSL:基于Rough Set的表示语言。软件学报,1997,8(8): 569-576。
    [27] P J. Lingras and YY. Yao. Data Mining Using Extension of the Rough Set Model. Journal of the American Society for Information Science, 1998, 49(5): 415-422.
    [28] Z. Pawlak. Rough Set: Theory and its Applications to Data Analysis. Cybernetics and Systems, 1998, 29(7): 661-688.
    [29] S. Tsumoto. Automated Extraction of Medical Expert System Rules from Clinical Databases Based on Rough Set Theory. Information Sciences, 1998, 1129(1): 67-84.
    [30] J. Grzymala-Busse. Classification of Unseen Examples under Uncertainty. Fundamental Informatice, 1997, 30: 255-267.
    
    
    [31] S. Parsons. A Rough Set Approach to Reasoning under Uncerntainty. Journal of Experiment and Theory on Artificial Intelligence, 1995, 7: 175-193.
    [32] 徐燕, 怀进鹏, 王兆其. 基于区分能力大小的启发式约简算法及其应用. 计算机学报, 2003, 1: 97-103.
    [33] A.N. Shan, N. Cercone and W. Ziarko. Discovering Rules from Data for Water Demand Prediction, Journal of Intelligent Real-Time Automation: Engineering Applications of Artificial Intelligence, 1996, 9(6): 645-654.
    [34] 潘丹, 郑启伦. 属性约简自寻优算法. 计算机研究与发展, 2001, 38(8): 12-18.
    [35] G.Y. Wang and F. Liu. Generating Rules and Reasoning under Inconsistencies. Proc. Of IEEE Int. Conf. On Industrial Electronics, Control and Instrumentation, Nagoya, Japan, 2000, 2536-2541.
    [36] 曾黄麟. 粗集理论及其应用-关于数据推理的新方法(修订版),重庆:重庆大学出版社,1998.
    [37] 王国胤. Rough集理论与知识获取,西安:西安交通大学出版社,2001.
    [38] R. S. Michalski and R. E. Stepp. Learning From Observation: Conceptual Clus ter- ing. R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, Tioga Publishing Company, 1983, 331-363.
    [39] R. Kohavi. Bottom-up Induction of Oblivious Read-Once Decision Graphs: Strengths and Limitations. AAAI'94: Proc. Of the 12th Nat. Conf. On Artificial Intelligence, Menlo Park, California: AAAI Press/MIT Press, 1994, 613-618.
    [40] H.S Nguyen,A. Skowron: Quantization of Real Values Attributes, Rough Set and Boolean Reasoning Approaches. Proc. of the 2nd Joint Annual Conf. on Information Science, Wrightsville Beach, NC, USA, 1995, 34-37.
    [41] 侯利娟、王国胤、聂能等. 粗糙集理论中的离散化问题. 计算机科学, 2000, 27(12): 89-94。
    [42] 赵军,王国胤,吴中福等:基于粗集理论的数据离散化新算法. 重庆大学学报(自然科学版),2002,25(3):18-21。
    [43] H.S. Nguyen. Discretization Problem for Rough Sets Methods. Proc of 1th Int. Conf. on Rough Sets and Current Trends in Computing (RSCTC'98), 1998, 545-552.
    [44] S.H .Nguyen, H.S. Nguyen. Some Efficient Algorithms for Rough Set Methods. Proc. Of the Conf. On Information Processing and Management of Uncertainty in Knowledge Based Systems, Granada, Spain, 1996: 1451-1456.
    [45] Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository. Html.
    [46] L. Breiman, J.H. Friedman, R.A. Olshen etc. Classification and Regression Trees, Belmont, CA: Wadsworth, 1984.
    
    
    [47] 常犁云,王国胤,吴渝等:一种Rough Set理论的属性约简及规则提取方法,软件学报,1999,10(11): 1206-1211。
    [48] T. Mollestad and A. Skowron. A Rough Set Framework for Data Mining of Propositional Default Rules, Lecture Notes in Computer Science, Springer Verlag, 1996, vol. 1079: 448--457.
    [49] 赵军,王国胤,吴中福等. 基于粗集理论的数据离散化研究,小型微型计算机系统,已录用。
    [50] D.W. Aha. Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms, Int. J. of Man-Machine Studies, 1992, 36(1): 267-287.
    [51] G. John, R. Kohavi and K. Pfleger. Irrelevant Features and the Subset Selection Problem, Proc. of 11th Int. Conf. on Machine Learning, San Francisco, CA: Morgan Kaufmann, 1994, 121-129.
    [52] P.M. Narendra and K. Fukunaga. A Branch and Bound Algorithm for Feature Selection, IEEE Transactions on Computer, 1977, C-26(9): 917-922.
    [53] K. Kira and L.A. Rendell. The Feature Selection Problem: Traditional Methods and a New Algorithm, Proc. of the 10th Nat. Conf. on Artificial Intelligence, Menlo Park, CA: MIT Press, 1992, 129-134.
    [54] D. Koller and M. Sahami. Toward Optimal Feature Selection, Proc. of the 13th Int. Conf. on Machine Learning, San Francisco, CA: Morgan Kaufmann, 1996, 284-292.
    [55] H. Almuallim, and T.G. Dietterich. Learning with Many Irrelevant Features. Proc. of the 9th Nat. Conf. on Artifical Intelligence, MIT Press, Cambridge, Massachusetts, 1992, 547-552.
    [56] Kononenko I. On Biases in Estimating Multi-Valued Attributes, Proc. Of the 14th Int. Joint Conf. On Artificial Intelligence, San Francisco, CA: Morgan Kaufmann, 1995, 1034-1040.
    [57] 吴福保,李奇,宋文忠. 基于粗集理论知识表达系统的一种归纳学习方法,控制与决策,1999,14(3): 206-211。
    [58] J.C. Schlimmer. Efficiently Inducing Determinations: A Complete and Systematic Search Algorithm that Uses Optimal Pruning, Proc. of the 10th Int. Conf. on Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993, 284-290.
    [59] H. Liu and V. Setiono A probabilistic approach to feature selection——a filter solution, Proc. of the 13th Int. Conf. on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1996, 319-327.
    [60] G. Brassard, Fundamentals of Algorithms. Prentice Hall, New Jersey, 1996. G. Brassard and P. Bratley. Fundamentals of algorithms, New Jersey: Prentice Hall, 1996.
    [61] 苗夺谦,胡桂荣. 知识约简的一种启发式算法,计算机研究与发展,1999, 36(6): 681-684。
    
    
    [62] 于洪,杨大春,王国胤等. 基于Rough Set理论的知识约简算法,计算机科学,2001,28(5.专刊): 31-34。
    [63] 王国胤,于洪,杨大春. 基于条件信息熵的决策表约简,计算机学报,2002,25(7), 759-766。
    [64] S.K. Choubey, J.S. Deogun and V.V. Raghavan etc. On Feature Selection and Effective Classifiers, J. of ASIS, 1998, 49(5): 423-434.
    [65] J. Doak. An Evaluation of Feature Selection Methods and Their Application to Computer Security, Technical Report, University of California, 1992.
    [66] Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data, Dordrecht: Kluwer, 1991.
    [67] J.R. Quilan Induction of decision trees, Machine Learning, 1986, 1: 81-106.
    [68] P. Clark and T. Niblett. The CN2 induction algorithm, Machine Learning, 1989, 3: 261-283
    [69] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
    [70] A. Skowron and C. Rauszer. The Discernibility Matrices and Functions in Information Systems. R. Slowinski (ed.): Intelligent Decision Support. Handbook of Applications and Advances of the Rough Sets Theory. Dordrecht: Kluwer 1992, 331-362.
    [71] X.H. Hu and N. Cercone. Learning in relational databases: a Rough Set approach, International Journal of Computational Intelligence, 1995, 11(2): 323-338.
    [72] Jelonek J. et al. Rough set reduction of attributes and their domains for neural networks, International Journal of Computational Intelligence, 1995, 11(2): 338-347.
    [73] 王珏,王任,苗夺谦等. 基于Rough Set理论的数据浓缩,计算机学报,1998,21(5): 393-400.
    [74] 叶东毅. Jelonek属性约简算法的一个改进, 电子学报, 2000, 28(12): 81-82.
    [75] 叶东毅. 粗糙集属性约简的一个贪心算法,系统工程与电子技术,2000,22(9): 63-65.
    [76] 赵军,王国胤,吴中福等. 一种高效的属性核计算方法,小型微型计算机系统,已录用。
    [77] 赵军,王国胤,吴中福等. 基于粗集理论的特征子集选择算法,计算机科学,2002,29(11):83-86
    [78] Jun Zhao, Zhongfu Wu, Hong Tang etc. A New Algorithm Based on Rough Set Theory for Feature Selection, Proc. Of the Int. Nat. Conf. On Fuzzy Information Processing: Theories and Applications (FIP2003), Bei Jing, China, Acceepted.
    [79] R. Caruana and D. Freitag. Greedy Attribute Selection, Proc. of the 11th Int. Conf. On Machine Learning, New Brunswick, 1994, 28-36.
    [80] 苗夺谦, 王珏. 粗糙集理论中概念与运算的信息表示, 软件学报,1999, 10(2): 113-116。
    G.Y. Wang. Algebra View and Information View of Rough Set Theory, Data Mining and Knowledge Discovery: Theory, Tools, and Technology III, Proc. of SPIE, 2001, Vol. 4384:
    
    [81] 200-207.
    [82] 王国胤, 何晓. 基于Rough集理论的自主式知识学习模型. 计算机科学,2002,29(9: 专刊),24-25.
    [83] 王国胤, 何晓, 吴渝. 一种不确定性条件下的自主式知识学习模型. 软件学报,已录用.
    [84] I. Düntsch, G. Gediga. Uncertainty Measures of Rough Set Prediction. Artificial Intelligence. 1998, 106: 109-137.
    [85] W. Ziarko. Variable Precision Rough Set Model, Journal of Artificial Intelligence, 1993, 46: 39-59.
    [86] 陈湘辉, 朱善君, 吉吟东. 基于熵和变精度粗糙集的规则不确定性度量. 清华大学学报(自然科学版),2001,47(3): 109-112.
    [87] 王国胤. 决策表信息系统中的不确定性度量. 计算机科学,28(5.专刊):23-26.
    [88] R.S. Michalski, I. Mozetic, J. Hong and N. Lavrac. The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains. Proc. Of the 5th Nat. Conf. On Artificial Intelligence AAAI-86, Philadelphia, Pa., Morgan Kaufman, 1986, 1041-1045.
    [89] P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning, 1989, 3: 261-283.
    [90] H. Theron and I. Cloete. BEXA: A Covering Algorithm for Learning Propositional Concept Descriptions. Machine Learning, 1996, 24: 5-40.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700