粗糙集理论中的约简算法研究

英文题名：Research on Reduct Algorithm in Rough Set Theory
作者：李军
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 粗糙集 ; 约简 ; 区分矩阵 ; 区分能力指数 ; 划分
英文关键词：Data Mining ; Rough Set ; Reduct ; Discernibility Matrix ; Discernibility Ability Index ; Partition
学位年度：2004
导师：李雄飞
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2004-05-01

摘要

近年来，数据挖掘(Data Mining)引起了信息产业界的极大关注，其主要原因是数据海洋的日益增大，我们需要新技术将海量数据转化为有用的信息和知识。
    分类是数据挖掘的主要任务之一，分类的目的是要建立区分数据对象的模型，以便能够使用模型预测类标记未知的数据对象。主要的分类方法有决策树、贝叶斯分类、支持向量机、神经网络、遗传算法、粗糙集等等。
    粗糙集(Rough Set)理论是波兰数学家Z.Pawlak于1982年提出的，是一种新的处理含糊性(Vagueness)和不确定性(Uncertainty)问题的数学工具，粗糙集理论的主要优势在于它不需要关于数据的任何预备的或额外的信息。高效的约简算法是粗糙集理论应用于数据挖掘与知识发现领域的基础，但是，已经证明求解所有约简和求解最小约简都是NP-hard问题，因此，寻求快速的约简算法仍是粗糙集理论的主要研究课题之一。
    本文在对数据挖掘和粗糙集约简算法进行综述之后，提出一组基于属性区分能力指数的信息系统约简算法和一组决策表相对约简算法。
    本文提出的算法都涉及系统的划分，系统划分的目的是为了用属性a来代替相应区域的区分元素(见图4.1)，从而减小算法搜索的空间。
    为了能够用属性a来代替相应区域的区分元素，要求划分时要将系统划分成属性对应的若干等价类。根据第四章所解释的算法思想，自然希望选择取值分布较均匀的、具有较多属性值的属性。
    本文针对信息系统和决策表分别提出区分能力指数的概念，并利用具有最大区分能力指数的属性对系统进行划分。
    当然，信息增益度量也倾向于取值较多且取值分布均匀的属性，但是信息增益不能很好地反映算法思想。正如定理4.2和定理5.2所指出的，取值分布较均匀的属性具有较大的区分能力；而定理4.3和定理5.3指出，当属性b是属性a的细化时，属性b的区分能力不小于a的区分能力。因此，区分能力指数是较好的系统划分启发信息。
    第四章首先定义了属性区分能力指数I(a)的概念，并给出I(a)的若干性质。然后提出了基于区分能力指数的信息系统划分与约简的基本算法，算法的主要思想是采用分而治之的策略，利用具有最大属性区分能力指数的属性将信息系统划分为子表，求取子表约简后，利用子表约简求解原系统的约简。本文指出该算法所得结果是信息系统的约简。通过对基本算法的分析，提出了考虑公共项的约简算法、考虑公共项和频繁项的约简算法和带有近似精度阈值的约简算法，并对相应算法的完备性进行了分析。为了进一步简化布尔函数化简的过程，当各个子表区分函数的析取范式具有公共项时，考虑公共项的约简算法只给出一个约简。这虽然会丢掉若干约简，但在分类精度容许


    的情况下，这种策略会在得到各个子表约简后，立即给出原信息系统的一个约简。类似地，当各个子表区分函数的析取范式具有频繁项时，考虑频繁项的约简算法和带有近似精度阈值的约简算法，也只给出一个约简。当然，带有近似精度阈值的约简算法是对以上算法的完善。第四章还对各个子表布尔函数的不同形式进行了较全面的讨论。
    第五章针对决策表定义了条件属性的区分能力指数DI(a)的概念，并给出DI(a)的若干性质。为论述方便，本文还提出了拟等价类的概念。然后提出基于区分能力指数的决策表划分与相对约简的基本算法，算法的主要思想与第四章算法思想类似，但需改动三点：(1) 划分决策表为各个拟等价类对应的子表；(2) 区分能力指数定义改为DI(a)；(3) 区分函数fi按照决策表区分函数方式构建。基本算法所得结果是决策表的相对约简。最后，提出了带有近似精度阈值的相对约简算法，并对相应算法的完备性进行了分析。本章还对为何选取区分能力指数作为系统划分的启发信息，进行了讨论。
    第五章的最后一节给出了若干实验结果，实验结果表明，本文提出的算法有较高的效率。
    值得进一步研究的问题如下：
    本文给出的算法不具有完备性，因此，进一步的工作应考虑从理论上讨论算法所得结果的“满意程度”。另外，算法的时间复杂度主要是由于区分函数化简引起的，故区分函数化简的高效算法也是值得进一步研究的问题。
In recent years, Data Mining has rosed a great attention of the information industry, the main reason is because data ocean has more and more larghe we need new technology to transmute the very large data into useful information or knowledge.
    Classification is a primary task of Data Mining, the aim of Classification is found a model which can discernibility data objects, in order to we can use this model to forecast the class of a data object which we do not know it’s class maker. The main classification methods include Dicision Tree, Beyes classification, Support Vector Machine, Artificial neural Net, Genetic Algorithm, Rough Set and etc.
    Rough Set theory is founded by a Polish mathematician—Z. Pawlak in 1982, it is a new kinds of mathematical tool to deal with Vagueness and Uncertainty problems. The main advantage of Rough Set theory is that it has no use for any preliminary or additional information about data. The effective reduct algorithm is the foundation to use the rough set theory in data maining and knowledge discovery in database. But it has been proved that the problem about solve all reducts or the minimum reduct is NP-hard problem, so search a fast reduct algorithm is however a main research issue of Rough Set theory.
    In this paper, we give a summarization of data mining and reduct algorithms in rough set theory, and then we propound a series of reduct algorithms about information system and a series of relative reduct algorithms about decision table, they are all base on the discernibility ability index.
    The algorithms propounded in this paper are all relate to the division of system, the aim of the division of system is replace the discernibility element in the corresponding region with attribute a(see figure 4.1), thereby, it reduce searching space of the algorithm.
    In order to replace the discernibility element in the corresponding region with attribute a, we must divide system into any equivalence classes according to attribute a. By the explanation of algorithm’s idea in chapter 4, an attribute be supposed to be selected ought to an attribute which has a relatively uniform value distribution and has relatively more attribute value.
    Discernibility ability index about information system and decision table was respectively propounded, an attribute which has the maximal discernibility ability index is used for the division of system.
    Certainly, information gain measurement is also be apt to the attribute which


    has a relatively uniform value distribution and has relatively more attribute value, but information gain can’t reflect the idea of the algorithm. By theorem 4.2 and theorem 5.2, we can know that an attribute has biggish discernibility ability when it has a relatively uniform value distribution. Moreover, by theorem 4.3 and theorem 5.3, we can know that the discernibility ability of attribute b should not less than that of attribute when attribute b is the thinning of attribute a. Therefore, discernibility ability index is a good heuristic information of division of system.
    Discernibility ability index I(a) was defined in chapter 4, some properties about I(a) were presented. And then, we propound a basic reduct algorithms about information system, it is base on the discernibility ability index, the main idea of this algorithm is divide and rule, according to the attribute that has the maximal discernibility ability index, we divide system into some subtable, when we gained the reduct of every subtable, we can use them to solve the reduct of original information system. In this paper, we proved the output of basic algorithm is reduct of information system. After analyze the basic algorithm, we propound a series of algorithms: a reduct algorithm which take into account common term; a reduct algorithm which take into account common term and frequent term; a reduct algorithm which take into account threshold of approximate precision, moreover, algorithm’s categoricalness were analyzed. For the sake of predigest the evolvement of Boolean function, the reduct algorithm which tak

引文

[1] Bazan J., Skowron A, Synak P.. Dynamic Reducts as a Tool for Extracting Laws from Decisions Tables. In: Ras Z W, Zemankiva M(Eds.). Methodologies for Intelligent Systems. 1994. Berlin: Springer-Verlag, 346-355.
    [2] Bazan J., Dynamic reducts and statistical inference, IPMU’96, 1996, 1147-1152.
    [3] Bazan J., A Comparison of Dynamic and non Dynamic Rough Set Methods for Extracting Laws from Decision Tables, 1998.
    [4] Bjorvand A T. (1998). 'Rough Enough' – A system supporting the Rough Sets Approach. http://home.sn.no/~torvill.
    [5] Chan C C,A rough set approach to attribute generalization in data mining,Journal of Information Sciences,107(1998),169-176.
    [6] 胡可云, 陆玉昌, 石纯一, 粗糙集及其应用进展, 清华大学学报, 41(1), 2001。
    [7] 胡可云，基于概念格和粗糙集的数据挖掘方法研究，清华大学博士学位论文，2001。
    [8] Katzberg J D, Ziarko W. (1993). Variable Precision Rough Sets with Asymmetric bounds. In: Ziarko W P(Ed.). Proceedings of RSKD'93. Springer-Verlag, 167-177
    [9] Kryszkiewicz M, Rybinski H. (1993). Finding Reducts in Composed Information Systems. In: Ziarko W P(Ed.). Proceedings of RSKD'93. London: Springer-Verlag, 261-273.
    [10] Kryszkiewicz M. (1998). Rough set approach to incomplete information systems. information sciences, 112:39-49.
    [11] 苗夺谦，基于信息熵的属性约简算法，中国科学院自动化研究所技术报告，1996。
    [12] 苗夺谦、胡桂荣，知识约简的一种启发式算法，计算机研究与发展，1999,36(6),681-684。
    [13] Pawlak Z., Rough sets, International Journal of Computer and Information Sciences,11(1982),341-356.
    [14] Pawlak Z. et al., Rough Sets, Communication of the ACM,38(1995),189-195.
    [15] Pawlak Z., Vagueness and uncertainty: A rough set perspective, Computational Intelligence,11(1995),227-232.
    [16] Pawlak Z., An inquiry into Vagueness and uncertainty, In: M. Dabrowski, M. Michalewicz, and Z.W. Ras(eds.), Proceedings of the Third


    International Workshop on Intelligent Information Systems, 1994, 338-343.
    [17] Robert Susmaga, Parallel Computation of Reducts, L. Polkowski and A. Skowron (Eds.): RSCTC'98, 1998, 450-458.
    [18] Skowron A, Rauszer C,The discernibility matrices and functions in information systems, Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory, 1992,331-362.
    [19] Starzyk J, Nelson D E, Sturtz K., A mathematical foundation for improved reduct generation in information systems, Knowledge and Information Systems(2000)2, 2000, 131-146.
    [20] Starzyk J, Nelson D E, Sturtz K. (1999). Reduct generation in information system. Bulletin of international rough set society, 3(1/2):19-22.
    [21] Xiaohua Hu and Cercone N.,Learning in relational database:a rough sets approach,Computational Intelligence,11(1995),323-355.
    [22] Xiaohua Hu,Knowledge Discovery in Databases: An Attribute-Oriented Rough Set Approach, 1995, Ph.D dissertation, Dept. of Computer Science,University of Regina.
    [23] Xiaohua Tony Hu, T.Y.Lin,and Jiaochao Han, A new rough sets model based on database systems, G. Wang et al.(Eds):RSFDGrC 2003, LNAI 2639, 2003, 114-121.
    [24] Yao Y Y, T.Y.Lin. (1996). Generalization of Rough Sets using modal logics. intelligent Automation and soft computing, 2(2), 103-120.
    [25] Ziarko W.,Variable precision rough set model,Journal of Computer and System Sciences,46(1993),39-59.
    [26] 王珏，Rough set约简与数据浓缩，高技术通讯，1997年11月，40-45。
    [27] Wang Jue(王珏) and Miao Duoqian(苗夺谦), Analysis on Attribute Reduction Strategies of Rough Set,J. of Comput. Sci. & Technol, 1998.
    [28] Wong S.K.M, Ziarko W., Optimal decision rules in decision table, Bulletin of Polish Academy of Sciences, 1985,33(11-12), 693-696.
    [29] Wroblewski J., Ensembles of classifiers based on approximate reducts, 2000.
    [30] Fayyad U、Piatetsky-Shapiro、Smyth、Uyhurusamy，Advances in Knowledge Discovery and Data Mining，1996，MIT Press.
    [31] Jiawei Han、Micheline Kamber著，范明、孟小峰等译，《数据挖掘概念与技术》、机械工业出版社，2001年第一版。
    [32] 李雄飞编著，《数据挖掘与知识发现》，高等教育出版社，2003年第一版。
    [33] 李雄飞等，《基于粗集理论的约简算法》，吉林大学学报，2003年1月。

    [34] 李军，《粗糙集与模糊集的比较》，长春理工大学学报，2002年12月。
    [35] 陆汝钤，人工智能(上册)，科学出版社，1995年。
    [36] 陆汝钤，知识科学与计算科学，清华大学出版社，2003年。
    [37] 陆汝钤，世纪之交的知识科学，清华大学出版社，2001年。
    [38] Pawlak Z.,Rough sets: Theoretical Aspects of Reasoning about Data,Kluwer Academic Publishers,Boston,1991.
    [39] 史忠植著，《知识发现》，清华大学出版社，2002年第一版，143-169。
    [40] 臧雪柏等，《挖掘相联规则的并行算法》，小型微型计算机系统，2003年12月。
    [41] 张文修、吴伟志、梁吉业、李德玉编著，《粗糙集理论与方法》，科学出版社，2001年第一版。
    [42] 曾黄麟，粗集理论及其应用，重庆大学出版社，1998年。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700