基于兴趣度的关联规则挖掘算法的研究

英文题名：Mining Algorithm Research for Association Rules Base on Interest Measure
作者：陈安龙
论文级别：硕士
学科专业名称：计算机应用
中文关键词：数据挖掘 ; 关联规则 ; 兴趣度 ; 极大团 ; 邻接矩阵
英文关键词：Data Mining ; Association rules ; Interest measure ; Maximum clique ; Adjacency matrix
学位年度：2003
导师：陶宏才
学科代码：081203
学位授予单位：西南交通大学
论文提交日期：2003-06-01

摘要

数据挖掘是面向海量数据的知识发现技术，研究高效的挖掘算法是数据挖掘研究的重要内容之一。关联规则是数据挖掘的重要模式之一，有着极其重要应用价值。本文主要研究了如何提高布尔关联规则的挖掘算法的有效性和伸缩性。
     Apriori算法是挖掘布尔关联规则的算法，而该算法在空间和时间的复杂性有着难以克服的局限性。因此，文中引入了一种不需要产生候选项的频繁模式增长算法，将数据库的事务的信息压缩到FP一树，然后通过后缀与前缀连接产生频繁模式，从而避免了多次扫描数据库，降低了时间开销。
     当数据库中的项目数目较大且事务数量巨大时，频繁模式增长算法内存开销很大，可能导致内存空间不足的现象。因此，本文提出了基于极大团划分的模式增长算法，将事务项目集分解成若干子集，对每个子集分别使用频繁模式增长算法找出它们的频繁模式，从而解决了内存不足的矛盾。同时，提出了一种用邻接矩阵产生频繁2项集的方法，可以减少扫描数据库的次数。
     如何从大量的关联模式中筛选出用户感兴趣且有价值的规则，是算法研究的重要内容之一。基于支持度和信任度的框架模型有一定的局限性，本文在此框架中引入了基于影响的兴趣度，用来修剪无趣的规则，从而筛选出用户真正感兴趣的规则模式。
Data mining is the knowledge discovery technique oriented to a great deal of data. Researching efficient algorithm is one of the important contents in study of data mining. Association rule is one of the important models of data mining, and has the most significant application value. The core of this dissertation is how to improve the validity and scalability of mining algorithm of Boolean association rules.
    The Apriori algorithm is the method of finding Boolean association rules, but has the disadvantage in the complexity of space and time. Therefore, this thesis introduces a new frequent-pattern (FP) growth algorithm that does not need to produce the candidate item sets. This algorithm compresses information in database to the FP-tree, then produces frequent pattern by joining suffix with prefix, consequently avoids scanning the database many times, and lowers the time expense.
    When there are a great many of items and transactions in the database, frequent-pattern growth algorithm needs more additional computer memory, which may cause the lack of memory. Therefore, this paper brings forward frequent-pattern growth algorithm based on maximum clique that resolves problem of memory insufficiency by dividing item set into several subsets, then computing frequent-pattern for each subset. In this paper, a new algorithm is given to find fraquent 2-itemset by adjacency matrix with less times scanning the database.
    How to select the interested and valuable rules from a large number of association modes is one of the important contents in study of mining algorithm. There is limitation in model based on support and confidence measure, thus interest measure model based on effect is given in this dissertation, which is used to prune the no-interest rules in order to discover the real interest rules mode.

引文

[1] 周皓峰、朱扬勇、施伯乐．一个基于兴趣度的关联规则的采掘算法．计算机研究与发展，2002．4 450-457
    [2] 徐宝文、张卫丰．数据挖掘技术在Web预取中的应用研究．计算机学报，2001．4 430-436
    [3] 程继华、郭建生、施鹏飞．挖掘所关注规则的多策略方法研究．计算机学报，2000．1 47-51
    [4] 邹晓峰、陆建江、储为民、宋自林．挖掘关注的语言值关联规则．解放军理工大学学报(自然科学版)，2002．6 9-12
    [5] 肖利、金远平、徐宏炳、王能斌．基于多维标度的快速挖掘关联规则算法，软件学报，1999．7 749-753
    [6] 周延泉、何华灿、李金荣．利用广义相关系数改进的关联规则生成算法．西北工业大学学报，2001．11 639-643
    [7] 陆建江、宋自林、钱祖平．挖掘语言值关联规则．软件学报，2001．12 607-611
    [8] 张竹润、谢康林、张忠能．一种提取关联规则的数据挖掘快速算法．上海交通大学学报，2002．4 555-558
    [9] 李雄飞、苑森淼、董立岩、全勃．多段支持度数据挖掘算法研究．计算机学报，2001．6
    [10] SymthP, GoodmanRM. An Information Theoretic Approach to Rule Induction from Database [J]. IEEE Trans. knowledge Data Eng. 1992,4(4):301-316
    [11] Brin S, Motwani R,Ullman J D et al. Dynamic itemset counting and implication rules for market basket data. In: Procthe ACM SIGMOD International Conference on Management of Data, Tucson, Arizon, 1997 . 255-264
    [12] AgrawalR, SrikantR. Fast algorithms for mining association rules in large databases In: Procof the 20thInt'l Confon Very Large DataBases. Santiago, Chile, 1994.478-499
    [13] 周欣、沙朝锋、朱扬勇、施伯乐．兴趣度——关联规则的又一个阀值．计算机研究与发展，2000．5 627-633


    [14] 铁治欣、陈奇、俞瑞钊．采掘关联规则的高效并行算法．计算机研究与发展，1999．8 947-953
    [15] PeiQi Liu、ZengZhi Li.A fast algorithm of mining association rules in network management. Machine Learning and Cybernetics, 2002.2 915-918
    [16] NaiQian Li、 QinBao Song. Mining cross-table association rules based on projections of itemsets. Machine Learning and Cybernetics, 2002.1 170-174
    [17] Cohen, E.、Datar, M.、 Fujiwara, S. Finding interesting associations without support pruning. Data Engineering, 16th International Conference on , 2000 489-500
    [18] Aly, H.H.、Amr, A.A.、Taha, Y. Fast mining of association rules in large-scale problems. Computers and Communications, 2001. Proceedings. Sixth IEEE Symposium on , 2001 107-113
    [19] Shragai, A.、Schneider, M. Discovering uantitative associations in databases IFSA World Congress and 20th NAFIPS International Conference, 2001. Joint 9th July 2001 423-428
    [20] Xiangjj Huang, Aijun An,Cercone, N.、Prombouse, G. Discovery of interesting association rules from livelink web log data. Data Mining, 2002. Proceedings. 2002 IEEE International Conference on , 2002 763-766
    [21] Jiexun Li、Guoqing Chen. A SAR-based interesting rule mining algorithm Fuzzy Information Processing Society, 2002. Proceedings. NAFIPS. 2002 Annual Meeting of the North American, 2002 178-183
    [22] Zaki、 M.J. Scalable algorithms for association mining Knowledge and Data Engineering, IEEE Transactions on, Volume: 12 Issue: 3 , May/Jun 2000 372-390
    [23] EuiHong Han、Karypis, G.、Kumar, V. Scalable parallel data mining for association rules. Knowledge and Data Engineering, IEEE Transactionson, Volume: 12 Issue: 3, Jun2000 337-352
    [24] 倪志伟、蔡庆生、方瑾．用神经网络来挖掘数据库中的关联规则．系统仿真学报，2000．11 685-687
    [25] 欧阳为民、蔡庆生．大型数据库中多层关联规则的元模式制导发现．

    软件学报，1997．12 920-927
    [26] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. Proceedings of the 21st International Conference on Very large Database, 1995
    [27] J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. Proceedings of ACM SIGMOD International Conference on Management of Data, San Jose, CA, May 1995. 175-186
    [28] H. Mannila, H. Toivonen, and A. Verkamo. Efficient algorithm for discovering association rules. AAAI Workshop on Knowledge Discovery in Databases, 1994, pp. 181-192.
    [29] H. Toivonen. Sampling large databases for association rules. Proceedings of the 22nd International Conference on Very Large Database, Bombay, India, September 1996
    [30] J. Han、J. Pei、Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM SIGMOD Int. Conf. Management of Data(SIGMOD'00),Dalas, TX, May 2000.
    [31] 程继华、旌鹏飞．多层次关联规则的有效挖掘算法．软件学报．1998．12 937-941
    [32] 陆丽娜、陈亚萍、魏恒义、杨麦顺．挖掘关联规则中hpriori算法的研究．小型微型计算机系统，2000．9 940-943
    [33] 罗可、吴杰．一种基于Apriori的改进算法．计算机工程与应用，2000．11 20-22
    [34] 路松峰、卢正鼎．计算支持度和置信度的上下界．小型微型计算机系统，2000．8 851-854
    [35] 朱绍文、王泉德、黄皓．关联规则挖掘技术及其发展动向．计算机工程，2000．9 4-6
    [36] 周延泉、何华灿、李金荣．利用广义相关系数改进的关联规则生成算法．西北工业大学学报，2001．11 639-643
    [37] 路松峰、胡和平．加权关联规则的开采．小型微型计算机系统，2001．3 347-350
    [38] 程继华、施飞．基于子块划分的关联规则的挖掘．计算机工程与设计，1999．9 25-29


    [39] 綦艳霞、杨炳儒．KDD中知识评价的研究综述．计算机应用研究，2001．12 1-4 20
    [40] 蔡伟杰、张晓辉、朱建秋、朱扬勇．关联规则挖掘综述．计算机工程，2001．5 31-49
    [41] 聂永红．常项集产生的算法及兴趣度量．广西师院学报(自然科学版)，2001．12 65-68
    [42] 宋爱波、董逸生、赵茂先．稠密数据库有趣规则的快速挖掘．小型微型计算机系统，2001．7 822-826
    [43] 楼晓鸿、丁宝康．一种多支持度的关联规则采集算法．计算机工程，2001．6 102-103
    [44] 俞金寿．数据挖掘技术．计算机应用，2000．6 38-42
    [45] 施润身、赵青．改进的关联规则采掘算法及其实现．同济大学学报，2002．2 222-225
    [46] 杨炳儒、唐菁．基于复杂类型数据的发现特征子空间模型(DFSSM)的研究．中国工程科学，2003．1 56-68
    [47] 程继华、魏暑生、施鹏飞．基于概念的关联规则的挖掘．郑州大学学报(自然科学版)，1998．6 27-30
    [48] 谢志鹏、刘宗田．基于概念格的关联规则发现．小型微型计算机系统，2000．10 1028-1031
    [49] Cohen, E.、 Datar, M.、Fujiwara, S. Finding interesting associations without support pruning Knowledge and Data Engineering, IEEE Transactions on , Volume: 13 Issue: 1 , Jan/Feb 2001 64-78
    [50] ChengFa Tsai、Yi-Chau Lin、Chi-Pin Chen. A new fast algorithms for mining association rules in large databases Systems, Man and Cybernetics, 2002 IEEE International Conference on, Volume: 7 , 2002 251-256
    [51] Taouil, R.、 Pasquier, N.、 Bastide, Y.、 Lakhal, L. Mining Bases for Association Rules Using Closed Sets Data Engineering, 2000. Proceedings. 16th International Conference on , 2000 307-307
    [52] Jiawei Han、Michel ine Kember著，范明、孟小峰译．数据挖掘概念与技术．机械工业出版社，2001．4
    [53] 吉根林．数据挖掘技术．中国图象图形学报，2001．8：715-721

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700