基于关联规则的数据挖掘算法研究

英文题名：Study of Data Mining Algorithm Based on Association Rules
作者：吴仁堂
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 关联规则 ; 修剪 ; APM算法
英文关键词：Data mining ; Association rules ; Prune ; APM
学位年度：2010
导师：周根宝
学科代码：081203
学位授予单位：内蒙古农业大学
论文提交日期：2010-05-01

摘要

关联规则是数据挖掘技术的一个最活跃的研究方向之一,其反映出项目集之间有意义的关联关系。关联规则可以广泛地应用于各个领域,既可以检验行业内长期形成的知识模式,也能够发现隐藏的新规律。有效地发现、理解和运用关联规则是完成数据挖掘任务的一个重要手段。
     关联规则挖掘需要在挖掘效率和精确性方面进行改进,也需要新的更有效的算法。本文对关联规则挖掘相关的概念和关联规则典型算法进行了详细的分析和总结,然后在介绍关联则挖掘基本算法-Apriori算法的基础上,对现有的经典算法进行了研究分析并指出了它们使用的传统搜索方法和频度计算上的不足。
     传统算法存在的另一个重要问题是:生成的关联规则之间存在着大量的冗余规则,这使得用户分析和利用这些规则变得十分困难,如何修剪冗余规则以便用户分析成了一个重要课题。减少冗余规则的方法很多,目前对冗余规则的修剪技术主要在正关联规则领域,但负关联冗余规则的修剪同等重要,本文在介绍正关联规则修剪的同时也对负关联规则挖掘技术进行了深入的研究讨论。并在现有算法的基础上提出了新的冗余规则裁剪算法,该算法运用概率论的相关性定义进一步对生成的关联规则进行裁剪。
     接着介绍了基于模式矩阵匹配的关联规则算法-APM算法,并对算法性能进行了分析。APM算法扫描一遍数据库后就不再使用数据库,并且用矩阵的编码方式用来求一个待生成的k-项集是不是频繁项集,大大提高了挖掘关联规则的效率,对数据挖掘来说有一定的实用价值。
Association rules is one of the major issues in data mining technology,it reflects the significative association between sets. Association rules can be widely applied in various fields, it can test the knowledge pattern of the industry's long-established and can find some new rules. Effectively find, understand and use data mining association rules is an important means to complete the task.
     Mining association rules need to improve the efficiency and accuracy, also needs new and more efficient algorithms. In this paper, the concept of association rule mining and the typical algorithm related to association rules is analyzed and summarized.And then introduce the basic algorithm for mining association rule -Apriori algorithm, based on existing classical algorithm analysis and pointed out their lack of use traditional search and in frequency calculation.
     However, there is the traditional method is another important issue: the association rules algorithm generated a great deal of redundancy rules, which makes users to analyze and use these rules very difficult. This requires us to prune redundant rules, and how to enable users to facilitate analysis has become an important issue. There many ways to reduce the redundant rules, the current pruning techniques redundant rules are mainly in mining positive association rules, but the negative association rules of pruning redundant rules is also important, this paper study and discussion positive and negative association rules techniques in-depth. A new non-redundant rules algorithm is presented based on former algorithm, this algorithm using the correlation knowledge of probability for further pruning redundant rules.
     Then introduced an association rule mining algorithm based on pattern matrix matching -APM association rules algorithm, and this algorithm performance is analyzed. APM algorithm scans the database once and use the encoding matrix to determine a k-item set is frequent item set or not. The experimental results demonstrate that the algorithm is correct and effective.

引文

1 R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases[C].Proc.1993 ACM SIGMOD Int’l Conf. Management of Data, Washington, D.C , May 1993, pp. 207-216
    2 Margaret H.Dunham .数据挖掘教程[M] .郭崇慧等译.清华大学出版社,2005
    3 J.Han,and M.Kamber,Data mining concepts and techniques[M].Morgan Kaufmann Publishers, San Francisco,CA,2001
    4朱建平,张润楚.数据挖掘的发展及其特点[J].统计与决策,2002(7):71-72
    5 Pat Langley and Herbert A.Simon,Applications of machine learning and rule induction[C],Communications of the ACM,38(11) ,1995,pp. 54-64
    6 Jiawei Han,Hong Cheng,Dong Xin,Xifeng Yan. Frequent pattern mining: current status and future directions[J], Data Mining and Knowledge Discovery,2007,volume(15): 55-86
    7 R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases[C]. 1994,In VLDB'94: 487-499
    8刘明吉,王秀峰,黄亚楼.数据挖掘中的数据预处理[J].计算机科学,2000,27(4):54-57
    9毛国君,段立娟,王实等.数据挖掘原理与算法[M].清华大学出版社,2005
    10 HAN J, PEI J, YIN Y. Mining frequent patterns without candidate generation[C] ,Proc of the 2000 ACM SIGMOD Internal Conference on Management of Data. Dallas, Texas: ACM Press, 2000: 1-12
    11 PARK J S, CHEN M S, YU P S. An effective hash based algorithm of association rules[C], Proc of International Conference on Management of Data. Washington DC: ACM SIGMOD, 1995: 175-186
    12 SAVASERE A,OM IECINSKIE,NAVATHE S.An efficient algorithm of association rules in large database[C],Proc of International Conference on Management of Data Washington DC: ACM SIGMOD,1995: 432-443
    13 RaymondT.Ng,Lakes V.S.Ukshmanan,Jiawei Han,and Alex Pang.ExPloratory Mining and Pruning Optimizations of Constrained Assoeiation Rules[C].In Proeeedings of ACM SIGMOD Intl.Conf.on Management of Data(SIGMOD,98):Seattle,Washington,USA,June 1998:13-24
    14 Lakes V.S.Lakshmanan,Raymond T.Ng , Jiawei Han , and Alex. Pang.Optimization ofConstrained Frequent Set Queries with 2-variable Constraints[C].In Proceedings of ACM SIGMOD Intl.Conf on Management of Data(SIGMOD’99),PhiladelPhia,PA,USA,May-June,1999:157-168
    15 Roberto J.Bayardo,Rakesh Agrawal,and Dimitrious Gunopulos.Constraints Based RuleMining in Large , Dense Database[C].In Proceedings of 15th Intl.Conf.on data engineering(ICDE’99),Sydney.Australia.march 1999:188-197
    16丁祥武.挖掘时态关联规则[J].武汉交通科技大学学报,1999,(8):365-376
    17董祥军,宋翰涛,姜合,陆玉昌.时态关联规则的挖掘[J].计算机工程,2005,(15):24-26
    18 Rakesh Agrawal, John C. Shafer. Parallel mining of association rules[C]. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 962-969
    19 Han Jiawei,FuYongjian.Mining Multiple-level Association Rules in Large Databases[J] IEEE Transon Knowledge and DataEngineering,1999,11(5):798-805
    20 M.Kamber,J.Han, and J.y. Chiang.Metarule-Guided Mining of Multi- dimensional Association Rules using Data Cubes[C].In Proc.3rd Int.Conf.Knowledge Discovery and DataMining(KDD-97),NewportBeach,California,August 1997:207-210
    21 J.Han,J.Pei,and Y.Yin. Mining Frequent Patterns without Candidate Generation[C].In proc.2000 ACM-SIGMOD Int.Conf.Management of Data(SIGMOD’00),2000(5):l-12
    22朱玉全,宋余庆.频繁闭项目集挖掘算法研究[J].计算机研究与发展, 2007, 44 (7) : 1177 -1183
    23冯玉才,冯剑琳.关联规则的增量式更新算法[J].软件学报,1998,9(4):301-306
    24 Brin, S., Motwani, R., Silverstein, and C.Beyond Market: Generalizing Association Rules to Correlations[C]. In Processing of the ACM SIGMOD Conference, 1997: 265-276
    25 Eui-Hong Han, George Karypis, Vipin Kumar. Scalable parallel data mining for association rules[C]. Proceedings of the ACM SIGMOD Conference, 1997: 277-288
    26 Wu Xindong, Zhang Chengqi, Zhang Shichao. Mining both positive and negative association rules[C]. Proceedings of the 1 9th International Conference on Machine Learning(ICML). San Francisco: Morgan Kaufmann Publishers, 2002: 658- 665
    27李学明,刘勇国,彭军等.扩展型关联规则和原关联规则及其若干性质[J].计算机研究与发展,2002,39(12):1740-1750
    28董祥军,王淑静,宋瀚涛,陆玉昌,负关联规则的研究[J].北京理工大学学报,2004,24(11):978-981
    29 Dong X., Sun F., Han, X., Study of Positive and Negative Association Rules Based on Multi-confidence and Chi-Squared Test[C]. LNAI 4093, Springer-Verlag Berlin Heidelberg, 2006: 100-109
    30 Zhang Chengqi,Zhang Shichao.Association rule mining[M].Heidelberg: Springer-Verlag, 2002: 47-84
    31 Dong X, Niu Z, Shi X. Mining Both Positive and Negative Association Rules from Frequent and Infrequent Itemsets[C]. Volume 4632, Springer Berlin/Heidelberg, 2007: 122-133
    32 Cohen J. Statistical Power Analysis for the Behavioral Sciences (2nd.)[M]. Lawrence Erlbaum, New Jersey, 1988
    33王柏盛,刘寒冰,靳书和马丽艳.基于矩阵的关联规则挖掘算法[J].微计算机信息,2007 , 23(15):143-145
    34李杰,徐勇,王云峰,王友.最简关联规则及其挖掘算法[J].计算机工程[J], 2007,33(13):46-48
    35刘江华,戴新喜,白似雪.基于模式矩阵的P_Matrix算法[J],南昌大学学报(理科版) ,2007,31(5):496-499
    36丁艳辉,王洪国,高明,谷建军.一种基于矩阵的关联规则挖掘新算法[J],计算机科学, 2006, 33(4):188-189
    37胡慧蓉,王周敬.一种基于关系矩阵的关联规则快速挖掘算法[J].计算机应用, 2005,25(7):1577-1579
    38徐前方,肖波,郭军.基于相关矩阵的关联规则挖掘及其更新算法[J],计算机工程.2008,34(1):40-42
    39张嘉赢,刘井莲,赵卫绩.一种基于半布尔矩阵的混合维关联规则算法[J].沈报,2008,20(2):19-21
    40黄学平,薛安荣.基于数据库划分的关联规则算法.计算机工程与设计[J],2008,29(12):3005-3007
    41 PASQU IER N, BASTIDE Y, TAOU L R, et al. Discovering frequent closed itemsets for association rules[C],Proc of ICDT1999. Israel:[sn],1999: 398-416
    42陈涛,张玮.一个改进的并行关联规则算法研究[J].计算机技术与发展,2007,17(1):139-141
    43 PASQU IER N,BASTIDE Y, TAOU IL. Efficient mining of association rules using closed itemset[J]. Information System , 1999, 24 (1) :25-46
    44 BERZAL F,CUBERO J-C,MAR N N. TBAR:An efficient method for association rule mining in relational databases [J]. Data & Knowledge Engineering, 2001,37: 47-64
    45 HAN Jia-wei, KAMBRM. Datamining: concepts and techniques[M].Beijing: Higher Education Press, 2001: 232-233
    46李超,余小平.基于矩阵的Apriori算法的改进方法[J].计算机工程, 2006 (23) : 68-69
    47 Hyvarinen A Fast and Robust Fixed Point Algorithm for Independent Component Analysis[J]. IEEE Transaction on Neural Network, 1999, 10(3): 626-634
    48牛小飞,石冰.基于向量和矩阵的挖掘关联规则的高效算法[J],计算机工程与应用, 2004,40(12):170-173
    49方炜炜,杨炳儒,宋威,侯伟.基于布尔矩阵的关联规则算法研究[J] .计算机应用研究,2008,25(7):1964-1966
    50宫雨.具有上界约束的关联规则算法研究[J].计算机工程,2007,33(5):29-31
    51吴绍函,余昭平.基于矩阵的关联规则挖掘算法[J].计算机工程,2008,34(23):31-33
    52吴伟平,林馥,贺贵明.一种无冗余的快速关联规则发现算法[J].计算机工程,2003,29(9): 90-92
    53董祥军,宋瀚涛,等.基于最小兴趣度的正、负关联规则挖掘[J].计算机工程与应用, 2004: 27: 24-31
    54杨越越,翟延富,等.一种改进的冗余规则修剪方法[J].郑州大学学报(理学版) ,2007,39(4): 134-136
    55 Ganter B, Wille R. Formal Concept Analysis: Mathematical Foundations[M]. Berlin, Germany: Springer Verlag, 1999

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700