数据库中关联规则及效用模式挖掘算法的研究

英文题名：The Study of Association Rule and Utility Pattern Mining Algorithm
作者：熊学栋
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 关联规则 ; 频繁项目集 ; 效用挖掘 ; 效用模式
英文关键词：data mining ; association rule ; frequent itemsets ; utility mining ; utility pattern
学位年度：2007
导师：肖建华
学科代码：081203
学位授予单位：湘潭大学
论文提交日期：2007-05-01

摘要

近年来随着数字化在各机关企业中越来越普及,数据库在各个企业中的角色也就越来越重要。数据库所累积大量的数据中往往隐藏了许多有用的重要信息,如何能够有效率且正确地发掘出这些信息就变成为一个重要的课题,因此数据挖掘技术随即应运而生。目前数据挖掘中应用最广的技术就是关联规则的挖掘,许多的相关技术及研究已经被提出。
     关联规则挖掘模型以平等的方式对待每个项目(item),只考虑项目是否在事务记录中出现。但是在实际的情况中,项目之间的是有明显区别的,我们可以将这种区别定量化,其中一种方法就是以效用来衡量项目之间的区别。
     本文在研究提出关联规则新算法的同时,对另一类问题,效用模式的挖掘也作了细致的研究。效用模式挖掘是一个全新的挖掘技术分支.效用模式发现问题是和关联规则,序列分析较为相似的一类问题,它们有共同的数据背景------从购物篮数据延伸开来的客户记录数据。和另外两者的挖掘类似,效用挖掘也是从这些数据中寻找潜在有用的,非平凡的支持决策的新知识。只是更加侧重满足最小效用值,可以看成是一种带有约束的项集挖掘。
     本文延续了对关联规则的研究,给出了一种基于划分和分解的算法,该算法基于划分的思想,只需扫描数据库一次,较大的减少了候选项集的数量,也缩小了检验候选项集时考虑的范围。实验表明该算法在效率上有较大的改进。针对效用挖掘的情况,本文在总结前人研究的基础上,将问题转化为一个最优化问题,提出一种基于二分划分树的启发式算法,该算法能有效的在数据中寻找效用模式。相对于基于剪枝的效用模式发现算法,该算法性能上有较大的突破。
     本研究的主要内容为有效的关联规则算法和效用挖掘新算法,通过在实验中对比算法的性能,验证了研究成果的先进性。
With the coming of the age of information, more and more companies and government agencies are being digitally equipped. And the database technology is playing more important a role than ever before. Huge amount of information remains undiscovered in these accumulated databases .it becomes a crucial challenge to efficiently and correctly extract the useful information hidden in these databases .data mining technology address this problem. As far as it goes , the most popular technology in data mining is association rules mining. Many researches have been contributed in this area.
     The mostly studied association rule mining model care about whether an item is included in a transaction or not. And thus treat all items equally. While in the real world case, there are discriminations between items. We can quantify these differences between items , one option is that we can use utility as a measure to signify the usefulness of the respective items.
     The thesis works on the research of association rule mining, along with the research on utility pattern mining, which is a emergent new topic in the data mining community .utility pattern mining is somewhat like the association rule and the sequence analysis ,for they share the same form of the targeted database .and all contrived to obtain the finding of potentially useful, non-trivial ,decision-support knowledge.
     The thesis proposes a association mining algorithm based on partition and decomposition. The algorithm was grounded on the idea of partitioning the whole database which is a way to save for RAM storage .it scans the database once and shrinks the amount of candidate item-sets. The experiment concludes that it has an edge in the efficiency comparison.
     Another contribution of the thesis is the study on the utility pattern mining. the author proposed an heuristic algorithm based the dynamic binary partition tree. The algorithm doing so by further assuming the problem in an optimization framework .the experiment also shows that it is more robust than former ones.
     This dissertation paper mainly deal with the research of association rule mining and the utility item-sets mining problem. the experiment dedicated to the verification of the proposed algorithms show that the research is novel and constructive .

引文

[1] Rakesh Agrawal Ramakrishnan S&ant*.Fast Algorithms for Mining Association Rules[C]. Proceedings of the 20th VLDB Conference Santiago:Chile, 1994:487-499.
    [2] PAUL W. PURDOM, DIRK VAN GUCHT, PERFORMANCE OF THE APRIORI ALGORITHM[J]. Society for Industrial and Applied Mathematics.2004, 33(5): 1223–1260.
    [3] HUANG Liusheng,CHEN Huaping,WANG Xun , CHEN Guoliang. A Fast Algorithm for Mining Association Rules[J]. J. Comput. Sci.&Techno,2000, 15(6):619-624.
    [4] Kritsada SPIPHAEW,Thanaruk THREERAMUNKONG.Fast Algorithms for Mining Generalized Frequent Patterns of Generalized Association Rules[J].IEICE TRANS.INF&SYST, 2004,E87-D(3):761-770.
    [5] Jian Pei, Jiawei Han, Laks V.S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints[C].in the Proceedings of 2001 IEEE International Conference on Data Engineering (ICDE’01),Heidelberg, Germany, 2001:433-332.
    [6] R. Agrawal, T. Imielienski, and A. Swami. Mining Association Rules between Sets of Items in LargeDatabases[C]. Proc. Conf. on Management of Data, ACM Press, New York, NY, USA 1993:207–216.
    [7] J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation[C]. In Proc Conf on theManagement of Data (SIGMOD’00. Dallas, TX). ACM Press: New York, NY, USA ,2000:1-12.
    [8] M. Zaki ,C.J. Hsiao. An efficient algorithm for closed itemset mining[C]. SIAM. SDM.Frorida:sanfromcisco, 2002:44-48.
    [9] K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets[C]. IEEE. ICDM.Newyork:ACM press, 2001:45-71.
    [10]Jiawei Han,Mieheline Kamber.Data Mining:Concept and Techniques[M].USA:Morgan Kaufmann publishers,inc,2000:1-202
    [11] SRIKANT R,AGRANWALR. Mining generalized association rules[C]. Proceeding of the 21th Intennational Conference on Very Large Database. Zurich,Switzerland,1994:407-419.
    [12] Guozhu Dong, Senior Member, IEEE, Jiawei Han, Senior Member, IEEE, Joyce M.W. Lam,Jian Pei,Member, IEEE Computer Society, Ke Wang, and Wei Zou.Mining Constrained Gradients in Large Databases[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,2004, 16( 5):1-17.
    [13] Hong Yao, Howard J. Hamilton, and Cory J. Butz .A Foundational Approach for Mining Itemset Utilities from Databases[C]. Hong Yao.In Proceedings 2004 SIAM . HKACADEMIC:April, 2004:482-486
    [14] R . Agrawat,et al .ParallcI Mining of association Rules[J]. IEEE Transactions on knowledge and data engineering, 1996,8(6):962-969.
    [15] 何小东,刘卫国.数据挖掘中关联规则挖掘算法比较研究[J].计算机工程与设计,2005,26(5):1265-1268.
    [16]朱建平,张润楚.数据挖掘的发展及其特点.统计与决策.2004,(7):71-72.
    [17]铁治兴,陈奇,俞瑞钊.关联规则采掘综述[J].计算机应用研究,2000,1:1-5.
    [18] 毕建欣 , 张岐山 . 关联规则挖掘算法综述 [J]. 中国工程科学,2005,7(4):88-94.
    [19] 吴芬兰 , 胡朝举 , 高推 , 李整 . 关联规则挖掘算法的改进 [J]. 微机发展,2005,15(8):151-152.
    [20]秦亮曦,史忠植.关联规则研究综述[J].广西大学学报,2005,30(4):309-317.
    [21] 谈恒贵 , 王文杰 , 李克双 . 频繁项集挖掘算法综述 [J]. 计算机仿真,2005,22(11):1-4.
    [22] 高俊 , 施伯乐 . 快速关联规则挖掘算法研究 [J]. 计算机科学,2005,32(3):200-202.
    [23]李华君,周海岩. 基于项目集知识库的关联规则挖掘与更新的高效算法[J]. 计算机工程与设计,2005,25(12):2198-2221.
    [24] 周涛 ,陆惠玲 . 关联规则挖掘算法研究 [J].齐齐哈尔大学学报 , 2004年,20(3):58-61.
    [25]辛志,刘少辉,史忠植.关联规则算法的实现与改进[J]. 计算机工程与应用,2002, 24:190-192.
    [26] 李小兵,吴锦林,薛永生,翁伟. 关联规则挖掘算法的改进与优化研究[J]. 厦门大学学报(自然科学版), 2005, 44(4): 468-471.
    [27]陆丽娜,陈亚萍,魏恒义,杨麦顺.挖掘关联规则中Apriori算法的研究[J].小型微型计算机系统,2000,21(9):940—943.
    [28]杜孝平,马秀莉,唐世渭.快速关联规则挖掘算法[J].计算机工程与应用,2002,11:1—4.
    [29] 徐健辉 . 生成频繁项集的逻辑 “ 与 ” 运算算法 [J]. 计算机应用,2004,24(1):88-90.
    [30]牛小飞,石冰.基于向量和矩阵的挖掘关联规则的高效算法[J].计算机工程与应用,2004:170-173.
    [31] 栾东庆, 徐素琴. PrefixSpan 算法在多维序列模式挖掘中的应用[J]. 微机发展 , 2003,(08)
    [32] 刘旭,祁之力,谭立刚. 一种基于灰关联的序列模式挖掘算法[J]. 北京邮电大学学报 , 2003,(03) .
    [33] 朱立运, 朱建秋. 带时间特征的序列模式挖掘算法 TESP[J]. 计算机工程 , 2004,(10)
    [34] 刘月波, 陆阶平, 刘同明. 基于 CTID 序列模式的一种改进算法[J]. 微机发展 , 2005,(03)
    [35] 沙金,邓成玉,张翠肖,刘伟峰,. 闭合序列模式挖掘算法[J].计算机工程与设计 , 2006,(03) .
    [36] 陈恩红, 李铜舒, 王舒. 一种基于 Max Gap 约束的高效序列模式挖掘算法[J]. 计算机工程与科学 , 2006,(10)
    [37] 李庆华, 马传香. 挖掘频繁闭序列的一种改进算法[J]. 小型微型计算机系统 , 2006,(03)
    [38]Pei J, Han J, Mortazavi-Asl B, et al. PrefixSpan: mining sequence patterns efficiently by prefix-projected pattern growth[A]. In: 17th the International Conference on Data Engineering[C]. Hidelberg, 2001.215-224.
    [39] Yen-Liang Chen, Kwei Tang, Ren-Jie Shen and Ya-Han Hu. Market basket analysis in a multiple store environment [J]. Decision Support Systems, 2005, (2):339-354
    [40] 王新宇,杜孝平,谢昆青.FP-growth算法的实现方法研究[J]. 计算机工程与应用,2004(8):174-176
    [41] 周焕银,张永,蔺鹏. 一种不产生候选项挖掘频繁项集的新算法[J].计算机工程与应用,2004,(15):182-185.
    [42] Yuh-Jiuan Tsay and Ya-Wen Chang-Chien An efficient cluster and decomposition algorithm for mining association rules [J] Information Sciences,2004 , 160(1)4-22
    [43] Jin-Mao Wei, Wei-Guo Yi and Ming-Yang Wang Novel measurement for mining effective association rules [J]Knowledge-Based Systems, Volume 19, Issue 8, December 2006,
    [44] Guoqing Chen, Hongyan Liu, Lan Yu, Qiang Wei and Xing Zhang .A new approach to classification based on association rule mining [J]. Decision Support Systems,2006, 42(2): 674-689
    [45] 盛立,高明,刘希玉. 一种高效的关联规则挖掘算法[J].滨州学院学报 ,2005,(06) .
    [46 伊卫国,卫金茂,王名扬,王兴通. 基于数据库划分的高效关联规则挖掘算法研究[J].东北师大学报(自然科学版) , 2004,(04) .
    [47] 伊卫国,卫金茂,王名扬. 关联规则挖掘方法的改进[J].东北师大学报(自然科学版) , 2006,(02) .
    [48] 黄艳,王延章. 大型数据库中多层次相联规则的提取[J].大连理工大学学报, 1999,(06):44-48.
    [49]倪志伟,蔡庆生,方瑾. 用神经网络来挖掘数据库中的关联规则[J]. 系统仿真学报,2000,(5)
    [50] 杨炳儒 , 王建新 .KDD 中双库协同机制的研究 (I)[J]. 中国工程科学,2002,4(4):41-51.
    [51]杨炳儒,王建新,孙海洪.KDD中双库协同机制的研究(Ⅱ)[J].中国工程科学.2002,4(5):34-43,78.
    [52] 杨炳儒,孙海洪.基于双库协同机制的挖掘关联规则算法Maradbcm[J].计算机研究与发展,2002,39(11):1447-1454.
    [53]陈晓云 .一种带约束条件的关联规则频繁集挖掘 [J]. 计算机工程与应用,2003:205-207.
    [54] 罗芳. 数据挖掘技术在移动通信决策支持系统中的应用. 交通与计算机. 2004, 4(22),74-76.56
    [55]罗来鹏,郝光,吴建乐. 用关联规则对超市经营进行决策分析[J]. 凉山大学学报,2003,5(4):66-68.
    [56]王华秋,曹长修,王越.一种快速并行关联规则算法研究及仿真.重庆大学学报,7:223-225.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700