关联规则基本技术研究

英文题名：Research on the Basic Technology of Association Rules
作者：郭运凯
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 关联规则 ; 最大频繁项目集 ; 排序频繁模式树 ; 高维聚类
英文关键词：Data mining ; Association rules ; Maximal frequent itemset ; Updating mining ; Sorted Frequent pattern tree ; High dimension cluster
学位年度：2009
导师：杨君锐
学科代码：081203
学位授予单位：西安科技大学
论文提交日期：2009-04-20

摘要

数据挖掘是指从大型数据库中发现潜在的、新颖的、有价值的、可用的及能被用户理解的模式和信息的过程。关联规则挖掘是数据挖掘的一个重要研究领域,主要是发现数据库中属性之间的关联关系。
     本文在广泛查阅国内外文献的基础上,针对关联规则算法的若干问题进行了深入地分析研究,论文的主要研究内容和成果如下:
     首先,提出了基于排序FP-Tree(Sorted FP-Tree,简称SFP-Tree)的最大频繁项目集挖掘算法SFP-Miner。在SFP-Miner算法中,通过两次扫描数据库将其中每个事务所包含的频繁项目压缩存储在SFP-Tree中。在挖掘过程中,充分利用SFP-Tree的特点,并采用合并子树和预剪枝策略在SFP-Tree上进行深度优先挖掘,而不需要扫描数据库,减少了算法在挖掘过程中使用的存储空间和计算时间。实验结果表明,该算法有较好的性能。
     其次,提出了基于完全合并SFP-Tree的最大频繁项目集更新挖掘算法UAMFI。该算法基于完全合并SFP-Tree,直接在树上进行深度优先搜索,能够快速地进行最大频繁项目集的更新挖掘。实验测试和结果分析,该算法可以高效的更新最大频繁项目集。
     最后,针对多值属性关联规则挖掘问题,提出了基于高维聚类的多值属性关联规则挖掘算法DBSMiner。该算法借鉴ARCS思想,先将高维数据集的各维进行划分,然后将密度单元进行排序,并提出一种基于网格的高维聚类算法对划分后的数据进行聚类挖掘。理论分析和试验结果表明,DBSMiner算法具有较好的执行效率和精确度,能有效的进行多值属性关联规则的挖掘。
Data mining means a process of finding nontrivial, extraction of implicit, pervious unknown and potential useful information from data in database. Association rule mining as an important field of data mining discovers interesting relationships among attributes in those data.
     By studying the literature domestic and abroad, we research some basic problems of association rules mining algorithms. The main contexts are showed as follows:
     Firstly, a maximal frequent itemset mining algorithm SFP-Miner, which based on Sorted FP-Tree was proposed. The SFP-Miner scanned Database twice and compress stored the frequent itemset in SFP-Tree. By using depth-first strategy, the algorithm pruned the searching space by pre-prune and mergence strategy and discovered all the maximal frequent itemset efficiently and didn’t need to scan the Database. The experimental result indicated that SFP-Miner is an efficient algorithm.
     Secondly, we presented a new updating algorithm, UAMFI, for mining maximal frequent itemsets from transaction database when minimum support was changed by customer. The algorithm adopted a new data structure FMSFP-Tree (Full Merged SFP-Tree) which stored all the frequent itemsets in any given minimum support and it directly mined and updated the maximal frequent itemsets in FMSFP-Tree. It can efficiently mine maximal frequent itemsets with changed minimum support. From the experimental result, we can conclude that the algorithm is highly efficient to the updating mining problems.
     Finally, we presented a new algorithm, DBSMiner (Density Based Sub-space Miner), for mining quantitative attributes association rule. This algorithm, which referenced the ARCS (Association Rule Clustering System), used a grid structure to quantize the object space into a finite number of cells; it sorted all the dense grids by descending order and used a grid based cluster algorithm to cluster the data with all attributes. At last, it clustered the association rules. Theoretical analysis and experimental results show that, DBSMiner algorithm has good performance and accuracy. It can effectively mine association rule of quantitative attributes.

引文

[1] Han J, Kanbe M. Data Mining: Concepts and Techniques [M]. Morgan Kaufman Publishers, San Francisco, CA, 2001
    [2] U. Fayyad. G. Piatetsky-Shapiro, Padhraic Smyth. Knowledge Discovery and Data mining Towards a Unifying Framework[C].Proceedings of Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. 1996
    [3] Agrawal R, Imielinshi T, Swami A. Mining association rules between sets of items in large database [C]. In: Proceedings of ACM SIGMOD Conference on Management of Data, Washton,DC. 1993:207-216
    [4] Han J, Jian P, Yiwen Y. Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference Management of Data. Dallas, 2000:1-12
    [5] Savasere A, Omiecinski E, Navathe S. An efficient algorithm for mining association rules[C]. In: Proceedings of the 21st International Conference on VLDB.Zurich, 1995:432-444
    [6] Agrawal R, Srikant R. Fast algorithm for mining association rules[C]. In: Proceedings of the 20th International Conference on VLDB. Santigo, 1994:487-499
    [7] Mannila H, Toivonen H, Verkamo I. Efficient Algorithm for Discovering Association Rules[C]. In KDD-94:AAAI Workshop on Knowledge Discovery in Database, 1994:181-192
    [8] Toivonen H. Sampling large databases for association rules[C]. In: Proceedings of the 22th International Conference on Very Large Databases, Bombay, India, 1996:1-12
    [9] Bayardo R. Efficiently mining long patterns from databases[C]. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 1998:85-93
    [10] Lin Dao-I, Kedem Z.M. Pincer-Search: a new algorithm for discovering the maximum frequent set[C]. In: Proceedings of the 6th European Conference on Extending Database Technology. Heidel-berg: Springe-Verlag, 1998:105-119
    [11] Burdick D, Calimlim M, Gehrke J. Mafia: A maximal frequent itemset algorithm for transactional database[C]. In: International Conference on Data Engineering, 2001
    [12] Cheung D W. Maintenance of discovered association rules in large databases: an incremental updating technique[C]. In: Proceedings of the 12th International Conferenceon Data Engineering, New Orleans, Louisiana, 1996:106-114
    [13]冯玉才,冯剑琳.关联规则的增量式更新算法[J].软件学报, 1998, 9(4): 301-306
    [14] Quinlan J R. Induction of decision tree. Machine Learning, 1986, 1(1):81-106
    [15] Quinlan Ross J. C4.5:Programs for Machine Leaming[M]. SanMate, CA:Morgan Kaufmann Publishers, 1993
    [16] http://www.rulequest.com/
    [17] Wang W, Yang J, Muntz R R. STING: A Statistical Information Grid Approach to Spatial Data Mining[C].In Proceedings of the 23rd Conference on VLDB.Athens 1997:186-195
    [18] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications[C].In Proceedings of ACM SIGMOD International Conference on Management of Data.1998
    [19] G.Sheikholeslami.S. Chalterjee. and A. Zhang.WaceCluster: A multi-resolution clustering approach for very large spatial databases[C]. In Proceedings of 24th VLDB Conference. 1998:428-439
    [20] Carrier, B. and Spafford, E. Automated digital evidence target definition using outlier analysis and existing evidence[C]. In Proceedings of the 2005 Digital Forensics Research Workshop,2005
    [21] Vic Barnett and Tony Lewis. Outliers in Statistical Data. John Wiley and Sons, 2 edition, 1984
    [22] Yuan Bao-yang Li-wei, Lang Guo-qing. Method of predicting the gas-in-oil concentrations in transformers based on grey theory [J]. In Proceedings of the CSU-EPSA,2001,13(3):40-42
    [23] Terasvirta T. Van Dijk D. Medeiros M C. Linear models, smooth transition auto regressions, and neural networks for forecasting macroeconomic time series: a re-examination [J]. International Journal of Forecasting. 2005. 21(4):755-775
    [24] Antonie M L,Zaiane O R.Text Document Categorization by Term Association[C].In Proceedings of the IEEE 2002 Int’l Conf on Data Mining.2002:19-26
    [25] Beil F,Ester M,Xu X.Frequent Term-Based Text Clustering[C]. In Proceedings of the International Conference on Knowledge Discovery and Data Mining.2002:436-442
    [26] Xuan Liu, Hongmei He, and Ondrej S ykora.Parallel Genetic Algorithm and Parallel Simulated Annealing Algorithm for the Closest String Problem[C].In Proceedings of ADMA , LNAI 3584, 2005:591-597
    [27] Agrawal R, Shafer J. Parallel Mining of Association Rules [J]. IEEE Trans onKnowledge and Data Engineering. 19.8(6):962-969
    [28] Y. Guo and J. Sutiwaraphun, Knowledge probing in distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, 1999:115-132
    [29] B.Babcock, S.Babu, M.Datar, R.Motwani, J.Widom. Models and issues in data streams[C]. In Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symp. On Principles of Database Systems. Madison: ACM Press, 2002:1-16
    [30] B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and issues in data stream systems.ACM PODS, Madison, Wisconsin, 2002:1-16
    [31] P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports[C]. In Proceedings of VLDB, 2001:541–550
    [32] M Greenwald and S. Khanna. Space-efficient online computation of quantile Summaries[C]. In Procedings of ACM SIGMOD, Santa Barbara, California, United States ,2001:58-66
    [33] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments[C]. In Proceedings of the 1996 Annual ACM Symp. On Theory of Computing, 1996:20–29
    [34] G.S.Manku and R.Motwani. Approximate Frequency Counts over Data Streams, VLDB,2002:346-357
    [35] J.Chang and W. Lee. A Sliding Window Method for Finding Recently Frequent Itemsets online Data Streams. Journal of Information Science and Engineering, Vol. 20. No.4.July. 2004
    [36] P. Domingos and G Hulten. Mining High-Speed Data Streams. In Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining,2000:71-80
    [37]王伟平,李建中,张冬冬等.基于滑动窗口的数据流连续J-A查询的处理方法[J].软件学报,2006,17(4):740-749
    [38] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000
    [39] Burl,MC.etc.Mining for image content[C].In:Systemics,Cybernetics, and Informatics/Information Systems: Analysis and Synthesis,(Orlando,FL),1999:1-8
    [40] Osmar R. Za?ane, Jiawei Han, Ze-Nian Li, Jean. Hou, Mining Multimedia Data. In Proceedings of CASCON’98: Meeting of Minds, Toronto, Canada, 1998: pp: 83-96
    [41] Li Zhang,etc.An information-driven Framework for Image Mining[C].In:Proc. of 12thInt’l Conf. on Database and Expert Systems Application(DEXA) ,Munich,Germany ,2001:232-242
    [42]员巧云,程刚.近年来我国数据挖掘研究综述[J].情报学报,2005,24(2):250-256
    [43] C.Ordonez,E.Omiecinski.Discovering association rules based on image content[C].proceedings of the IEEE Advances in Digital Libraries Conference(ADL’99),1999
    [44] Petra Perner, etc. Mining Images to Find General Forms of Biological Objects[C]. In: ICDM , 2004:60 - 68.
    [45] Tom Soukup Lan Davidson.可视化数据挖掘--数据可视化和数据挖掘的技术与工具[M].朱建秋,蔡伟杰译.北京:电子工业出版社,2004:60-66
    [46] Keim, DA,Kriege,HP.Issues in Visualizing Large Database. Visual Databases sysytems,1995,(4):72-73
    [47] Agarwal, R., and Srikant, R. 2000. Privacy-preserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data, ACM Press, 439-450
    [48] Claerhout, B., and DeMoor, G.J.E. 2005. Privacy protection for clinical and genomic data: The use of privacy-enhancing techniques in medicine. International Journal of Medical Informatics, 74(2-4):257-265
    [49] Clifton, C., Marks, D. 1996. Security and privacy implications of data mining. In Workshop on Data Mining and Knowledge Discovery, 15-19, Montreal, Canada
    [50] Brin S, Motwani R, Ullman J D et al. Dynamic itemset counting and implication rules for market basket analysis. In: Proceedings of 1997 ACM-SIGMOD International Conference on Management of Data. Tucson, AZ, 1997: 255-264
    [51] Park J S,Chen M S,Yu P S. An Effective Hash-Basede Algorithm for Mining Association Rules. In: Proceedings of ACM SIGMOD International Conference Management of Data,San Jose,CA,1995: 175-186
    [52] Bayardo R.Efficiently mining long patterns from databases.In:Haas LM,ed.Proceedings of the ACM SIGMOD International Conference on Management of Data, New York: ACM Press, 1998 :85-93
    [53] Karam Gouda and Mohammed J. Zaki. Efficiently Mining Maximal Frequent Itemsets. In Proceedings of the 1st IEEE International Conference on Data Ming, San Jose, USA, 2001: 163-170
    [54] R.Agarwal,C.Aggarwal,V.Prasad.Depth first generation of long patterns [C]. In: R. Ramakrishnan,S. Stolfo,eds. Proc. of 6th ACM-SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining.Boston:ACM Press,2000:108-118
    [55] D. Burdick,M. Calimlim, J. Gehrke.Mafia: A maximal frequent itemset algorithm for transactional databases[C]. In: D. Georgakopoulos,etal. Proc. of the 17th Int’l Conf. on Data Engineering Heidelberg:IEEE Press,2001:443-452
    [56]朱玉全,孙志挥,季小俊.基于频繁模式树的关联规则增量式新新算法[J].计算机学报,2003;26(1):91-96
    [57]周海岩.关联规则的开采和更新.软件学报,1999, 10(10):1079-1084
    [58]路松峰,卢正鼎.快速开采最大频繁项目集.软件学报, 2001,12(2):293-297
    [59]宋余庆,朱玉全,孙志挥.基于FP-Tree的最大频繁项目集挖掘及更新算法.软件学报, 2003,14(9):1586-1592
    [60]杨君锐,赵群礼.一种不产生候选集的最大频繁集快速挖掘算法[J].微电子学与计算机,2004,21(11):125-128
    [61]秦亮曦,史忠植. SFPMax——基于排序FP树的最大频繁模式挖掘算法[J].计算机研究与发展, 2005,42(02) :217-223
    [62] Gouda K,Zaki M.J. Efficiently mining maximal frequent itemsets. In: Proceedings of the 1st IEEE International Conference on Data Mining,2001:163-170
    [63] Zhou Q.H,Wesley C,Lu B.J. SmartMiner:A depth first algorithm guided by tail information for mining maximal frequent itemsets. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2002),2002:570-577
    [64]吉根林,杨明,宋余庆等.最大频繁项目集的快速更新[J].计算机学报,2005,28(1) :128-135
    [65]阮幼林,李庆华,刘干.最大频繁模式的快速挖掘与更新算法[J].计算机工程与应用,2005,41(24):23-26
    [66]杨君锐,赵群礼.基于FP-Tree的最大频繁项目集更新挖掘算法[J].华中科技大学学报(自然科学版),2004,32(11):88-90
    [67] Rymon R.Search through systematic set enumeration[C].In Proceeding of Third International Conference on Principles of Knowledge Representation and Reasoning. 1992:539- 550
    [68]杨明.一种划分多值属性的关联规则挖掘算法[J].计算机工程,2002,28(6):13-14
    [69] Srikant R, Agrawal R. Mining quantitative association rules in large relational tables. In: Proceedings of the ACM SIGMOD Conference on Management of Data. Montreal, Canada, June 1996:1-12
    [70]张朝晖,陆玉昌,张钹.发掘多值属性的关联规则[J].软件学报, 1998,9(11): 801-805
    [71] Miller, R. J.,Yang, Y.Association rules over interval data.In: Peckham, J.,ed. Proceedings of the ACM SIGMOD International Conference on Management ofData.Tucson, Arizona: ACM Press, Tucson, Arizona,1997:452-461
    [72] Wang, Ke,Soon,Hock William Tay, Liu Bing.Interestingness-Based interval merger for numeric association rules. In: Agrawal, R., Stolorz. P., eds.Proceedings of the 4th International Conference on Knowledge Discovery and Data Ming.New York: AAAI Press,1998:121-128
    [73] http://archive.ics.uci.edu/ml/
    [74] W. Lian, D.W. Cheung, and S.M. Yiu.An efficient algorithm for finding dense regions for mining quantitative association rules. Computers and Mathematics with Applications 50 (3-4) (2005) :471-490

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700