最大频繁项集和频繁基项集挖掘算法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

最大频繁项集和频繁基项集挖掘算法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Mining Algorithms of Maximal Frequent Itemsets and Opened Frequent Itemsets
作者：韩崇
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：关联规则 ; 最大频繁项集 ; 位置向量 ; 布尔矩阵 ; 极大布尔关联规则 ; 频繁基项集
英文关键词：association rules ; maximal frequent itemset ; position vector ; Boolean matrix ; maximal Boolean association rules ; opened frequent itemsets
学位年度：2010
导师：姜保庆
学科代码：081203
学位授予单位：河南大学
论文提交日期：2010-05-01

摘要

数据挖掘是计算机科学的一个领域,目的是通过分析快速增长的商业、科学和工程数据来获取知识和其他利益,这个领域正在迅猛增长和发展。关联规则的挖掘是数据挖掘课题中的一个重要分支,它被用来描述事务数据库中属性间存在的潜在关系,近年来已经成为数据挖掘中一个相当活跃的领域。
     频繁项集挖掘是关联规则挖掘中的最重要的任务。本文对频繁项集的一种紧凑表示——最大频繁项集挖掘和如何在不丢失规则信息基础上减少关联规则生成数量问题进行研究。研究的主要内容包括以下几个方面:
     1.对有限个有限链直积下集极大元挖掘算法——Boundary算法进行了深入研究,它是一个深度优先算法,可以用在最大频繁项集这类具有位置格下集性质挖掘问题中。
     2.提出了一种用位置向量精确表示项目集的深度优先算法GMPV来挖掘最大频繁项集,算法中使用布尔矩阵方法来进行事务数据库映射,在频繁项集生成过程中通过超集存在检测和基于超集支持度计算等方法提高算法效率,通过实验验证了GMPV算法的有效性。
     3.在分析频繁基项集的定义和性质的基础上提出了频繁基项集挖掘的剪枝策略,设计了频繁基项集的挖掘算法,它可以用来进行极大布尔关联规则生成,并且根据极大布尔关联规则的性质简化了基于频繁闭项集和频繁基项集的极大布尔关联规则的生成算法。
Data mining is an area in computer science that aims to analyze the rapidly increasing amounts of business, scientific, and engineering data for knowledge and other profitable uses. Association rules mining is an important branch of data mining, which is used to describe the implicit relationship of the attributes in the transactional databases and has become an quite active field in the research of data mining in recent years.
     Mining frequent itemsets is the most important task of mining association rules. In this paper, the problems of mining maximal frequent itemsets, which is a compact representation of frequent itemsets and how to reduce the quantity of the association rules without lose information of association rules are researched. The main research is as follows:
     1. The Boundary algorithm whose function is to find the maximal elements of a down-set of direct product of finite finite-chains is thoroughly studied. The Boundary is a depth-first search algorithm, and it can be used to mining the maximal frequent itemsets problem, which has the property of down-set of position lattice.
     2. A new depth-first search algorithm, called GMPV, which accurately displays the itemset based on position vector, has put forward to mine the maximal frequent itemsets. In GMPV, the transaction database were mapped to a Boolean matrix and superset checking and pruning method based on support also were used to increase the algorithm efficiency. The experiment results displayed that the algorithm is validity.
     3. Based on the analysis of the definition and property of opened frequent itemset, the opened frequent itemset mining algorithm was designed, which can use to generate the maximal Boolean association rules. According to the property of the maximal Boolean association rules, this paper predigested the generating algorithm of maximal Boolean association rules.

引文

[1] R. Agrawal, T. 1mielinski, and A. Swami. Database mining:A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5:914-925, 1993.
    [2] R. Agrawal, T. 1mielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. ACM SIGMOD Intl. Conf. Management of Data, pages 207-216, Washington, DC, 1993.
    [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. of the 20th Int. Conf. on Very Large Data Bases (VLDB'94), pages 487-499. Santiago, Chile, September 1994.
    [4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U.M. Fayyad, G. Piatetsky Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 307-328. MIT Press, 1996.
    [5] C. C. Aggarwal, Z. Sun and P. S. Yu. Online Generation of Profile Association rules. In Proc. of the 4th Intl. Conf. on Knowledge discovery and Data Mining, pages 129-133, New York, NY, August 1996.
    [6] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth first generation of long patterns. In Proc. of the 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining. pages 108-118. Boston, MA, USA, 2000.
    [7] R. Bayardo. Efficiently mining long patterns from databases. In Proc. of 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 1998. 85-93, Seattle, WA June 1998.
    [8] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proc. of the 17th Intl. Conf. on Data Engineering. pages 443-452, Heidelberg, Germany, 2001.
    [9] K. C. C. Chan, and W. H. Au. Mining fuzzy association rules. In Proc. of 6th Intl. Conf. On Information and Knowledge Management. Las vegas, pages. 209-215, Nevada, USA, 1997.
    [10] C. H. Cai, A. Fu, C. H. Cheng, and W. W. Kwong. Mining association rules with weighted items.In Proc. of the Intl. Database Engineering and Applications Symposium. pages 68-77, Cardiff, Wales, 1998.
    [11] P. Dokas, L. Ertoz, V. kumar, A. Lazarevic, K. Srivastava, and P. N. Tan. Data Mining for Network intrusion Detection. In Proc. NSF Workshop on Next Generation Data Mining,Baltimore, MD, 2002.
    [12] T. Fukuda, Y. Morimoto, S. Morishita. Data mining using two dimensional optimized association rules: Scheme, algorithms and visualization. In Proc. of the SIGMOD Intl. Conf. on Management of Data, Montreal, Canada, 1996, pp. 13-24.
    [13] K. Flip, L. Alexandros, K. Yannis and F. Christos. Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. In Proc. of the 24th Int. Conf. on Very Large Data Bases (VLDB), New York, USA, August 1998, pp. 582-593.
    [14]方炜炜,杨炳儒,宋威,侯伟.基于布尔矩阵的关联规则算法研究[J].计算机应用研究, 2008, 25(7):1964-1966.
    [15] K. Gouda, and M. J. Zaki. Efficiently mining maximal frequent itemsets. In Proc. of the 1st IEEE Intl. Conf. on Data Mining, pages 163-170, San Jose, USA, 2001.
    [16] G. Grahne, and J. Zhu. Efficiently Using Prefix-trees in Mining Frequent Itemsets. In FIMI’03 Workshop on Frequent Itemset Mining Implementation, Melboume, Florida, USA, November, 2002.
    [17] G. Grahne, and J. Zhu. High performance mining of maximal frequent itemsets. In Proc. of the 6th SIAM Intl. Workshop on High Performance Data Mining(HPDM 2003). pages 135-143, San Francisco, USA, 2003.
    [18] J. Han, G. Dong, and Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database, In Proc. of 1999 Intl. Conf. on Data Engineering (ICDE'99), pages 106-115, Sydney, Australia, March 1999.
    [19] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation, In Proc. of 2000 ACM-SIGMOD Intl. Conf. on Management of Data (SIGMOD’00). pages 1-12. Dallas, TX, May 2000.
    [20] J. Han, and M. Kamber.数据挖掘:概念与技术(原书第2版)[M].范明,孟小峰译.机械工业出版社. 2007, 3.
    [21] T. P. Hong, C. S. Kuo and S. C. Chi. Trade-off between time complexity and number of rules for fuzzy mining from quantitative data. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 9, No. 5, 2001, pp. 587-604.
    [22] T. P. Hong, C. S. Kuo and S. L.Wang. A fuzzy AprioriTid mining algorithm with reduced computational time, Applied soft computing, 2004, 5:1-10.
    [23] Y. Huang, S. Shekhar, and H, Xiong. Discovery Co-Location Patterns from Spatial Databases: A General Approach. IEEE Transactions on Knowledge and Data Engineering, 16(12):1472-1485,December 2004.
    [24]吉根林,孙志挥.一种基于可信度最优的数量关联规则挖掘算法.东南大学学报(自然科学版), 2001, 31(2):31-34.
    [25]姜保庆.关于弱比例规则的挖掘与推理研究.西南交通大学博士论文. 2005.
    [26] B. Jiang, Y. Xu, Q. Wei and K. Wu. Weak Ratio Rules Between Nonnegative Real-valued Data in Transactional Database. In Proc. of the IEEE Intl. Conf. on Granular Computing, Beijing, China, July 25-27, 2005. pp. 488-491.
    [27]姜保庆,李建,徐扬.布尔关联规则集的结构[J].河南大学学报(自然科学版), 2006, 36(1):88-90.
    [28] B. Jiang, R. Li, and J. Zhao. A Strategy of Finding Maximal Cutting Plans. Journal of Computational Information Systems. 2007, 3(1):377-380.
    [29] B. Jiang, C. Han, and X. Hu. A finite ranked poset and its application in visualization of association rules. In Proc. of the IEEE Intl. Conf. on Granular Computing, Hangzhou, China, August 26-28, 2008. pp. 322-325.
    [30] B. Jiang, C. Han, and L. Li. Mining opened frequent itemsets to generate maximal Boolean association rules. In Proc. of the IEEE Intl. Conf. on Granular Computing, Nanchang, China, August 17-19, 2009. pp. 274-277.
    [31] S. Ju, and C. Chen.MMFI: an effective algorithm for mining maximal frequent itemsets[C]. In Proc. of the IEEE Intl. Symposium on Information Processing, EEE CS Press, pages 26-32, 2008.
    [32] W. Li, J, Han, and J, Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-association Rules. In Proc. of the 2001 IEEE Intl. Conf. on Data Mining, pages 369-376, San Jose, CA, August 1997.
    [33] B. Liu, W, Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data Mining, pages 80-86, New York, NY, August 1998.
    [34] D. Lin, Z. M. Kedem. Pincer-Search: A new algorithm for discovering the maximum frequent set. In Proc. of the 6th European Conf. on Extending Database Technology. Heidelberg: Springer-Verlag, 1998. 105-119.
    [35] J. Li, H. Shen, and T. Rodney. Mining the Smallest Association Rules Set for Predictions[A]. In Proc. of the 2001 IEEE Intl. Conf. on Data Mining, pages 361-368, 2001.
    [36]路松峰,卢正鼎.快速开采最大频繁项目集[J].软件学报, 2001, 12(2):293-297.
    [37]刘耀和,胡宝清.模糊关联规则及其挖掘算法[J].应用科学学报, 2003, 21(4): 373-376.
    [38]马光志,崔荣晓.基于覆盖运算挖掘最小规则集[J].计算机工程与科学, 2005, 27(6): 65-69.
    [39]马莉,任学军,韩崇,刘亚雷.极大布尔关联规则的挖掘算法[J].郑州大学学报(理学版), 2008, 40(4):39-43.
    [40] A. Ozgur, P. N. Tan, and V. Kumar. RBA: An Integrated Framework for Regression based on Association Rules. In Proc of the SIAM Intl. Conf. on Data Mining, pages 210-221, Orlando, FL, April 2004.
    [41] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of the 7th Intl. Conf. on Database Theory (ICDT'99), pages 398-416, Jerusalem, Israel, Januay 1999.
    [42] S, Parthasarathy and M, Coatney. Efficient Discovery of Common Substructure in Macromolecules. In Proc. of the 2002 IEEE Intl. Conf. on Data Mining, pages 362-369, maebashi City, Japan, December 2002.
    [43] R. Srikant and R. Agrawa. Mining quantitative association rules in large relation tables. In Proc of the 1996 ACM-SIGMOD Int. Conf. On Management of Data (SIG-MOD’96), pages 1-12. Montreal, Canada, Jun 1996.
    [44]宋余庆,朱玉令,孙志辉等.基于FP-Tree的最大频繁项集挖掘及其更新算法[J].软件学报, 2003, 14(9):1586-1592.
    [45]孙东,黄天戌,秦丙栓等.基于模糊数据挖掘与遗传算法的异常检测方法[J].计算机应用, 2006, 26(1): 210-215.
    [46] P. N. Tan and V. Kumar. Mining Association Patterns in Web Usage Data. In Proc. of the Intl. Conf. on Advances in Infrastructure for e-Business, e-Education and e-Medicine on the Internet, L’Aquila, Italy, January 2002.
    [47] P. Tan, M. Steinbach, and V. Kumar.数据挖掘导论[M].范明,范宏建译.人民邮电出版社. 2006, 5.
    [48]王新,王湄生.关联规则挖掘中的关联推理[J].云南民族学院学报(自然科学版), 2001, 10(3):373-379.
    [49] H. Wang, Q. Li. An improved maximal frequent itemset algorithm. In: Wang GY, eds. In Proc. of the Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, the 9th Int’l Conf. (RSFDGrC 2003). LNCS 2639, Heidelberg: Springer-Verlag, 2003. 484-490.
    [50] Q. Wei, B. Jiang, K. Wu, W. Wang. A kind of Weak Ratio Rules for forecasting Upper Bound. In Proc. of the 7th Int. FLINS Conf. on Applied Artificial Intelligence (FLINS2006), August 29-31,2006, Genova, Italy, pp.185-192.
    [51] H. Xiong, X. He, C.Ding, Y. Zhang, V. Kumar, and S. R. Holbrook. Identification of Functional Modules in Protein Comlplexes via Hyperclique Pattern Discovery. In Proc. of the Pacific Symposium on Biocomputing (PSB’2005), Maui, January 2005.
    [52] H. Xiong, M. Steinbach, P. N. Tan, and V. Kumar. HICAP: Hierarchical Clustering with Pattern Preservation. In Proc. of the SIAM Intl. Conf. on Data Mining. pages 279-290, Orlando, FL, April 2004.
    [53]谢翠华,沈洁,李云.一种基于FP-tree的最小预测集新算法[J].计算机工程, 2006, 32(6):82-85.
    [54] X, Yan and J, Han. gSpan: graph-based Substructure pattern Mining. In Proc. of the 2002 IEEE Intl. Conf. on Data Mining, pages 721-724, maebashi City, Japan, December 2002.
    [55]颜跃进,李舟军,陈火旺.基于FP-Tree有效挖掘最大频繁项集[J].软件学报, 2005, 16(2): 215-222.
    [56] M. J. Zaki, and M. Orihara. Theoretical foundation of association rules. In Proc. of the 1998 ACM SIGMOD Workshop on Research Issue in Data Mining and Knowledge Discovery, Seattle, WA, June 1998.
    [57] M. J. Zaki. Generating Non-Redundant Association Rules. In Proc. of the 6th Intl. Conf. on Knowledge Discovery and Data Mining, pages 34-43, Boston, MA, August 2000.
    [58] Q. Zou, W. Chu, B. Lu. SmartMiner: A depth first algorithm guided by tail information for mining maximal frequent itemsets. In Proc. of the IEEE Intl. Conf. on Data Mining, (ICDM’02). pages 570-577, 2002.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700