基于FP-树的最大频繁模式挖掘算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
从大型数据库中挖掘关联规则是数据挖掘领域中非常重要的研究课题。其中,最大频繁模式挖掘问题在关联规则挖掘任务中扮演着重要的角色,具有广泛的应用前景。
     FP-树是算法FP-growth中提出的新的数据结构。借助于FP-树结构,算法FP-growth采用不同于Apriori系列算法的候选产生测试方法而采取模式增长方法挖掘频繁模式,取得了很好效果。
     本文主要在以下几个方面对基于FP-树的最大频繁模式挖掘问题进行研究:第一是提出了基于FP-树的最大频繁模式挖掘算法FP-Max。在该算法中,我们首先介绍了FP-树的定义和构造过程,并分析了基于FP-树进行挖掘的可行性和完整性;然后我们提出基于FP-树的最大频繁模式挖掘算法FP-Max,试验表明算法FP-Max在挖掘密集型、频繁模式较长的大数据集时是有效的。第二是提出FP-树驻留磁盘的最大频繁模式挖掘算法FP-Max-Disk。算法FP-Max运行的前提是构造的FP-树能够驻留内存,但是当事务数据库TDB很大或者设置的最小支持度阀值min_sup很小时,那么构造驻留内存的FP-树将是不现实的。为此,我们首先将原事务数据库TDB划分为一系列投影数据库,然后将每个投影数据库构造为能够装入内存的条件FP-树,最后基于这些条件FP-树挖掘最大频繁模式。第三是研究探讨了
    
    基于FP一树的最大频繁模式并行挖掘问题。借助于多局部频繁模式树
    和并行投影技术,本文提出了两种基于共享内存计算模型的最大频繁
    模式并行挖掘算法。根据理论分析,这两种并行算法在采用了新的数
    据结构和简单的动态负载平衡技术后,可以实现各处理器独立异步运
    行、较小的1/O开销以及良好的负载平衡。
引文
[1] J.Han and M.Kambe. Data mining: Concepts and techniques. Morgaan Kaufmann Publishers,San Francisco, CA, 2001
    [2] R.Agrawal, et al. Mining association rules between sets of items in large databases. Proc. ACM SIGMOD int'l conf. Management of data, Washington, DC, May 1993, 207~216
    [3] R.Agrawal, R. Srikant. Fast algorithms for mining association roles. Proc. 20th int'l conf. very large databases, Santiago, Chile, Sept. 1994, 497~499
    [4] J.S. Park, et al. Using a hash-based method with transaction trimming for mining association rules. IEEE Transactions on knowledge and data engineering, 1997, 9(5),813~825
    [5] A.Savasere, E.Omiecinski, and S.Navathe. An efficient algorithm for mining association rules in large databases. Proceedings of the 21st International Conference on very large Databases, Zurich, Switzerland, Sept. 1995,432~444
    [6] H.Toivonen. Sampling large databases for association rules. Proceedings of the 22nd International Conference on Very Large Database, Bombay, India, Sep. 1996,134~145
    [7] M.J.Zaki et al. New algorithms for fast discovery of association rules. Proc. 3rd Int'l Conf. Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, Calif., 1997, 283~86
    [8] J.Han, J.Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc.2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), Dalas, TX, May 2000,1~12
    [9] A.K.Jain, M.N.Murty, and P.J. Flynn. Data clustering: A survey. ACM Comput. Sum., 31,1999,264-323
    [10] D.Burdick, M.Calimlim, and J.Gehrke. MAFIA: A maximal frequent itemset algorithm for transactional databases. In int'l Conf. on Data Engineering, Apr. 2001
    [11] Srinivasan, Parthasarathy, and S.Dwarkadas. Shared state for distributed interactive data mining applications. In the International Journal on Distributed and Parallel Databases, March2002
    [12] R.Agrawal and J.Shafer. Parallel mining of association rules. IEEE Trans. Knowledge and Data Eng, Vol. 8, No. 6, Dec. 1996, 962~969
    [13] J.S.Park, M.Chen, and P.S.Yu. Efficient parallel data mining for association rules. Proc. ACM Int'l Conf. Information and Knowledge Management, ACM Press, New York, 1995, 31~36
    [14] D.W. Cheung, et al. Efficient mining of association rules in distributed databases. IEEE Transactions on knowledge and data engineering, 1996, 8(6), 911~922
    [15] Osmar R. Zaane, Mohammad EI-Hajj, and Paul Lu. Fast parallel association rule mining
    
    without candidacy generation. In Proc. of the IEEE 2001 International Conference on Data Mining(ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001
    [16] David Wai-Lok Cheung, et al. A general incremental technique for maintaining discovered association rules. In proc. of DASFAA'97, Apr. 1997,185~194
    [17] N.F. Ayan, A.U.Tansel, and Erol Arkun. An efficient algorithm to update large itemsets with early pruning. ACM SIGKDD Intl. Conf. on Knowledge Discovery in Data and Data Mining (SIGKDD'99), San Diego, California, August 1999
    [18] D.-I.Lin and Z.M.Kedem. Pincer-Search: A new algorithm for discovering the maximum frequent set. In Proc. of the Sixth European Conf. on Extending Database Technology, 1998
    [19] C.I.Ezeife and Yue Su. Mining incremental association rules with generalized FP-tree. Proceedings of the fifteenth Canadian Conference on Artificial Intelligence, AI 2002, May, 2002, Calgary
    [20] C.H.Cai et al. Mining association rules with weighted items. Proceedings of IEEE International Database Engineering and Applications Symposium (1998), 68~77
    [21] M.J.Zaki et al. Parallel algorithms for fast discovery of associaiton rules. Data Mining and KnowledgeDiscovery. An int'l J. Vol. 1, No.4, Dec. 1997, 343~373
    [22] R.J.Bayardo. Efficiently mining long patterns from databases. In Proc. of the 1998 ACM-SIGMOD Int'l Conf. on Management of Data, 85-93
    [23] J.Han, Y. Fu. Discovery of multiple-level association rules from large databases. Proc. of the 21th int'l conf. on very large databases, Zurich, Switzerland, Sept. 1995,407~419
    [24] J.Pei, J.Han, and L.V.S.Lakshmanan. Mining frequent itemsets with convertible constraints. In Proc. 2001 int'l conf. Data Engineering (ICDE'01), Heidelberg, Germany, April 2001
    [25] J. Han, J. Chiang et al. DBMiner: A system for data mining in relational databases and data warehouses. In Proc. of CASCON'97: Meeting of Minds. Toronto, Canada, 1997, 249~260
    [26] S.Brin, R.Motwani, and C.Silverstein. Beyond market basket: Generalizing association rules to correlation. Proc. 1997 ACM-SIGMOD int. conf. Management of Data, Tucson, Arizona, May 1997, 265~276
    [27] M. J. Zaki. Parallel and distributed association mining: A survey. 1999, IEEE Concurrency, 7(4), 14~25
    [28] Jian Pci. Pattern-growth methods for frequent pattern mining [doctor thesis]. Simon Fraser University, June 13, 2002
    [29] 肖利,金远平,徐宏炳等.基于多维标度的快速挖掘关联规则算法.软件学报,1999,10(7),749~753
    [30] 铁治欣,陈奇,俞瑞钊.采掘关联规则的高效并行算法.计算机研究与发展,1999,36(8),
    
    948~953
    [31] 周欣,沙朝锋,朱扬勇等.兴趣度——关联规则的又一个阀值.计算机研究与发展,2000,37(5),627~633
    [32] 程继华,施鹏飞.多层次关联规则的有效挖掘算法.软件学报,1998,9(12),937~940
    [33] 杨明,孙志辉,赵传申.交易数据库的加权关联规则增量更新算法.计算机工程与应用,2002,38(1),71~73
    [34] 游湘涛,叶施仁,史忠植.多策略通用数据采掘工具MSMiner.计算机研究与发展,2001,38(5),581~586
    [35] 江卓军,谢康林,张文杰.一种新的关联规则挖掘思想.计算机工程,2002,28(4),88~90
    [36] 周斌,吴泉源.序列模式挖掘的一种渐进算法.计算机学报,1999,22(8),882~887
    [37] 陈国良.并行算法的设计与分析,第一版,北京:高等教育出版社,1994,5
    [38] S.M.Weiss and C.A.Kulikowski. Computer systems that learn: Classification and prediction methods from statics, nueral nets, machine learning, and expert systems. San Mateo, CA: Morgan Kaufmann, 1991
    [39] Roberto J. Bayardo Jr. and Rakesh Agrawal. Mining the most interesting rules. In Proc. of the Fifth ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, 145-154, 1999
    [40] R.Agareal, C.Aggarwal, and V.V.V. Prasad. A tree projection algorithm for generation in large databases. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000
    [41] S.Orlando, P. Palmerini, and R.Perego. Enhancing the apriori algorithm for frequent set counting. Proceedings of 3rd International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001), September 5-7, 2001, Munich, Germany, Lecture Notes In Computer Science, Springer, 2001
    [42] John D.Holt and Soon M.Chung. Mining association rules using inverted hashing and pruning. Information Processing Letters, 2002, 83(4), 2114220
    [43] D.Dunopulos, H.Mannila, and S.Saluja. Discovering all the most specific sentences by randomized algorithms. In intl. Conf. on Database Theory, Jan, 1997
    [44] 李立羽,施鹏飞.OLAP关联规则挖掘.计算机工程与应用,2002,38(3),128~130
    [45] Szymon Jaroszewicz and Dan A.Simovici. A general measure of rule interestingness. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery and Data Mining, 1999, 253~265
    [46] Jian Pei, Jiawei Han, and Runying Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
    
    Discovery, 2000, 21~30
    [47] J. Pei et al. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proc. 2001 Int. Conf. Data Engineering (ICDE'01), Heidelberg, Germany, April, 2001, 215-224
    [48] J. Pei, A.K.H.Tung, and J. Han. Fault-tolerant frequent pattern mining: Problems and challenges. In Proc. 2001 ACM-SIGMOD Int. Workshop Data Mining and Knowledge Discovery (DMKD'01), Santa Barbara, CA, May 2001
    [49] S. Ananthanarayana, D.K.Subramanian, and M.N.Murty. Scalable, distributed and dynamic mining of association rules. HIPC 2000, 559~566
    [50] Li Shen, Hong Shen, Ling Cheng. New algorithms for efficient mining of association rules. Information Sciences 118(1-4), 1999, 251~268
    [51] David Skillicorn. Strategies for parallel data minng. IEEE Concurrency, vol.7, No.4 1999
    [52] Ruoming Jin and Gagan Agrawal. Shared memory parallelization of data mining algorithms: Technques, programming interface, and performanc. In Proc. of the second SIAM Conference on Data Mining, April 2002
    [53] M.J.Zaki and Yi Pan. Introduction: Recent developments in parallel and distributed data mining. Distributed and Parallel Databases, 11, 2002, 123~127
    [54] Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. IEEE Transeations on Knowledge and Data Engineering, Vol.12, No.3, May/June, 2000
    [55] David W. Cheung, Sau D.Lee, and Yongqiao Xiao. Effect of data skewness and workload balance in parallel data mining. IEEE Transcations on Knowledge and Data Engineering, Vol. 14, No.3, May/June, 2002
    [56] J.R.Quinlan. C4.5: Programs for machine learning. San Mateo, CA:Morgan Kaufmann, 1993
    [57] P.E.Utgoff. An incremental ID3. In proc. 5th Int. Conf. Machine learning, San Mateo, CA, 1988, 107~120
    [58] E. Knorr and R.Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases(VLDB'98),New York, Aug. 1998, 392~403
    [59] S.Lawrence, C.L.Giles, and A.C.Tsoi. Symbolic conversion, grammatical inference and rule extraction for foreign exchange rate prediction. In Y. Abu-Mostafa, A.S.Weigend, and P.N Refenes, editors, Neural networks in the Captial Markets, Singapore: World Scientifc, 1997
    [60] M.W. Craven and J.W. Shavlik. Using neural networks in data mining. Future Generation Computer Systems, Vol.13, 1997, 211~229
    [61] J.H.Friedman. Multivariate adaptive regression. Annals of Statistics, Vol. 19, 1991, 1~141
    
    
    [62] Wesley Romao, Alex A.Freitas, and Roberto C.S.Pacheco. A genetic algorithm for discovering interesting fuzzy prediction roles: Applications to science and technology data.
    [63] 严小卫,蒋运承.模糊数据挖掘.小型微型计算机系统,2001,22(4),504~506
    [64] 李永敏,朱善君,陈湘晖等.基于粗糙集理论的数据挖掘模型.清华大学学报(自然科学版),1999,39(1),110~117
    [65] Steven Noel, Vijay V. Raghavan, and Chee-Hung Henry Chu. Visualizing association mining results through hierarchical clusters, ICDM 2001, 425~432
    [66] http://www-4.ibm.com/sofiware/data/miner.html
    [67] http://www.sas.com/sofiware/components/miner.html
    [68] http://www.sgi.com/software/mineset
    [69] http://www.isl.co.uk/clem.html
    [70] http://www.dbminer.com
    [71] http://www-aig.jpl.nasa.gov/public/mls/skicat/skicat home.html
    [72] M.J.Zaki. SPADE: An efficient algorithm for mining frequent sequences. In Machine Learning Journal, special issue on Unsupervised Learning (Doug Fisher, ed.), Vol. 42 Nos. 1/2, Jan/Feb 2001, 31~60
    [73] M.J.Zaki. Parallel sequence mining on shared-memory machines. In Journal of Parallel and Distributed Computing, special issue on High Performance Data Mining (Vipin Kumar, Sanjay Ranka, Vineet Singh, eds.), Volume 61, No. 3, March, 2001, 401~426

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700