关联规则并行算法的研究与分析

作者：宋伟
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 关联规则 ; 频繁项集 ; 并行算法 ; 集群系统
英文关键词：Data Mining ; Association Rule ; Frequent Itemset ; Parallel Algorithm ; Cluster System
学位年度：2006
导师：范明
学科代码：081202
学位授予单位：郑州大学
论文提交日期：2006-11-10

摘要

随着信息技术的迅猛发展，要从日益庞大和复杂的数据中发现有价值的信息和知识，达到为决策服务的目的，已成为非常艰巨的任务。数据挖掘技术在此背景下应运而生。关联规则挖掘是数据挖掘中的一个重要分支，也是目前应用最广泛的一种数据挖掘类型。目前传统的关联规则挖掘技术大多采用串行算法，随着数据库规模的增大以及分布式数据库的发展，研究并行算法以更好地适应实际需求逐渐成为人们所关注的目标。
     本文在探讨数据挖掘的基本知识的基础上，对各种传统的串行算法进行对比分析，总结它们的优缺点，说明进行并行挖掘关联规则的必要性；结合集群系统特点，介绍了并行体系结构，探讨了并行编程模式及方法。并行关联规则的代表算法各有特点，论文对算法的基本思想进行了介绍，并对比分析了不同算法的性能特点。并行算法对大型数据库的处理明显优于串行算法，但是，现在的并行算法仍然有许多不完善的地方，存在一些需要解决的问题。
     并行算法对并行机的依赖性很强，在一台并行机上有效的算法在别的不同结构的并行机上可能效果并不好，现有的算法并不完全适合集群系统。在集群环境下，设计并行算法时，为尽可能减少通信量，应采用数据并行的思想。论文结合集群特点，提出了在集群环境下采取基于主从(Master／Slave)模式的数据并行策略来并行挖掘关联规则，并对性能进行了分析。
With the rapid development of information technology, it becomes more difficult and urgent to mine useful information and knowledge automatically in more and more larger amounts of data to support the strategies. The technology of data mining emerges under this background. Association Rule Mining is an important branch of data mining, and becomes one of the widest applied data mining styles. With the development of distributed database and the increase of data amount, research for parallel algorithm becoming the focus.
     The paper discusses and compares the traditional serial algorithms, analyses their virtues and disadvantages, then explained the necessary of parallel algorithm. This paper also introduces the cluster structure and discusses the model and methods of parallel programming. After analyzing the characteristic of some typical Parallel association rule algorithms, point out that there are also some problems to be solved.
     Parallel algorithms rely on the structure of parallel machine strongly, so the algorithm with good performance perhaps can not work well at another machine that has different structure. At the cluster system environment, we should use data parallel method to avoid masses communications. The paper proposes to use data parallel strategy based on Master/Slave model and analyses the performance.

引文

[1] Han Jiawei，Kamber M著．范明，孟小峰等译．数据挖掘：概念与技术．北京：机械工业出版社，2001．
    [2] Fayyad U. M., Piatetsky Shapiro G., Smyth P., Uthurusamy R.U. Advances in Knowledge Discovery and Data Mining[c],AAAI/MIT Press, 1996:83-115
    [3] Chen M.,Han J. and Y u P S. Data mining: An overview from database perspective.IEEE Trans. Knowledge and Data-Engineering,1996:833-866.
    [4] 刘振凯，贵忠华，蔡青．基于神经网络结构学习的知识求精方法[J]．计算机研究与发展，1999，36(10)：1069-1073．
    [5] 刘小虎，李生．决策树的优化算法[J]．软件学报，1998，9(10)：797-800．
    [6] 苗夺谦，胡桂荣．知识约简的一种启发式算法[J]．计算机研究与发展，1 999，36(6)：681-684．
    [7] 于金龙，李晓红，孙立新．连续属性的整体离散化[J]．哈尔滨工业大学学报，2000，32(3)：48-53．
    [8] 陈文伟．智能决策技术[M]．北京：电子工业出版社，1998．
    [9] 糜元根．数据挖掘方法的评述[J]．南京化工大学学报，2001，5(23)：105-110
    [10] K.Koperski, J.Han. Discovery of Spatial Association Rules in Geographic Information Database. In Advances in Spatial Databases, Proceedings of 4th Symposium,S SD' 95.(Sug.6-9,Portland,Maine).Springer. Verlag, Berlin, 1995:47-66.
    [11] Akihiro Inokuchi, Takashi Washio, et al. Basket Analysis for Graph Structure Data. In Proc. Third Pacific-Asia Conference.PAKDD-99,Pages 420-41, Beijing,April 1999.
    [12] R.Agrawal, T.Imielinski, and A.Swami. Mining Association Rules Between Sets of Items in Large Database. ACM-SIGMOD 93,Washington,DC,1993.
    [13] R.Agrawal and R.Srikant.Fast algorithms for mining association rules.In proceedings of the 20th International Conference on Very Large Databases(VLDB'94), Santiago,Chile, 1994:487-499.
    [14] R. Agrawal, T.Imielinski, and A.Swami.Mining association rules between sets of items in large databases. In ACM SIGMOD Intl. Conf. Management of Data, 1993.
    [15] Park J., Chen M., and Yu P. S. An effective hash based algorithm for mining association rules. In ACM SIGMOD Intl. Conf. Management of Data, 1995.
    [16] Hosheimer M., Kersten M., Mannila H., and Toivonen H. a perspective on databases and data mining. In 1st Intl. Conf. Knowledge Discovery and datamining, 1995.
    [17] Houstma M., Swami A. Set-oriented mining of association rules in relational databases. In 11th Intl. Conf. Data Engineering, 1994.
    [18] Zaki M. J., Parthasarathy S., Ogihara M., Li W. New algorithms for fast discovery of association rules. In Proc. Of the 3rd Intl. Conf. on Knowledge Discovery and Data Mining, 1997.
    [19] Brin S., Motwani R., Ullman J.D., Tsur S., Dynamic itemset counting and implication rules for market basket data. In Proc of 1997 ACM-SIGMOD Intl. Conf.On Management of Data, Tucson, Arizona, 1997:255-264.
    [20] Agarwal R. C., Agarwal C., Prasad V.V.V., A tree projection algorithm for generation of frequent itemsets.Journal of parallel and distributed computing (Special Issue on High Performance Data Mining), 2000.
    [21] Savasere A., Omiecinski E., and Navathe S. An efficient algorithm for mining association rules in large databases.In Proc. Of the 21st VLDB Conference, Zurich, Switzerland, 1995: 432-443.
    [22] Toivonen H. Sampling large databases for association rules. In Proc. Of the 22nd VLDB Conference, 1996.
    [23] Han J., Pei J., Yin Y. Mining frequent patterns without candidate generation. In ACM-SIGMOD, Dallas, 2000.
    [24] 范明,王秉政．一种直接在Trans-树中挖掘频繁模式的新算法[J]．计算机科学．2003,No．8：117-123．
    [25] Warschko T M, Blum J M, Tichy W F. ParaStation: Efficient Parallel Computing by Clustering Workstations: Design and EValuation. Journal of Systems Architecture, 1998, 44:241-260
    [26] Flynn, Michael J., and Kevin W.Rudd. Parallel architectures. ACM Computing Surveys 28(1):67-70, March 1996.
    [27] McGraw, James R., and Timothy S. Axelrod. "Exploiting multiprocessors: Issues and options. " In Robert G.Babb Ⅱ, (ed.), Programming Parallel Processors, Pages 7-25.Reading, MA:Addison-Wesley, 1988.
    [28] PVM news group. PVM: Parallel Virtual Machine. July 19, 2006. http://www.csm.ornl.gov/pvm/
    [29] Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J. MPI: The Complete Reference. London: MIT Press, 1996.
    [30] Matthew I, Agarwal A. LoPC: modeling contention in parallel algorithms[C]. ACM. Portland, Oregon, United States, 1997:56-69
    [31] Mohammed J Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara.Parallel Algorithms for Discovery of Association Rules[M].Kluwer Academic Publishers, Bostom. Manufactured in the Netherlands, 1997.: 1-32.
    [32] J. Han and Y. Fu. Discovery of Multiple-Level Association Rules from Large Databases. Proceedings of the 21st Intl. Conf. on VLDB, 1995.:420-431.
    [33] Takahiko Shintani, et al. Parallel Generalized Association Rule Mining on Large Scale PC Cluster[C].Workshop on Large Scale Parallel KDD System Aug 15th, 1999.San Diego, CA US. 100-110.
    [34] D. Keim, H.Kriegel, and T.Seidl. Supporting data mining of large databases by visual feedback querier. In Proc. 10th of Intl.Conf. on Data Engineering, Houston, TX, 1994.:302-313.
    [35] W. Lu, J.Han, and B.C.Ooi. Knowledge discovery in large spatial datbases. In Far East Workshop on Geographic Information Systems, Singapore, 1993.:275-289.
    [36] David W.Cheung and Yongqiao Xiao. Effect of Data Skewness in Parallel Mining of Association Rules. In PAKDD'98.Spinger-Verlat(1998).48-60.
    [37] McClean, S., Scotney, B., and Greek, K. Clustering heterogeneous distributed databases. In Workshop on distributed and parallel knowledge discovery.2000. Boston, MA, USA.
    [38] 蔡伟杰,张晓辉,朱建秋等．关联规则挖掘综述．计算机工程,2001,27(5)：31-33．
    [39] Agrawal R., and Shafer J. Parallel Mining of Association Rules. IEEE Transactions on Knowledge and Data Eng., 1996, 8(6):962-969.
    [40] Park, J., Chen, M. and Yu, P.. An effective hash-based algorithm for mining association rules. In Proc. Of 1995 ACM-SIGMOD Int. Conf. on Management of Data. (1995)
    [41] Park, J., Chen, M. and Yu, P.. Efficient parallel data mining for association rules. In Proc. Of the 4th Intl. Conf. On Information and Knowledge Management.(1995)
    [42] Zaki M. J., Parthasarathy S., Ogihara M., et al. Parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery, 1997 1(4):343-373.
    [43] Mueller A. Fast sequential and parallel algorithms for association rule mining. A Comparison.Technical Report CS-TR-3515, University of Maryland, College Park, 1995.
    [44] Han E H., Karypis G., and Kumar V. Scalable Parallel Data Mining for Association Rules. In ACM-SIGMOD Conf. Management of Data, 1997:277-288.
    [45] Park J., Chen M., and Yu p. An effective Hash_based algorithm for mining association rules. In Proc. of ACM-SIGMOD Intl. Conf. on Management of Data, 1995.
    [46] Kumar V., Grama A., and Karypis G. Introduction to Parallel Computing: Algorithm Design and Analysis. Benjiamin Cummings/Addison Wesley, Redwood City, 1994.
    [47] Papadimitriou C.H., Steiglitz K. Combinatorial Optimization:Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ.
    [48] AMDAHL'S LAW FOR PARALLEL SPEEDUP. http://www.math.buffalo.edu/～pitman/courses/cor501/HPCI/node11.html

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700