基于SSAS的数据挖掘算法研究与实现

英文题名：Research and Implementation of Data Mining Algorithms Based on SSAS
作者：郭醒
论文级别：硕士
学科专业名称：软件工程
中文关键词：数据挖掘 ; 关联规则 ; 数据立方体 ; 决策树
英文关键词：Data Mining ; Association Rule Mining ; Data Cube ; Decision Tree
学位年度：2008
导师：刘大有
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2008-04-01

摘要

本文在数据挖掘研究和关联规则挖掘研究背景下,重点研究了基于SSAS(SQL Server 2005 Analysis Services)下发现关联规则最大频繁项目集的方法,以及决策树优化算法。
     本文首先分析讨论了数据挖掘技术的产生背景、数据挖掘的基本过程、数据挖掘的主要任务;然后介绍了关联规则挖掘的基本概念,研究了关联规则最大频繁项目集发现算法。本文在Microsoft SSAS环境下改进并实现了一种采用集合枚举树来描述项目集、基于数据立方体的快速发现最大频繁项目集的算法FM_CUBE,显著提高了发现效率。FM_CUBE为发现最大频繁项目集的数据挖掘应用提供了一种有效而快速的算法;最后,通过对决策树算法的研究,在最小错误剪枝的基础上设计出了新的剪枝优化算法。实验结果表明,提出的算法较Microsoft算法在时间上有较好的性能。
With the continuous development of database technology and the extensive database management system applications, database storage of data increases rapidly. Much important information exists in a large amount of data, and these would be important information to support the people's good decision-making. At present database system can be accomplished only in the database to access to the data, and what people get from these data is only a part of the data and the more important information is the characteristics of the data and the description of its development trend forecasts. The information generated in the decision-making process has very important reference value. So the requirements of data-processing technology is also rising, that is needed to be able to conduct a deeper level of data processing, in order to obtain the overall features of the development trends and forecasts of the data.
     Data mining is to discovery interested knowledge from large data sets (which may be incomplete, noise, the uncertainty, various forms of storage), which is implicit, previously unknown, and in the decision-making have potential value. The extracted knowledge can be described as for the concept, rules, laws and forms mode. Therefore, data mining as a new field of study, involving such as machine learning, pattern recognition, statistics, databases, artificial intelligence, mathematics and visualization technology, and other areas of learning, is an emerging research fields with broad application e.
     In this paper, the basic process of data mining and the main tasks of data mining were discussed. The paper also has a study on the entire data mining process: data integration, data cleansing, data selection, data transformation, data mining, pattern assessment, a test that knowledge and practice. We have a deep research on association rule mining and decision tree building.
     Then in the second chapter, this paper studies the algorithm of discovering the maximum frequent item set. In this paper, with Microsoft SSAS environment, we improve and implement the algorithm FMCUBE with a set-enumeration tree used to describe the item set, based on data cube. The FM_CUBE algorithm significantly improves the efficiency of discovery. Identifying the frequent subsets is the key technique and the computationally intensive step in association mining task. In fact, any frequent subset is a subset of a maximal frequent item set. FMCUBE which finds the most frequent item sets of data mining application provides an effective and quick method. In Chapter II, association rules were first introduced, and classical algorithm Apriori is explained. Then proposed the largest frequent item sets FMCUBE algorithm. Unlike relational database entities - relational model, in data warehouse data model is multi-dimensional data model, it will form data as data cube. Multi-dimensional data cube is the statistical entities. Based on the data cube an subset is a combination of different members of data cube, and the support of the subset is the measure value. Generally algorithms discovering frequent item sets based on the data cube calculate the support with using data cube. Some scholars have given the algorithm based on the data cube and the frequent Apriori Algorithm.
     The authors use C# to program the Max-Miner algorithms and FMCUBE algorithm, and use SQL Server 2005 Analysis Services to generate the data cube and access the data cube through ADOMD.net and MDX.
     In the third chapter, based on the research of the smallest error pruning Decision Tree Algorithm is designed. The experimental results show that the proposed algorithm has better performance than Microsoft algorithm in terms of timing. In Chapter III, the classic first Decision Tree Algorithm ID3, C4.5 were analyzed and studied. To set a record for each record has the same structure, and each structure by the number of pairs of attribute values constituted. Those properties are on behalf of their respective categories. To solve the problem is to construct a decision tree, and thus gained by the non-category of attribute values correctly predict the answers attribute value category. Then two kinds of algorithms of the main advantages and disadvantages are analyzed. Then the whole generation of decision tree process has been more detailed Description: Decision Tree Construction mainly divided into two parts, the first generation trees, at the beginning, all the data in the root node, and then recursive data points tablets; Second, Tree pruning is likely to remove some of the noise or abnormal data. Decision Tree stop division of conditions: a node, the data belong to the same category; not attribute data can be used for segmentation.
     Then, the main pruning methods for detailed is studied and discussed. Pessimistic on the wrong pruning PEP, the smallest error pruning MEP, the cost of a complex pruning CCP, based on an incorrect pruning EBP, such four major pruning algorithm are studied, and so do their pros and cons.
     Last, Microsoft decision tree algorithm is described and examples in the database through the SSAS CollegePlans, MovieClick a decision tree to achieve data mining.
     In the fourth part, on the basis of the pruning algorithm study, in accordance with the principle of minimum pruning mistakes, the ID3 optimization algorithm is proposed. This algorithm greatest advantage is that it can be in accordance with the characteristics and attributes of data from the optimal choice of nodes generating program and the lowest error rate, so that the system will not only improve the efficiency of operation, and can reduce the occurrence of the error rate. Then, based on the above pruning methods, we use C # achieving a decision tree optimization algorithm. Here, the tree controls (TreeView) of Microsoft Visual Studio 2005 is used to achieve the addition of trees. Finally, the Decision Tree algorithm is better efficiency than Microsoft Decision Tree.
     Finally, a summary of this paper is given, and data mining future is discussed.
     The study results of the thesis, especially of maximal frequent item sets and decision tree, are of both theoretical and practical benefit to further researches.

引文

[1]. L. Agosta. The essential guide to data warehousing. Prentice Hall PTR, 1999.
    [2]. J. Han et al. Data mining: Concepts and techniques. Morgan Kaufmann Publishers. 2001.
    [3]．精通SQL Server 2005程序设计Andrew J．brust,Stephen Forte清华大学出版社．

    [4]. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. 2000 ACM SIGMOD Intl. Conference on Management of Data (SIGMOD'2000), ACM Press, 2000, 1-12.

    [5]. M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules.http://citeseer.ist. psu. edu/zaki97new.html.

    [6]. M. El-Hajj and O.R. Zaiane. Inverted matrix: efficient discovery of frequent items in large data sets in the context of interactive mining.Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'03), ACM Press, 2003.

    [7]. G. Liu, H. Lu, W. Lou, and J.X. Yu. On computing, storing and querying frequent patterns. Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'03),ACM Press, 2003, 607 - 612.

    [8]. J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'02),ACM Press, 2002, 229-238.

    [9]. M.J. Zaki and K. Gouda, Fast vertical mining using diffsets.Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'03), ACM Press, 2003, 326-335.

    [10]. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables.Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data(SIGMOD'96),1996,1-12.
    [11].D.Burdick,M.Calimlim,and J.Gehrke.MAFIA:a maximal frequent itemset algorithm for transactional databases.Proc.of ICDE,443-452,2001.
    [12].D.Lin and Z.Kedem.Pincer-search:a new algorithm for discovering the maximum frequent set.Proc.of the 6th Int.Conference on Extending Database Technology(EDBT),1998,105-119.
    [13].Lin Dao-I,Kedem Z M.Pincer-search:a efficiently algorithm for discovering the maximum frequent set.IEEE Transactions On Knowledge And Data Engineering,2002,14(5):553-566.
    [14].R.Bayardo.Efficiently Mining Long Patterns from Databases.Proceedings the 1998 ACM SIGMOD International Conference on Management of Data(SIGMOD'98),ACM Press,1998,85-93.
    [15].J.Pei,J.Han,and R.Mao.CLOSET:An Efficient Algorithm for Mining Frequent Closed Itemsets.Proc.2000 ACM-SIGMOD Int.Workshop on Data Mining and Knowledge Discovery(DMKD'00),2000,21-30.
    [16].N.Pasquier,Y.Bastide,R.Taouil,and L.Lakhal.Discovering frequent closed itemsets for association rules.Journal of Lecture Notes in Computer Science,1999,vol.1540,398-416.
    [17].M.Zaki.CHARM:An Efficient Algorithm for Closed Association Rule Mining.Technical Report TR 99-10,RPI,1999.
    [18].刘大有,刘亚波,尹治东;关联规则最大频繁项目集的快速发现算法[J];吉林大学学报(理学版);2004年02期;67-70.
    [19].SQL Server 2005 联机丛书入门,SQL Server Analysis Services,MSDN 网站.
    [20].EdwardMelomed,Irin Corbach.SQL Server 2005 Analysis Services 标准指南.电子工业出版社.
    [21].孙明丽,王斌.SQL Server 2005数据库系统开发完全手册.明日科技,人民邮电出版社.
    [22].沈兆阳.SQL 2000 OLAP解决方案_数据仓库与AnalysisService.清华大学出版社.
    [23].朱德利.SQL Server 2005数据挖掘与商业智能完全解决方案.电子工业出版社.
    [24].(美)雅各布,(美)米斯勒.SQL Server 2005分析服务从入门到精通.清华大学出版社.
    [25].唐华松,姚耀文.数据挖掘中决策树的探讨.计算机应用研究.北京:清华大学出版社,2001(8):18-19.
    [26].马秀红,宋建设,董展飞.数据挖掘中决策树的探讨.计算机工程与应用,2004(1):50-185.