关联规则挖掘在税务系统中的应用与研究

作者：苏立明
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：关联规则挖掘 ; 频繁项集 ; 税务系统 ; FG ; A+
英文关键词：association rule ; frequent itemsets ; Taxation management System ; A+ ; FG
学位年度：2010
导师：周连喆
学科代码：081203
学位授予单位：长春工业大学
论文提交日期：2010-03-01

摘要

在20世纪初,即使到了20世纪中叶,没有一家公司的账目、订货记录和文件柜的数据总和能超过几十个百万字节。今天,最大的公司数据库的容量是用万亿字节来计量的。海量数据的积累为数据挖掘的应用提供新的空间,数据挖掘市场份额正日益扩大,越来越多的大中型企业开始利用数据挖掘来分析公司的数据,以辅助决策,数据挖掘正逐渐成为他们在市场竞争中立于不败之地的法宝。
     本论文通过对Apriori算法的详细分析,总结Apriori算法的优点和不足,针对Apriori算法的性能瓶颈,分析各种优化算法及其特点。A+算法是一种快速挖掘频繁项集的算法,该算法将数学的知识应用在挖掘频繁项集中,运用向量内积的运算,逐步缩减布尔矩阵,最终求得频繁项集。该算法运算简单,搜索效率高,无需产生候选项集,只扫描一次数据库。将该算法应用于税务系统的关联规则挖掘,并基于税务数据对A+算法与Apriori算法进行比较。实验结果表明,A+算法的运行时间少于Apriori算法。
     FG算法是基于事务规则树的高效挖掘关联规则的算法,该算法是针对于FP-growth算法中构造条件FP-tree和超集检测是用建立事务规则模型来说明问题的。将事务规则树合并为规则链,在规则链上运用FG算法来实现事务规则树模型的构造,构造完成后形成的规则已经去掉大部分的冗余规则；然后FG算法运用过滤技术再去掉冗余规则,挖掘出所有无冗余的关联规则。FG不需要查找频繁项,直接找出关联规则,方法比较灵活。将算法应用在税务稽查中,同时基于该数据集对FG算法和FP-Growth算法进行分析。实验结果表明,FG算法的运行时间少于FP-Growth算法。
     论文首先介绍关联规则的基本概念、分类、挖掘步骤,分析关联规则研究与应用现状,接下来介绍关联规则挖掘的经典算法Apriori算法及其现有的改进算法；然后对关联规则改进算法A+算法及FG算法进行介绍,通过实验进行分析比较。接下来介绍数据挖掘在税务系统中的应用现状,税务数据特征以及A+算法和FG算法在税务系统中的具体应用。
In the early 20th century, even in the mid-20th century, no company accounts, ordering the data records and file cabinets can be more than the sum of scores of millions of bytes. Today, the largest company database capacity is measured in trillions of bytes. The accumulation of huge amounts of data for data mining applications in the new space, the data mining market share is growing, more and more large and medium enterprises began to use data mining to analyze the company's data to decision support, data mining is becoming their competition in the market invincible magic weapon.
     This thesis is on a detailed analysis of Apriori algorithm, Apriori algorithm summarized the advantages and disadvantages, the performance bottleneck for the Apriori algorithm, to analyze optimization algorithm and its characteristics. A+ algorithm is a fast algorithm for mining frequent itemsets, the algorithm will apply knowledge of mathematics frequent itemset mining using the vector inner product operation, and gradually reduce the matrix, eventually obtained frequent item sets. The proposed algorithm is simple and efficient search without candidate itemsets generated, only one database scan. The algorithm is applied to the tax system for mining association rules, and tax data on the A+ based algorithm compared with Apriori algorithm. The results show that, The running time of A+ is less than Apriori algorithm.
     FG algorithm is based on efficient tree service rules for mining association rules algorithm is FP-growth algorithm for the structural conditions in the superset of FP-tree and the establishment of service rules test model is used to illustrate the problem. The transaction chain rule tree into rules, the rules on the use of FG algorithm chain to achieve the service rules tree model structure, structural form of the rules after the completion of most of the redundant rules have been removed; and then re-FG algorithm using filtering technology to remove redundant rules out all non-redundant association rules. FG does not need to find frequent items, direct to find out association rules, methods and more flexible. The algorithm is applied to tax inspectors, the same FG based on the data set on the algorithm and FP-Growth algorithm for analysis. Experimental results show that, running time of FG is less than FP-Growth algorithm.
     Paper first introduces the basic concepts of association rules, classification, mining step, analysis research and application of association rules, then introduce a classical algorithm for mining association rules algorithm and the existing improved Apriori algorithm; then the association rule algorithm and improved algorithm A+ FG algorithm is introduced, analyzed and compared by experiments. Next comes the data mining in the tax system status, tax data, characteristics, and A+ algorithm and FG algorithm are used in the concrete application of the tax system.

引文

[1]米哈尔斯基等著,朱明等译.机器学习与数据挖掘：方法和应用[M].电子工业出版社.2004年1月
    [2]Agrawal R,Imilienski T,Swami A.Mining association rules between sets of items in large database.In:Buneman P, Jajodia S, eds Proc. of the 1996 ACM SIGMOD Int'l Conf. on Management of Data.New York:ACM Press.207-216.1993
    [3]Jiawei Han, Micheline Kamber.Data Mining Concepts and Techniques Second Edition. 149-150.1995
    [4]Pasquier N,Bastide Y,Taouil R.Discovering Frequent Closed Itemsets for Association Rules[C]. Proc of ICDT1999.398-416.1999
    [5]Zaki M J,Hsiao C J.CHARM:An Efficient Algorithm for Closed Itemset Mining[C]. Proc. Of the 2nd SIAM Intl. Conf. on Data Mining.Arlington:SIAM.12-28.2002
    [6]毛国军,段丽娟,王实,石云.数据挖掘原理与算法.第二版.P69-73
    [7]Agrawal R,Srikant R.Fast Algorithms for Mining Association Rules.In:Proc.of the 20th VLDB Conference Santiago.1994
    [8]Park J S, Ming-Syan C,Philip S Y.An Effective Hash Based Algorithm for Mining Association Rules.In:Proc.of ACM SGMOD.175-185.1995
    [9]Savasere A,Omiecinski E,Navathe S.An eficient algorithm for mining association rules in large databases[A].In:Dayal U,Gray PM D,Nishio,eds.Proc. of the 21st Int'l Cone on VLDB.Burlington:Morgan Kaufmann Publishers.432-443.1995
    [10]Brin S, Motwai R, Ullman J D, Tsur S.Dynamic Itemset Counting and Implication Rules for Market Basket DataS.In:ACM-SIGMOD Conference on Management of Data.265-276.1997
    [11]Han J,Pei J,Yin Y.Mining frequent paterns without candidate generation.In:Chen WD,Naughton JF,Bernstein PA,eds. Proc. of the 2000 ACM SIGM OD Int'1 Conf.Son M anagement of Data.New York:ACM Press.1-12.2000.
    [12]Mannila H,Toivonen H.Discovering Generalized Episode Using Minimal Occurrence[C].Proc. of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining.1996
    [13]Toivonen H. Sampling Large for Association Rules. In:Proceedings of the 22nd International Conference on Very Large Database.134-145.1996
    [14]Cheung D W,Han J,NgVT.Maintenance of Discovered Association Rules in Large Database:an Incremental updating approach[A]. the 12th IEEE Intenational Conference on Data Engineering.106-114.1996
    [15]Cheung D W,Lee S D,Kao B.a General Incremental Technique for Maintaining Discovered Association Rules[A].Proc. of Database Systems for Advance Applications.185-194.1997
    [16]冯玉才,冯剑琳.关联规则的增量式更新算法[J].软件学报[J].VOL9(4).301-306.1998
    [17]Agrawal R,Shafer J.Parallel Mining of Association Rule[J].IEEE Trans on Knowledge and Data Engineering.VOL6(6).962-960.1996
    [18]Park J S,Chen M-S,Yu P S.Effective Parallel Data Mining for Association Rules[Z].In ACM Int'l Conf on Information and Knowledge Management.1995
    [19]Cheung D W,Ng V T,Fu A W. Efficient Mining of Association Rules in Distributed Databases[J]. In IEEE Trans on Knowledge and Data Engineering. VOL8(6).911-922. 1996
    [20]Dorian Pyle著,杨冬青,马秀莉,唐世渭.业务建模与数据挖掘.机械工业出版社.2004
    [21]陈安,陈宁,周龙骧等.数据挖掘技术及应用[M].科学出版社.2006
    [22]Han Jiawei,Pei Jian,Yin Yiwei.Mining Frequent Parrerns without Candidate Generation[C]. Proc. of the 2000 ACMSIGMOD Int'l Conf on Management of Data. 2000
    [23]秦亮曦,李谦,史忠植.基于排序FP树的频繁模式高效挖掘算法[J].计算机科学.VOL32(5),31-33.2005
    [24]宋余庆,朱玉全等.一种基于最大频繁模式树的约束最大频繁项目集挖掘及其更新算法[J].计算机研究与发展.42(5).P777-783.2005
    [25]Burdick D,Calimlim M,Gehrke J. MAFIA:A Aaximal Frequent Itemset Algorithm for ransactional Databases[C].Proc. of Intl.Conf. on Data Engineering.,325-337.2001
    [26]程千俊,王黎明.深化税收分析建立科学的税收分析工作机制.中国税务.第一期30-31.2009
    [27]任力欣.精细化利用税务信息.中国第七届税收电子论坛.42-43.2007
    [28]吉振云.浅议税务部门数据的挖掘与利用.计算机与网络.第三、四期.137-139.2008
    [29]姚亮,徐邵兵,胡学钢.关联规则挖掘在税收执法管理中的应用.计算机工程.VOL34(24).266-268.2008
    [30]刘以安,杨斌.关联规则挖掘中对Apriori算法的一种改进研究[J].计算机应用.Vol27(2).418-420.2007
    [31]丁卫平、施诠、管致锦,一种基于事务规则树的高效关联规则挖掘算法[J].计算机应用研究.Vol24(5).83-86.2007

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700