异常数据挖掘算法研究及其在税务上的应用

作者：李丹
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 异常检测 ; 聚类分析 ; 税务应用
英文关键词：datamining ; Outlier detection ; clustering ; revenue apply
学位年度：2005
导师：洪晓光
学科代码：081202
学位授予单位：山东大学
论文提交日期：2005-04-05

摘要

本文主要介绍了异常挖掘和聚类分析在税务行业的应用。
     随着数据库技术在税收上的的普及和应用,税务机关积累了大量的原始数据,然而却不能有效的利用这些资源。而如何从这些数据中得到有用的知识,正是数据挖掘要解决的问题。数据挖掘技术是从上个世纪80年代开始发展起来的一门新技术,就是从大量的、不完全的、有噪声的、模糊的、随机的实际应用数据中其主要的目的,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识。
     异常挖掘是数据挖掘中的重要研究方面之一,其作用就是发现数据中的“小模式”,即数据集中显著不同于其它数据的对象。这在税务上是非常有效的数据挖掘方式。特殊的生产经营模式、规模特别大的纳税企业(也就是税务行业所谓的重点税源)、甚至各种涉税犯罪都会形成异常的数据,而这些数据正是税务机关关注的重点。如何快速有效地找到这些特殊的数据,对税务行业有着重要的意义。本文在税务行业的异常数据挖掘方面进行了探讨。
     本文首先讲述了数据挖掘的基本概念和方法,介绍了数据挖掘研究的一般对象和典型应用;具体研究了聚类和异常挖掘技术,说明了评价聚类和异常挖掘算法的一般准则,介绍了一些典型的聚类和异常挖掘算法。具体回顾了异常挖掘的研究发展及当前研究动态,介绍了基于距离、基于密度、基于偏离以及高维数据等孤立点发现中的主要算法,具体分析了各个算法的主要内容,在此基础上总结比较了各个算法的优劣及其适用范围。
     本文的重点是使用一种基于密度的方法对税务机关的税收数据进行聚类分析,发现其中有意义的模型以及异常的数据。根据税务行业的特点,异常挖掘具有非常广阔的应用前景。本文在研究现有聚类分析和异常挖掘算法的基础上,从税务行业的实际需求出发,根据税务行业数据的特点,对基于孤立点因
In this article we will apply clustering and outlier detection method on data of tax.As the database technology has been used on revenue widely, revenue has accumulated a large number of row data, which are saved in Database to little avail. How to abstract knowledge from these data is the key task of Data mining technology. Data Mining is a new technique developed from 1980s. It aims to extract the implicit, unknown, and potentially useful knowledge from voluminous, non-complete, fuzzy, stochastic data.Outlier analysis is a important part of data mining research. Its purpose is to find the "small patterns" from dataset. An outlier is an object that is considerably dissimilar or inconsistent with the remainder of the data. This is very useful in revenue. The outlier in revenue database could be generated by a special mode of production, a large-scale taxpayer, or even criminality. All of these are in special supervision of revenue. It is important for revenue to find them quickly and accurately. The outlier detection technology adapted to revenue is discussed in this article.Firstly, we describe the basic concepts and method. Then introduce the commonly objects and representative applications. We study clustering and outlier detection technology and describe the commonly rules, and introduce some clustering and outlier detection algorithms. The research process and the current situation of outlier detection are reviewed. The algorithms of outlier detection based distance; density, deviation and high dimension are introduced. The content of these algorithms is analyzed. The disadvantages and advantages of these
    algorithms are compared.The emphasis of this article is using ODACDS(outlier detection algorithm on Continued Data Sets), one of density-based clustering method to analyze the data of tax. The algorithm can discover arbitrary shape clusters and can distinguish noise. Owing to the feature of tax, outlier detection can be used widely in the field. For the demand of revenue, we studied all kinds of algorithms about outlier detection and Clustering. On the base of studying the clustering algorithm based outlier factor, we bring forward an outlier detection algorithm on Continued Data Sets. The new algorithms can be use to find the excursion of the data. We firstly introduce the concepts of outlier factor, then explain the idea and process of the algorithm, and do some discuss for the detail and exception.

引文

[1] 数据挖掘的聚类方法行小帅等电路与系统学报 2003.2
    [2] 基于模糊方法的聚类分析李泽霞兰州大学 2003
    [3] Hhuang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J], Data Mining Knowledge Discovery. 1998, 2
    [4] Bezdek J C. Pattern Recognition with Fuzzy Objective Function algorithms[M]. New York, Plenum, 1981
    [5] Jiawei Han, Micheline Kamber著,范明,孟小峰等译.《数据挖掘概念与技术》,机械工业出版社,2001年8月
    [6] M. Ankerst, M. Breunig, H. -P. Kriegel, and J. Sander. OPTICS: Ordering points to identify the clustering structure. In Proc. 1999 ACM-SIGMOD Int. Conf Management of Data (SIGMOD'99), Philadelphia, PA, June 1999.
    [7] W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid apprcoach to spatial datamining. In Proc. 1997Int. Conf. Very large Data Bases(VLDB'97), Athens, Greece, Aug. 1997.
    [8] GSheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proc. 1998 Inc Conf. Very large Data Bases (VLDB'98). New York, Aug. 1998
    [9] 面向高维数据的聚类算法研究谢立宏 2002.04
    [10] D Hawkins. dentification of Outliers[M]. London: Chapman and Hall, 1980
    [11] 异常挖掘方法研究蒋良孝等计算机工程与应用,2003.19
    [12] 基于单元的孤立点算法研究及客户忠诚度分析系统构建孙仁诚 2003.4
    [13] Breunig, M.M., Kriegel, H.-P., Ng, R.T., and Sander, J. LOF: identifying density-based localoutliers. In ACMSIGMOD Conference Proceedings, 2000
    [14] 基于密度的海量数据增量式挖掘技术研究周永锋 2002

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700