基于密度的海量数据增量式挖掘技术研究

作者：周永锋
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：数据挖掘 ; 聚类分析 ; 孤立点因子 ; 增量更新
英文关键词：data mining ; clustering ; outlier factor ; incremental updating
学位年度：2002
导师：邓苏
学科代码：1201
学位授予单位：中国人民解放军国防科学技术大学
论文提交日期：2002-11-01

摘要

增量式挖掘，就是对于大数据集(数据库或数据仓库等)，当数据递增的时候，增量地更新数据挖掘结果，而不是对每次更新后的整个数据集进行挖掘。对于许多种类的大型数据库或数据仓库挖掘，增量数据挖掘是一个诱人的目标。本文主要研究了基于孤立点因子的增量式挖掘技术。
     首先讲述了数据挖掘的基本概念和方法，介绍了数据挖掘研究的一般对象和典型应用；研究了聚类挖掘技术，说明了评价聚类的一般准则，简单介绍了现有的典型的增量挖掘方法，为进一步研究和学习积累了经验，明确了需求。
     在现有的大多数聚类挖掘方法中，参数的影响较大，而且常常需要用户指定参数，参数的决定成为实际应用的一个难点。本文在研究基于密度的聚类算法的基础上，提出了基于孤立点因子的聚类算法，有效地解决了这个问题；并在此基础上，提出了增量式算法，用于增量的更新聚类结果。文中同时给出了孤立点因子聚类方法的有关概念，以及相应的算法描述，详细说明了算法思想和聚类过程。
     最后，实验分析了基于孤立点因子的聚类算法的有效性，并与有关算法作了性能对比，实验说明了基于孤立点因子的聚类算法对于参数的健壮性；实验也简要分析了增量式算法的有效性和效率。
Incremental data mining is updating the result of data mining incrementally, when data increase in the large data set (such as database or datahouse), it is not updating the total data set. For many kind of large databases or datahouse, incremental data mining is a temptable goal. We study the incremental data mining technology based outlier factor.
    We first describe the basic concepts and basic method and introduce the commonly objects and representative applications; and we study clustering data mining technology and describe the commonly rules, and we introduce the incremental data mining method; so we accumulate experience for farther study and definitude requirement.
    The influence of the algorithm parameters is very notability and the parameters need the appoint of users in mass clustering data mining algorithm, so determining parameters is very difficulty. We bring forward clustering algorithm based outlier factor, and resolve the problem efficiency, and we gained the incremental algorithm on the base. We describe the concepts of clustering algorithm based outlier factor, and explain the idea of the algorithm and the clustering process.
    In the end, we analysis the validity of the algorithm, and we contrast the algorithm with the other; we analysis and validate that the parameters have littler influence to clustering data mining algorithm based outlier factor; and we also analysis and validate the incremental clustering data mining algorithm.

引文

[1]陆汝钤主编．《世纪之交的知识工程与知识科学》，清华大学出版社，2001年9月．
    [2]Jiawei Han, Micheline Kamber合著，范明，孟小峰等译．《数据挖掘概念与技术》，机械工业出版社，2001年8月．
    [3]Ester.M., Kriegel. H.-P. Sander. J. et. al. Incremental clustering for mining in a data warehousing environment. In: Gupta. A., Shmueli. O., Widom. J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York: Morgan Kaufmann Publishers Inc., 1998. 323～333.
    [4]数据挖掘讨论组．《数据挖掘资料汇编》，http://dataming.126.com.
    [5]陈文伟．《决策支持系统及其开发(第二版)》，清华大学出版社，2000年2月．
    [6]Zhang T.,Ramakrishnan R.,Linvy M.: "BIRCH: An Efficient Data Clustering Method for Very Large Databases", Prco. ACM SIGMOD Int. Conf. on Management of Data, ACM Press, New York, 1996,pp.103-114.
    [7]Ester M., Kriegel H.-P., Sander J., Xu X.: "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 226-231.
    [8]Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim: "CURE: An Efficient Algorithm for Large Databases", In Proc. of the ACM SIGMOD conf. on Management of Data, May 1998.
    [9]CHEN Ning, CHEN An, ZHOU Long-xiang. An Incremental Grid Density-Based Clustering Algorithm. In Journal of Software 2002, Vol. 13, pp.1-7.
    [10]Yong-Feng Zhou, Qing-Bao Liu, Su Deng, Qiang Yang An Incremental Outlier Factor Based Clustering Algorithm. In the First International Conference on Machine Learning and Cybernetics Nov.2002,CHINA.
    [11]Fayyed U, Piatetsky-Shapiro, Smyth, Uthurusamy. 199. Advances in Knowledge Discovery and Data Mining. MIT Press.
    [12]Leonard Kaufman, and Peter J. Rousseeuw, Finding Groups in Data - An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990.
    [13]Jiawei Han, Micheline Kambr. DATA MINING Concepts and Techniques, Higher Education Press,2001.
    [14]Wang W., Yang J., Muntz R.: "STING: A Statistical Information Grid Approach to Spatial Data Mining", Proc. 23th Int. Conf. on Very Large Databases, Athens, Greece, Morgan Kaufmann Publisher, San Francisco, CA, 1997,pp.186-195.
    [15]Sheikholeslami G., Chatterjee S., Zhang A.: "WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases", Proc. Int. Conf. on Very Large Databases, New York, NY,1998,pp.428-439.


    [16]N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. Of ACM SIGMOD, pages 322-331, Atlantic City, NJ, May 1990.
    [17]高峰，谢剑英发现关联规则的增量式更新算法计算机工程2000．12．
    [18]冯玉才，冯剑琳关联规则的增量式更新算法软件学报1998．9(1)．301-306．
    [19]Cheung D. W. Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Techniques. In Proc. of the 12th International Conference on Data Engineering, New Orleans, Louisana, 1996,pp.106-114.
    [20]Breunig, M.M., Kriegel, H.-P., Ng, R.T., and Sander, J. LOF: identifying density-based local outliers. In ACMSIGMOD Conference Proceedings, 2000.
    [21]Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, David Cheung.: "A Robust Outlier Detection Scheme for Large Data Sets", In: http://www.cs.panam.edu/～chen/papers.html.
    [22]Alex Berson, Stephen Smith, Kurt Thearling著，贺奇，郑岩，魏藜等译．《构建面向CRM的数据挖掘应用》，人民邮电出版社，2001年8月．
    [23]Ankerst M., Breunig M., Kriegel H.-P., Sander J. 《OPTICS: Ordering Points To Identify the Clustering Structure》, Proc.ACM SIGMOD '99, Int. Conf. on Management of Data, Philadelphia, PA, 1999.
    [24]Feldman R., Aumann Y., Amir A., Mannila H.: "Efficient Algorithms for Discovering Frequent Sets in Incremental Databases", Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ, 1997, pp. 59-66.
    [25]Arning,A., Agrawal R., Raghavan P,: "A Linear Method for Deviation Detection in Large Databases", Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996,pp. 164-169.
    [26]Barnett V., Lewis T.: "Outliers in statistical data", John Wiley, 1994.
    [27]Hawkins, D.: "Identification of Outliers", Chapman and Hall, London, 1980.
    [28]Knorr E. M., Ng R. T.: "Finding Intensional Knowledge of Distance-based Outliers", Proc. 25th Int. Conf. on Very Large Databases, Santiago, Chile, Morgan Kaufmann Publishers, San Francisco, CA, 1994,pp. 144-155.
    [29]周永锋，邓苏，杨强，刘青宝基于DTS对象模型的DTS包实现计算机应用2002．11．
    [30]Ester M., Wittmann R.: "Incremental Generalization for Mining in a Data Warehousing Environment", Proc. 6th Int. Conf. on Extending Database Technology, Valencia, Spain, 1998, in: Lecture Notes in Computer Science, Vol. 1377, Springer, 1998, pp. 135-152.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700