数据挖掘分类算法的研究与应用

作者：刘振岩
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：数据挖掘 ; 分类 ; 急切分类 ; 懒散分类 ; 决策树 ; 感知器 ; 代数超曲面 ; 神经网络
英文关键词：Data Mining ; Classification ; Eager Classification ; Lazy Classification ; Decision Tree ; Perceptron ; Algebra Hyper Surface Neutral Network
学位年度：2003
导师：王万森
学科代码：081203
学位授予单位：首都师范大学
论文提交日期：2003-04-01

摘要

随着数据库技术的成熟应用和Internet的迅速发展，人类积累的数据量正在以指数速度增长。对于这些数据，人们已经不满足于传统的查询、统计分析手段，而需要发现更深层次的规律，对决策或科研工作提供更有效的决策支持。正是为了满足这种要求，从大量数据中提取出隐藏在其中的有用信息，将机器学习应用于大型数据库的数据挖掘(Data Mining)技术得到了长足的发展。
     所谓数据挖掘(Data Mining，DM)，也可以称为数据库中的知识发现(Knowledge Discover Database，KDD)，就是从大量的、不完全的、有噪声的、模糊的、随机的数据中，提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程。发现了的知识可以被用于信息管理、查询优化、决策支持、过程控制等，还可以用于数据自身的维护。因此，数据挖掘是数据库研究中的一个很有应用价值的新领域，它又是一门广义的交叉学科，融合了数据库、人工智能、机器学习、统计学等多个领域的理论和技术。
     分类在数据挖掘中是一项非常重要的任务，目前在商业上应用最多。分类的目的是学会一个分类函数或分类模型，该模型能把数据库中的数据项映射到给定类别中的某一个。许多分类的方法已被机器学习、专家系统、统计学和神经生物学方面的研究者提出。本论文主要侧重数据挖掘中分类算法的研究，并将分类算法划分为急切分类和懒散分类，全部研究内容基本围绕着这种划分方法展开。
     本文的主要研究内容：
     1．讨论了数据挖掘中分类的基本技术，包括数据分类的过程，分类数据所需的数据预处理技术，以及分类方法的比较和评估标准；比较了几种典型的分类算法，包括决策树、k-最近邻分类、神经网络算法；接着，引出本文的研究重点，即将分类算法划分为急切分类和懒散分类，并基于这种划分展开对数据挖掘分类算法的研究。
     2．结合对决策树方法的研究，重点研究并实现了一个“懒散的基于模型的分类”思想的…懒散的决策树算法”。在决策树方法的研究中，阐述了决策树的基本概念以及决策树的优缺点，决策树方法的应用状况，分析了决策树算法的进一步的研究重点。为了更好地满足网络环境下的应用需求，结合传统的决策树方法，基于“懒散的基于模型的分类”的思想，实现了一个网络环境下基于B/S模式的“懒散的决策树算法”。实践表明：在WEB应用程序中采用此算法取得了很好的效果。
     3．选取神经网络分类算法作为急切分类算法的代表进行深入的研究。在神经网络中，重点分析研究了感知器基本模型，包括感知器基本模型的构造及其学习算法，模型的几何意义及其局限性，并针对该模型只有在线性可分的情况下才能用感知器的学习算法进行分类的这一固有局限性，研究并推广了感知器模型。

    首都帅范大学硕士学位论文数据挖掘分类算法的研究与应用
     4．重点研究了一类感知器推广模型——代数超曲面神经网络模型。在这一
     部分，首先介绍了代数超曲面神经网络模型的构造及其几何意义；然后，
     详细阐述了代数超曲面神经网络学习算法的具体实现，以及此算法的实
     验结果和创新之处；最后提出了进一步的研究目标。代数超曲面神经网
     络模型在解决非线性问题上有很大的潜力，尤其对高维非线性数据分类
     有独特优势。本研究的创新之处是算法的自适应升次计算，研究表明：
     采用自适应建模方式后，大大提高了建模成功率。但是，对高维数据的
     分类，存在内存受限的问题，还需要进一步的深入研究。
With the application of Database and the development of Internet, accumulated data are exponential increasing. For these data people are not satisfied with the traditional methods of queries and statistics, but want to find deeper regulations to provide effective decision to science and research works. So data mining technology that apply machine learning to large database to acquire useful information from a lot of data is developed.
    Data mining (DM) or knowledge discover database (KDD) is to discover useful information and potential knowledge from plentiful and uncompleted and noise and fuzzy and random data which are hided and not known by people. These discovered knowledge might be used to manage information and optimize queries and make decision and control procedure and maintain database and so on. So data mining is a very valued new area of database research area, and it is a crossed subject that adopts theory and technology of database and artificial intelligent and machine learning and statistics and so on.
    Classification is a very important task in data mining and extensively applied to commerce at present. The destination of classification is to learn a classification function or classification model that can map a data item to a preassigned class. The researcher of machine learning and expert system and neural biology provides a lot of classification methods. This paper does some research works about classification algorithm in data mining. Classification algorithm is divided to eager and lazy and total research works are based on this divide.
    The main work of the thesis:
    1. The base technologies of classification in data mining are introduced. These technologies include the procedure of classification and the preprocessing of classification data and compared and evaluated criterion of classification methods. Several of typical classification algorithms are compared which are decision-tree and k-nearest neighbor and neural network algorithm. Then the emphasis of the paper is induced that divide the classification to eager and lazy and the research of classification algorithm in data mining is based on this divide.
    2. A lazy decision-tree algorithm that comes from the idea of lazy classification based on model is researched on the base of the research of the traditional decision-tree. In traditional decision-tree, the concepts and advantages and disadvantages of decision-tree are presented, and the application and research situation of decision-tree are analyzed. Appling to web environment a web application used lazy decision-tree algorithm that comes from the idea of lazy



    based on model classificaton is developed. And the practical run shows this method acquired better grade.
    3. Neural network is deeply researched as representation of eager classification. Perceptron is selected. At first the creation of typical perceptron model and its learn algorithm are introduced. Then on the base of the principal and geometrical presentation of typical perception model, the limitations of typical perceptron model are studied. This limitation is that perceptron learn algorithm can be used only when data are linear separability. To resolve this problem, expanded perceptron models are research.
    4. Algebra hyper surface neutral network is a kind of expanded perceptron model. This model is an emphasis of this paper. At first the creation of this model and its geometrical presentation are introduced. Then it's learning algorithm is accomplished and test's results and innovation of program are presented. At last the further aims are provide base on test's conclusion. This model is potential to resolve nonlinear separability problems; especially it adapts to classify high-dimmension data. Adaptive raise degree computer method is the innovation of research. Researches show that success rate of creating model raise after using the adaptive method. But it exists the limitation of memory for high-dimension data. So a deeply research will be continued.

引文

[1] [美]A.Berson, S. Smith and K. Thearling著，贺奇等译．构建面向CRM的数据挖掘应用．人民邮电出版社，2001。
    [2] [加]J.Han and M. Kamber著，范明、孟小峰等译．数据挖掘概念与技术(Data Mining: Concepts and Techniques)．机械工业出版社，2001．
    [3] [美]R.格罗思，侯迪等译．数据挖掘-构筑企业竞争优势．西安交通大学出版社，2001．
    [4] 常迥．信息理论基础．清华大学出版社，1993．
    [5] 高济、朱淼良等．人工智能基础．高等教育出版社，2002．
    [6] 韩力群．人工神经网络理论、设计及应用．化学工业出版社，2002．
    [7] 何振亚．神经智能．湖南科学技术出版社，1997．
    [8] 靳蕃．神经计算智能基础原理．方法．西南交通大学出版社，2000．
    [9] 廖信彦．ASP.NET技术参考．中国铁道出版社,2001．
    [10] 刘同明．数据挖掘技术及其应用．国防工业出版社，2001．
    [11] 刘振岩、王万森等，急切分类与懒散分类的研究．小型微型计算机系统，2002，23(12)：1489-1491．
    [12] 刘振岩、王万森．非线性神经网络的研究．计算机工程与应用，2004年第4期(待发)．
    [13] 刘振岩、王万森．机器学习在数据挖掘中的应用与发展．计算机工程与应用，2002,38(增刊)：173-174．
    [14] 刘振岩、王万森．基于XML的WEB数据挖掘的研究．计算机科学，2003年第5期(待发)．
    [15] 刘振岩、王万森、陈立平．WEB信息检索与WEB数据挖掘．微机发展，2003年第9期(待发)．
    [16] 沈世镒．神经网络系统理论及其应用．科学出版社，2000．
    [17] 史忠植．知识发现．清华大学出版社，2002．
    [18] 王万森．人工智能原理及其应用．电子工业出版社，2000．
    [19] 王旭、王宏等．人工神经元网络原理与应用．东北大学出版社，2000．
    [20] 王永骥、涂健．神经元网络控制．机械工业出版社，1998．
    [21] 王永庆．人工智能原理与方法．西安交通大学出版社，1998．
    [22] 夏云庆．Visual C++6.0数据库高级编程．北京希望电子出版社，2002
    [23] 杨建刚．人工神经网络实用教程．浙江大学出版社，2001．
    [24] 杨振生．组合数学及其算法，中国科技大学出版社，1997．
    [25] 张立明．人工神经网络的模型及其应用．复旦大学出版社，1992．
    [26] 周春光等．计算智能．吉林大学出版社，2001．
    [27] 周永权等．代数超曲面神经元理论及学习算法．中国人工智能进展，2000,498-501．
    [28] Burges C J C. "A tutorial on support vector machines for pattern recognition," Data Mining and Knowledge Discovery, Vol. 2:2, 1998.
    [29] Cortes C, Vapnik V.N. "Support Vector Networks, "Machine Learning, Vol.20, pp. 273-297, 1995.
    [30] D.W. Aha, D.Kibler, and M.K. Albert. Instance-based learning algorithms. Machine Learining, 6(1), pp.37-66, 1991
    [31] Fayyad, U.M., Piatetsky-Shapiro, G., and Smyth, P. The KDD Process for Extracting Useful Knowledge from Volumes of Data, Comm. ACM, 39(11), pp.27-34, 1996.


    [32] G.Melli. Knowledge based on-line classification. Master's thesis, Simmon Fraser University, School of Computing Science, April 1998
    [33] J.H.Friedman, R.Kohavi, and Y. Yun. Lazy decision trees. AAAI. Thirteenth National Conference on Artificial Intelligence, pp. 717-724. AAAI Press, 1996
    [34] J.Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27, pp. 97-107, 1998.
    [35] J.R.Quinlan. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993.
    [36] Ling Zhang, Bo Zhang, "A Geometrical Representation of McCulloch-Pitts Neural Model and Its Applications," IEEE Transactionson Neural Networks, Vol. 10: 4, pp.925-929, July 1999.
    [37] M.Ankerst, C.Elsen, M.Ester, and H.P. Kriegel. Visual classification: An interactive approach to decision tree construction. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD'99), pp. 392-396, San Diego, CA, Aug. 1999.
    [38] QinHe, Zhong-Zhi shi, Zhenyan Liu. An Adaptive Classification Method Based On Algebra Hyper Surface. (mimeo).
    [39] R.S.Michalski, I.Brakto, and M.Kubat. Machine Learning and Data Mining: Methods and Applications. New York: John Wiley & Sons, 1998.
    [40] S.Haykin. Neural Networks A Comprehensive Foundation (影印版)．清华大学出版社,2001.
    [41] T.M.Mitchell. Machine Learning. New York: McGraw-Hill, 1997.
    [42] Vapnik V. N. Statistical Learning Theory, J. Wiley press, New York, 1998.
    [43] Vapnik V. N. Support Vector Method for Function Approximation, Regression Estimation and Signal Processing. Neural Information Processing Systems, Vol.9. MIT Press, Cambridge, MA.
    [44] Vapnik V. N. The Nature of Statistical Learning Theory. New York: Springer-Verlag press, 1995.
    [45] Vapnik V. N., Levin E, Le Cun Y. "Mearsuring the VC-Dimension of learning machine," Neural Computation, Vol.6, pp. 851-876, 1994.
    [46] Widrow B., Winter R. G. "Layered neural nets for pattern recognition," IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 36:3, pp. 1109-1118, 1988.
    [47] William Fulton, Algebraic Topology A First Course, Springer-Verlagpress, 1995.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700