信息增值中的聚类分析算法研究

英文题名：Research of Clustering Algorithm in Information Value-added
作者：赵辉
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息增值 ; 聚类分析 ; 蚂蚁算法 ; 度限制树 ; 高考成绩
英文关键词：information value-added ; clustering algorithm ; ant system ; degree-constraint tree ; scores of the entrance-test-to-college
学位年度：2003
导师：顾军华
学科代码：081203
学位授予单位：河北工业大学
论文提交日期：2003-01-01

摘要

随着信息获取技术的提高，生产生活的各个领域都存储了海量信息。而信息对人类社会和经济发展的巨大作用集中体现在信息的增值作用上。目前被广泛应用的信息增值方法是数据挖掘技术。本文通过对数据挖掘技术、知识表达方法的分析，提出聚类分析是动态信息增值的最有效的方法。
     目前的聚类分析算法普遍存在着初值敏感的缺点，本文以小样本理论为基础，提出了从小样本集中得到初值的算法，在降低了对初值的敏感性的同时，提高了聚类的效果。
     针对动态信息增值问题，本文分析了现有聚类分析算法的缺点，并论述了聚类分析算法与知识进化类算法—蚂蚁算法结合的可能性和必要性，最终提出了通过使用蚂蚁算法建立度限制树作为信息数据的分布的思想，大大提高了传统的聚类分析方法进行动态信息增值的效率。并在对实验数据分析的基础上，对建立度限制树的方法加以改进，使得聚类效果有了进一步的提高。
     最后，本文阐述了高考成绩的增值意义，并将聚类分析方法应用于高考成绩的信息增值系统中，得到了一系列有意义的结论。
With the development of the technology of getting the information, there is more and more information in all fields. After the information's value is added, it can play a more important role in our society. So it is important to study on how to make the information more valuable. Nowadays, Data Mining is the broadest way to do this. In this paper, we raise the point that clustering algorithm is the most efficient way to make the information's value add.
    However, the existing clustering algorithm has two shortcomings: The first one is that the clustering algorithm depends on the initial given input so much. In the paper, we propose an algorithm of finding the initial input based on the sub-sample theory. According to the test data, the novel algorithm not only decreases the sensitivity but also generates better quality clusters. The second shortcoming is that the clustering algorithm can not solve the dynamic data very well. In this paper we develop another new algorithm to improve it. According to the ant system, the distributing of the original data is gathered into a degree constraint tree. Then the clusters can be generated according to the feature of the tree. We analyze the result of the algorithm on the test data. It shows that the novel algorithm is an efficient way to cluster the dynamic data.
    In the last part of this paper, we expatiate how important of adding the value of the scores of entrance-test-to-college. And then after the proposed algorithm works on the scores, we give some useful conclusions of the scores.

引文

[1] 杨炳儒，江亚东。基于大型数据库的KDD系统及其应用研究，科技前沿与学术评论，23(1)：49-55
    [2] 钟义信。知识论：核心问题，电子学报2001(4)：526-530
    [3] 信息科学原理，福建人民出版社第一版 1988，北京邮电大学出版社再版 1996
    [4] 钟义信。“知识论”基础研究电子学报 2001(1)：96-102
    [5] M. J Berry, G. Linoff. Dataminig techniques for markering, sales and customer support, Wiley Computer Publishing 1997
    [6] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. Knowledge discovery and data minging: towards a unifying framework. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Porland, Oregon 1996
    [7] R. Agrawal, et al. Database Mining: A Performance Perspective. IEEE Transactions on knowledge and data engineering, 1993, 5(6): 914-925
    [8] 朱绍文，王泉德等。关联规则挖掘技术及发展动向，计算机工程 2000(9)：4-6
    [9] 许龙飞，杨晓昀。KDD中广义关联规则发现技术研究，计算机工程与应用 1998(9)：32-35
    [10] 程继华，施鹏飞，郭建生。模糊关联规则及挖掘算法，小型微型计算机系统 1999(4)：270-274
    [11] 吉根林，帅克，孙志辉。数据挖掘技术及其应用，南京师大学报(自然科学版) 2000(23)：25-27
    [12] Jiawei Han, Micheline Kamber. Data Mining Concept and Techniques 高等教育出版社 2001(5)
    [13] Srikant R, Agrawal R. Mining quantitative association rules in large relational tables, Proceeding of the ACM SIGMOD International Conference on Management of Data, Montreal, Canada, 1996(1-12)
    [14] Han J, Fu Y. Discovery of multiple-level association rules from large database, Proc. of the 21st Int Conf. on Very Large Data Bases. Zurich, 1995: 420-431
    [15] 人大数据挖掘研究中心，数据挖掘中的聚类分析，统计与信息论坛 2002(5)：4-10
    [16] 张讲社，徐宗本。基于视觉系统的聚类：原理与方法，工程数学学报，2000(5)：14-20
    [17] R. Ng, J. Han Efficient and Effective Clustering Method for Spatial Data Mining Proc. International Conference Very Large Database, Stantiago, Chile, 1994: 144-155
    [18] J Hah, Y Cai, N Cercone, Data-Driven Discovery of Quantitative Rules in Relational Databases, IEEE Trans, Knowledge and Data Engeering, 1993(5): 29-40
    [19] W Lu, J Han, BCOoi Knowledge Discovery in Large Spatial Databases, Proc. FarEast Workshop Geographic Information Systems, Singapore, 1993: 275-289
    [20] T Zhang, R Ramakrishnan, M Livny BIRCH: An Efficient Data Clustering Method for Very Large Database, Proc. 1996 ACM SIGMOD International Conference Management Data Montreal, Canada 1996
    [20] M Ester, H-P Kriegel, X Xu, Knowledge Discovery in Large Spatial Database(SSD'95), Portland, Maine,

    1995: 67-82
    [21] Kaelbing L P, Littman M L, Moore A W. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 1996(4): 237-285
    [22] Goldberg D. Genetic Algorithms in Search, Optimization, and Machine Learning. Massacheusettes: Addison-Wesley Pub, 1989
    [23] Colorni A, Dorigo Mand Maniezzo V. Distributed optimization by ant colonies[A]. In:Proc.of 1st European Conf. Artificial Life[C]. Pans, France: Elsevier, 1991: 134-142
    [24] Colorni A, Dorigo Mand Maniezzo V. Distributed optimization by ant colonies[A]. In:Proc.of 1st European Conf. Artificial Life[C]. Pans, France: Elsevier, 1991: 134-142
    [25] Colorni A, Dorigo Mand Maniezzo V. An investigation of some propertied of an ant algorithm[A]. In: Proc. Of Parallel Problem Solving from Nature (PPSN)[C]. France: Elsiver, 1992:509-520
    [26] Dorigo M, Maniezzo Vand Colomi A. Ant system: an autocatalytic optimizing process[R]. Technical Report 91-016, Politecnico di Milano, 1991
    [27] 侯向丹。蚂蚁算法扩展性及其应用研究，河北工业大学硕士研究生论文，2002：15-16
    [28] 顾军华，侯向丹。基于蚂蚁算法的Qos组播路由问题求解，河北工业大学学报
    [29] 霍国庆。信息资源管理的起源与发展，图书馆，1997(6)：4-10
    [30] 李广建。试论信息工作的增值过程及其途径，情报杂志，1998(5)：7-9
    [31] 吴钢华。信息增值的特点及其原理概述，情报理论与实践，1998(2)：79-82
    [32] Fayyad U. Knowledge Discovery and Data Mining towards a Unifying Framework.. KDD'96 Proc. 2nd Conf. on Knowledge Discovery and DataMing, AAAI press, 1996
    [33] 席静，欧阳为民。基于聚类的连续值属性最佳离散化算法，小型微型计算机系统 2000(10)：1025-1027
    [34] S. K. Au, J. L. Becko Important sampling in high dimension, Structural Safty 25 (2003): 139-163
    [35] P. S. Bradley and Usama M. Fayyad, Refining Initial for K-Means Clustering. http://www.cs.wisc.edu/icm198/papers/paper152.html
    [36] 钟义信。信息学原理福州市：福建人民出版社第一版，1988，北京邮电大学出版社再版，1996
    [37] 关俐，梁洪峻。数据仓库和数据挖掘，微型电脑应用 1999(9)：17-20
    [38] Holland J H. Adaptataion in Natural and Artificial Systems. Cambridge, Massachusetts, the MIT press, 1992
    [39] Davis L. Handbook of Genetic Algorithms, Van Nostrand Reinhold, 1991
    [40] 徐勇，刘奕文，陈贺新，戴逸松。一种基于自适应遗传算法的聚类分析方法，系统工程与电子技术 1997(9)：39-43)
    [41] Marco Dorigo, Luca M. Gambardella. Ant colonied for the Travelling Salesman Problem. 1997(43): 73-81
    [42] 马良。全局优化的一种新方法，系统工程与电子技术 2000(9)：61-63
    [43] 左孝凌。离散数学，上海科学技术文献出版社 1989．2
    [44] 陈光亭，张国川。约束最小生成树问题研究，浙江大学学报(理学版)1999(4)：28-32
    [45] 马良，蒋馥。度约束最小生成树的快速算法，运筹与管理 1998(3)：1-5
    [46] Ma Liang. Ant Algorithm for the Degree-Constrained Minimum Spanning Tree, Journal of System

    Engineering 1999(9): 211-214
    [47] Zhou shuigeng, Zhou aoying. FDBSCAN: A fast DBScan algorithm, Journal of software, 2000(6): 735-744
    [48] 牛万欣，刘义东，蔡旭。考试制度改革的探讨与实践，宁夏工学院学报(自然科学版)1995(4)：64-67
    [49] 张玫欣。论考试与试卷分析，华东工业学院学报 1999(10)：91-93
    [50] 钱钟，吴祖俭。高考成绩与发展潜力的相关性研究，江苏高教 2003(3)：39-41

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700