聚类算法及在教学测评系统中的应用研究

英文题名：The Research of Clustering Algorithm and Its Application in Teaching Quality Evaluation System
作者：李新良
论文级别：硕士
学科专业名称：计算机应用
中文关键词：数据挖掘 ; 关联规则 ; 层次分析 ; 聚类分析 ; 教学测评
英文关键词：DataMining ; Association Rules ; Analytic Hierarchy Process ; Clustering Algorithm ; Teaching Quality Evaluation
学位年度：2008
导师：陈湘涛
学科代码：081203
学位授予单位：湖南大学
论文提交日期：2008-04-01
答辩委员会主席：文双春

摘要

面对大规模的高维数据,如何建立有效的、可扩展的聚类算法是数据挖掘领域的一个研究热点。围绕这个问题,本文在以下几个方面对聚类算法进行了深入研究。
     分析现有三种聚类初始化方法的优缺点,提出一种新的基于距离的初始化方法。该方法不需设定门限,不受数据集顺序的影响,而且对孤立点和噪声数据有较强的抑制作用,适用于较大规模数据的聚类初始化。
     针对基于网格、密度聚类方法的缺陷,提出一种基于网格、密度及距离的综合聚类方法。该方法能识别任意形状、大小、不同密度的类,能有效过滤噪声数据,参数设置简单,且无需预先给定聚类个数,具有近似线性的时间复杂度,适合处理大规模数据的聚类问题。
     研究现有层次聚类方法的缺点,提出一种新的层次聚类方法。该方法采用划分方法将数据分成原子簇,以这些原子簇为基础,实行自底向上的层次聚类,得到最终的聚类结果。该方法对输入参数不敏感,能有效过滤噪声数据,具有执行效率高的优点。
     针对高校教学评价系统中存在的问题,在研究聚类算法的基础上,开展了基于数据挖掘教学测评系统的原型研究,引入了科学决策技术中的层次分析法,解决了指标建立和权值分配的问题;在研究基于关联规则算法的基础上,对学院测评数据进行聚类分析研究,有效地减少了分析数据量,克服了按得分分类的不合理性。
Facing the massive volume and high dimensional data, it is one of research points of data mining on how to build effective and expandable clustering algorithm for data mining.Aiming at above issues, the author substantially studied clustering algorithms as follows:
     Analyse the advantages and disadvantages of the present initializations and propose a method based on the distance optimization,which does not need a threshold, is not affected from the order of data set, is insensitive to outliers or noise, and is available to the clustering of a very large data set.
     According to the disadvantages of the clustering algorithm based density and gird, a clustering algorithm (CUBN) is presented, which integrates density-based, gird-based and distance-based clustering methods.The method can identify clusters having non-spherical shapes, size and different denisty and can effectively filter noise data, with simple parameters, It has near linear-time complexity and is available to the clustering of a very large data set.
     According to the existing shortcomings of hierarchy clustering,a clustering algorithm(CMM) is presented, which used the division and classified the data clusters of atom and clustered hierarachy and finally came to the results.It is more robust to outliers, and wide variances in size.with the implementation of the advantages of high efficiency.
     According to the existing problems in the teaching evaluation system of university, on the study of clustering algorithm, the article conducted the prototype study based on the data mining teaching evaluation system,introduced Analytic Hierarehy Proeess of the scientific of decision-making and resolved the index of value creation and distribution;on the study of the association rulers based on the algorithm, the article took the study of clustering analysis to college evaluation data, reduced effectively the volumn of data analysis and overcame the unreasonability of classification.

引文

[1] JiaweiHan,MichelineKamber 著,范明,孟小峰译.数据挖掘,机械工业社,2004
    [2] 陈英,徐罡,顾国昌.一种本体和上下文知识集成化的数据报, 2007, 10(7): 2508-2515.
    [3] 杨颖,韩忠明,杨磊.兴趣子空间挖掘算法在高维数据聚类中的应用.计算机程,2007,1(33): 12-14.
    [4] J.Gennar, P.Langley, D.Fisher.Models of incremental concept for mation. Artificial Intelligence, 40(1):11-61.
    [5] 胡彩平,秦小麟.空间数据挖掘研究综述.计算机科学,2007,5:14-19.
    [6] Sarafis I, Zalzala A M S, Trinder PW.A Genetic Rulebased Data clustering Toolkit[C]. Honolulu: Congress on Evolutionary Computation (CEC), 2002
    [7] 刘红岩 ,陈剑 .数据挖掘中的数据分类算法综述 .清华大学学报 (自然科学版),2002,42(6):727-730.
    [8] 周水庚 , 周傲英 . 一种基于密度的快速聚类算法 . 计算机研究与发展 , 2000,37(11):1287-1292.
    [9] Banerjee A, Ghosh J. On Scaling up Balanced Clustering Algorithms. Arlington: Proceedings of the 2nd SIAM ICDM, 2002.
    [10] 周水庚,范晔,周傲英.基于数据取样的 DBSCAN 算法.小型微型计算机系统, 2000,21(12):1270-1274.
    [11] 赵艳厂 , 谢帆 . 一种新的聚类算法等密度线算法 . 北京邮电大学学报,2002,25(2):8-13.
    [12] 陈宁,陈安.基于密度的增量式网格聚类算法.软件学报,2002,13(1): 1-7.
    [13] 苑森森 , 程晓青 . 数量关联规则发现中的聚类方法研究 . 计算机学报 , 2003,7(3):23-25.
    [14] Zhao Yanchang, Song Junde, GDILC: a grid-based density-isoline clustering algorithm, 2001 International Conferences on Info-tech and Info-net, 2001, vol.3:140-145 .
    [15] 贺玲 , 吴玲达 , 蔡益朝 . 数据挖掘中的聚类算法综述 . 计算机应用研究 , 2007,1:10-13.
    [16] J.Han, J.Pei, Y.Yin.Mining. Fraquent patterns without candidate generation. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), 2000(5): 1-12.
    [17] 蔡伟杰,张晓辉.关联规则挖掘综述.计算机工程,2001,27(5):31-33.
    [18] Aleksandar Lcev, Carolina Ruiz, Elizabeth F.Ryder. Distance-Enhanced Association Rules for Gene.BIOKDD, 2003:34-40.
    [19] 周水庚 , 周傲英 . 基于数据分区的 DBSCAN 算法 . 计算机研究与发展 , 2000,37(10):1153-1159.
    [20] Berkhin P, Becher J. Learning Simp le Relations: Theory and Applications. Arlington: Proceedings of the 2nd SIAM ICDM, 2002.
    [21] 毕建欣,张岐山.关联规则挖掘算法综述.计算机工程,2005,7(4):88-94.
    [22] Han J, KamberM, TungA K H. Spatial Clustering Methods in Data Mining: A Survey. Geographic DataMining and Knowledge Discovery, 2001.
    [23] 张敏,于剑.基于划分的模糊聚类算法,软件学报,2004, 15(06): 0858-0868.
    [24] Dunn, J.C. A Fuzzy Relative ISODATA Process and Its Use in Detecting Compact Well-separated Clusters. Cybern, 1974, 3:32-57.
    [25] 李红艳 . 数据挖掘及其运用于教学评价的设想 .襄樊职业技术学院学报 . 2003(6):19-21.
    [26] IshibuchiH, Nakashina T and Yamamoto T.Fuzzy association rules for handling continuous attributs.ISIE 2001, Pusan, Korea.
    [27] 吴正龙,熊范伦,腾明贵.基于模糊聚类的模糊关联规则挖掘.小型微型计算机系统,2004, 25(7):1295-1297.
    [28] Yongqiang Cao, J ianhongWu. Dynamics of Projective Adaptive Resornance TheoryModel: The Foundation of PART Algorithm. IEEE Transactions on Neural Network,2004,15(2).
    [29] 李新延,李德仁.DBSCAN 空间聚类方法及其在城市规划中的应用.测绘科学. 2005,30(3):51-53.
    [30] Kailing K, Kriegel H, Kroger P. Density-connected Subspace clustering for High-dimensional Data. Proc.of SIAM Data Mining Conf, 2004.
    [31] 黎敏.数据挖掘算法的研究与应用:[硕士论文].大连:大连理工大学,2004,37-39
    [32] Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann, 2001.
    [33] 赵法信 ,王国业 .数据挖掘中聚类分析算法研究 .通化师范学院学报 ,2005, 26(2):11-13.
    [34] 张震.基于数据挖掘技术的教学质量评价系统研究:[硕士论文].合肥:合肥工业大学,2003,72-75.
    [35] Tung A K H, Hou J, Han J. Spatial Clustering in the Presence of ObStacles. Heidelberg: Proceedings of the 17 th ICDE, 2001.
    [36] 宋中山 , 吴立锋 . 关联规则挖掘在教学评价中的应用 . 中南民族大学学报,2006,3:34-36.
    [37] Agrawal R, Gehrke J, Gunopulos D, et al. Automatic Subspace clustering of High Dimensional Data for Data Mining Applications. Proc. of SIGMOD Conf, 2002.
    [38] Dembele, D. and Kastner, P. Fuzzy C-means Methods for Clustering Microarray data. Bioinformatics, 2003, 19:973-980.
    [39] 汪澜.数据挖掘技术在教学评估中的应用研究:[博士论文].阜新:辽宁工程技术大学,2003,42-44.
    [40] Aggarwal C. Towards Systematic Design of Distance Functions for Data mining Applications. Proc. of SIGKDD Conf., 2003.
    [41] 陈辉 . 关联规则挖掘在教学评价系统中的应用 . 南华大学学报自然科学版,2005,19(1):11-13.
    [42] 赵泽茂,何坤金,胡友进.基于距离的异常数据挖掘算法及其应用.计算机应用与软件,2005, 22(9):105-107.
    [43] 周东华 . 数据挖掘中聚类方法的研究与应用 :[ 硕士论文 ]. 天津 : 天津大学,2006,45-48.
    [44] 王莉.数据挖掘中聚类方法的研究.天津:天津大学,2003,65-67.
    [45] 苑森森 , 程晓青 . 数量关联规则发现中的聚类方法研究 . 计算机学报 , 2001,5:37-44.
    [46] Brown D, Huntley C. A Practical Application of Simulated Annealing To Clustering. University of Virginia, 1991.
    [47] 陈文庆 ,许棠 .关联规则挖掘 Apriori 算法的改进与实现 .微机发展 ,2005, 8(15):155-157.
    [48] 刘红岩,陈剑.数据挖掘中的数据分类算法综述.清华大学学报自然科学版, 2007,42:6-8.
    [49] Cristofor D, Simovici D A. An Information2theoretical Approach to Clustering Categorical Databases Using Genetic Algorithms. Arlington: The 2nd SIAM ICDM, Workshop on Clustering High Dimensional Data, 2002.
    [50] 徐敏等.应用聚类分析对关联规则进行分组.南京航空航天大学学报,2005,4: 45-53.