基于k-means的中文文本聚类算法的研究与实现

作者：张睿
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：文本聚类 ; k-means ; bisecting ; k-means ; 共享最近邻
英文关键词：Document Clustering ; k-means ; bisecting k-means ; Shared Nearest Neighbor
学位年度：2009
导师：刘晓霞
学科代码：081202
学位授予单位：西北大学
论文提交日期：2009-06-01

摘要

在机器学习、数据挖掘等领域得到普遍应用的k-means算法由于具有时间复杂度低的优点,在文本聚类领域也得到了广泛的应用。论文对文本聚类的相关技术与算法进行研究,针对文本数据高维性和稀疏性的缺点,改进了文本聚类中的特征选择方法,以及与k-means相关的算法,并在此基础上设计并实现了一个中文文本聚类原型系统。主要工作有:
     1)聚类领域进行特征选择时由于缺乏类信息而难以选择出最具类区分能力的特征词。在文档频率,单词贡献度两种特征选择方法的基础上,利用贪心算法对特征进行增量选择。实验表明改进的算法可以在保证聚类质量的前提下过滤更多的特征词。
     2)文本数据高维性和稀疏性的特点使得文本对象间的相似度不易度量,根据文本间的相似度为k-means算法选择的始聚类中心时可能不能很好的代表整个文本集。针对该缺点,对k-means算法中的初始化问题,提出一个改进的初始聚类中心选择方法。实验表明改进的方法选择到初始聚类中心比较分散且代表性好。
     3)为了提高聚类中簇的质量,通过引入共享最近邻相似度中邻居的概念,对bisecting k-means算法进行改进,实验结果表明该算法的聚类质量较原算法有一定的提高。
     在以上研究工作的基础上,实现了基于k-means的中文文本聚类原型系统。通过实验对系统中的各个算法进行了评测和比较。
As a widely used algorithm in machine learning and data-mining, k-means is also used in document clustering for its low time complexity .This paper mainly focus on the how to improve the performance of document clustering algorithm. Based on existing research, improved k-means algorithms and new feature selection method are proposed. Design and implement a Chinese document clustering System on the basis of the proposed algorithms. Works achieved in this paper are as follow:
     1) It is hard to select features for unsupervised feature selection methods used in clustering due to the lack of class label information. Based on document frequency and term contribution, greedy algorithm is introduced to select features incrementally .Experiments show that the proposed method can remove more features than traditional methods without degrading the clustering quality.
     2) In order to improve the clustering quality of k-means, well separated initial centroids should be selected. Initial centroids are aurally hard to select due to the high dimensionality and sparseness of document data. A new method for selecting initial centroids is proposed. Experiment show that the centroids selected by the proposed method are well separated and with high representative.
     3) In order to improve clusters quality of the bisecting k-means, neighbor used in shared nearest neighbor is introduced. Experiments show that the improved algorithm performs better than the original one.
     Design and implement a document clustering system using the algorithm mentioned above. Each algorithm in the system is contrasted and evaluated through experiments.

引文

[1]V.Hatzivassiloglou,J.L.Klavans,M.L.Holcombe,R.Barzilay,M.-Y.Kan,and K.R.McKeown.Simfinder:A Flexible Clustering Tool for Summarization[A].Proc.of the Workshop on Summarization in NAACL-01[C].Pittsburg,Pennsylvania,USA:2001
    [2]Oren Zamir,Oren Etzioni,Omid Madani,Richard M.Karp,Fast and Intuitive Clustering of Web Documents,KDD'97,1997:287-290.
    [3]林鸿飞,马雅彬.基于聚类的文本过滤模型[J].大连理工大学学报.2003,42(2):249-252.
    [4]Rauber,M.Fr(u|¨)hwirth.Automatically Analyzing and Organizing Music Archives[A].Proceedings of the 5th.European Conference on Research and Advanced Technology for Digital Libraries(ECDL 2001)[C].Darmstadt.Germany,2001:402-414
    [5]Cutting,D.,Karger,D.,and etc.Scatter/Gather:A Cluster-based Approach to Browsing Large Document Collections[A].SIGIR'92,1992[C].318 -329.
    [6]James Allan,Topic Detection and Tracking:Event-based Information Organization[M],Kluwer Academic Publishers,2002:1-16
    [7]Daphe Koller and Mehran Sahami.Hierarchically classifying documents using veryfew words,Proceedings of the 14th International Conference on Machine Learning(ML),Nashville,Tennessee,July 1997:170-178.
    [8]Zheng Chen,WeiYing Ma,Jinwen Ma.Learning to Cluster Web Search Results[A]Proceedings of the 27th Annual International ACM SIGIR Conference[C].Sheffield,South Yorkshire,UK,July 2004:210 -217
    [9]苏芳仲.中文Web文本挖掘的若干关键技术研究及其实现[D].福州大学硕士学位论文,2006.
    [10]MacQueen J.Some methods for classification and analysis of multivariate observations[A],proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability.[C].1967:281-297
    [11]Huang Z.Extensions to the k-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Diseovery,1998,2(3):283-304
    [12]T Zhang,R Ramakrishnan,Mogihara.Birch:An effieient data clustering method for very large databases.In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data Montreal.Canada:June,1996:103-114
    [13]Viswanath P,Pinkesh R.I-DBSCAN:a fast hybrid density based clustering method.18th International Conference on Pattern Recognition,Hong Kong,2006:912-915
    [14]Kouae L.Using self-organizing maps for information visualization and knowledge discovery in complex geospatial database.21th International Cartographic Conference.Durban.2003:1674-1701
    [15]Bezdek J C.Pattern recognition with fuzzy objective function algorithms[M].New York:Plenum Press 1981
    [16]Steinbach M,KaryPis G,Kumar V.A comparison of document clustering techniques[A].Proceedings of KDD 2000 Workshop on Text Mining[C],2000:1-20
    [17]Ying Zhao.KaryPis G Hierarchical Clustering Algorithms for Document Datasets.Proceedings of Data Mining and Knowledge Discovery[C],2005 10(2):141-168.
    [18]Dhillon I S,Modha D S.Concept decompositions for large sparse text data usingclustering[J].Machine Learning,2001,42(1):143-175
    [19]Banerjee A,Dhillon I S,Ghosh J,S.Sra.Clustering on the Unit Hypersphere using Von Mises-Fisher Distributions Journal of Machine Learning Research(JMLR) 2005(6):1345-1382.
    [20]Beil F,Ester M,Xu X.Frequent term-based text clustering.In:Proc.8th Int.Conf.on Knowledge Discovery and Data Mining(KDD)' 2002[C],Edmonton.Alberta,Canada,2002:436-442
    [21]Fung B C M,Wang K,Ester M.Hierarchical Document Clustering Using Frequent Itemsets.In:Proc.of the 2003 SIAM Intl.Conf.on Data Mining(SIAM'03)
    [22]Hassan H.Malik and john R.Kender.High Quality,Efficient Hierarchical Document Clustering using Closed Interesting Itemsets Proceedings of the Sixth International Conference on Data Mining(ICDM'06) 2006
    [23]龙吴,冯剑琳,李曲.R-means:以关联规则为簇中心的文本聚类[J],计算机科学,2005(9):156-159
    [24]Zhuang L,Dai Honghua.A Maximal Frequent Itemset Approach for Web Document Clustering[A]Proc of the 4th Int'l Conf on Computer and Information Technology[C],2004:970-977.
    [25]顾益军,樊孝忠,王建华等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.
    [26]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实[J].计算机应用研究,2001,18(9):23-26.
    [27]Gerald Salton,Wong A.,and Yang,C.S.Avector space model for automaticindexing[J].Comm.ACM,1975,18(11):613-620.
    [28]M.Easter,H.P.Kriegel,J.Snade,X.Xu.A density-based algorithm for diseovering Clustering large spatial databases[A].In proc.1996 Int.Conf.Knowledge Dlseovery And Data Mining(KDD'96)[C],226-231,Portland,OR,Aug.1996
    [29]张猛,王大玲,于戈.一种基于自动阈值发现的文本聚类算法,计算机研究与发展,2004,41(10):1745-1753.
    [30]周昭涛.文本聚类分析效果评价及文本表示研究[D],中国科学院研究生院硕士学位论文,2005
    [31]K.Van Rijsbergen.Information Retrieval[M].Butterworths,London.1979.
    [32]F.Beil,M.Este,X.Xu.Frequent term-based text clustenng[J].In Pore.2002 Int.Conf knowledge Diseovery and DataMining(KDD'02).NewYork,2002:436-442
    [33]Iwayama Makoto and Tokunaga Takenobu.Hierarchical Bayesian clustering for automatic text classification,Department of Computer Science Tokyo Institute of Technology,Tech Rep:TR95-0015,1995.
    [34]Y.Yang,O.Pedersen.A comparative study on feature selection in text categorization The ICML97[C],Nashville,1997
    [35]M.Rogati,Y.Yang.High performance feature selection for text categorization.The CIKM-02.Mclean.2002
    [36]L.Tao,L.Shengping,C.Zheng.et al.An evaluation on feature selection for text clustering.Proceedings of the twentieth international conference on machine learning The ICML-03,Washington DC,2003
    [37]姜宁,宫秀军,史忠植.高维特征空间中文本聚类研究[J]计算机工程与应用2002,10
    [38]Wilbur,I.W.,&Sirotkin,K.The automatic identification of stop words[J].Journal of Information Science,1992,18(1):45-55.
    [39]Dash,M.,& Liu,H.Feature Selection for Clustering[A].Proc.of PAKDD[C],2000:110-121
    [40]L.Liu,et al.A Comparative Study on Unsupervised Feature Selection Methods for Text Clustering[A].Proceeding of Natural Language Processing and Knowledge Engineering'05[C],October,November 2005:597-601
    [41]徐燕,李锦涛,王斌,孙春明.基于区分类别能力的高性能特征选择方法[J].软件学报,2008,1(1):82-89.
    [42]Nirmalie Wiratunga,Rob Lothian,and Stewart Massie.Unsupervised Feature Selection for Text Data[C].In Proceedings of the 8th European Coference on Case-Based Reasoning,340-354
    [43]R.A.Jarvis and E.A.Patrick.Clustering Using a Similarity Measure Based on Shared Nearest Neighbors[J].IEEE Transactions on Computers.1973,C-22(11):1025-1034
    [44]Sudipto Guha,Rajeev Rastogi,and Kyuseok Shim.ROCK:A Robust Clustering Algorithm for Categorical Attributes.In Proc.of the 15th Int.conf.on Data Engineering[J].IEEE Computer Society,March 1999:512-521
    [45]Brandley P.S.,Fayyad U.M.Refining initial points for K-Means Clustering Proc.of 15th International Conferene on Maehine Learning,SanFraneiseo,1998:91-99.
    [46]周涓,熊忠阳,张玉芳等.基于最大最小距离法的多中心聚类算法.计算机应用,2006,26(6):1425-1427
    [47]任江涛,施潇潇,孙婧昊,黄焕宇,印鉴.一种改进的基于特征赋权的k均值聚类算法[J].计算机科学,2006,33(7):186-187
    [48]张猛.文本聚类中参数自动设置技术的研究与实现[D].东北大学学位论文,2005
    [49]刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006(1):11-15
    [50]李金宗.模式识别导论.北京.高等教育出版社.1994:294.356.
    [51]Vesanto J,Alhoniemi E.Clustering of the self-organizing map.IEEE Transactions on Neural Networks,2000,11(3):586-600

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700