基于相似度的文本聚类算法研究及应用

英文题名：Study on Similarity-based Text Clustering Algorithm and It's Application
作者：曾路平
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本聚类 ; 单词相似度 ; K-Means ; 向量空间模型 ; 公安情报
英文关键词：Text Clustering ; Word Similarity ; K-Means ; Vector Space Model ; Public Security Information
学位年度：2009
导师：李星毅
学科代码：081203
学位授予单位：江苏大学
论文提交日期：2009-04-01

摘要

文本聚类是文本挖掘的一个重要分支,因其独特的知识发现功能而得到较为深入的研究。文本聚类算法已经在文档自动整理、检索结果的组织和数字图书馆服务等方面得到了广泛的应用。但是在应用中随着文本集的不断扩大,传统的文本聚类算法遇到了一些难以克服的困难,算法忽略了文本中单词之间的语义相关性,算法聚类结果不稳定等。论文主要针对以上问题对文本聚类进行研究。
     论文首先详细介绍了传统的文本聚类算法,并对其进行比较和分析。其次,为了解决向量空间模型忽略单词之间的语义相关性的问题,提出了一种基于单词相似度的文本聚类算法(TCWS);针对传统K-Means算法聚类结果不稳定的缺点,提出了一种基于文本平均相似度的K-Means算法(KAAST)。最后,将研究成果应用到公安情报系统中。本文的主要研究内容概括如下:
     (1)介绍了常用文本聚类算法,并从伸缩性、多维性、处理高维数据的能力等方面对常用文本聚类算法进行分析和比较。
     (2)提出一种基于单词相似度的文本聚类算法(TCWS)。该算法首先利用单词相似度对单词进行聚类获得单词之间的语义相关性,然后利用产生的单词类作为向量空间模型的项表示文本,降低了向量空间的维度,最后采用基于划分聚类算法对文本聚类。实验表明TCWS算法提高了聚类结果的正确性。
     (3)提出一种基于文本平均相似度的K-Means算法(KAAST)。该算法首先构造文本平均相似度集合,其次从集合中选取当前平均相似度最大的文本作为初始聚类中心,同时删除集合中与其簇相关的文本,这样选取出的中心点不但具有代表性且分散,最后利用选取的中心作为K-Means算法的初始聚类中心对文本聚类。实验表明KAAST算法的稳定性有较大的提高。
     (4)在理论研究的基础上,将本文提出的算法应用到公安情报系统中,并设计和实现了文本聚类子系统,提高了情报处理的效率和正确性。
Text Clustering is an important branch of Text Mining,which has get more depth research because of its unique knowledge discovery functions.Today,there are lots of efficient text clustering algorithms which have been widely used in the automatic document finishing,the organization of search results and digital library services.However,with expansion of document sets,traditional text clustering algorithm encountered a number of insurmountable difficulties.For instance, algorithm ignores the semantic correlation between words,the instability of result. These papers mainly for the above problems do some research on text clustering.
     First,we introduce the traditional text clustering algorithms.We compare and analyze the traditional text clustering algorithms.Secondly,to solve the vector space model ignoring the semantic correlation between words,we propose a text clustering algorithms based on word similarity(TCWS).Due to the traditional K-Means algorithms have an shortcoming of clustering results instability,we propose a K-Means algorithms based on average similarity of text(KAAST).Finally,research results be applied to public security information system.The works in this article as follows:
     (1) Introduced to the traditional text clustering algorithm,and they were compared and analyzed from the scalability,multi-dimensional,dealing with high dimensional data and so on.
     (2) We propose a text clustering base on words similarity algorithm.First of all, TCWS algorithm use of word similarity classification of words,access to word semantic relevance between words,and then make use of the word classification as a vector space model category of items with text that reduced dimension of vector space model,finally,used partitioning clustering algorithm.Experiments showed that TCWS algorithm improve the accuracy of clustering results.
     (3) We propose a K-Means base on average similarity of text algorithm.First of all,structural average similarity of text collection,Secondly,selected from collection of the greatest average similarity of the text as the initial cluster center,at the same time,needs to delete the text which cluster associated with the initial cluster center. Selected initial cluster center not only on behalf of and scattered.Finally,used to the selected center as the initial cluster centers of K-Means algorithm.Experiments showed that KAAST algorithm improve d stability.
     (4) According to above theory research,the algorithms presented in this article are used to the public security information system,and Design and Implementation of a text clustering system,which can improve efficiency and correctly.

引文

[1]M.Fayyad,Gregory Piatetsky Shapiro,Padhraic Smyth.Advances in Knowledge Discovery and Data Mining[M].California:AAA/MIT Press,1996.
    [2]Mitra S,Pal,S.K,Mitra P.Data Mining in soft computing Framework:a suvery[C].In Proc of IEEE Transactions on Neural Networks,2002,13(1):3-14.
    [3]Ming-syan Chen,Jiawei Hun,Philip S.YU.Data mining:An overview from a database perspective[J].IEEE Transactions on Knowledge Data Engineering,1996,8(6):866-883.
    [4]梁循.数据算法与应用[M].北京:北京大学出版社,2006.
    [5]Pang-Ning Tan,Michael Steinbach,Vipin Kumar.Introduction to Data mining[M].US:Addison-Wesley Press,2006.
    [6]Zeng H J,He Q C,Chen Z.Learning to cluster web search results.In Proc of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,2004:210-217.
    [7]Florian Beil,Martin Ester,Xiaowei Xu.Frequent Term-Based Text Clustering[C].In Proc of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining,2002:436-442.
    [8]Chunling Chen,Frank S.C.Tseng,Tyne Liang.Hierarchical Document Clustering Using Fuzzy Association Rule Mining[C].In proc of the 3th International Conference on Innovative Computing Information and Control,2008:326-329.
    [9]Huifeng Shi,Qiang Hua,Po Zhang.The Formal Concept Analysis of the documents clusters[C].In Proc of the 6th International Conference on Machine Learning and Cybernetics,Hong Kong,2007:3381-3385.
    [10]Huang Zhe Xue.Extensions to the k-means Algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(1):283-304.
    [11]Huang Zhe Xue.Clustering Large Data Sets with Mixed Numeric and Categorical Values[C].In Proc of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining:Singapore,World Scientific,1997:21-35.
    [12]Chen Ning,Chen An,Zhou Long Xiang.Fuzzy K-Prototypes Algorithm for Clustering Mixed Numericand Categorical Valued Data[J].Journal of software,2001,12(8):1107-1119.
    [13]梅馨,刑桂芬.文本挖掘技术综述[J].江苏大学学报,2003,24(5):72-76.
    [14]Kogan J.Hybrid Clustering of Large TextData[C].Advanced Information Networking and Applications Workshops,2007,21(1):367-372.
    [15]Sudipto Guha,Rajeev Rastogi,Kyuseok Shim.CURE:An Efficient Clustering Algorithm for Large Databases[C].In Proc of the ACM SIGMOD conference on Management of Data,Seattle,1998:73-84.
    [16]Sergio M Savaresi,Daniel L Boley,Sergio Bittanti.Cluster selection in divisive Clustering algorithms[C].In Proc of the 2nd SIAMICDM,Arlington,2002:299-314.
    [17]Ester M,Kdefel H P,Sander J.A Density-based Algorithm for Discovering Clustering in Large Spatial Databases[C].In Proc of the ACM SIGMOD international Conference Management of Data Montreal.Canada,1996:226-231.
    [18]W Wang,J Yang,R Muntz.STING:A statistical information grid approach to Spatial Data Mining[C].In Proc of the 23th ACM-SIGMOD international Conference on Very Large Database,Greece,1997:186-195.
    [19]Huidong Jin,Man-Leung Wong,KwongSak Leung.Scalable model-based clustering by working on data summaries[C].In Proc of the 3th IEEE International Conference on Data Ming,2003:91-98.
    [20]Enrique H.Ruspini.A New Approach to Clustering[C].Information and Control,1969,15(1):22-32.
    [21]Miin Shen Yang,KuoLung Wu,Jian Yu.A Novel Fuzzy Clustering Algorithm[C].IEEE International Sympostium on Computational Intelligence in Robotics and Automation,Japan,2003:647-652.
    [22]陈桂林,王永成,韩客松.一种改进的快速分词算法[J].计算机研究与发展,2000,37(4)418-424.
    [23]顾益军,樊孝忠等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-339.
    [24]Imola K.Fodor.A survey of dimension reduction techniques[R].LLNL technical report,2002.
    [25]秦春秀,赵捧末,刘怀亮.词语相似度计算研究[J].情报理论与实践,2007, 30(1):105-108.
    [26]苟恩东,颜伟.基于语义网计算英语词语相似度[J].情报学报,2006,25(1):43-48.
    [27]李昕,郑宇,江芳泽.用改进的RPCL算法提取聚类的最佳数目[J].上海大学学报,1999,5(5):409-413.
    [28]Lei Xu,Krzyzaka,Ojae.Rival penalized competitive learning for clustering analysis,RBF net and curve detection[C].IEEE Transactions on Neural Networks,1993,4(4):636-649.
    [29]U.Fayyad,Cory Reina,P.S.Bradley.Initialization of iterative renement clustering algorithms[C].In Proceedings of the Conference on Knowledge Discovery in Databases,1998:194-198.
    [30]赖玉霞,刘建平,杨国兴.基于遗传算法K-均值聚类分析[J].计算机工程,2008,34(20):200-202.
    [31]K.Krishna,M.Narasimha Murty.Genetic K-Means Algorithm[C].IEEE Transactions on System,MAN and CYBERNETICS-PART B,1999,29(3):433-439.
    [32]Ujjwal Maulik,Sanghamitra B.Genetic algorithm-based clustering technique[C].Pattern Recognition,2000,33(9):1455-1465.
    [33]范明,孟小峰等译.数据挖掘概念与技术(第一版)[M].北京:机械工业出版社,2006.
    [34]万小军,杨建武,陈晓鸥.文档聚类中k-means算法的一种改进算法[J].计算机工程,2003,29(2):102-103.
    [35]Mu-Chun Su,Chien-Hsing Chou.A modified version of the K-means algorithm with a distance based on Cluster symmetry[C].IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(6):674-680.
    [36]林嘉宜,徐剑峰,彭宏.一种新的中心对称聚类算法[J].计算机科学,2003,30(6):136-138.
    [37]Chaudhuri D,Chaudhuri B.B.A Novel Multiseed Nonhierarchical Data Clustering Technique[C].IEEE Transactions on System,Cybernetics PartB,1997,27(5):871-876.
    [38]Dan Pelleg,Andrew Moore.Accelerating exact k-means with geometric reasoning[C].In Proc of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining,1999:277-281.
    [39]Vance Faber.Clustering and the Continuous K-Means Algorithm[J].Los Alamos Science,1994,22:138-144.
    [40]秦钰,荆继武,向继.基于优化初始中心点的k-means改进算法[J].中国科学院研究生院学报,2007,24(6):771-777.
    [41]P.S,Bradley,Usama M.Fayyad.Ref'ming Initial Points for K-Means Clustering[C].In Proc of the 5th International Conference on Machine Learning,San Francisco,1998:91-99.
    [42]刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,1(16)11-15.
    [43]李飞,薛杉,黄亚楼.初始中心优化的k-means聚类算法[J].计算机科学,2002,29(7):94-96.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700