基于相似度的文本聚类算法研究及应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于相似度的文本聚类算法研究及应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study on Similarity-based Text Clustering Algorithm and Its Application
作者：马素琴
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本挖掘 ; 文本聚类 ; 文本表示 ; 语义列表 ; 相似度计算 ; 聚簇表示 ; DBSCAN算法 ; TDCAOTV算法 ; 分位数
英文关键词：text mining ; text clustering ; text representation ; semantic list ; similarity calculation ; cluster representation ; DBSCAN algorithm ; TDCAOTV algorithm ; quantile
学位年度：2010
导师：施化吉
学科代码：081203
学位授予单位：江苏大学
论文提交日期：2010-05-01

摘要

文本聚类是文本挖掘的一项重要技术,可广泛应用于文本挖掘与信息检索等方面,在大规模文本集的组织与浏览、文本集层次归类的自动生成等方面都具有重要的应用价值。但是,传统的文本聚类算法忽略了文本中单词之间的语义相关性,存在聚类结果不稳定等问题。论文主要钳对以上问题对文本聚类进行研究。
     论文先论述了文本挖掘的相关知识,分析了文本聚类的必要性及国内外研究现状,并介绍了传统的文本聚类算法,并对其进行比较和分析。重点对文本表示方法及DBSCAN算法做了深入研究,对相关算法进行改进,并在此基础上设计一个文本聚类系统。本文主要工作如下:
     (1)介绍常用文本聚类算法,并从伸缩性、多维性、处理高维数据的能力等方面对常用文本聚类算法进行分析和比较。
     (2)提出一种基于语义列表的文本聚类算法,该算法利用语义相似度计算文本的相似度,获得文本的语义相关性,采用语义列表中的同义词近义词指针降低单词的冗余度,降低了文本数据的维度,最后采用基于划分聚类算法对文本聚类。实验表明此算法提高了聚类结果的正确性。
     (3)对聚类算法DBSCAN进行改进,提出一种阈值优化的文本密度聚类算法。该算法首先使用k近邻距离对对象进行排序,并通过分位数区分密度不同的各序列,找到与其对应的优化,根据优化阈值使用密度聚类方法对对象进行聚类。改进后的聚类算法克服了阈值选取对聚类结果的影响,提高了聚类精确度和时间效率。文章采用树形结构存储聚簇,增加了聚簇的可读性。实验结果证明了该算法的有效性。
     (4)在理论研究的基础上,将本文提出的文本聚类算法应用于文本数据集中,设计一种文本聚类系统,该系统提供了预处理模块、语义列表模块、聚类算法模块、结果评估模块,分析系统各个模块的主要功能及其应用,结果表明该系统具有良好的可扩展性、灵活性。
Text Clustering is an important branch of Text Mining, which has get more depth research because of its unique knowledge discovery functions. Today, there are lots of efficient text clustering algorithms which have been widely used in the automatic document finishing, the organization of search results and digital library services. However, with expansion of document sets, traditional text clustering algorithm encountered a number of insurmountable difficulties. For instance, algorithm ignores the semantic correlation between words, the instability of result. These papers mainly for the above problems do some research on text clustering.
     In the first place, this paper discusses some knowledge of text mining, and analyzes the necessity of text clustering and the research actuality of text clustering at home and abroad. Then the traditional text clustering algorithms are introduced, and which are compared and analyzed. It puts more emphases on the deep study of document representation and DBSCAN algorithm and makes the improvement towards related algorithms, meanwhile designs a text clustering system based on the previous theories. The works in this paper is as follows:
     (1) Introduced to the traditional text clustering algorithms, and they were compared and analyzed from the scalability, multi-dimensional, dealing with high dimensional data and so on.
     (2) In order to represent documents, this paper presents the Chinese text clustering algorithm using semantic list. First of all, the algorithm use of semantic similarity to compute text similarity, access to text semantic relevance between texts, and then make use of synonym or near-synonym of the semantic list to reduce redundancy of the words that reduced dimension of texts. Finally, used partitioning clustering algorithm. Experiments showed that CTCAUSL algorithm improve the accuracy of clustering results.
     (3) A text density clustering algorithm with the optimized threshold values is proposed to solve the problem of reduced clustering performance of the DBSCAN algorithm because of global threshold values. The proposed algorithm sorts objects with k-neighbor distance, and discerns arrays with different densities by quantile, and finds the corresponding optimization, then carries out clustering of objects using density clustering algorithm based on optimized threshold values. The advanced clustering algorithm has overcome the problem of reduced clustering performance caused by threshold values selection, and has improved clustering accuracy and efficiency. The paper stores clusters with tree structure, and has made clusters more legible. The experimental results show the effectiveness of the algorithm.
     (4) On the basis of studying theory, the algorithms presented in this article are used in the text collection, and Design of a text clustering system, which provide pretreatment module, semantic list module, text clustering module and result evaluation module. From the analysis of the main functions of each module of system architecture and its application, it shows that the system has good extensibility and flexibility.

引文

[1].S.Das,A.Konar,Automatic image pixel clustering with an improved differential evolution,Applied Soft Computing 9(2009) 226-236.
    [2].S.Das,A.Abraham,A.Konar,Automatic clustering using an improved differential evolution algorithm,IEEE Transaction on Systems,Man.and Cybernetics - Part A:Systems and Humans 38(2008) 218-237.
    [3].S.Das,A.Abraham,A.Konar.Automatic clustering with a multi-elitist particle swarm optimization algorithm,Pattern Recognition Letters 29(2008)688-599
    [4].Dhillon I.S,Co-clustering documents and words using bipartite spectral graph partitioning,Proceedings of the 7th ACM Conference on Knowledge Discovery and Data Mining.New York,ACM Press,2001:269-274.
    [5].Dhillon,Kogan:Refining Clustering in High Dimensional Text Data,http://www.cs.utexas.edu/users/inderjit/public papers/refine.pdf,2002.
    [6].Yang Y.,Pedersen J.O.,A comparative study on feature selection in text categorization,Proceedings of 14th International Conference on Machine Learning.Nashville,Morgan Kaufmann Publishers,1997:412-420.
    [7].Aggarwal C.C.,Yu P.S.,Finding generalized projected clusters in high dimensional spaces,Proceedings of the 2000 ACM SIGMOD international conference on Management of data,New York,ACM Press,2000:70-81.
    [8].Deerwester S.C.,Dumais S.T.,Landauer T.K.,et al.:Indexing by Latent Semantic Analysis,Journal of the American Society of Information Science,vol.41,1990:391-407.
    [9].Furnas G.W.,Gomeal M.,Gornezl M.,Statistical semantics:Analysis of the potential performance of key-word information systems,Bell System Technical Journal,vol.62(6),1983:1753-1806.
    [10].C.F.Xie,X.Li,A Sequence-based Automatic Text Classification Algorithm,Journal of software,vol.13(4),2002:783-788.
    [11].Zelikovitz S.,Hirsh H.,Using LSI for text classification in the presence of background text,Proc.of CIKM01,New York,ACM Press,2001:113-118.
    [12].张春霞,天用.汉语自动分词的研究现状与困难.系统仿真学报,2005,17(1):138-143.
    [13].金翔宇,孙正兴,张福炎.一种中文文档的非受限无词典抽词方法.中文信息学报.2001, 15(6):33-39.
    [14].韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究.中文信息学报,2001,15(2):23-30.
    [15].吴应良,韦岗,李海洲.一种基于N-Gram模型和机器学习的汉语分词算法.电子与信息学报,2001,23(11):1148-1153.
    [16].郭祥昊,钟义信,杨丽.基于两字词簇的汉语快速自动分词算法.情报学报,1998,17(5):352-357.
    [17].WANG Y,HUANG S.Apriori and N-gram Based Chinese Text.Feature Extraction Method.Journal of Shanghai Jiaotong University(Science),2004,9(4):11-14.
    [18].Imola K.Fodor.A survey of dimension reduction techniques[R].LLNL technical report,2002.
    [19].顾益军,樊孝忠等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-339.
    [20].Vijay V.Raghavan,S.K.M.Wong.A Critical Analysis of Vector Space Model for Information Retrieval.Journal of the American Society for Information Science,1986,37(2):279-287.
    [21].宫秀军,史忠植.基于Bayes潜在语义模型的半监督Web挖掘.软件学报,2002,013(008):1508-1514.
    [22].姜宁,史忠植.文本聚类中的贝叶斯后验模型选择方法.计算机研究与发展,2002.39(5):580-587.
    [23].李涓子,黄昌宁.语言模型中一种改进的最大熵方法及其应用.软件学报,1999,10(3):257-263.
    [24].赵军,黄昌宁.汉语基本名词短语结构分析模型.计算机学报,1999,22(2):141-146.
    [25].屈刚,陆汝占.基于特征的汉语词性标注模型.计算机研究与发展,2003,40(4):556-561.
    [26].S.Das,A.Konar,Automatic image pixel clustering with an improved differential evolution,Applied Soft Computing 9(2009) 226-236.
    [27].S.Das,A.Abraham,A.Konar,Automatic clustering using an improved differential evolution algorithm,IEEE Transaction on Systems,Man,and Cybernetics - Part A:Systems and Humans 38(2008) 218-237.
    [28].S.Das,A.Abraham,A.Konar,Automatic clustering with a multi-elitist particle swarm optimization algorithm,Pattern Recognition Letters 29(2008)688-699.
    [29].Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 1998, 2(2):283-304.
    [30].L. Kaufman, P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis.New York: John Wiley & Sons. 1990.
    [31].M. Steinbach, Ge. Karypis, and V. Kumara. A Comparison of Document Clustering Techniques. KDD-2000 Workshop on Text Mining. August 20-23, 2000, Boston MA USA.109-110.
    [32].Kaufman L., Rouseeuw P. J., Finding group in data: an introduction to cluster analysis, New York, John& Sons. 1990.
    [33].Zhang T, Ramakrishnan R. Livny M. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996, pp. 103-114.
    [34].Guha S., Rastogi R., Shim K. CURE: an efficient clustering algorithm for large database.Information Systems. 26(1). 2001, pp.35-58.
    [35].S. Guha, R. Rastogi, K. Shim. ROCK: A robust clustering algorithm for categorical attributes, Information Systems, 25(5), 2000, pp.345-366.
    [36].Karypis G, Han E. H., Kumar V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. Computer, 32(8), 1999, pp.68-75.
    [37]. M Ester H, P Ktiege. A density-based algorithm for discovery clusters in large spatial databases. In Proc 1996 Int Conf Knowledge Discovery and Data Mining. Portland: Aug,1996:226-231.
    [38].J. W. Shavlik, T. G. Dietterich, Reading in Machine Learning, San Mateo, CA: Morgan Kaufmann, 1990.
    [39].Ruspini H E. A new approach to clustering. Froc 2nd Int Conf Knowledge Discovery and Data Mining, Portland: Dec, 1969:126-131.
    [40].Bezdek J C. Pattern recognition with fuzzy objective function algorithms [M]. New York:Plenum Press, 1981.
    [41].索红光,王玉伟.一种用丁文本聚类的改进k-means算法[J].山东人学学报,2008,43(1):60-64.

    [42]. Wei Song, Song Cheol Park. A novel document clustering model based on latent semantic analysis[C].Third International Conference on Semantics,Knowledge and Grid,IEEE,2007.
    [43].冯少荣,肖文俊.基于语义距离的高效文本聚类算法[J].华南理工大学学报,2008,36(5):30-37.
    [44].彭京,杨冬青,唐世渭,付艳,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363.
    [45].PANDYA A,BHAT TACHARYA P.Text similarity measurement using concept representation of texts[C]//Proceedings of First International Conference on Pattern Recognition and Machine Intelligence.Berlin.Germany:Springer.2005:678-689.
    [46].SUB SHUANG,ZHANG YONG.Clustering method based on semantic similarity[J].Journal of Nanjing University of Aeronautics and Astronautics.2006,38(6):712-716.
    [47].Baeza-Yates,R.and Ribeiro-Neto,B.Modern Information Retrieval,1~(st) ed.Addison-Wesley-Longman,Reading,MA,1999.
    [48].验室语料库[DB/OL].[2009-03-12]:http://www.sogou.com/labs/resources.html
    [49].LEEHAO-BUPT,ICTCLAS中文分词工具[EB/OL].[2008-12-05]http://download.csdn.net/source/377228.
    [50].MINQIANG LI,LIANG ZHANG.Multinomial mixture model with feature selection for text clustering.Knowledge Based Systems 21(2008)704-708.www.elsevier.com/locate/knosys.
    [51].Morteza Haghir Chehreghani,Hassan Abolhassani,Mostafa Haghir Chehreghani.Improving density based methods for hierarchical clustering of web pages[J].Data and Knowledge Engineering,2008,67(1):30-50.
    [52].Ester,Martin,Hans Peter Kriegel et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proc of the 1996 2nd Int'l Conf on Knowledge Discovery and Data Mining Portland:AAAI Press,1996:226-231.
    [53].Wei Song,Song.Cheol Park.Genetic Algorithm-based Text Clustering Technique:Automatic Evolution of Clusters with High Efficiency[C].7th International Conference on Web-Age Information Management Workshops,WAIM 06,Hong Kong,China,2006:17-17.
    [54].Zeng H J,He Q C,Chen Z.Learning to cluster web search results.In Proc of the 27th Annual International ACM SIGIR Conference on Research and Development in information Retrieval. 2004:210-217.

    [55]. Gong Yue-Jiao .Xu Rui-Tian.Zhang Jun.Liu, Ou. A clustering-based adaptive parameter control method for continuous ant colony optimization. 2009 IEEE International Conference on Systems, Man and Cybernetics, SMC (2009) 1827-1832.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700