基于遗传算法与模糊聚类的文本分类研究

英文题名：Research on Text Categorization Based on Genetic Algorithm and Fuzzy Clustering
作者：于水英
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本分类 ; 聚类分析 ; 遗传算法 ; 模糊C-均值聚类
英文关键词：text categorization ; clustering analysis ; genetic algorithm ; fuzzy C-means clustering
学位年度：2009
导师：丁华福
学科代码：081203
学位授予单位：哈尔滨理工大学
论文提交日期：2009-03-01

摘要

随着数据的爆炸式增长,信息处理已经成为人们获取有用信息不可缺少的工具,文本分类也已成为重要研究方向。作为非监督学习方法的模糊聚类分析已成为文本分类研究的热点,对基于模糊聚类的文本分类研究具有重大的理论和现实意义。然而,模糊聚类算法存在初始值敏感问题。因此,本文提出了一种遗传算法优化模糊聚类的文本分类算法。
     本文对模糊C-均值(FCM)聚类算法的一种改进算法-特征加权的FCM(WFCM)聚类算法,与FCM算法进行了测试比较。结果表明,WFCM聚类算法提高聚类的正确率。遗传算法是一种高效率的随机全局优化搜索算法,本文将遗传算法与FCM结合产生基于遗传算法的特征加权的FCM(WFCM)聚类算法(GWFCM),充分发挥FCM的局部搜索和遗传算法的全局搜索能力。本文在研究现有聚类类别数目自动学习的基础上,对聚类的有效性判断加以改进,在算法中动态改变聚类类别数目,以提高聚类的有效性和精确性。
     针对编码特征的问题,本文引入一个基因平均差异度的概念,算法的执行过程中,交叉和变异算子,动态地计算基因平均差异度值,使用该值以限制适应度差的个体产生,从而优化了遗传算法的执行性能。这种聚类方法在性能上比经典的聚类算法有较大的改进,它通过非线性映射能够较好地分辨、提取并放大有用的特征。
     由于在遗传算法的应用中,采用了比例选择算子,会产生进化早期的早熟收敛和进化后期的搜索效率下降等问题。为此,本文提出一种非线性排序选择机制。在群体进化过程中,本文实施精英基因引入策略确保了遗传进化的稳定性,避免无效解的扩散,从而保证了算法的收敛性,确保了遗传进化的稳定性,提高了对聚类中心的搜索效率。
     为了验证本文所提算法的高效性和可行性,我们将GWFCM与FCM、WFCM进行,抽取大量文本进行实验。通过实验可以看出GWFCM较WFCM的查准率、查全率和F1值分别提高了0.030、0.022、0.026,GWFCM算法相对于其它方法在文本分类和聚类中具有很好的表现。
Along with the data’s explosive growing, information processing has become a indispensable tool for people to acquire useful message, so that text categorization is the important research direction. Fuzzy clustering analysis, as a kind of unsupervised learning methods, is a research hotspot concerning about text categorization. Therefore the research of text categorization based on fuzzy clustering is hence of great theoretical and practical significance. However, fuzzy clustering algorithm exist initial value sensitivity problem. Therefore, In this paper, a fuzzy clustering algorithm based on genetic algorithm is proposed.
     This paper test and comparison of fuzzy C-means clustering(FCM) and weighted FCM(WFCM) clustering algorithm, which is a improvement of FCM. the results show that WFCM clustering algorithm improved the fuzzy clustering’s accuracy rate. Genetic algorithms are a high efficient global optimization stochastic search algorithm, this paper combines genetic algorithm with WFCM, the characteristics of weighted FCM clustering algorithms based on genetic algorithms (GWFCM) is put forward, which making full use of FCM local search virtue and global search ability of genetic algorithm. In this paper, at the basis of study clustering class number automatically learning, improve the effective judgment of clustering, dynamic changes clustering class number in algorithm, the validity and precision of clustering is advanced.
     Aiming at coding characteristics problems, in this article a concept of degree of genetic variation is introduced. In the algorithm implementation process, crossover and mutation operator, the dynamically calculated value of genetic variation, the value to limit the bad fitness individual production, So as to the optimize execution performance of genetic algorithm. This clustering method is greater improvement than classical clustering algorithms in performance. Through non-linear mapping, it can better distinguish extract and amplify useful features.
     Due to using proportional selection operator in the application of genetic algorithm, there are some questions, which are premature convergence in early evolution and search efficiency decline late evolution. For these reason, in this paper a kind of non-linear selection mechanism is proposed. In the group improvement process, this paper propose elite gene introduce policy, So as to ensure the stability of genetic evolution, on cluster centers improve search efficiency.
     In order to confirm the efficiency and feasibility of our algorithm, We compare GWFCM with FCM and WFCM. extract a lot of texts experimentize, The experiment results shown that the precision、recall and F1 improved 0.030、0.022and 0.026 separately. GWFCM has better performance than other methods in text categorization and clustering.

引文

[1] CHEN M-S,HAN J,YU P S.Data mining:an overview from a database perspective[J].IEEE Transactions on Knowledge and Data Engineering,1996,8(6):866-883.
    [2]阳琳赞,王文渊.聚类融合方法综述[J].计算机应用研究.2005,12(3):8-12.
    [3] FAYYAD U,PIATETSKY-SHAPIRO G,SMYTH P,UTHURUSAMY R.Advances in knowledge discovery and data mining [M].AAAI/MIT Press,1996:73-89.
    [4]胡红霞,王振兴.搜索引擎技术的现状及发展趋势[J].信息工程大学学报,2005,12(4):66-69.
    [5] FABRIZIO SEBASTIANI.Machine learning in automated text categorization [J].ACM Computing Surveys,2002,1(34):l-47.
    [6]薛德军,张钱,孙茂松.中文文本自动分类中的关键问题研究[D].北京:清华大学(博士),2004:35-48.
    [7]卢忠良,王家云.一种基于模糊聚类的汉语文本自动分类[J].计算机应用,2002:49-50.
    [8] I A SARAFIS,P W TRINDER,A S ZALZALA.NOCEA:A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases[J]. Applied Soft Computing , 2007 , 7(3) : 668-710.
    [9]朱巧明,李培峰,吴娴,等.中文信息处理教程[M].北京:清华大学出版社,2005:203-286.
    [10]戴文华.基于混合并行遗传算法的文本分类研究[D].武汉:华中师范大学(硕士),2007:4-7.
    [11] PIERPAOLO D'URSO.Fuzzy Clustering for Data Time Arrays With Inlier and Outlier Time Trajectories [J].IEEE Transactions on Fuzzy Systems,2005,13(5):583-640.
    [12] SALTON G,BUCKLEY B.Term-Weighting approaches in automatic text retrieval [J].Information Processing and Management,1988,24(5):513-523.
    [13] MECALLUM A,NIGAM K.A comparison of event models for naive Bayes text classification [C] . the AAAI-98 Workshop on Leaning for Text Categorization,Menlo Park,1998:41-48.
    [14]刘建舟,何婷婷,姬东鸿.基于开放式语料的汉语术语的自动抽取[C].第二十届东方语言计算机处理国际学术会议,沈阳,2003,43-49.
    [15] ZELIKOVITZ S,HIRSH H.Using LSI for Text Classification in the presence of Back ground Text [C].Proc of CIKM01,Georgia,2005:113-118.
    [16] PATRICK PANTEL , DEKANG LIN . Document Clustering with Committees [J].Department of Computing Science,2002,2:199-206.
    [17]张莹.基于自主学习的中文文本分类算法研究[D].哈尔滨:哈尔滨工业大学(硕士),2006:6-12.
    [18]代六玲,黄河燕,陈肇雄.中文文本分类中特征方法的比较研究[J].中文信息学报,2006,18(1):26-32.
    [19] HAN J,KAMBER M.Data Mining:Concepts and Techniques [M].San Francisco:Morgan Kaufmann Publishers,2001:211-219.
    [20]李雄飞,李军.数据挖掘与知识发现[M].北京:高等教育出版社,2003:56-63.
    [21]谢立宏.面向高维数据的聚类算法研究[D].武汉:武汉大学(博士),2005:62-71.
    [22]于剑.论模糊C均值算法的模糊指标[J],计算机学报,2003,26(8):968-973.
    [23]孙即祥.现代模式识别[M].长沙:国防科技大学出版社,2002:39-46.
    [24] M ESTER,H KRIEGEL,J SANDERETAL.A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise [C].In Second International Conference on Knowledge Discovery and Data Mining,Portland Oregon,1996(8):226-231.
    [25]孟海东,张玉英.基于密度和对象方向聚类算法的改进[J].计算机工程与应用,2006,20:154-156.
    [26] D G ROUSSINOV,H CHEN.A Scable Self-organizing Map Algorithm for Textual Classification:A Neural Net-work Approach to Thesaurus Generation [J].Communication and Cognition-Artificial Intelligence,1998,15(1):81-111.
    [27] CHEN NING,CHEN AN,ZHOU LONG XIANG.An Incremental Grid Density-Based Clustering Algorithm [J] . Journal of Software , 2007 ,13(1):1-7.
    [28] JI HE+,MAN LAN,CHEW-LIM TAN,et al.Initialization of cluster refinement algorithms:a review and comparative study [C].Proceeding of International Joint Conference on Neural Networks,Budapest,2004:297-302.
    [29] MOTH'D BELAL AI-DAOUD . A New Algorithm for Cluster Initialization . Transactions on Engineering [J] . Computing and Technology,2005,4:74-76.
    [30] MICHAEL STEINBACH , GEORGE KARYPIS , VIPIN KUMAR . A Comparison of Doeument Clustering Techniques [J].In KDD Workshop on Text Mining,2000:120-143.
    [31] YING ZHAO,GEORGE KARYPIS.Criterion Functions for Document Clustering-Experiments and Analysis[C] . Technical RePort TR#01-40 ,Department of Computer Science,University of Minnesota,Minneapolis,USA,2002:58-67.
    [32]刘菁.基于模糊理论与人工神经网络的暂态稳定评估方法[D].上海:上海交通大学(硕士),2007:16-21.
    [33]杜长海,吉根林.模糊聚类在中文文本分类中的应用研究[J].计算机工程与应用,2006,8:240-251.
    [34] DUNN J C.A fuzzy relative of the ISODATA process and its use in detecting compact well separated cluster [J].Cybernetics and Systems,1973,3(3):32-57.
    [35] BEZDEK J C.Pattern Recognition with Fuzzy Objective Funcion Algorithms [M].New York:Plenum Press,1981:62-78.
    [36]陆伟峰.基于改进的模糊C-均值聚类建立T-S模糊系统[J].无锡南洋学院学,2008,7(3):2-4.
    [37] NIKHIL R PAL,DEBRUP CHAKRABORTY.Mountain and subtractive clustering method : Improvements and generalization [J] . International Journal of Intelligent Systems,2006,15(4):329-341.
    [38] KONONEKO J.Estimating attributes Analysis and extensions of Relief[C].Rroceedings of the 7th European Conference on Machin Learning,Berlin Springer,1994:171-182.
    [39] KIRA K , RENDELL L A . A practical approach to feature selection [C].Derek Sleeman,Peter Edwards.Proceeding of the 9th International Workshop on Machine Learning,Cotland,UK,1992:249-256.
    [40]宫改云,高新波,伍忠东.FCM聚类算法中模糊加权指数m的优选方法[J].模糊系统与数学,2005,19(1):143-148.
    [41]陈明.基于进化遗传算法的优化计算[J].软件学报,1998,l9(ll):876-879.
    [42]周明,孙树栋.遗传算法原理及应用[M].北京:国防工业出版社,1999:20-45.
    [43]李茂军,樊韶胜,童调生.单亲遗传算法在模式聚类中的应用[J].计算机工程与应用,1998,10(1):15-18.
    [44]王涛,沈谦,朱明星,张良震.遗传与K-Means混合算法用于聚类分析[J].模式识别与人工智能,1999,12(1):24-27.
    [45]周明,孙树栋.遗传算法原理及应用[M].北京:国防工业出版社,2002:5-7.
    [46]邓乃扬,田英杰.数据挖掘中的新方法:支持向量机[M].科学出版社,2005:49-125.
    [47]吴科,石冰,卢军,等.基于文本集密度的特征选择与权重计算方案[J].中文信息学报,2004,1(18):42-47.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700