基于SVM有聚类指导的Web中文文本分类器的研究及其实现

英文题名：Research and Realization of Clustering Guided Web Chinese Text Classification Based on SVM
作者：张俊艳
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web文本挖掘 ; 文本聚类 ; 文本分类 ; 分类器 ; 支持向量机
英文关键词：Web Text mining ; Text clustering ; Text classification ; Classification machine ; Support Vector Machine
学位年度：2004
导师：林世平
学科代码：081203
学位授予单位：福州大学
论文提交日期：2003-12-01

摘要

随着Internet的迅速发展，网络信息不断膨胀。为了提供高效、准确的信息服务，我们需要对网络中纷繁芜杂的信息进行合理的组织与分类。本文的目标就是以Web文本信息处理为背景，从理论及应用两个层次对文本信息的聚类、分类方法进行了较为深入的研究。
    论文首先阐述了文本分类器的总体模型，包括：信息预处理、特征表示、特征提取、利用文本挖掘技术提取分类模式（其中涉及到文本聚类、分类）和对模式进行质量评估等五个方面。其次，对分词、特征提取、文本聚类、分类等理论知识和关键技术作了介绍，特别是有聚类指导的基于SVM的分类模式的提取。最后，构造中文文本分类器，并编程实现，通过实例测试分类器性能。
    论文的重点是在文本聚类指导下的分类模式的提取。与传统分类器不同，我们在缺乏类信息的情况下，采用聚类替代领域专家的人工分类获得类信息，为构造分类器提供合适的类信息，取得了较好效果。
    聚类部分，改进了k-means算法，克服了它的倾向缺陷，使它的结果分布比较均匀，更能体现一个聚簇的规律，提高了分类精度。针对实验数据的高维性、稀疏性等特征，我们提出了HSMBK和HSSCA两个聚类算法。(1) HSMBK算法，利用了对称划分原理；采用了一种新的计算相似性方法--布尔特征稀疏差异度；将选优思想应用到聚簇中心的计算，形成一种新的中心计算方法，减少了孤立点的影响；采用启发式思想提出了JW准则，为K值的选择提供依据。(2) HSSCA算法，分两个阶段处理：第一阶段将数据聚集成小的聚簇，不需指定聚类数目；再次聚类采用凝聚的聚类法将小聚簇进行合并得到所需聚类数。采用了另一种新的计算相似度的方法--集合的布尔特征稀疏差异度。
    通过对三个聚类算法进行实验验证，选择聚类效果最好的HSMBK算法指导分类模式的提取。
    分类部分，论文在理论上分析了文本分类采用支持向量机技术的优点，对两种具体的SVM算法-C-SVC和V-SVC进行了研究并利用实例进行验证。最后详细介绍了基于支持向量机的Web中文文本分类器的设计与实现。
Along with the development of Internet, network information increases rapidly. In order to make the information service more efficient and precise, it is important to get the information in Internet organized and classified reasonably. The thesis focuses on text information processing in the network, proceeds the thorough research to text clustering、 classification from two levels which are theories and application.
     First, a model of automatic text classification system is described, which includes five aspects: the information pretreatment、the features denotation、the features extraction、making use of text mining technique extracting classified model(involve text clustering and classification) and evaluating model quantity. Second, the thesis introduces the theory and the key techniques which are word segmentation、features extraction、text clustering and text classification, specially the extraction of clustering guided classification model based on SVM. At last, we construct the Chinese text classification machine, take it to realization by programming and use the true data to test the classification machine.
    The important part of the thesis is the extraction of clustering guided classification model. Different from traditional classification machine, our research is preceded under the situation of lacking class label and class information, replacing manual classification with clustering in order to gain classification information and the rustle is good.
    In clustering part, we modify k-means for overcoming its trend limitation, making its clustering result more equal and mostly reflecting the character of clustering. The modified algorithm can increase the classification accuracy.
    It can find that the data is high dimension and sparse. We bring forward HSMBK and HSSCA algorithms to code with the problem. (1) HSMBK, it uses the bisect partition principle and adopts a new method to count the comparability-- "binary feature sparse otherness". We apply the thought of choosing excellent element to the method of calculating the center of clustering for reducing the effect of the isolated points. At last, we bring forward JW rule based on the enlighten idea. (2) HSSCA, It has two phases: First, it assembles the data to small

    child clusterings. Second, it uses the agglomerate clustering algorithm to unite these small clusterings for getting the needed clustering number. It also adopts other new method to calculate the comparability-"binary feature sparse otherness based on collection".
    We validate three clustering algorithm by experiment and elect the best algorithm-HSMBK to extract the classification pattern.
    In classification part, we analyze the advantage of using the Support Vector Machine (SVM) to text classification on theory. The two classical SVM algorithms-C-SVC algorithm and S-SVC algorithm have been done more research and the two algorithms performance has been compared by using practice data. At last, we detailed present the design of Web Chinese Text Classification machine based on SVM.

引文

[1] Jiawei Han，Micheline amber著．数据挖掘概念与技术(，范明，孟晓峰等译)．北京：机械工业出版社，2001.
    [2] 史忠植．知识发现．北京：清华大学出版社，2002.
    [3] 郑成增，陈志峰，李思思，王延珍。智能化远程教学系统的研究。计算机工程。2000，26(7)：29-32.
    [4] Chakrabarti S and Dom B E and Ind．Enhanced hypertext classification using hyper-links[C]．In：Proc of ACM-SIOGMOD Int'l Conf on Management of Data(SIGMOD'98),Seattle,WA,1998:307~318.
    [5] Graham-Cumming J. Hits and miss-es:A year watching the Web[C]. In:Proc of 6th Int'l World Wide Web Conf Santa Clara,California,1997.
    [6] 王伟强，高文，段立娟.Internet上的文本数据挖掘.计算机科学2002Vol.27No.4.
    [7] W.Lam and C.Y.H．Using a generalized instance set for automatic text categorization．In Proceedings of the 21th Ann Int ACM SIGIRConference on Research and Development in Information Retrieval (SIGIR'98), 1998,81~89.
    [8] L.Douglas Baker and Andrew K.Mccallum. Distributional clustering of words for text categorization. In Pro of the 21th Ann Int ACM SIGIRConference on Research and Development in Information Retrieval,96~103.1998.
    [9] A .McCallum and K.Nigam. A comparison of event models for naive bayes text classification . AAAI-98 Workshop on Learning for text categorization,1998.
    [10] E.Wiener,J.O.Pedersen and A.S.Weigend. A neural network approach to topic spotting. In Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95), 1995.
    [11] Thorsten Joachims.Text Categorization with Support Vector Machines:Learning with Many Relevant Features. In European Conference on Machine Learning (ECML'97),170~178,1997.
    [12] 黄崑，符绍宏.自动分词技术及其在信息检索中应用的研究. 信息检索技术，2001
    年，第3期.
    [13] 李振兴，徐泽平，唐卫清. 全二分最大匹配快速分词算法. 计算机工程与应用. 2002.
    [14] Cooper W S. Getting beyond Boole. Information Processing &Management. 1998 Vol.24(3).
    [15] Salton G and Lesk M E. Computer evaluation of Index and text processing.Association for computing Machinery. 1968 Vol.15(1).
    [16] Maron,M.E.et al. On relevance,probabilistic indexing and information retrieval.Journal of the ACM,Vol.7(3),1996.
    [17] 张义忠，赵明生，朱精南．基于内容的网页特征提取．博士论坛.
    [18] Cluster Analysis Applied to a Study of Race Mixture in Human Populations, Classification and Clustering, Academic Press.
    [19] Sergio.M.Savaresi and Daniel L.Boley. On the performance of bisecting k-means and PDDP
    [20] Michael Steinbach and George Karypis and Vipin Kumar.A Comparison of Document Clustering Techniques.Department of Computer Science and Egineering, University of Minnesota. Technical Report

    #00-034.
    [21] V.Vapnik and A.Y.Chervonenkis, On the Uniform Convergence of Relative Frequencies of Events to their Probabilities, Theory of Probab. And its Applicationa,16(2):263-280,1971.
    [22] V.Vapnik, Estimation of Dependence based on Empirical Data, New York, Sringer-Verlag,1982.
    [23] V.Vapnik, A.Y.Chervonenkis, The Necessary and Sufficient Conditions for Consistency in the Empirical Risk Minimization Method, Pattern Recognition and Image Analysis, 1(3):283-305,1991.
    [24] Vapnik V N. The Nature of Statistical Learning Theory,NY:Springer Verlag,1995张学工译. 统计学习理论的本质. 北京：清华大学出版社，2000.
    [25] Chih-Chung Chang and Chih-Jen Lin. LIBSVM:a Library for Support Vector Machines. Last updated:May 5,2003．
    [26] Chih-Chung Chang,Chih-Wei Hsu,and Chih-Jen Lin. The Analysis of Decomposition Methods for Support Vector Machines.
    [27] Cortes C, Vapnik V.Support Vector Networks. Machine Learning,1995,20:273-297.
    [28] T.Joachims. Marking Large-Scale SVM Learning Practical. In:B.Scholkopf, C.J.C.Burges,A.Smola eds.,Advances in Kernel Methods Support Vector Learning,Cambridge,MA:MIT Press,1998:169-184.
    [29]J.C.Platt.Fast Training of SVMs Using Sequential Minimal Optimization. In:B. Scholkopf, C.J.C.Burges,A.Smola eds.Advances in Kernel Methods-Support Vector Learning,Cambridge,MA:MIT Press,1998:185~208.
    [30] Chang,C.C. and C.J. Lin. Training v-support Vector classifiers:Theory and algorithm. Neural Computation 13(9),2119-2147。
    [31] Crisp,D.J.and C.J.C.Burges. A geometric interpretation of v-SVM classifiers. In S.Solla, T.Leen, and K.R.Miiller(Eds),Advances in Neural Information Processing Systems, Volume 12, Cambridge, MA. MIT Pre.
    [32] Chang,C.C. and C.J. Lin. Training v-support Vector classifiers:Theory and algorithm. Neural Computation 13(9),2119-2147ss. ,2000．
    [33] 都云琪．中文文本自动分类的研究与实现：［硕士学位论文］．西安电子科技大学．
    [34] 刘敏．基于支持向量机的中文文本自动分类系统的设计和实现：［硕士学位论文］。浙江大学．
    [35] 黄科，马少平．基于统计分词的中文网页分类．中文信息学报，第16卷，第6期，Vol.16No.6　　
    [36] 邹涛，王继成，黄源，等．中文文档自动分类系统的设计与实现．中文信息学报，第13卷，第3期，Vol.13No.3．
    [37] 徐雨臻。中文文本挖掘系统的设计与实现：[硕士学位论文]．哈尔滨工业大学.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700