基于偏斜数据集的中文文本分类问题的改进特征权重算法研究

英文题名：Improved Term-weighting Approach in Chinese Text Classification over Skewed Data Sets
作者：张玉杰
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：文本分类 ; 特征权重 ; 偏斜数据集 ; TFIDF ; TFIDF-λDI
英文关键词：Text classification ; Term-weighting ; Skewed data ; TFIDF ; TFIDF-λDI
学位年度：2010
导师：孙铁利
学科代码：081202
学位授予单位：东北师范大学
论文提交日期：2010-05-01

摘要

随着Internet技术的飞速发展,各种多样化的庞大信息资源每天以数量级的形式增长,在海量信息资源中大多数信息仍是以文本的形式存在,如何管理、组织如此庞大且不断增长的文本信息,并且从中挖掘出人们需要的相关信息已成为一项具有研究价值的课题,近年来得到国内外学者的广泛关注。文本自动分类技术应时代的需求自此产生,并且随着该技术地不断发展,已成为各种搜索引擎、信息检索、信息过滤等问题的行之有效地解决办法,成为一项具有广泛应用前景和使用价值的关键技术。随着越来越多学者的关注和研究,目前已在国内外学术界掀起一股热潮。
     在文本自动分类过程中,包括多项关键技术:分词、特征选择、向量空间模型、建立分类模型、分类评价指标等。基于机器学习的文本自动分类大多建立在向量空间模型之上,在空间向量模型中,将文本以计算机能够识别的形式表示出来,通过特征权重计算方法计算出文本中处于重要地位并且能够较好地表示文本类别的特征词的权值,忽略掉对分类没有贡献或者说贡献不大的词。这样做的目的一是可以降低文本向量空间的维数,提高文本分类的效率,二是可以使选择出来的特征词能够更好地代表文本,提高文本分类的精度。因此,文本特征权重计算方法是文本分类的基础和前提,具有重要的地位。基于以上分析,本文将研究重点放到特征项权重计算方法的改进上。所做工作主要如下:
     (1)介绍了文本分类的研究背景和理论知识,分别介绍了国内、外文本分类技术的发展状况和优秀分类体系。
     (2)阐述了文本分类的关键技术,主要包括文本预处理、特征降维、文本表示、文类算法及分类评价指标等。
     (3)详细分析了经典的特征权重算法TFIDF,并指出传统算法的缺点,主要针对于特征词分布于类间、类内以及类别分布偏斜的数据集三种情况下,对传统特征权重算法提取出的特征词对文本分类效果的影响进行分析,指出其问题及不足。同时针对目前基于传统TFIDF进行改进的特征权重算法进行介绍和对比分析,文中重点对以上提出的问题表现优秀的TFIDF-DI算法进行分析和讨论。
     (4)描述偏斜数据集的概念和近年来基于该概念产生的新理论和新方法,用传统特征权重算法TFIDF和TFIDF-DI两种算法进行对比实验分析,指出这两种方法对于分布偏斜的数据集所具有的缺点,并分析其原因。
     (5)通过详细分析对比,在TFIDF-DI算法基础上提出新的改进算法TFIDF-λDI算法,引入λ因子用以修正基于偏斜数据集的文本分类问题,通过实验对传统特征权重算法TFIDF和基于TFIDF改进的优秀算法TFIDF-DI及本文提出的新的改进算法TFIDF-λDI进行对比分析,实验结果显示本文提出的TFIDF-λDI算法对于数据集分布偏斜情况下的文本分类问题具有较好的效果。
With the fast development of the Internet, various kinds of diversified information are growing exponentially everyday, most of these abundant information resources are still exist in terms of text. It is becoming a high research value that how to manage and organize so huge and increasing text information and mining relevant information from which people needed, this problem has been drowning more and more attention all over the world. With this background, text classification based on the machine learning grew with the trend of the times, it has become the important basis and prerequisite in information retrieval, information filtering, search engine, text database, data mining fields and so on, and it has comprehensive application foreground.
     In the process of the text classification, it includes many key technologies: Chinese Word Segmentation, feature selection, vector space model, classification model, classification evaluation indicator and so on. Most of automatic text categorization based on the machine learning is built on the vector space model (VSM), Text is expressed as the form of computers can recognized in the VSM. Using the feature weight algorithm, we choose the features that play an important role and can represent text better in the text; at the same time we ignore the features that have no contribution to the text categorization. One reason of the above purpose that it can reduce the dimension of the VSM and improve the efficiency of the text categorization, the other reason is that it can choose the better features expressed the text, it can improve the precision. Therefore, text feature weight algorithm is the basis and premise of the text categorization, it has the important position. Following the analysis mentioned above, this dissertation focuses on improving the term-weighting approach. The contributions of this dissertation are listed as follow:
     Basis concept of text classification and the development at home and abroad are introduced briefly.
     Introduce the key technology of text classification including pretreatment the text, feature dimension reduction, text representation, classification algorithm and evaluation metric.
     Introduce the classification term-weighting approach TFIDF and analyze its weaknesses, lay out several improving approaches based on TFIDF, TFIDF-DI is the better one and analyze it.
     Introduce the concept of the skewed dataset and do the control experiments using the TFIDF and TFIDF-DI, analyze the results and pointe out the shortcoming of these two approaches with the skewed dataset.
     Propose an improvement method TFIDF-λDI based on the TFIDF-DI and use the KNN algorithm comparing the new approach with TFIDF and TFIDF-DI, the result shows the improvement method has the certain enhancement for the performance of classification.

引文

[1]尹中航.网络新闻智能分类技术的研究与实现.上海交通大学博士学位论文,2002,7.
    [2]Maron M. Automatic indexing: an experimental inquiry. Journal of the Association for Computing Machinery.1961, 8(3): 404-417.
    [3]Sparck J.K, Willett P, et al. Readings of information retrieval. San Mateo, US: Morgan Kaufmann, 1997.
    [4]Salton G, et al. A vector space model for automatic indexing. Communications of the ACM. 1975, 18: 613-620.
    [5]侯汉清.分类法的发展趋势简论[M].北京:中国人民大学出版社,1981.
    [6]朱兰娟,王永成.中文文献的自动分类.中文信息,1986,(4):2628.
    [7]苏新宁,徐进鸿,史九林.档案自动分类算法研究[J].情报学报.1995年03期.
    [8]叶新明.中文文献自动分类研究概述[J].情报理论与实践.1992年05期.
    [9]吴军,王作英.汉语语料的自动分类[J].中文信息学报.1995, 9(4):25-32.
    [10]王永成,张坤.中文文献自动分类研究[J].情报学报.1997年05期.
    [11]邓要武,王连俊.图书自动分类专家系统的设计尝试[J].图书情报工作.1997年05期.
    [12]Chawla,n., Japkowicz,N., &Kolcz, A(Eds.).(2003).Proceedings of the ICML’2003 workshop on learning from imbalanced data sets.
    [13]Chawla,n., Japkowicz,N., &Kolcz,A(Eds.).(2004).ACM SIGKDD Explorations Newsletter, 6(1)[Special issue on learning from imbalanced datasets].
    [14]Goldman,S., &Zhou,Y.(2000). Enhancing supervised learning with unlabeled data. In Proceedings of 17th international conference on machine learning (pp.327-334). San Fransisco, California, USA.
    [15]Japkowicz,N.(ED.).(2000).Proceedings of the AAAI’2000 workshop on learning from imbalanced data sets ,AAAI Tech Report WS-00-05,AAAI.
    [16]HOW B C, NARA YANAN K.An empirical study of feature selection for text categorization based on term weightage[C]//Proceedings of the 2004 IEEE/W IC/ACM Intemational Conference on Web Intelligence.Washington, DC: IEEE Computer Society, 2004:599-602.
    [17]李荣陆.文本分类及其相关技术研究[博士论文].上海复旦大学,2004,4.
    [11]Winter Wen,中文搜索引擎技术揭秘:中文分词[EB/OL]. http://www.stlchina.org/twiki/bin/view.pl/Main/SESegment.2005.
    [18]刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望[J].计算机工程与应用. 2006, 3. P177.
    [19]Zou F, Wang F L, Deng X T, et al. Stop Word List Construction and Application in Chinese Language Processing[J], WSEAS Transactions on Information Science and Application, 2006,3(6):1036-1044.
    [20]顾益军,樊孝忠,王建华等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4): 337-340.
    [21]刘丽珍,宋瀚涛.文本分类中的特征选取[J].计算机工程,2004,30(4): 14-16.
    [22]边肇祺,张长水,张学工.模式识别(第二版).北京:清华大学出版社. 2000.284-303.
    [23]Joachims T.Text categorization with support vector machines: Learning with many relevant features. In Proceedings 10th European speech Conference on Machine Learning. Chemnitz. Springer-Verlag. 1998.137-142.
    [24]Fabio Alolli, Alessandro Sperduti. Multiclass classification with multi-prototype support vector machines. Journal of Machine Learning Research. 2005.6(6). 817-850.
    [25]Suykens J, Vandewale J. Least squares support vector machine classifiers. Neural Proccessing Letters. 1999.9(3).293-300.
    [26]Deerwester,S.Dumais, S.T,Fumas, GW, Landauer, T.K.Harshman, R. Indexing by Latent Semantic Analysis[J]. Journal of the American Society of Information Science, 1990, 41(6):391-407.
    [27]Joliffe, I.T.Principal Component Analysis. New York: Springer Verlag, 1986.
    [28]Gerard Salton, Michael J, McGill. Introduction to Modern Information Retrieval. New York: McGraw-Hill, Inc., 1983. 284.
    [29]李正林.中文文本数据分类研究[学位论文].上海:上海师范大学,2004.5 3-3.
    [30]李荣陆,胡运发.基于密度的KNN文本分类器训练样本裁剪方法.计算机研究与发展,2004,41(4),539-545.
    [31]李扬,曾海泉.基于KNN的快捷WEB文档分类.小型微型计算机系统.2004, 25(4), 725-729.
    [32]苏新宁.信息检索理论与技术.科学技术文献出版社,2004.
    [33]Vladimir V.Vapnik. The Nature of Statistical Learning Theory [M]. Springer, New York, 1995.
    [34]Platt J.Fast Training of Support Vector Machines using Sequential Minimal Optimization[A]. B.Scholkopf, C.Burges and A.Smola, eds, Advances in Kernel Methods-Support Vector Learning[C], pp.185-208, Cambridge, MA, USA: MIT Press, 1999.
    [35]Quinlan, J,R.Induction of Decision Trees[J].Machine Learning.1986,1(1):81—106.
    [36]Zhang T. and Oles F.J. Text categorization based on regularized linear classification methods. Information Retrieval. 4:5-31, 2001.
    [37]Aas K. and Eikvil A. Text Categorization: A survey Technical report, Norwegian Computing Center, http://citeseer.nj.nec.com/aas99text.html, 1999.
    [38]Xin Liu. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999, 42-49.
    [39]Fabrizio Sebastiani.Machine learning in automated text categorization. ACM Computing Surveys, 34(I):1-47, 2002.
    [40]James Auen. Natural Language Understanding [M]. The Benjamin/Cummings Publishing Company, 1991-05.
    [41]Dai Liu-ling, Huang He-yan,etc. A Comparative Study on Feature Selection in Chinese Text Categorization. Journal of Chinese Information Processing, Vol.18, No.1.
    [42]G.Salton:“Automatic Text Processing:The Transformation,Analysis,and Retrieval of Information by Computer.”Addison Wesley, 1989.
    [43]鲁松,李晓黎,自硕等.文档中词语权重计算方法的改进[J].中文文信息学报,2000,14(6):8.20.
    [44]刁倩,王永成,张惠惠等. VSM中词权重的信息摘算法[J].情报学报,2000,19(4):354-358.
    [45]吴科,石冰,卢军等.基于文本集密度的特征选择与权重计算方案[J].中文信息学报, 2004,18(1):42-43.
    [46]徐风亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用, 2005.1:181-184.
    [47]T Theeramunkong, V Lertnattee. Improving centroid-based text classification using term-distribution-based weighting system and clustering. NECTEC (NT-B-06-4F-13-311)2001.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700