面向文本分类的特征词选取方法研究与改进

英文篇名：Feature Word Selection for Document Classification
作者：李国和 ; 岳翔 ; 吴卫江 ; 洪云峰 ; 刘智渊 ; 程远
英文作者：LI Guohe;YUE Xiang;WU Weijiang;HONG Yunfeng;LIU Zhiyuan;CHEN Yuan;College of Geophysics and Information Engineering,China University of Petroleum;Beijing Key Lab of Data Mining for Petroleum Data,China University of Petroleum;PanPass Institute of Digital Identification Management and Internet of Things;
关键词：文本文档 ; 特征词 ; 特征选取 ; 文本分类
英文关键词：Text document;;Feature word;;Feature selection;;Text classification
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：中国石油大学(北京)地球物理与信息工程学院;中国石油大学(北京)油气数据挖掘北京市重点实验室;石大兆信数字身份管理与物联网技术研究院;
出版日期：2015-07-15
出版单位：中文信息学报
年：2015
期：v.29
基金：国家高新技术研究发展计划(2009AA062802);; 国家自然科学基金(60473125);; 中国石油(CNPC)石油科技中青年创新基金(05E7013);; 国家重大专项子课题(G5800-08-ZS-WX)
语种：中文;
页：MESS201504018
页数：6
CN：04
ISSN：11-2325/N
分类号：124-129

摘要

中文特征词的选取是中文信息预处理内容之一,对文档分类有重要影响。中文分词处理后,采用特征词构建的向量模型表示文档时,导致特征词的稀疏性和高维性,从而影响文档分类的性能和精度。在分析、总结多种经典文本特征选取方法基础上,以文档频为主,实现文档集中的特征词频及其分布为修正的特征词选取方法(DC)。采用宏F值和微F值为评价指标,通过实验对比证明,该方法的特征选取效果好于经典文本特征选取方法。
Feature words selection from texts is a significant step in Chinese text information pre-processing.After the segmentation of Chinese texts,a Vector Model constructed by feature words representing the Chinese text documents cannot avoid low accuracy of document classification(or document retrieval)due to the sparseness and high-dimension of feature words.On the basis of an analysis of several classical text feature selection methods,a new method of text feature selection(DC)is presented,which is based on a modified document frequency.Experiments prove the performance of DC,is better than that of typical other methods according to macro-F values and micro-F values.

引文

[1]苗夺谦,卫志华.中文文本信息处理的原理与应用[M].北京:清华大学出版社,2007
    [2]刘铭.大规模文档聚类中若干关键问题的研究[D].哈尔滨工业大学博士学位论文.2010.
    [3]熊忠阳,张鹏招,张玉芳.基于χ2统计的文本分类特征选择方法的研究[J],计算机应用,2008,28(2):513-514
    [4]熊云波.文本信息处理的若干关键技术研究[D].复旦大学博士学位论文.2006.
    [5]王辉,张成锁,卓呈祥.一种改进的相对熵特征选择方法[J].计算机工程,2011,37(10):167-169.
    [6]柴玉梅,王宇.基于TFIDF的文本特征选择方法[J].微计算机信息,2006,22(8-3):24-26
    [7]苏丹.一种基于最少出现文档频的文本特征提取方法[J].计算机工程与应用,2012,48(10):164-166+178.
    [8]Bong Ch,K.Narayanan.An empirical study of feature selection for text categorization based on term weightage[C]//Proceedings of the International Conference on Web Intelligence,2004:599-602.
    [9]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32.
    [10]Saltong,Clementty.On the construction ofeffective vocabularies for information retrieval[C]//Proceedings of the 1973Meet-ing on Programming Languages and Information Retrieva.l NewYork:ACM,1973:11.
    [11]宗成庆.统计自然语言处理[M].北京:清华大学出版社,2011.
    [12]陈键.面向文本分类的特征词选取方法研究[D].合肥工业大学硕士学位论文.2009.
    [13]余俊英.文本分类中特征选择方法的研究[D].江西师范大学硕士学位论文.2007.
    [14]周茜,赵明生等.中文文本分类中的特征选择研究[J].中文信息学报,2003,18(3):17-23.
    [15]单松巍,冯是聪,李晓明.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用,2003,39(22):146-148
    [16]Yang Yiming,Pedersen J O.A comparative study on feature selection in text categorization[C]//Proceedings of the Fourteenth International Conference on Machine Learning.San Francisco,CA,USA:ICML97 Morgan Kaufmann Publishers Inc,1997.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700