中文文本分类中特征选择算法及分类算法的研究

英文题名：Research on Feature Selection Algorithm and Classification Algorithm in Chinese Text Categoriztion
作者：迟麟
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：文本分类 ; 分词算法 ; 特征选择 ; 绝对比例区分 ; K最近邻 ; 查准率 ; 查全率
英文关键词：Text classification ; Segmentation algorithm ; Feature selection ; Categorical proportional difference ; Knearest neighbors ; Precision rate ; Recall rate
学位年度：2010
导师：刘文远
学科代码：081202
学位授予单位：燕山大学
论文提交日期：2009-12-01

摘要

近年来,随着信息技术的迅速发展,特别是Internet的普及,网页上的电子文本信息急剧增加,如何有效地组织和管理这些海量信息,并且能够快速、准确地获得用户所需要的信息是当今信息资源管理技术领域的一大挑战。通过文本自动分类技术的使用,可以使电子文本信息自动的按照类别的方式进行组织和管理,满足人们方便快捷的信息处理需求,准确定位所需信息资源。
     本文从分词算法,特征选择算法和文本分类算法三个方面对文本分类进行深入研究。
     首先,通过分析预处理中中文文本分类的特点,中文文本向量空间模型表示法,和两种机械的分词方法,在算法的词典结构、算法的匹配方式、算法对歧义词的处理策略和算法识别未登录词的策略上改进了分词方法,并进行了实验验证。
     其次,在文本预处理的基础上,为了进一步提高特征项对类别的区分能力,本文分析了基于绝对比例区分(CPD)的特征选择算法,分别在特征项的频度和特征项的冗余两个方面进行改进,提出了改进的CPD特征选择算法,并通过实验进行比较验证。
     最后,通过分析传统的K最近邻(KNN)分类算法具有计算量巨大和当类别间有较多共性,即训练样本间有较多特征交叉现象时,KNN分类的精度将下降的两点不足,提出了改进的KNN文本分类算法,并在中文文本分类语料库--TanCorpV1.0和搜狐互联网网页语料库两种数据集上,通过实验与传统的KNN算法进行比较验证。
In recent years, with the rapid development of information technology, especially in the popularity of Internet, dramatic increasingly in web pages of electronic text information, how to effectively organize and manage these vast amounts of information, and how to quickly and accurately obtain the information needed by users in today's information resource management technology is a big challenge. By using the automatic text classification techniques, electronic text information can be automatically organized and managed according to categories, it meets people's demand for convenient and efficient information processing, and accuracy locates information resources.
     We deeply studied segmentation algorithms, feature selection methods and text classification algorithms.
     Firstly, by analyzing the features of Chinese text categorization in pre-processing, representation of vector space model, and the two kinds of mechanical segmentation method, we improved the segmentation method in the dictionary structure of the algorithm, the algorithm matching method, disposal strategy of algorithm to ambiguous word and disposal strategy algorithm to unknown word, and had experimental validation.
     Secondly, on the basis of text pre-processing, in order to improve the post-classification accuracy rate and reduce the calculation of the amount of classification algorithms, we analyzed Categorical Proportional Difference (CPD) feature selection method, and improved this method in frequency and redundancy of feature items, and experimented to compare validation.
     Finally, by analyzing the two shortcomings which are the enormous computational, and when there is more commonality between the categories, namely, to have more features between the training samples cross phenomenon, KNN classification accuracy will decline. we proposed an improved KNN algorithm for text classification, experimented in Chinese text categorization corpus-TanCorpV1.0 and Sohu web page corpus, comparing the traditional KNN algorithm.

引文

1李建刚,霍焱.一种基于遗传神经网络文本分类器的研究.东北大学软件学院硕士论文. 2007:35-42
    2刘明吉,王秀峰. Web文本信息的特征获取算法.南开大学计算机与系统科学系硕士论文. 2002:22-40
    3戴文华.基于混合并行遗传算法的文本分类及聚类研究.华中师范大学硕士论文. 2007:25-36
    4彭时名.中文文本分类中特征提取算法研究.重庆大学硕士论文. 2006:27-34
    5 Yang Yiming, Pederson J.O. A Comparative Study on Feature Selection in Text Categorization. Proceedings of the 14th International Conference on Machine learning, Nashville: Morgan Kaufmann, 1997:412-420
    6 John G H, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. In Proc. of the Eleventh Intl. Conf. on Mathine Learning, 1994:121-129
    7 Mitra P, Murthy C A, Pal S K. Unsupervised Feature Selection Using Feature Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(3):301-212
    8 Changki Lee, Gary Geunbae Lee. Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manage, 2006, 42(1): 155-165
    9陈志雄,陈健,闵华清.基于信息增益的中文文本关联分类.中文信息学报, 2007, 21(3):61-68
    10 Karl-Michael Schneider. Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization. PKDD, 2005:252-263
    11王卫玲,刘培玉,初建崇.一种改进的基于条件互信息的特征选择算法.计算机应用, 2007, 27(2):433-435
    12 Zhaohui Zheng, Sargur N. Srihari. Text Categorization Using Modified-CHI Feature Selection and Document/Term Frequencies. ICMLA, 2002:143-146
    13 William John Teahan. Text classification and segmentation using minimum cross-entropy. RIAO, 2000:943-961
    14段军峰,黄维通,陆玉昌.中文网页分类研究与系统实现.计算机科学, 2007, 34(6):210-213
    15 Aleksander Kolcz, Abdur Chowdhury. Avoidance of Model Re-Induction in SVM-Based Feature Selection for Text Categorization. IJCAI, 2007:889-894
    16 Jihong Cai, Fei Song. Maximum Entropy Modeling with Feature Selection for Text Categorization. AIRS, 2008:549-554
    17 Zhi-Hong Deng, Shi-Wei Tang, Dongqing Yang, Ming Zhang, Xiao-Bin Wu, Meng Yang. Two Odds-Radio-Based Text Classification Algorithms. WISE Workshops, 2002:223-231
    18 Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, QianSheng Cheng, Weiguo Fan, Wei-Ying Ma. OCFS: optimal orthogonal centroid feature selection for text categorization. SIGIR, 2005:122-129
    19 Yijun Gu, Rong Wang, Jianhua Wang, Jiangde Yu. A New Chinese Text Feature Selection Method in Centroid-Based Classifier. ISIP, 2008:88-92
    20 Hu Guan, Jingyu Zhou, Minyi Guo. A class-feature-centroid classifier for text categorization. WWW, 2009:201-210
    21尚文倩,黄厚宽,刘玉玲.文本分类中基于基尼指数的特征选择算法研究.计算机研究与发展, 2006, 43(10):1688-1694
    22赵世奇,张宇,刘挺.基于类别特征域的文本分类特征选择方法.中文信息学报, 2005, 19(6):21-27
    23宋枫溪,高秀梅,刘树海,杨静宇.统计模式识别中的维数削减与低损降维.计算机学报, 2005, 28(11):1915-1922
    24崔彩霞,王素格.基于类内频率的文本分类特征选择方法.计算机工程与设计, 2007, 28(17):4249-4251, 4265
    25闫鹏,郑雪峰,李明祥,陈松华.二值文本分类中基于Bayes推理的特征选择方法.计算机科学, 2008, 35(7):173-176
    26徐燕,李锦涛,王斌,孙春明.基于区分类别能力的高性能特征选择方法.软件学报, 2008, 19(1):82-89
    27 Mondelle Simeon, Robert J. Hilderman. Categorical Proportional Difference: A Feature Selection Method for Text Categorization. AusDM, 2008:201-208
    28 Xin Xu, Bofeng Zhang, Qiuxi Zhong. Text Categorization Using SVMs with Rocchio Ensemble for Internet Information Classification. ICCNMC, 2005:1022-1031
    29 Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. ICML, 1997:143-151
    30 Pascal Soucy, Guy W. Mineau. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model. IJCAI, 2005:1130-1135
    31张秋余,竭洋,李凯.基于模糊支持向量机与决策树的文本分类器.计算机应用, 2008, 28(12):3228-3230
    32 Taeho Jo. Inverted Index based Modified Version of KNN for Text Categorization. JIPS, 2008, 4(1):17-26
    33 Kwangcheol Shin, Ajith Abraham, Sang-Yong Han. Improving kNN Text Categorization by Removing Outliers from Training Set. CICLing, 2006:563-566
    34王煜,王正欧,白石.用于文本分类的改进KNN算法.中文信息学报, 2007, 21(3):76-82
    35 Lili Hao, Lizhu Hao. Temporal Data Driven Naive Bayesian Text Classifier. ICYCS, 2008:699-702
    36 Wang Ding, Songnian Yu, Qianfeng Wang, Jiaqi Yu, Qiang Guo. A Novel Naive Bayesian Text Classifier. ISIP, 2008:78-82
    37 Wenyuan Dai, Gui-Rong Xue, Qiang Yang, Yong Yu. Transferring Naive Bayes Classifiers for Text Classification. AAAI, 2007:540-545
    38 Jihong Cai, Fei Song. Maximum Entropy Modeling with Feature Selection for Text Categorization. AIRS, 2008:549-554
    39 Alfons Juan, David Vilar, Hermann Ney. Bridging the Gap between Naive Bayes and Maximum Entropy Text Classification. PRIS, 2007:59-65
    40 Ishrar Hussain, Olga Ormandjieva, Leila Kosseim. Automatic Quality Assessment of SRS Text by Means of a Decision-Tree-Based Text Classifier. QSIC, 2007:209-218
    41王煜,王正欧.基于模糊决策树的文本分类规则抽取.计算机应用,2005,25(7):1634-1637
    42 Cheng Hua Li, Soon Cheol Park. Text Categorization Based on Artificial Neural Networks. ICONIP, 2006:302-311
    43 Alberto Ferreira de Souza, Felipe Pedroni, Elias Oliveira, Patrick Marques Ciarelli, Wallace Favoreto Henrique,Lucas Veronese,Claudine Badue. Automated multi-label text categorization with VG-RAM weightless neural networks. Neurocomputing (IJON), 2009, 72(10-12):2209-2217
    44丁振国,黎靖,张卓.一种改进的基于神经网络的文本分类算法.计算机应用研究, 2008, 25(6):1639-1641
    45 Claudio Carpineto, Carla Michini, Raffaele Nicolussi. A Concept Lattice-Based Kernel for SVM Text Classification. ICFCA, 2009:237-250
    46 Taoufik Guernine, Kacem Zeroual. A New Fuzzy Hierarchical Classification Based on SVM for Text Categorization. ICIAR, 2009:865-874
    47巩知乐,张德贤,胡明明.一种改进的支持向量机的文本分类算法.计算机仿真, 2009, 26(7):164-167
    48 Andrew McCallum, Kamal Nigam. Employing EM and Pool-Based Active Learning for Text Classification. ICML, 1998:350-358
    49 Bruno Caprile, Stefano Merler Cesare Furlanello, Giuseppe Jurman. Exact Bagging with k-Nearest Neighbor Classifiers. Multiple Classifier Systems, 2004:72-81
    50 Chi-Yuan Yeh, Zhi-Ying Lee. Boosting One-Class Support Vector Machines for Multi-Class Classification. Applied Artificial Intelligence(AAI), 2009, 23(4):297-315
    51陈桂林,王永成,韩克松,王刚.一种高效的中文电子词表数据结构.计算机研究与发展, 2000, 37(1):109-116
    52 Ma Yuchun, Song Hantao. Research of Chinese word segmentation based on the web. Computer Application, 2004, 24(4):134-136
    53 Yan Yintang, Zhou Xiaoqiang. Study of segmentation strategy on ambiguous phrase of overlap type. Journal of the China Society for Scientific and Technical Information, 2000, 19(6):637-643

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700