一种基于χ~2统计的特征分类选择方法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Study on a Method of Feature Classification Selection Based on χ~2 Statistics
  • 作者:谭章禄 ; 王兆刚 ; 胡翰
  • 英文作者:Tan Zhanglu;Wang Zhaogang;Hu Han;School of Management, China University of Mining and Technology;
  • 关键词:χ~2统计 ; 特征选择 ; 文本分类 ; 稳定性
  • 英文关键词:χ~2 Statistics;;Feature Selection;;Text Categorization;;Stability
  • 中文刊名:XDTQ
  • 英文刊名:Data Analysis and Knowledge Discovery
  • 机构:中国矿业大学(北京)管理学院;
  • 出版日期:2019-02-25
  • 出版单位:数据分析与知识发现
  • 年:2019
  • 期:v.3;No.26
  • 基金:国家自然科学基金项目“基于数据挖掘的煤矿安全可视化管理模型及图元体系研究”(项目编号:61471362)的研究成果之一
  • 语种:中文;
  • 页:XDTQ201902009
  • 页数:7
  • CN:02
  • ISSN:10-1478/G2
  • 分类号:76-82
摘要
【目的】针对传统χ~2统计无法保证各类别之间信息的均衡性从而影响分类效果的问题,改进χ~2统计以提高其应用效果。【方法】通过分析传统χ~2统计的特征选择过程及其局限,提出一种基于χ~2统计的特征分类选择方法,根据特征词与每一类的关联度分类别选取特征词。【结果】以SVM为分类模型,通过实验对比改进前后的方法对文本分类效果的影响,结果表明基于χ~2统计的特征分类选择方法在准确率、平均分类准确率、最低分类准确率、稳定性和系统运行时间等方面得到显著改善。【局限】特征词选取数量较少时,改进前后差异不明显。【结论】基于χ~2统计的特征分类选择方法,有效改善了分类模型的稳定性与泛化性能,使分类准确率的波动幅度减小,分类过程的效率显著提高。
        [Objective] This paper aims at improving the application effect by improving χ~2 statistics. The deficiency of traditional χ~2 statistics could not guarantee the balance of information between categories and influence the classification effect. [Methods] By analyzing the characteristics selection process of traditional χ~2 statistics and its limitations, a feature classification selection method based on χ~2 statistics was proposed, and the feature words of different classes were selected according to the correlation degree between the feature words and each class. [Results]The effect of the improved method on the text classification effect was compared with the SVM as the classification model. The results showed that the feature classification selection method based on χ~2 statistics made the accuracy, the average classification accuracy, the lowest classification accuracy, the stability and the system running time significantly improved. [Limitations] When the number of feature words selected was small, the difference was not obvious before and after improvement. [Conclusions] The method of feature classification selection based on χ~2 statistics could effectively improve the stability and generalization performance of the classification model, reduce the fluctuation of classification accuracy and improve the efficiency of classification process.
引文
[1]Yang Y,Liu X.A Re-examination of Text Categorization Methods[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,1999:42-49.
    [2]Yang Y.An Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval,1999,1(1-2):69-90.
    [3]路永和,陈景煌.混合蛙跳算法在文本分类特征选择优化中的应用[J].数据分析与知识发现,2017,1(1):91-101.(Lu Yonghe,Chen Jinghuang.Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J].Data Analysis and Knowledge Discovery,2017,1(1):91-101.)
    [4]王东波,何琳,黄水清.基于支持向量机的先秦诸子典籍自动分类研究[J].图书情报工作,2017,61(12):71-76.(Wang Dongbo,He Lin,Huang Shuiqing.Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine[J].Library and Information Service,2017,61(12):71-76.)
    [5]胡韧奋,诸雨辰.唐诗题材自动分类研究[J].北京大学学报:自然科学版,2015,51(2):262-268.(Hu Renfen,Zhu Yuchen.Automatic Classification of Tang Poetry Themes[J].Acta Scientiarum Naturalium Universitatis Pekinensis,2015,51(2):262-268.)
    [6]Meesad P,Boonrawd P,Nuipian V.A Chi-square-test for Word Importance Differentiation in Text Classification[C]//Proceedings of the 2011 International Conference on Information and Electronics Engineering.2011.
    [7]王光,邱云飞,史庆伟.集合CHI与IG的特征选择方法[J].计算机应用研究,2012,29(7):2454-2456.(Wang Guang,Qiu Yunfei,Shi Qingwei.Collective CHI and IG Feature Selection Method[J].Application Research of Computers,2012,29(7):2454-2456.)
    [8]Dai L,Hu J,Liu W.Using Modified CHI Square and Rough Set for Text Categorization with Many Redundant Features[C]//Proceedings of the 2008 International Symposium on Computational Intelligence&Design.2008.
    [9]Galavotti L,Sebastiani F,Simi M,et al.Feature Selection and Negative Evidence in Automated Text Categorization[C]//Proceedings of the ACM KDD Workshop on Text Mining.2000.
    [10]Jin C,Ma T,Hou R,et al.Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization[J].IETE Journal of Research,2015,61(4):351-362.
    [11]闫健卓,李鹏英,方丽英,等.基于χ2统计的改进文本特征选择方法[J].计算机工程与设计,2016,37(5):1391-1394.(Yan Jianzhuo,Li Pengying,Fang Liying,et al.Improved Method for Text Feature Selection Based on CHI[J].Computer Engineering and Design,2016,37(5):1391-1394.)
    [12]熊忠阳,张鹏招,张玉芳.基于χ2统计的文本分类特征选择方法的研究[J].计算机应用,2008,28(2):513-518.(Xiong Zhongyang,Zhang Pengzhao,Zhang Yufang.Improved Approach to CHI in Feature Extraction[J].Computer Applications,2008,28(2):513-518.)
    [13]刘海峰,苏展,刘守生.一种基于词频信息的改进CHI文本特征选择[J].计算机工程与应用,2013,49(22):110-114.(Liu Haifeng,Su Zhan,Liu Shousheng.Improved CHI Text Feature Selection Based on Word Frequency Information[J].Computer Engineering and Applications,2013,49(22):110-114.)
    [14]肖婷,唐雁.改进的χ2统计文本特征选择方法[J].计算机工程与应用,2009,45(14):136-140.(Xiao Ting,Tang Yan.Improvedχ2 Statistics Method for Text Feature Selection[J].Computer Engineering and Applications,2009,45(14):136-137.)
    [15]Li Y,Luo C,Chung S M.Text Clustering with Feature Selection by Using Statistical Data[J].IEEE Transactions on Knowledge and Data Engineering,2008,20(5):641-652.
    [16]张辉宜,谢业名,袁志祥,等.一种基于概率的卡方特征选择方法[J].计算机工程,2016,42(8):194-198,205.(Zhang Huiyi,Xie Yeming,Yuan Zhixiang,et al.A Method of CHI-square Feature Selection Based on Probability[J].Computer Engineering,2016,42(8):194-198,205.)
    [17]裴英博,刘晓霞.文本分类中改进型CHI特征选择方法的研究[J].计算机工程与应用,2011,47(4):128-130.(Pei Yingbo,Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J].Computer Engineering and Applications,2011,47(4):128-130.)
    [18]李平,戴月明,王艳.基于混合卡方统计量与逻辑回归的文本情感分析[J].计算机工程,2017,43(12):192-196,202.(Li Ping,Dai Yueming,Wang Yan.Text Sentiment Analysis Based on Hybrid Chi-square Statistic and Logistic Regression[J].Computer Engineering,2017,43(12):192-196,202.)
    [19]邱云飞,王威,刘大有,等.基于方差的CHI特征选择方法[J].计算机应用研究,2012,29(4):1304-1306.(Qiu Yunfei,Wang Wei,Liu Dayou,et al.CHI Feature Selection Method Based on Variance[J].Application Research of Computers,2012,29(4):1304-1306.)
    [20]徐明,高翔,许志刚,等.基于改进卡方统计的微博特征提取方法[J].计算机工程与应用,2014,50(19):113-117,142.(Xu Ming,Gao Xiang,Xu Zhigang,et al.Feature Selection Methods of Microblogging Based on Improved CHI-square Statistics[J].Computer Engineering and Applications,2014,50(19):113-117,142.)
    [21]史峰,王辉,郁磊,等.MATLAB智能算法30个案例分析[M].第一版.北京:北京航空航天大学出版社,2011:275-278.(Shi Feng,Wang Hui,Yu Lei,et al.30 Cases Analysis of MATLAB Intelligent Algorithm[M].The 1st Edition.Beijing:BeiHang University Press,2011:275-278.)

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700