基于k最近邻和改进TF-IDF的文本分类框架

英文篇名：Text categorization framework based on improved TF-IDF and k-nearest neighbor
作者：龚静 ; 黄欣阳
英文作者：GONG Jing;HUANG Xin-yang;Department of Public Basic Courses,Hunan Polytechnic of Environment and Biology;College of Computer Science,University of South China;
关键词：文本分类 ; k-NN ; 分类器 ; 权重矩阵 ; 优化
英文关键词：text categorization;;k-NN;;classifier;;weight matrix;;optimization
中文刊名：SJSJ
英文刊名：Computer Engineering and Design
机构：湖南环境生物职业技术学院公共基础课部;南华大学计算机学院;
出版日期：2018-05-16
出版单位：计算机工程与设计
年：2018
期：v.39;No.377
基金：国家自然科学基金项目(61300234);; 湖南省教育厅基金项目(12C1056)
语种：中文;
页：SJSJ201805024
页数：6
CN：05
ISSN：11-1775/TP
分类号：148-152+157

摘要

为获得更加精确稳定的文本分类结果,提出一种基于k-最近邻(k-NN)和词频-逆文档词频(TF-IDF)改进的文本分类方法,主要由文本模块、图形用户界面(GUI)模块、预处理模块、k-NN&TF-IDF模块和相似性测量共5个模块组成。在权重获取方面,对处于不同位置的特征词分别赋予不同的系数,通过构建权重矩阵,反映特征词的重要性和分布情况。在编程方面,通过执行修正的语言集查询(LINQ),优化查询效率。实验结果表明,与其它分类方法相比,该方法在分类准确率、查全率和F1测度方面具有一定优势。讨论分类器对整个文本分类框架的影响,实验结果表明,k-NN分类器比SVM分类器更适合文本分类。
To obtain more accurate and stable results for text categorization,a text categorization method based on improved term frequency-inverse document frequency(TF-IDF)and k-nearest neighbor(k-NN)was proposed,which mainly contained the document module,the module of graphical user interface(GUI),the pre-processing module,and the module of k-NN&TFIDF and similarity measurement.In the aspect of weight acquisition,different coefficients were assigned to different positions,and the weight matrix was constructed to reflect the importance and distribution of feature words.In the aspect of programming,the query efficiency was optimized by executing the revised language set query(LINQ).Experimental results show that compared with other classification methods,the proposed method has certain advantages in classification accuracy rate,recall rate and the F1 measurement.In addition,the impact of the classifier on the whole text classification framework was discussed.Experimental results show,k-NN classifier is more suitable for text classification than SVM classifier.

引文

[1]ZHANG Jie,CHEN Huaixin.Normalized term frequency Bayes for text classification[J].Computer Engineering and Design,2016,37(3):799-802(in Chinese).[张杰,陈怀新.基于归一化词频贝叶斯模型的文本分类方法[J].计算机工程与设计,2016,37(3):799-802.]
    [2]Zhang Xiang,Zhao Junbo,Lecun Y.Character-level convolutional networks for text classification[C]//Neural Information Processing Systems.Montreal,Canada:IEEE Press,2015:134-141.
    [3]Yang Dan,Fan Xinghua.Dynamic feature selection strategy in incremental Chinese text classification[C]//International Conference on Applied Robotics for the Power Industry.Zürich,Switzerland:IEEE Press,2012:1123-1126.
    [4]Peng Tao,Liu Lu,Zuo Wanli.PU text classification enhanced by term frequency-inverse document frequency-improved weighting[J].Concurrency&Computation Practice&Experience,2014,26(3):728-741.
    [5]SHI Hui.Research on text classification based on feature selection and feature weighting algorithm[D].Jinan:Shandong Normal University,2015(in Chinese).[石慧.基于特征选择和特征加权算法的文本分类研究[D].济南:山东师范大学,2015.]
    [6]ZHAN Zhijian,YANG Xiaoping.A semantic similarity calculation based on complex network[J].Journal of Chinese Information Processing,2016,30(4):71-80(in Chinese).[詹志建,杨小平.一种基于复杂网络的短文本语义相似度计算[J].中文信息学报,2016,30(4):71-80.]
    [7]LI Zhenjun,ZHOU Zhurong.Improvement of term frequencyinverse document frequency algorithm based on Document Triage[J].Journal of Computer Applications,2015,35(12):3506-3510(in Chinese).[李镇君,周竹荣.基于Document Triage的TF-IDF算法的改进[J].计算机应用,2015,35(12):3506-3510.]
    [8]LI Fenggang,LIANG Yu,GAO Xiaozhi,et al.Research on text categorization based on LDA-WSVM model[J].Application Research of Computers,2015,32(1):21-25(in Chinese).[李锋刚,梁钰,高晓志,等.基于LDA-WSVM模型的文本分类研究[J].计算机应用研究,2015,32(1):21-25.]
    [9]Handaga B,Deris MM.Text categorization based on fuzzy soft set theory[C]//International Conference on Computational Science&ITS Applications.Salvador de Bahia,Brazil:Springer-Verlag Press,2012:218-219.
    [10]Lee Shie-jue,Jiang Jung-yi.Multilabel text categorization based on fuzzy relevance clustering[J].IEEE Transactions on Fuzzy Systems,2014,22(6):1457-1471.
    [11]ZHANG Lin,SHAO Tianhao.Improved Bayesian text classification algorithm in cloud computing environment[J].Computer Science,2014,41(S1):339-342(in Chinese).[张琳,邵天昊.云计算环境下的一种改进的贝叶斯文本分类算法[J].计算机科学,2014,41(S1):339-342.]
    [12]Friedman C,Rindflesch TC,Corn M.Natural language processing:State of the art and prospects for significant progress,a workshop sponsored by the national library of medicine[J].Journal of Biomedical Informatics,2013,46(5):765-773.
    [13]XIN Rihua.The development and application of a sense tagged system with a human machine interactive interface[J].Control Engineering of China,2012,19(4):169-170(in Chinese).[辛日华.人机交互的语义标注系统开发和应用[J].控制工程,2012,19(4):169-170.]
    [14]Lamirel JC,Cuxac P,Chivukula AS,et al.Optimizing text classification through efficient feature selection based on quality metric[J].Journal of Intelligent Information Systems,2015,45(3):379-396.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700