融合TF-IDF和LDA的中文FastText短文本分类方法

英文篇名：Chinese FastText Short Text Classification Method Integrating TF-IDF and LDA
作者：冯勇 ; 屈渤浩 ; 徐红艳 ; 王嵘冰 ; 张永刚
英文作者：FENG Yong;QU Bohao;XU Hongyan;WANG Rongbing;ZHANG Yonggang;College of Information, Liaoning University;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University;
关键词：中文短文本分类 ; FastText ; 词频-逆文本频率 ; 词向量 ; 隐含狄利克雷分布
英文关键词：Chinese short text classification;;FastText;;term frequency-inverse document frequency(TF-IDF);;word vector;;latent Dirichlet allocation(LDA)
中文刊名：YYKX
英文刊名：Journal of Applied Sciences
机构：辽宁大学信息学院;吉林大学符号计算与知识工程教育部重点实验室;
出版日期：2019-05-30
出版单位：应用科学学报
年：2019
期：v.37
基金：国家自然科学基金(No.71771110);; 中国博士后科学基金(No.2018M631814);; 辽宁省社会科学规划基金(No.L18AGL007);; 符号计算与知识工程教育部重点实验室项目基金(No.93K172018K01)资助
语种：中文;
页：YYKX201903008
页数：11
CN：03
ISSN：31-1404/N
分类号：82-92

摘要

FastText文本分类模型具有快速高效的优势,但直接将其用于中文短文本分类则存在精确率不高的问题.为此提出一种融合词频-逆文本频率(term frequency-inverse document frequency, TF-IDF)和隐含狄利克雷分布(latent Dirichlet allocation, LDA)的中文FastText短文本分类方法.该方法在FastText文本分类模型的输入阶段对n元语法模型处理后的词典进行TF-IDF筛选,使用LDA模型进行语料库主题分析,依据所得结果对特征词典进行补充,从而在计算输入词序列向量均值时偏向高区分度的词条,使其更适用于中文短文本分类环境.对比实验结果可知,所提方法在中文短文本分类方面具有更高的精确率.
FastText text classification model has the advantages of high speed and high efficiency, but its application in Chinese short text classification has the problem of low precision. To solve this problem, a Chinese FastText short text classification method integrating TF-IDF and LDA is proposed. In the input phase of FastText text classification model, the dictionaries generated after n-gram processing are filtered by TF-IDF, and corpus thematic analysis is conducted by LDA model, then the feature dictionary is supplemented according to the obtained results. Thus, the highly differentiated entries are biased in the process of computing the mean value of input word sequence vectors, making them more suitable for Chinese short text classification environment. The experimental results show that the proposed method has higher precision in Chinese short text classification.

引文

[1]段旭磊,张仰森,孙祎卓.微博文本的句向量表示及相似度计算方法研究[J].计算机工程,2017,43(5):143-148.Duan X L,Zhang Y S,Sun Y Z.Research on sentence vector representation and similarity calculation method about microblog texts[J].Computer Engineering,2017,43(5):143-148.(in Chinese)
    [2]Spinellis D,Raptis K.Component mining:a process and its pattern language[J].Information and Software Technology,2000,42(9):609-617.
    [3]张谦,高章敏,刘嘉勇.基于Word2Vec的微博短文本分类研究[J].信息网络安全,2017,17(1):57-62.Zhang Q,Gao Z M,Liu J Y.Research of weibo short text classification based on Word2Vec[J].Netinfo Security,2017,17(1):57-62.(in Chinese)
    [4]赵辉,刘怀亮.一种基于维基百科的中文短文本分类算法[J].图书情报工作,2013,57(11):120-124.Zhao H,Liu H L.Classification algorithm of Chinese short texts based on Wikipedia[J].Library and Information Service,2013,57(11):120-124.(in Chinese)
    [5]范云杰,刘怀亮.基于维基百科的中文短文本分类研究[J].现代图书情报技术,2012,28(3):47-52.Fan Y J,Liu H L.Research on Chinese short text classification based on Wikipedia[J].New Technology of Library and Information Service,2012,28(3):47-52.(in Chinese)
    [6]Wu F L,Zheng Y F.Adaptive normalized weighted KNN text classification based on PSO[J].Scientific Bulletin of National Mining University,2016,(1):109-115.
    [7]Liu J,Xu Y,Deng J,Wang L,Zhang L.Ld-CNNs:a deep learning system for structured text categorization based on LDA in content security[C]//International Conference on Network and System Security.Taiwan,2016:113-125.
    [8]Bahassine S,Madani A,Kissi M.An improved Chi-square feature selection for Arabic text classification using decision tree[C]//International Conference on Intelligent Systems:Theories and Applications.Mohamrnedia,Morocco,IEEE,2016:2378-2536.
    [9]阳爱民,林江豪,周咏梅.中文文本情感词典构建方法[J].计算机科学与探索,2013,7(11):1033-1039.Yang A M,Lin J H,Zhou Y M.Method on building Chinese text sentiment lexicon[J].Journal of Frontiers of Computer Science and Technology,2013,7(11):1033-1039.(in Chinese)
    [10]陈科文,张祖平,龙军.文本分类中基于熵的词权重计算方法研究[J].计算机科学与探索,2016,10(9):1299-1309.Chen K W,Zhang Z P,Long J.Research on entropy-based term weighting methods in text categorization[J].Journal of Frontiers of Computer Science and Technology,2016,10(9):1299-1309.(in Chinese)
    [11]Blei D M,Ng Y A,Jordan I M.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3(1):993-1022.
    [12]Griffiths T L,Steyvers M.Finding scientific topics[C]//Proceedings of the National Academy of Sciences of the United States of America,2004,101(1):5228-5235.
    [13]Joulin A,Grave E,Bojanowski P,Mikolov T.Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics.Spain,2017:427-431.
    [14]Bojanowski P,Grave E,Joulin A,Mikolov T.Enriching word vectors with subword information[C]//Association for Computational Linguistics.Massachusetts,2017:135-146.
    [15]Hinton G E,Salakhutdinov R.Replicated softmax:an undirected topic model[C]//International Conference on Neural Information Processing Systems.Canada,2009:1607-1614.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700