基于word2vec词模型的中文短文本分类方法

英文篇名：Chinese short text classification method based on word2vec embedding
作者：高明霞 ; 李经纬
英文作者：GAO Mingxia;LI Jingwei;Faculty of Information Technology, Beijing University of Technology;
关键词：短文本 ; 中文文本分类 ; 维基百科 ; word2vec ; 词模型
英文关键词：short texts;;Chinese text classification;;Wikipedia;;word2vec;;embedding
中文刊名：SDGY
英文刊名：Journal of Shandong University(Engineering Science)
机构：北京工业大学信息学部;
出版日期：2018-11-02 13:42
出版单位：山东大学学报(工学版)
年：2019
期：v.49;No.234
基金：北京市MRI和脑信息重点试验室基金(20160201);; 数字出版国家重点试验室基金(Q5007013201501);; 计算机学院院级科研项目(2018JSJKY008)
语种：中文;
页：SDGY201902005
页数：8
CN：02
ISSN：37-1391/T
分类号：38-45

摘要

针对受字数限定影响的文本特征表达能力弱成为短文本分类中制约效果的主要问题,提出基于word2vec维基百科词模型的中文短文本分类方法(chinese short text classification method based on embedding trained by word2vec from wikipedia, CSTC-EWW),并针对新浪爱问4个主题的短文本集进行相关试验。首先训练维基百科语料库并获取word2vec词模型,然后建立基于此模型的短文本特征,通过SVM、贝叶斯等经典分类器对短文本进行分类。试验结果表明:本研究提出的方法可以有效进行短文本分类,最好情况下的F-度量值可达到81.8%;和词袋(bag-of-words, BOW)模型结合词频-逆文件频率(term frequency-inverse document frequency, TF-IDF)加权表达特征的短文本分类方法以及同样引入外来维基百科语料扩充特征的短文本分类方法相比,本研究分类效果更好,最好情况下的F-度量提高45.2%。
In the short text classification process, the weak feature expression of the limitation of the number of words restricted the classification effect. To solve this problem, a Chinese short text classification method based on embedding trained by word2 vec from Wikipedia(CSTC-EWW) was proposed, and a series of experiments for short texts with 4 topics from the iask.com website were finished. This method firstly trained the embedding by word2 vec from Wikipedia corpus. the feature of short text based on the embedding was established. Naive Bayes and SVM was used to classify short text. The experimental results showed the following conclusions: CSTC-EWW could effectively classify short texts and the best F-value could reach 81.8%; Comparing the text feature expression of BOW model weighted by TF-IDF and the method of extending feature from Wikipedia, the classification results of CSTC-EWW were significantly better and F-measure of CSTC-EWW on car could be increased by 45.2%.

引文

[1] 刘英涛.短文本分类研究[D].重庆:重庆理工大学,2016.LIU Yingtao.Research on short text classification[D].Chongqing:Chongqing University of Technology,2016.
    [2] METZLER D,DUMAIS S,MEEK C.Similarity measures for short segments of text[C]//Processdings of AAAI Conference on Artificial Intelligence.Heidelberg Berlin Germany:Springer-Verlag,2007:16-27.
    [3] ZELIKOWITZ S,TRANSDUCTIVE M F.Learning for short-text classification problem using latent semantic indexing international[J].Journal of Pattern Recognition and Artificial Intelligence,2005,19(2):143-163.
    [4] 杨超群.基于自身特征的短文本分类研究[D].合肥:合肥工业大学,2016.YANG Chaoqun.Research on short text classification based on its own features[D].Hefei:Hefei University of Technology,2016.
    [5] 范云杰,刘怀亮.基于维基百科的中文短文本分类研究[D].西安:西安电子科技大学,2013.FAN Yunjie,LIU Huailiang.Research on Chinese short text classification based on wikipedia[D].Xi’an:Xidian University,2013.
    [6] 刘婧姣,张素智.基于语义的短文本分类算法研究[D].郑州:郑州轻工业学院,2013LIU Jingjiao,ZHANG Suzhi.The study of short text classification algorithm based on semantic[D].Zhengzhou:Zhengzhou University of Light Industry,2013.
    [7] 蔡志威,闵华清.基于概念的短文本分类[D].广州:华南理工大学,2016.CAI Zhiwei,MIN Huaqing.Concept-based short text classification[D].Guangzhou:South China University of Technology,2016.
    [8] 李锐,张谦,刘嘉勇.基于加权word2vec的微博情感分析[J].通信技术,2017,50(3):502-506LI Rui,ZHANG Qian,LIU Jiayong.Microblog sentiment analysis based on weighted word2vec[J].Communications Technology,2017,50(3):502-506
    [9] 董文.基于LDA和word2vec的推荐算法研究[D].北京:北京邮电大学,2015.DONG Wen.Research of recommendation algorithm based on LDA and word2vec[D].Beijing:Beijing University of Posts and Telecommunications,2015.
    [10] 闭炳华.基于word2vec的数字图书馆本体构建技术研究[J].现代电子技术,2016,39(15):90-94.BI Binghua.Research on digital library ontology construction technology based on word2vec[J].Modern Electronics Technique,2016,39(15):90-94.
    [11] 赵飞,周涛,张良,等.维基百科研究综述[J].电子科技大学学报,2010,39(3):321-334.ZHAO Fei,ZHOU Tao,ZHANG Liang,et al.Research progress on Wikipedia[J].Journal of University of Electronic Science and Technology of China,2010,39(3):321-334.
    [12] HINTON G E.Learning distributed representations of concepts[C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society.Hillsdale,USA:Erlbaum,1986:1-12.
    [13] BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].The Journal of Machine Learning Research,2003,3:1137-1155.
    [14] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
    [15] 熊富林,邓怡豪,唐晓晟.word2vec的核心架构及其应用[J].南京师范大学学报(工程技术版),2015(1):43-48.XIONG Fulin,DENG Yihao,TANG Xiaosheng.The Architecture of word2vec and its applications[J].Journal of Nanjing Normal University(Engineering and Technology Edition),2015(1):43-48.
    [16] 唐明,朱磊,邹显春.基于word2Vec的一种文档向量表示[J].计算机科学,2016,43(6):214-217.TANG Ming,ZHU Lei,ZOU Xianchun.Document vector representation based on word2vec[J].Computer Science,2016,43(6):214-217.
    [17] 陆远蓉.使用数据挖掘工具Weka[J].电脑知识与技术,2008,1(6):14-16,19.LU Yuanrong.Using weka as data mining tool[J].Computer Knowledge and Technology,2008,1(6):14-16,19.
    [18] 汪海燕,黎建辉,杨风雷.支持向量机理论及算法研究综述[J].计算机应用研究,2014,31(5):1281-1286.WANG Haiyan,LI Jianhui,YANG Fenglei.Overview of support vector machine analysis and algorithm[J].Application Research of Computers,2014,31(5):1281-1286.
    [19] YANG Y.A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval.New York,USA:ACM,1999:42-49.
    [20] 奉国和,郑伟.国内中文自动分词技术研究综述[J].图书情报工作,2011,55(2):41-45FENG Guohe,ZHENG Wei.Review of Chinese automatic word segmentation[J].Library and Information Service,2011,55(2):41-45.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700