面向短文本分类的特征提取与算法研究

英文篇名：Research on different feature extraction and algorithms for ultra-short text classification
作者：刘晓鹏 ; 杨嘉佳 ; 卢凯 ; 田昌海 ; 唐球
英文作者：Liu Xiaopeng;Yang Jiajia;Lu Kai;Tian Changhai;Tang Qiu;National Computer System Engineering Research Institute of China;Information Research Center of Military Science,PLA Academy of Military Science;
关键词：自然语言处理 ; 文本分类 ; 超短文本
英文关键词：natural language processing;;text classification;;ultra short text
中文刊名：WXJY
英文刊名：Information Technology and Network Security
机构：华北计算机系统工程研究所;军事科学院军事科学信息研究中心;
出版日期：2019-05-10
出版单位：信息技术与网络安全
年：2019
期：v.38;No.505
语种：中文;
页：WXJY201905010
页数：5
CN：05
ISSN：10-1543/TP
分类号：52-56

摘要

近年来以大数据为中心的人工智能技术得到蓬勃发展,自然语言处理成为了人工智能时代最突出的前沿研究领域之一。然而,在自然语言处理领域的短文本分类中,不同的特征提取方法与机器学习算法集成时,处理效果差异明显。针对短文本分类精度较低的问题,基于组合的方式和预设的评价指标,通过将不同特征提取方法与不同机器学习算法进行组合,探究其在超短文本分类中的效果以寻求最优组合模型进而获得最佳分类效果。实验结果表明,在所选取的四种最优组合方法中,以词频-逆文件频率为特征提取方法、以逻辑回归为算法的组合模型在公开数据集中取得最好的实验效果,精度为92. 13%,查全率为90. 12%,适合应用于超短文本的分类应用场景。
In recent years,artificial intelligence technology centered on big data has been booming,natural language processing has become one of the most prominent frontier research areas in the era of artificial intelligence. However,in the short text classification of natural language processing,when different feature extraction methods are integrated with machine learning algorithms,the processing effects are significantly different. For the problem of low precision of short text classification,this paper combines different feature extraction methods with different machine learning algorithms based on the combination method and preset evaluation indicators to explore its effect in ultra-short text classification to seek the most excellent combination model to get the best classification effect. The experimental results show that among the four optimal combination methods selected,the method that the word frequency-reverse file frequency is used as the feature extraction method and the logistic regression algorithm is used as the combined model can obtain the best experimental results in the public data set with an accuracy of 92. 13%,the recall rate is 90. 12%,which is suitable for the classification application scene of ultra-short text.

引文

[1]朱琥珀.基于主题模型的新闻标题分类方法研究[D].合肥:安徽大学,2016.
    [2]GOLDBERG Y,LEVY O.Word2vec explained:deriving Mikolov et al.'s negative-sampling word-embedding method[J].arXiv preprint arXiv:1402.3722,2014.
    [3]RAMOS J.Using TF-IDF to determine word relevance in document queries[C].Proceedings of the First Instructional Conference on Machine Learning,2003,242:133-142.
    [4]LIU A,SCHISTERMAN E F.Principal component analysis[M].New York:Marcel Dekker,2004.
    [5]CORTES C,VAPNIK V.Support-vector networks[J].Machine Learning,1995,20(3):273-297.
    [6]RUCZINSKI I,KOOPERBERG C,LEBLANC M.Logic regression[J].Journal of Computational and Graphical Statistics,2003,12(3):475-511.
    [7]MLADENOVIC'N,HANSEN P.Variable neighborhood search[J].Computers&Operations Research,1997,24(11):1097-1100.
    [8]Kaggle.News aggregator dataset[EB/OL].[2019-03-01].https://www.kaggle.com/uciml/news-aggregatordataset.
    [9]周志华.机器学习[M].北京:清华大学出版社,2016.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700