用户名: 密码: 验证码:
基于Sentence-LDA主题模型的短文本分类
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Chinese Short Text Classification Based on Sentence-LDA Topic Model
  • 作者:张浩 ; 钟敏
  • 英文作者:ZHANG Hao;ZHONG Min;Wuhan Research Institute of Posts and Telecommunications;Nanjing Fiberhome World Communication Technology Co.Ltd.;
  • 关键词:短文本分类 ; Sentence-LDA ; 主题模型 ; 特征扩展 ; SVM
  • 英文关键词:short text classification;;Sentence-LDA;;topic model;;feature extension;;SVM
  • 中文刊名:JYXH
  • 英文刊名:Computer and Modernization
  • 机构:武汉邮电科学研究院;南京烽火天地通信科技有限公司;
  • 出版日期:2019-03-15
  • 出版单位:计算机与现代化
  • 年:2019
  • 期:No.283
  • 语种:中文;
  • 页:JYXH201903021
  • 页数:5
  • CN:03
  • ISSN:36-1137/TP
  • 分类号:106-110
摘要
短文本特征稀疏、上下文依赖性强的特点,导致传统长文本分类技术不能有效地被直接应用。为了解决短文本特征稀疏的问题,提出基于Sentence-LDA主题模型进行特征扩展的短文本分类方法。该主题模型是隐含狄利克雷分布模型(Latent Dirichlet Allocation, LDA)的扩展,假设一个句子只产生一个主题分布。利用训练好的Sentence-LDA主题模型预测原始短文本的主题分布,从而将得到的主题词扩展到原始短文本特征中,完成短文本特征扩展。对扩展后的短文本使用支持向量机(Support Vector Machine, SVM)进行最后的分类。实验显示,与传统的基于向量空间模型(Vector Space Model,VSM)直接表示短文本的方法比较,本文提出的方法可以有效地提高短文本分类的准确率。
        The short text features are sparse and the context is strongly dependent, which leads to the traditional long text classification technology can't be directly applied. In order to solve the problem of short text feature sparseness, a short text classification method based on Sentence-LDA topic model is proposed. The topic model is an extension of the LDA(Latent Dirichlet Allocation) model, it assumes that a sentence produces only one topic distribution. The trained Sentence-LDA topic model is used to predict the topic distribution of the original short text, thereby extend the obtained topic words into the original short text features, and complete the short text feature expansion. The SVM(Support Vector Machine) is finally used to classify the expanded short text. Experiments show that compared with the traditional method of directly representing short text based on VSM(Vector Space Model), the proposed method can effectively improve the accuracy of short text classification.
引文
[1] 孟欣,左万利. 基于wordembedding的短文本特征扩展与分类[J]. 小型微型计算机系统, 2017,38(8):1712-1717.
    [2] YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts[C]// Proceedings of the 22nd ACM International Conference on World Wide Web. 2013:1445-1456.
    [3] 汪静,罗浪,王德强. 基于Word2Vec的中文短文本分类问题研究[J]. 计算机系统应用, 2018,27(5):209-215.
    [4] WANG X, WANG J, YANG Y, et al. Labeled LDA-kernel SVM: A short Chinese text supervised classification based on Sina Weibo[C]// 2017 4th IEEE International Conference on Information Science and Control Engineering (ICISCE). 2017:428-432.
    [5] SONG G, YE Y, DU X, et al. Short text classification: A survey[J]. Journal of Multimedia, 2014,9(5):635-644.
    [6] IKONOMAKIS M, TAMPAKAS V. Text classification: A recent overview[C]// WSEAS International Conference on Computers.World Scientific and Engineering Academy and Society(WSEAS). 2005:1-6.
    [7] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J].Communications of the ACM, 1974,18(11):613-620.
    [8] 赵辉,刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013,57(11):120-124.
    [9] 翟延冬,王康平,张东娜,等. 一种基于WordNet的短文本语义相似性算法[J]. 电子学报, 2012,40(3):617-620.
    [10] 赵谦. 基于HowNet的短文本语义相似度计算方法研究[D]. 太原:太原理工大学, 2017.
    [11] 郭永辉. 面向短文本分类的特征扩展方法[D]. 哈尔滨:哈尔滨工业大学, 2013.
    [12] 吕超镇,姬东鸿,吴飞飞. 基于LDA特征扩展的短文本分类[J]. 计算机工程与应用, 2015,51(4):123-127.
    [13] LAULY S, BOULANGER A, LAROCHELLE H. Learning multilingual word representations using a bag-of-words autoencoder[J]. Computer Science, 2014, arXiv:1401.1803.
    [14] SALTON G, YANG C S. On the specification of term values in automatic indexing[J]. Journal of Documentation, 1973,29(4):351-372.
    [15] BüSCHKEN J, ALLENBY G M. Sentence-based text analysis for customer reviews[J]. Marketing Science, 2016,35(6):953-975.
    [16] BALIKAS G, AMINI M R, CLAUSEL M. On a topic model for sentences[C]// Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016:921-924.
    [17] BLEI D M, NG A Y,JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
    [18] GRIFFITHS T L, STEYVERS M. Finding scientific topics[J]. Proceedings of the National Academy of Sciences, 2004,101(Suppl 1): 5228-5235.
    [19] 金宸, 李维华, 姬晨, 等. 基于双向LSTM神经网络模型的中文分词[J]. 中文信息学报, 2018,32(2):29-37.
    [20] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究[J]. 数据分析与知识发现, 2017,1(3):76-84.
    [21] 李文波, 孙乐, 张大鲲. 基于Labeled-LDA模型的文本分类新算法[J].计算机学报, 2008,31(4):620-627.
    [22] 胡勇军,江嘉欣,常会友. 基于LDA高频词扩展的中文短文本分类[J]. 数据分析与知识发现, 2013(6):42-48.
    [23] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representation in vector space[J]. Computer Science, 2013, arXiv:1301.3781.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700