基于概念语义相关性和LDA的文本标记算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Text Labeling Algorithm Based on Conceptual-Semantic Relatedness and LDA
  • 作者:周春 ; 蒋运承
  • 英文作者:ZHOU Chun;JIANG Yuncheng;School of Computer,South China Normal University;
  • 关键词:概念语义相关性 ; 相似度计算 ; 文本标记 ; 主题模型 ; 文本分类
  • 英文关键词:conceptual semantic relatedness;;similarity computation;;text label;;topic model;;text classification
  • 中文刊名:HNSF
  • 英文刊名:Journal of South China Normal University(Natural Science Edition)
  • 机构:华南师范大学计算机学院;
  • 出版日期:2018-08-22
  • 出版单位:华南师范大学学报(自然科学版)
  • 年:2018
  • 期:v.50
  • 基金:国家自然科学基金项目(61772210);; 广州市科技计划项目(201807010043)
  • 语种:中文;
  • 页:HNSF201804023
  • 页数:8
  • CN:04
  • ISSN:44-1138/N
  • 分类号:125-132
摘要
为了提高文本标记和分类的效率,提出了基于概念语义相关性和LDA的文本自动标记算法(Text Mark Label,TML),用以代替人工标记的文本分类标记.该算法在概念语义相关性计算的基础上,使用LDA(Latent Dirichlet Allocation)提取文本的主题表示,通过计算文本主题从属于各分类目录的期望从而实现文本自动标记.为验证TML算法的效果,在标准文本分类数据集上使用文本分类器进行有监督文本分类实验.为对比数据集和分类器对分类效果的影响,在3个数据集(WebKB、Reuters-21578、20-News Group)上分别使用3种不同的分类器(Rocchio、KNN、SVM)进行实验.实验结果表明:TML算法有效地提高了文本分类效率及文本标记效率.
        In order to improve the efficiency of text labeling and classification,an automatic text labeling algorithm based on conceptual-semantic relatedness and LDA called TML is proposed. This algorithm can be used to replace manual labeling of text classification tags. The proposed algorithm is based on computing the semantic relatedness between concepts,using LDA(Latent Dirichlet Allocation) to extract the topic representation of texts and then using the results to complete automatic text labeling by computing the expectation that the topic of the text belongs to a certain category. To verify the effectiveness of the TML algorithm,text classifier was used on the standard text categorization data set for supervised text categorization experiments. Three different classifiers(Rocchio,KNN,SVM)were used to perform experiments on three datasets(Web KB,Reuters-21578,and 20-News Group). The experimental results show that the TML algorithm can effectively improve the efficiency of text classification and text labeling.
引文
[1]SOLEIMANI H,MILLER D.Semi-supervised multi-label topic models for document classification and sentence labeling[C]//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.New York:ACM,2016:105-114.
    [2]TONG S.Active learning:theory and applications[D].California:Standford University,2001.
    [3]TONG S,KOLLER D.Support vector machine active learning with applications to text classification[J].Journal of Machine Learning Research,2001,2(1):45-66.
    [4]BRYAN B,SCHNEIDER J.Actively learning level-sets of composite functions[C]//Proceedings of the 25th International Conference on Machine Learning.New York:ACM,2008:80-87.
    [5]BACHMAN P,SORDONI A,TRISCHLER A.Learning algorithms for active learning[C]//Proceeding of the 34th International Conference on Machine Learning.Sydney,Australia:[s.n.],2017:301-310.
    [6]CHANG M W,RATINOV L,ROTH D,et al.Importance of semantic representation:dataless classification[C]//Proceedings of the 15th AAAI Conference on Artificial Intelligence.Chicago:AAAI Press,2008:830-835.
    [7]RADA R,MILI H,BICKNELL E,et al.Development and application of a metric on semantic nets[J].IEEE Transactions on Systems Man&Cybernetics,1989,19(1):17-30.
    [8]RESNIK P.Using information content to evaluate semantic similarity in a taxonomy[C]//Proceedings of 14th In-ternational Joint Conference on Artificial Intelligence.Montreal,Canada:[s.n.],1995:448-453.
    [9]TVERSKY A.Features of Similarity[J].Readings in Cognitive Science,1988,84(4):290-302.
    [10]RODRGUEZ M A,EGENHOFER M J.Determining semantic similarity among entity classes from different ontologies[J].IEEE Transactions on Knowledge&Data Engineering,2003,15(2):442-456.
    [11]PETRAKIS E,VARELAS G,HLIAOUTAKIS A,et al.Xsimilarity:computing semantic similarity between concepts from different ontologies[J].Journal of Digital Information Management,2006,4(4):233-237.
    [12]BLEI D M,Ng A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
    [13]LI J Q,ZHAO Y,LIU B.Exploiting semantic resources for large scale text categorization[J].Journal of Intelligent Information Systems,2012,39(3):763-788.
    [14]LIU C L,HSAIO W H,LEE C H,et al.Semi-supervised text classification with Universum learning[J].IEEE Transactions on Cybernetics,2015,46(2):462-473.
    [15]JIAN H,GAO H J,QI X,et al.Feature combination and the KNN framework in object classification[J].IEEE Transactions on Neural Networks&Learning Systems,2015,27(6):1368-1378.
    [16]LI X M,OUYANG J H,ZHOU X T,et al.Supervised labeled latent Dirichlet allocation for document categorization[J].Applied Intelligence,2015,42(3):581-593.
    [17]YANG Y M.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1:69-90.
    [18]MIAO Y Q,KAMEL M.Pairwise optimized Rocchio algorithm for text categorization[J].Pattern Recognition Letters,2011,32(2):375-382.
    [19]PAUL S,MAGDON-ISMAIL M,DRINEAS P.Feature selection for linear SVM with provable guarantees[J].Pattern Recognition,2015,60:205-214.
    [20]CHEN Q X,YAO L X,YANG J.Short text classification based on LDA topic model[C]//Proceedings of 5th International Conference on Audio,Language and Image Processing.New York:IEEE,2017:749-753.
    [21]WANG Z Q,QIAN X.Text categorization based on LDA and SVM[C]//Proceedings of the 2008 International Conference on Computer Science and Software Engineering.New York:IEEE,2008:674-677.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700