摘要
为了提高文本标记和分类的效率,提出了基于概念语义相关性和LDA的文本自动标记算法(Text Mark Label,TML),用以代替人工标记的文本分类标记.该算法在概念语义相关性计算的基础上,使用LDA(Latent Dirichlet Allocation)提取文本的主题表示,通过计算文本主题从属于各分类目录的期望从而实现文本自动标记.为验证TML算法的效果,在标准文本分类数据集上使用文本分类器进行有监督文本分类实验.为对比数据集和分类器对分类效果的影响,在3个数据集(WebKB、Reuters-21578、20-News Group)上分别使用3种不同的分类器(Rocchio、KNN、SVM)进行实验.实验结果表明:TML算法有效地提高了文本分类效率及文本标记效率.
In order to improve the efficiency of text labeling and classification,an automatic text labeling algorithm based on conceptual-semantic relatedness and LDA called TML is proposed. This algorithm can be used to replace manual labeling of text classification tags. The proposed algorithm is based on computing the semantic relatedness between concepts,using LDA(Latent Dirichlet Allocation) to extract the topic representation of texts and then using the results to complete automatic text labeling by computing the expectation that the topic of the text belongs to a certain category. To verify the effectiveness of the TML algorithm,text classifier was used on the standard text categorization data set for supervised text categorization experiments. Three different classifiers(Rocchio,KNN,SVM)were used to perform experiments on three datasets(Web KB,Reuters-21578,and 20-News Group). The experimental results show that the TML algorithm can effectively improve the efficiency of text classification and text labeling.
引文
[1]SOLEIMANI H,MILLER D.Semi-supervised multi-label topic models for document classification and sentence labeling[C]//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.New York:ACM,2016:105-114.
[2]TONG S.Active learning:theory and applications[D].California:Standford University,2001.
[3]TONG S,KOLLER D.Support vector machine active learning with applications to text classification[J].Journal of Machine Learning Research,2001,2(1):45-66.
[4]BRYAN B,SCHNEIDER J.Actively learning level-sets of composite functions[C]//Proceedings of the 25th International Conference on Machine Learning.New York:ACM,2008:80-87.
[5]BACHMAN P,SORDONI A,TRISCHLER A.Learning algorithms for active learning[C]//Proceeding of the 34th International Conference on Machine Learning.Sydney,Australia:[s.n.],2017:301-310.
[6]CHANG M W,RATINOV L,ROTH D,et al.Importance of semantic representation:dataless classification[C]//Proceedings of the 15th AAAI Conference on Artificial Intelligence.Chicago:AAAI Press,2008:830-835.
[7]RADA R,MILI H,BICKNELL E,et al.Development and application of a metric on semantic nets[J].IEEE Transactions on Systems Man&Cybernetics,1989,19(1):17-30.
[8]RESNIK P.Using information content to evaluate semantic similarity in a taxonomy[C]//Proceedings of 14th In-ternational Joint Conference on Artificial Intelligence.Montreal,Canada:[s.n.],1995:448-453.
[9]TVERSKY A.Features of Similarity[J].Readings in Cognitive Science,1988,84(4):290-302.
[10]RODRGUEZ M A,EGENHOFER M J.Determining semantic similarity among entity classes from different ontologies[J].IEEE Transactions on Knowledge&Data Engineering,2003,15(2):442-456.
[11]PETRAKIS E,VARELAS G,HLIAOUTAKIS A,et al.Xsimilarity:computing semantic similarity between concepts from different ontologies[J].Journal of Digital Information Management,2006,4(4):233-237.
[12]BLEI D M,Ng A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[13]LI J Q,ZHAO Y,LIU B.Exploiting semantic resources for large scale text categorization[J].Journal of Intelligent Information Systems,2012,39(3):763-788.
[14]LIU C L,HSAIO W H,LEE C H,et al.Semi-supervised text classification with Universum learning[J].IEEE Transactions on Cybernetics,2015,46(2):462-473.
[15]JIAN H,GAO H J,QI X,et al.Feature combination and the KNN framework in object classification[J].IEEE Transactions on Neural Networks&Learning Systems,2015,27(6):1368-1378.
[16]LI X M,OUYANG J H,ZHOU X T,et al.Supervised labeled latent Dirichlet allocation for document categorization[J].Applied Intelligence,2015,42(3):581-593.
[17]YANG Y M.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1:69-90.
[18]MIAO Y Q,KAMEL M.Pairwise optimized Rocchio algorithm for text categorization[J].Pattern Recognition Letters,2011,32(2):375-382.
[19]PAUL S,MAGDON-ISMAIL M,DRINEAS P.Feature selection for linear SVM with provable guarantees[J].Pattern Recognition,2015,60:205-214.
[20]CHEN Q X,YAO L X,YANG J.Short text classification based on LDA topic model[C]//Proceedings of 5th International Conference on Audio,Language and Image Processing.New York:IEEE,2017:749-753.
[21]WANG Z Q,QIAN X.Text categorization based on LDA and SVM[C]//Proceedings of the 2008 International Conference on Computer Science and Software Engineering.New York:IEEE,2008:674-677.