基于频繁项特征扩展的短文本分类方法

英文篇名：Method of Short Text Classification Based on Frequent Item Feature Extension
作者：靳一凡 ; 傅颖勋 ; 马礼
英文作者：JIN Yi-fan;FU Ying-xun;MA Li;College of Information,North China University of Technology;
关键词：短文本分类 ; 特征扩展 ; 频繁项挖掘 ; 特征权重 ; 支持向量机
英文关键词：Short text classification;;Feature extension;;Frequent item mining;;Feature weight;;Support vector machine
中文刊名：JSJA
英文刊名：Computer Science
机构：北方工业大学信息学院;
出版日期：2019-06-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金(61702013);; 北京市优秀人才培养资助项目(2016000020124G016);; 北京市教委科技计划项目(KM201710009008);; 北方工业大学科研启动项目资助
语种：中文;
页：JSJA2019S1103
页数：4
CN：S1
ISSN：50-1075/TP
分类号：488-491

摘要

短文本具有特征维度高且稀疏等特点,导致将传统的分类方法应用于短文本分类时效果较差。针对此问题,提出基于频繁项特征扩展的短文本分类方法(Short Text Classification Based on Frequent Item Feature Extension,STCFIFE)。首先通过FP-growth算法挖掘背景语料库的频繁项集,结合上下文的关联特征,计算出扩展特征权重;然后将新特征加入到原短文本的特征空间中,在此基础上训练SVM(Support Vector Machine,SVM)分类器,并进行分类。实验结果表明,与传统的SVM算法和LDA+KNN算法相比,STCFIFE方法能有效缓解短文本特征不足、高维稀疏的问题,使F1值提升了2%～10%,提高了短文本的分类效果。
Short text has the characteristics of high feature dimension and sparse,as a result,the traditional classification method is not effective in short text classification.To solve this problem,a short text classification method based on frequent item feature extension called STCFIFE was proposed.First of all,frequent itemsets in the background corpus are mined through FP-growth algorithm,and combining the contextual association feature,the extended feature weight is calculated.Then the new features are added to the feature space of the original short text.On this basis,SVM(Support Vector Machine) classifier is trained for classification.The experimental results show that,compared with the traditional SVM algorithm and the LDA+KNN algorithm,STCFIFE can effectively alleviate problems of feature deficiency and high dimensional sparsity in short text and improves F1 value by 2%～10%,improving the classification effect in short text.

引文

[1] 张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590.
    [2] 王雯,赵衎衎,李翠平,等.Spark平台下的短文本征扩展与分类研究[J].计算机科学与探索,2017,34(5):1-9.
    [3] 王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学2013,40(12):229-232.
    [4] 石晶,李万龙.基于LDA模型的主题分析[J].自动化学报,2009,35(12):1586-1593.
    [5] YANG Y,ZHANG J,KISIEL B.A scalability analysis of classifiers in text categorization [C]//Proceedings of the 26th ACM International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-03).Toronto:ACM Press,2003:96-103.
    [6] JOACHIMS T.Text Categorization with Support Vector Ma- chines:Learning with Many Relevant Features[J].Machine Learning,1998,1398(23):137-142.
    [7] CALMA A,REITMAIER T,SICK B.Semi-Supervised Active Learning for Support Vector Machines:A Novel Approach that Exploits Structure Information in Data[J].Information Sciences,2018,456:13-22.
    [8] 徐光美,刘宏哲,张敬尊.基于特征加权的多关系朴素贝叶斯分类模型[J].计算机科学,2014,41(2):283-285.
    [9] 胡元,石冰.基于区域划分的KNN文本快速分类算法研究[J].计算机科学,2012,39(10):182-186.
    [10] 季一木,张永潘,郎贤波,等.面向流数据的决策树分类算法并行化[J].计算机研究与发展,2017,54(9):1945-1957.
    [11] SHIRAKAWA M,NAKAYAMA K,HARA T,et al.Wikipedia-Based Semantic Similarity Measurementsfor Noisy Short Texts Using Extended Naive Bayes[J].IEEE Transactionson Emerging Topics in Computing,2015,3(2):1.
    [12] LIU W S,CAO Z W,WANG J,et al.Short text classification based on Wikipedia and Word2vec[C]//2nd IEEE International Conference on Computer and Communications (ICCC).2016.
    [13] HE H,CHEN B,XU W,et al.Short Text Feature Extraction and Clustering for Web Topic Mining[C]//Proceedings of the Third International Conference on Semantics,Knowledge and Grid.IEEE Computer Society,2007:382-385.
    [14] LIU J L,YAN Y Y.SMS Text Classification Method Based on Context[J].Computer Engineering,2011,37(10):41-43.
    [15] CHEN Q U,YAO L X,YANG J.Short text classification based on LDA topic model[C]//International Conference on Audio,Language and Image Processing (ICALIP).2016.
    [16] WANG X L,WANG J,YANG Y.Labeled LDA-Kernel SVM:A Short Chinese Text Supervised Classification Based on SinaWeibo[C]//4th International Conference on Information Science and Control Engineering(ICISCE).2017.
    [17] YUAN M.Feature Extension for Short Text Categorization Using Frequent Term Sets[J].Elsevier Procedia Computer Scien-ce,2014,31:663-670.
    [18] FENG G,LI S,SUN T,et al.A Probabilistic Model Derived Term Weighting Scheme for Text Classification[J].Pattern Recognition Letters,2018,110:23-29.
    [19] MIROCZUK M M,PROTASIEWICZ J.A Recent Overview of the State-of-the-Art Elements of Text Classification[J].Expert Systems with Applications,2018,106:36-54.
    [20] LI H,WANG Y,ZHANG D,et al.Pfp:parallel fpgrowth for query recommendation[C]//Proceedings of the 2008 ACM Conference on Recommender Systems.ACM,2008:107-114.
    [21] SOGOULABS.SogouCS,version:2012[OL].http://www.sogou.com/ labs/resource/cs.php.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700