基于加权核非负矩阵分解的短文本聚类算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Short text clustering algorithm based on weighted kernel nonnegative matrix factorization
  • 作者:曹大为 ; 贺超波 ; 陈启买 ; 刘海
  • 英文作者:CAO Dawei;HE Chaobo;CHEN Qimai;LIU Hai;School of Computer,South China Normal University;School of Information Science and Technology,Zhongkai University of Agriculture and Engineering;
  • 关键词:核方法 ; 短文本聚类 ; 非负矩阵分解 ; 核技巧 ; 迭代优化求解
  • 英文关键词:kernel method;;short text clustering;;Nonnegative Matrix Factorization(NMF);;kernel trick;;iterative optimization
  • 中文刊名:JSJY
  • 英文刊名:Journal of Computer Applications
  • 机构:华南师范大学计算机学院;仲恺农业工程学院信息科学技术学院;
  • 出版日期:2018-04-09 13:27
  • 出版单位:计算机应用
  • 年:2018
  • 期:v.38;No.336
  • 基金:广东省科技计划项目(2017A040405057,2017A030303074,2016A030303058,2015A020209178);; 广州市科技计划项目(201604016035,201807010043)~~
  • 语种:中文;
  • 页:JSJY201808008
  • 页数:6
  • CN:08
  • ISSN:51-1307/TP
  • 分类号:46-50+57
摘要
对互联网产生的大量短文本进行聚类分析具有重要的应用价值,但由于短文本存在特征稀疏和特征难以提取的问题,导致传统的文本聚类算法难以有效处理该问题。为了解决该问题,利用非负矩阵分解(NMF)模型提出基于加权核非负矩阵分解(WKNMF)的短文本聚类算法。该算法通过核方法的映射关系将稀疏特征空间映射到高维隐性空间,从而可以充分利用短文本中的隐性语义特征进行聚类;另外,利用核技巧简化高维数据的复杂运算,并通过迭代更新规则不断地动态调整短文本的权重向量,从而可以区分不同短文本对聚类的重要性。在真实的微博数据集上进行了相关实验,结果表明WKNMF算法比K均值、隐含狄利克雷分布(LDA)、NMF和自组织神经网络(SOM)具有更好的聚类质量,准确度和归一化互信息分别达到了66.38%和66.91%。
        Clustering analysis of a large number of short texts generated by the Internet is of great application value.Because the characteristics of short texts such as sparse features and difficulty of extracting features,the traditional text clustering algorithm faces many challenges in short text clustering. To solve the problem,a short text clustering algorithm based on Weighted Kernel Nonnegative Matrix Factorization( WKNMF) was proposed by using Nonnegative Matrix Factorization( NMF) model. To make full use of hidden semantic features in short texts for clustering,sparse feature space was mapped to high-dimensional implicit vectors by using kernel method. In addition,kernel trick was used to simplify the complex operation of high-dimensional data,and the weight vectors of short texts were dynamically adjusted through iterative optimization updating rules,so that the importance of different short texts to clustering can be distinguished. Experiments were conducted on real Micro-blog data sets and WKNMF algorithm was compared with K-means,Latent Dirichlet Allocation( LDA),Nonnegative Matrix Factorization( NMF) and Self-Organization Map( SOM). The experimental results show that the proposed WKNMF algorithm has a better clustering performance than the contrast algorithms,its accuracy and Normalized Mutual Information( NMI) reach 66. 38% and 66. 91% respectively.
引文
[1]WANG J,LI Q,CHEN Y P,et al.Recommendation in internet forums and blogs[C]//ACL 2010:Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:Association for Computational Linguistics,2010:257-265.
    [2]LI J,RITTER A,HOVY E.Weakly supervised user profile extraction from twitter[C]//ACL 2014:Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:Association for Computational Linguistics,2014:165-174.
    [3]史庆伟,刘雨诗,张丰田.基于微博文本的词对主题演化模型[J].计算机应用,2017,37(5):1407-1412.(SHI Q W,LIU Y S,ZHANG F T.Biterm topic evolution model of microblog[J].Journal of Computer Applications,2017,37(5):1407-1412.)
    [4]BANERJEE S,RAMANATHAN K,GUPTA A.Clustering short texts using Wikipedia[C]//SIGIR 2007:Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,2007:787-788.
    [5]FODEH S,PUNCH B,TAN P-N.On ontology-driven document clustering using core semantic features[J].Knowledge and Information Systems,2011,28(2):395-421.
    [6]EFRON M,ORGANISCIAK P,EENLON K.Improving retrieval of short texts through document expansion[C]//SIGIR 2012:Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,2012:911-920.
    [7]TANG J,WANG X,GAO H,et al.Enriching short text representation in microblog for clustering[J].Frontiers of Computer Science,2012,6(1):88-101.
    [8]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]//CIKM 2009:Proceedings of the 18th ACM Conference on Information and Knowledge Management.New York:ACM,2009:919-928.
    [9]KIM Y.Convolutional neural networks for sentence classification[C]//EMNLP 2014:Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg,PA:Association for Computational Linguistics,2014:1746-1751.
    [10]周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017,40(6):1229-1251.(ZHOU F Y,JIN L P,DONG J.Review of convolutional neural network[J].Chinese Journal of Computers,2017,40(6):1229-1251.)
    [11]LEE D D.Learning the parts of objects by non-negative matrix factorization[J].Nature,1999,401(10):788-791.
    [12]VAGHMARE A K,RAO C V R.Unsupervised noise removal technique based on constrained NMF[J].IET Signal Processing,2017,11(7):788-795.
    [13]涂鼎,陈岭,陈根才,等.基于在线层次化非负矩阵分解的文本流主题检测[J].浙江大学学报(工学版),2016,50(8):1618-1626.(TU D,CHEN L,CHEN G C,et al.Hierarchical online NMF for detecting and tracking topic[J].Journal of Zhejiang University(Engineering Science),2016,50(8):1618-1626.)
    [14]徐海勇,郁梅,骆挺,等.基于非负矩阵分解的彩色图像质量评价方法[J].电子与信息学报,2016,38(3):578-585.(XU H Y,YU M,LUO T,et al.A color image quality assessment method based on non-negative matrix factorization[J].Journal of Electronics and Information Technology,2016,38(3):578-585.)
    [15]陈增强,谢征,张青.基于非负矩阵分解的复杂网络重构[J].复杂系统与复杂性科学,2016,13(3):26-32.(CHEN Z Q,XIE Z,ZHANG Q.Complex network reconstruction based on nonnegative matrix factorization[J].Complex Systems and Complexity Science,2016,13(3):26-32.)
    [16]XU W,LIU X,GONG Y.Document clustering based on non-negative matrix factorization[C]//SIGIR 2003:Proceedings of the26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM,2003:267-273.
    [17]袁伟,朱山风.基于距离学习的生物医学文本聚类算法研究[J].计算机应用与软件,2010,27(11):4-5.(YUAN W,ZHU S F.Study on biomedical text clustering algorithm based on metric learning[J].Computer Applications and Software,2010,27(11):4-5.)
    [18]LIN X,ZHANG Q,WEI G.The clustering algorithm for Chinese texts based on Lingo[C]//FSKD 2011:Eighth International Conference on Fuzzy Systems and Knowledge Discovery.Washington,D.C.:IEEE Computer Society,2011,2:1187-1190.
    [19]LIU Y,PI D,CHENG Q.Ensemble kernel method:SVM classification based on game theory[J].Journal of Systems Engineering and Electronics,2016,27(1):251-259.
    [20]ZHANG L,HU X.Locally adaptive multiple kernel clustering[J].Neurocomputing,2014,137:192-197.
    [21]ZHANG D,ZHOU Z-H,CHEN S.Non-negative matrix factorization on kernels[C]//PRICAI 2006:Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence.Berlin:Springer,2006:404-412.
    [22]AN S,YUN J-M,CHOI S.Multiple kernel nonnegative matrix factorization[C]//ICASSP 2011:Proceedings of the 2011 IEEE International Conference on Acoustics,Speech and Signal Processing.Washington,DC:IEEE Computer Society,2011:1976-1979.
    [23]ZHU F,HONEINE P,KALLAS M.Kernel nonnegative matrix factorization without the pre-image problem[C]//MLSP 2014:Proceedings of the 2014 IEEE International Workshop on Machine Learning for Signal Processing.Piscataway,NJ:IEEE,2014:1-6.
    [24]LIN C-J.Projected gradient methods for nonnegative matrix factorization[J].Neural Computation,2007,19(10):2756-2779.
    [25]LLOYD S.Least squares quantization in PCM[J].IEEE Transactions on Information Theory,1982,28(2):129-137.
    [26]RAO C R.The utilization of multiple measurements in problems of biological classification[J].Journal of the Royal Statistical Society,Series B(Methodological),1948,10(2):159-203.
    [27]张立文,徐家宁,李进,等.基于免疫网络和SOM的文本聚类算法研究[J].计算机应用与软件,2010,27(5):118-120.(ZHANG L W,XU J N,LI J,et al.Research on text clustering algorithm based on artificial immune network and SOM[J].Computer Applications and Software,2010,27(5):118-120.)
    [28]MUNKRES J.Algorithms for the assignment and transportation problems[J].Journal of the Society for Industrial and Applied Mathematics,1957,5(1):32-38.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700