摘要
聚类相关度大的个人微博有助于快速了解博主的专业兴趣和经历,目前的短文本聚类方法缺乏对于语义和句子相关度的充分考虑,提出了一种基于知网的个人微博语义相关度的聚类方法。其要点如下:(1)利用Skip-gram训练大量微博文本生成词汇向量;(2)根据词汇义原进行句内词汇消除歧义;(3)分别计算个人微博之间词汇和句子的相似度并将其综合得到博文相关度;(4)根据博文相关度进行个人微博的聚类。实验表明,相较于层次聚类法、密度聚类法,本文算法的准确度有明显提高。
Individual microblogs with large clustering correlation enable a quick understanding of bloggers' professional interests and experiences. Existing short text clustering methods lack sufficient consideration of the correlation between semantics and sentences. We propose a novel individual microblog clustering method according to semantic correlation based on the HowNet. The main steps are as follows:(1) use the skip-gram to train a large number of microblog texts to generate word vectors;(2) according to original semantic senses of words to eliminate ambiguity in the sentence;(3) calculate the similarity of words and sentences between microblogs respectively and get the correlation metrics;(4) cluster individual microblogs according to the microblog correlation. Experimental results show that the proposed clustering method outperforms the hierarchical clustering method and density clustering method.
引文
[1] Beil F W,Ester M,Xu X,et al.Frequent term-based text clustering[C]//Proc of the 8th International Conference on Knowledge Discovery and Data Mining,2002:436-442.
[2] Zhao Shi-qi,Liu Ting,Li Sheng.A topical document clustering method[J].Journal of Chinese Information Processing,2007,21(2):58-62.(in Chinese)
[3] Rosa K D,Shah R,Lin B,et al.Topical clustering of tweets[C]//Proc of the ACM SIGIR 3rd Workshop on Social Web Search and Mining (SWSM 2011),2011:1-8.
[4] Gao Yong-bing,Zhang Di,Yang Gui-peng,et al.Research on personal microblog clustering method combining semantic features[J].Journal of Chinese Computer Systems,2017,38(7):1543-1548.(in Chinese)
[5] Liu Qun,Li Su-jian.Word similarity computing based on HowNet [C]//Proc of the 3rd Chinese Lexical Semantics Workshop,2002:59-76.(in Chinese)
[6] Wang Xiao-lin,Wang Yi.Improved word similarity algorithm based on HowNet[J].Journal of Computer Applications,2011,31 (11):3075-3078.(in Chinese)
[7] Ge Bin,Li Fang-fang,Guo Si-lu,et al.Word’s semantic similarity computation based on HowNet[J].Application Research of Computers,2010,27(9):3329-3333.(in Chinese)
[8] Chen X,Liu Z,Sun M.A unified model for word sense representation and disambiguation[C]//Proc of Conference on Empirical Methods in Natural Language Processing,2014:1025-1035.
[9] Rothe S,Schütze H.AutoExtend:Extending word embeddings to embeddings for synsets and lexemes[J].Computer Science,2015,7(4):1507-1513.
[10] Pilehvar M T,Jurgens D,Navigli R.Align,disambiguate and walk:A unified approach for measuring semantic similarity[C]//Proc of Meeting of the Association for Computational Linguistics,2013:1341-1351.
[11] Liu P,Qiu X,Huang X.Learning context-sensitive word embeddings with neural tensor skip-gram model[C]//Proc of International Conference on Artificial Intelligence,2015:1284-1290.
[12] Parush N,Tishby N,Bergman H.Dopaminergic balance between reward maximization and policy complexity[J].Frontiers in Systems Neuroscience,2011,5:22.
[13] Wang Xiao-lin,Wang Dong,Yang Si-chun,et al.Word semantic similarity algorithm based on HowNet[J].Computer Engineering,2014,40(12):177-181.(in Chinese)
[14] Visalakshi N K,Suguna J.K-means clustering using Max-min distance measure[C]//Proc of (2009 NAFIPS) 2009 Annual Meeting of the North American Fuzzy Information Processing Society,2009:1-6.
[15] Tian Sen-ping,Wu Wen-liang.Algorithm of automatic gained parameter value k based on dynamic k-means[J].Computer Engineering and Design,2011,32(1):274-276.(in Chinese)
[16] Ramage D,Heymann P,Manning C D,et al.Clustering the tagged web[C]//Proc of Web Search and Data Mining,2009:54-63.
[17] Li Feng,Li Fang.An new approach measuring semantic similarity in Hownet 2000[J].Journal of Chinese Information Processing,2007,21(3):99-105.(in Chinese)
[2] 赵世奇,刘挺,李生.一种基于主题的文本聚类方法[J].中文信息学报,2007,21(2):58-62.
[4] 高永兵,张娣,杨贵朋,等.结合语义特征的个人微博聚类方法研究[J].小型微型计算机系统,2017,38(7):1543-1548.
[5] 刘群,李素建.基于《知网》的词汇语义相似度计算[C]//第三届汉语词汇语义学研讨会,2002:59-79.
[6] 王小林,王义.改进的基于知网的词语相似度算法[J].计算机应用,2011,31(11):3075-3077.
[7] 葛斌,李芳芳,郭丝路,等.基于知网的词汇语义相似度计算方法研究[J].计算机应用研究,2010,27(9):3329-3333.
[13] 王小林,王东,杨思春,等.基于《知网》的词语语义相似度算法[J].计算机工程,2014,40(12):177-181.
[15] 田森平,吴文亮.自动获取k-means聚类参数k值的算法[J].计算机工程与设计,2011,32(1):274-276.
[17] 李峰,李芳.中文词语语义相似度计算——基于《知网》2000[J].中文信息学报,2007,21(3):99-105.