基于SOM聚类的微博话题发现

英文篇名：Microblog topics detection based on SOM clustering
作者：宋莉娜 ; 冯旭鹏 ; 刘利军 ; 黄青松
英文作者：Song Lina;Feng Xupeng;Liu Lijun;Huang Qingsong;Faculty of Information Engineering & Automation,Kunming University of Science & Technology;Educational Technology & Network Center,Kunming University of Science & Technology;Yunnan Provincial Key Laboratory of Computer Technology Applications,Kunming University of Science & Technology;
关键词：话题发现 ; 词向量模型 ; 文本相似度 ; 短文本 ; SOM聚类
英文关键词：topics detection;;word vector model;;texts similarity;;short texts;;SOM clustering
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：昆明理工大学信息工程与自动化学院;昆明理工大学教育技术与网络中心;昆明理工大学云南省计算机技术应用重点实验室;
出版日期：2017-03-21 09:47
出版单位：计算机应用研究
年：2018
期：v.35;No.317
基金：国家自然科学基金资助项目(81360230,81560296)
语种：中文;
页：JSYJ201803008
页数：5
CN：03
ISSN：51-1196/TP
分类号：37-40+45

摘要

随着微博用户的增多,微博平台的信息更新频繁。针对微博文本的数据稀疏性、新词多、用语不规范等特点,提出了基于SOM聚类的微博话题发现方法。从原始语料中对文本进行预处理,通过词向量模型对短文本进行特征提取,降低了向量维度过高带来的计算量繁重问题。采用改进的SOM对话题进行聚类,该算法改善了传统文本聚类的不足,进而能有效地发现话题。实验表明该算法较传统文本聚类算法的综合指标F值有明显提高。
With the increase of microblog users,the information of microblog platform is updating frequently. This paper proposed microblog topics detection based on SOM clustering for the features of the microblog text data sparseness,new words and non-standard words. Firstly,it pretreated the short texts from the primitive text corpus,and extracted the features of the short texts by the word vector model which reduced the computational burden caused by the high vector dimension. In order to reduce the large amount of computation just to the high vector dimensions,this paper extracted the short text feature extraction by word vector model. Then,the topic clustering could be achieved by an improved SOM clustering. The algorithm improved the traditional texts clustering shortcoming. And the algorithm could find the topic effectively. Experimental results show that the algorithm's comprehensive index F value is improved obviously than the traditional methods.

引文

[1]Wang Yuan,Liu Jie,Huang Yalou,et al.Using hash tag graphbased topic model to connect semantically-related words without co-occurrence in microblogs[J].IEEE Trans on Knowledge and Data Engineering,2016,28(7):1919-1933.
    [2]贺敏,王丽宏,杜攀,等.基于有意义串聚类的微博热点话题发现方法[J].通信学报,2013,34(z1):256-262.
    [3]贺亮,李芳.基于话题模型的科技文献话题发现和趋势分析[J].中文信息学报,2012,26(2):109-115.
    [4]徐佳俊,杨飏,姚天昉,等.基于LDA模型的论坛热点话题识别和追踪[J].中文信息学报,2016,30(1):43-49.
    [5]刘星星,何婷婷,龚海军,等.网络热点事件发现系统的设计[J].中文信息学报,2008,22(6):80-85.
    [6]格桑多吉,乔少杰,韩楠,等.基于Single-Pass的网络舆情热点发现算法[J].电子科技大学学报,2015,44(4):599-604.
    [7]杨菲,黄伯雄.词共现网络的遗传算法在话题发现中的应用[J].计算机工程与软件,2013,49(14):126-129.
    [8]于洁.Skip-Gram模型融合词向量投影的微博新词发现[J].计算机系统应用,2016,25(7):130-136.
    [9]刘铭,刘秉权,刘远超.面向信息检索的快速聚类算法[J].计算机研究与发展,2013,50(7):1452-1463.
    [10]方延风,陈健.基于词向量距离的相关词变迁研究——以《情报探索》杂志摘要为例[J].情报探索,2015(4):5-7,10.
    [11]郭胜国,郭丹丹.基于词向量的句子相似度计算及其应用研究[J].现代电子技术,2016,38(13):99-107.
    [12]Zhao Jingling,Zhang Huiyun,Cui Baojiang.Sentence similarity based on semantic vector model[C]//Proc of the 9th International Conference on P2P,Parallel,Grid,Cloud and Internet Computing.2014:499-503.
    [13]刘芳.基于SOM聚类的可视化方法及应用研究[J].计算机应用研究,2012,29(4):1300-1303,1306.
    [14]Grtner T.A survey of kernrls for structured data[J].ACM SIGKDD Explorations Newsletter,2003,5(1):49-58.
    [15]Hammer B,Micheli A,Sperduti A,et al.Recursive self-organizing network models[J].Neural Networks,2004,17(8):1061-1085.
    [16]Tsutsumi K,Nakajima K.Maximum/minimum detection by a module-based neural network with redundant architecture[C]//Proc of International Joint Conference on Neural Networks.1999:558-561.
    [17]Deng Zhidong,Mao Chengzhi,Chen Xiong.Deep self-organizing reservoir computing model for visual object recognition[C]//Proc of International Joint Conference on Neural Networks.2016:1325-1332.
    [18]Qiu Lin,Xu Jungang.A Chinese word clustering method using latent dirichlet allocation and K-means[C]//Proc of the 2nd International Conference on Advances in Computer Science and Engineering.2013:267-270.
    [19]Yan Danfeng,Hua Enzheng,Hu Bo.An improved single-pass algorithm for Chinese microblog topic detection and tracking[C]//Proc of IEEE International Congress on Big Data.2016:251-258.
    [20]郑飞,张蕾.基于分类的中文微博热点话题发现方法研究[C]//第29次全国计算机安全学术交流会论文集.2014:311-314.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700