摘要
为辅助投资者在短期内及时发现投资热点,结合财经新闻的特点,提出一种财经新闻话题检测模型。构建基于财经新闻的时间窗切分新闻流,根据新闻文本中的主题事件、特征词、新闻语义及金融命名实体提取文本特征,并应用最近邻-凝聚层次聚类算法获得话题簇。实验结果表明,与传统多特征话题检测模型相比,该模型可有效降低聚类算法运行时间,提高话题检测准确度,且在一定程度上协助投资者进行决策判断。
In order to help investors find hot spots of investment in a short time,this paper combines the characteristics of the financial news and proposes a financial news topic detection model.The model constructs a time window based on financial news to segment news streams,combines the topic events,feature words,news semantics and financial name entities to extract text features,and applies the Nearest Neighbor-Hierarchical Agglomerative Clustering(NNHAC) algorithm to get the topic clusters.Experimental results show that,compared with tranditional multi-feature topic detection models,this model can effectively reduce the running time of the clustering algorithm,improve the accuracy of topic detection,and to a certain extent,it helps investors to make decision and judgement.
引文
[1] MITCHEL M L,MULHERIN J H.The impact of public information on the stock market[J].The Journal of Finance,1994,49(3):923-950.
[2] FANG L,PERESS J.Media coverage and the cross-section of stock returns[J].The Journal of Finance,2009,64(5):2023-2052.
[3] 洪宇,张宇,刘挺.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.
[4] GUO Q L,LI Y M,TANG Q.The similarity computing of documents based on VSM[C]//Proceedings of the 32nd Annual IEEE International Computer Software and Applications Conference.Washington D.C.,USA:IEEE Press,2008:585-586.
[5] 陈朔鹰,金镇晟.基于改进的TF-IDF算法的微博话题检测[J].科技导刊,2016,34(2):282-286.
[6] SCOTT D,SUSAN T D,GEORAGE W F,et al.Indexing by latent semantic analysis[J].Journal of the American Society of Information Science,1990,41(6):391-407.
[7] HOFMANN T.Probabilistic latent semantic indexing[C]//Proceeding of the 22nd Annual International SIGIR Conference.New York,USA:ACM Press,1999:289-296.
[8] BLEI D M,NG A Y,JORDAN M I,et al.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[9] 贺亮,李芳.基于话题模型的科技文献话题发现和趋势分析[J].中文信息学报,2012,26(2):109-115.
[10] 刘金硕,彭映月,章岚昕,等.网络食品安全问题话题发现的LDA-K-Means算法[J].武汉大学学报(工科版),2017,50(2):307-310.
[11] 王少鹏,彭岩,王洁.基于LDA的文本聚类在网络舆情分析中的应用研究[J].山东大学学报(理学版),2014,49(9):129-134.
[12] 车蕾,杨小平.多特征融合文本聚类的新闻话题发现模型[J].国防科技大学学报,2017,39(3):85-90.
[13] 郑德荣.新闻热点话题自动发现方法[D].哈尔滨:哈尔滨工业大学,2011.
[14] 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现[J].模式识别与人工智能,2012,25(3):382-387.
[15] IWATA T,YAMADA T,SAKURAI Y,et al.Online multiscale dynamic topic models[C]//Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2010:663-672.
[16] CHEN C C,CHEN Y T,SUN Y,et al.Life cycle modeling of news events using aging theory[C]//Proceedings of European Conference on Machine Learning.Berlin,Germany:Springer,2003:47-59.
[17] 蚂蚁软件.2017年度社会热点事件传播特点分析[EB/OL].[2018-01-22].http://www.eefung.com/hot-report/20180122160439.
[18] 吴平博,陈群秀.基于时空分析的线索性事件的抽取与集成系统研究[J].中文信息学报,2006,20(1):21-28.
[19] ZHANG Y,CHEN M D,LIU L Z.A review on text mining[C]//Proceedings of the 6th IEEE International Conference on Software Engineering and Service Science.Washington D.C.,USA:IEEE Press,2015:5.
[20] FAHAD A,ALSHATRI N,TARI Z,et al.A survey of clustering algorithms for big data:taxonomy and empirical analysis[J].IEEE Transactions on Emerging Topics in Computing,2014,2(3):267-279.