摘要
针对股评论坛主题发现,提出基于频繁项集与潜在语义相结合的短文本聚类(STC_FL)框架.在基于知网的知识获取后得到概念向量空间,挖掘并筛选出重要频繁项集,然后采用统计和潜在语义相结合的方法进行重要频繁项集的自适应聚类.最后,提出TSC-SN(text soft classifying based on similarity threshold and non-overlapping)算法,通过参数调优策略选择和控制文本软聚类过程.股吧论坛数据实证分析发现:所提出的STC_FL框架和TSC-SN算法可充分挖掘文本潜在语义信息,并有效降低特征空间维度,最终实现对短文本的深层次信息挖掘和主题归类.
To achieve more effective topic discovery of stock bar forum, this paper presents a framework with short text clustering based on frequent item-set and latent semantic(STC_FL). The important frequent item-sets are acquired with the concept vector space based on HowNet, and then a combination pattern of statistics and latent semantics is used to realize the self-adaptive clustering of important frequent item-sets. Finally, the algorithm of text soft classifying based on similarity threshold and non-overlapping(TSC-SN) is proposed. Text soft clustering is selected and controlled with parameter optimization. By taking the real stock bar forum data as a specific case of empirical analysis, it is shown that STC_FL framework and TSC-SN algorithm can fully exploit the latent semantic information of text and reduce the dimension of feature space, which realizes the deep information mining and topic classification of short texts.
引文
[1] 王仲远,程健鹏,王海勋,等.短文本理解研究[J].计算机研究与发展,2016,53(2):262.WANG Zhongyuan,CHENG Jianpeng,WANG Haixun,et al.Short text understanding:a survey [J].Journal of Computer Research and Development,2016,53(2):262.
[2] ZHENG Y,MENG Z,XU C.A short-text oriented clustering method for hot topics extraction [J].International Journal of Software Engineering & Knowledge Engineering,2015,25(3):453.
[3] SRIRAM B,FUHRY D,DEMIR E,et al.Short text classification in Twitter to improve information filtering [C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010).Geneva:ACM,2010:841-842.
[4] 蔡淑琴,张静,王旸,等.基于中心化的微博热点发现方法[J].管理学报,2012,9(6):864.CAI Shuqin,ZHANG Jing,WANG Yang,et al.Micro-blogging hotspot discovery method based on centralization [J].Chinese Journal of Management,2012,9(6):864.
[5] 翟延冬,王康平,张东娜,等.一种基于WordNet的短文本语义相似性算法[J].电子学报,2012,40(3):617.ZHAI Yandong,WANG Kangping,ZHANG Dongna,et al.An algorithm for semantic similarity of short text based on WordNet [J].Acta Electronica Sinica,2012,40(3):617.
[6] 杨震,王来涛,赖英旭.基于改进语义距离的网络评论聚类研究[J].软件学报,2014,25(12):2777.YANG Zhen,WANG Laitao,LAI Yingxu.Online comment clustering based on an improved semantic distance [J].Journal of Software,2014,25(12):2777.
[7] BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation [J].Journal of Machine Learning Research Archive,2003,3:993.
[8] 徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423.XU Ge,WANG Houfeng.The development of topic models in natural language processing [J].Chinese Journal of Computers,2011,34(8):1423.
[9] SHAMS M,BARAANI-DASTJERDI A.Enriched LDA (ELDA):combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction[J].Expert Systems with Applications,2017,80:136.
[10] KIM Y,SHIM K.TWILITE:a recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation [J].Information Systems,2014,42(3):59.
[11] ZHANG P,GU H,GARTRELL M,et al.Group-based latent Dirichlet allocation (Group-LDA):effective audience detection for books in online social media [J].Knowledge-Based Systems,2016,105:134.
[12] 李扬,孔雯婧,谢邦昌.基于主题模型的半监督网络文本情感分类研究[J].数理统计与管理,2016(6):961.LI Yang,KONG Wenjing,XIE Bangchang.Study on semi-supervised sentiment classification of web context based on topic model [J].Journal of Applied Statistics and Management,2016(6):961.
[13] ZOGHBI S, VULIC I, MOENS M F.Latent Dirichlet allocation for linking user-generated content and e-commerce data [J].Information Sciences,2016,367/368:573.
[14] 曹丽娜,唐锡晋.基于主题模型的BBS话题演化趋势分析[J].管理科学学报,2014,17(11):109.CAO Lina,TANG Xijin.Trends of BBS topics based on dynamic topic model [J].Journal of Management Sciences in China,2014,17(11):109.
[15] LASZLO M,MUKHEJEE S.A genetic algorithm that exchanges neighboring centers for k-means clustering [J].Pattern Recognition Letters,2007,28(6):2359.
[16] SUN Y,ZHU Q,CHEN Z.An iterative initial-points refinement algorithm for categorical data clustering [J].Pattern Recognition Letters,2002,23(7):875.
[17] 彭敏,黄佳佳,朱佳晖,等.基于频繁项集的海量短文本聚类与主题抽取[J].计算机研究与发展,2015,52(9):1941.PENG Min,HUANG Jiajia,ZHU Jiahui,et al.Mass of short texts clustering and topic extraction based on frequent item-sets [J].Journal of Computer Research and Development,2015,52(9):1941.
[18] CHEN C L,TSENG F S C,LIANG T.Mining fuzzy frequent item-sets for hierarchical document clustering [J].Information Processing and Management,2010,46(2):193.
[19] WANG K,XU C,LIU B.Clustering transactions using large items [C]//Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM 1999).Kansas City:ACM,1999:483-490.
[20] ZHANG W,YOSHIDA T,TANG X.Text clustering using frequent item-sets [J].Knowledge-Based Systems,2010,23:379.
[21] SETHI K K,RAMESH D.HFIM:a Spark-based hybrid frequent itemset mining algorithm for big data processing [J].Journal of Supercomputing,2017,73:1.
[22] DJENOURI Y,COMUZZI M.Combining Apriori heuristic and bio-inspired algorithms for solving the frequent itemsets mining problem [J].Information Sciences,2017,420:1.
[23] 刘青磊,顾小丰.基于《知网》的词语相似度算法研究[J].中文信息学报,2010,24(6):31.LIU Qinglei,GU Xiaofeng.Study on HowNet-based word similarity algorithm [J].Journal of Chinese Information Processing,2010,24(6):31.