基于频繁模式的长尾文本聚类算法

英文篇名：Long Tail Text Clustering Algorithm Based on Frequent Patterns
作者：宋中山 ; 张广凯 ; 尹帆 ; 帖军
英文作者：SONG Zhong-Shan;ZHANG Guang-Kai;YIN Fan;TIE Jun;School of Computer Science, South-Central University for Nationalities;
关键词：文本聚类 ; 长尾现象 ; 频繁模式 ; K中心点算法
英文关键词：text clustering;;long tail phenomenon;;frequent mode;;K-mediods algorithm
中文刊名：XTYY
英文刊名：Computer Systems & Applications
机构：中南民族大学计算机科学学院;
出版日期：2019-04-15
出版单位：计算机系统应用
年：2019
期：v.28
基金：国家科技支撑计划项目子课题(2015BAD29B01);; 农业部软科学研究课题(D201721);; 中央高校基本科研业务费专项资金(CZY18016)~~
语种：中文;
页：XTYY201904021
页数：6
CN：04
ISSN：11-2854/TP
分类号：143-148

摘要

短文本聚类一直是信息提取领域的热门话题,大规模的短文本数据中存在"长尾现象",传统算法对其聚类时会面临特征纬度高,小类别信息丢失的问题,针对对上述问题的研究,本文提出一种频繁项协同剪枝迭代聚类算法(Frequent Itemsets collaborative Pruning iteration Clustering framework, FIPC).该算法将迭代聚类框架与K中心点算法相结合,运用协同剪枝策略,实现对小类别文本聚类,实验结果证明该聚类算法能够有效的提高小类别短文本信息聚类的精确度,并能避免聚类中类簇重叠的问题.
Short texts clustering is a popular topic in the field of information extraction. There is a "long tail phenomenon"when the scale of data is large, which causes high dimensions of features and information loss of small class. To solve these problems, this study proposes a Frequent Itemsets collaborative Pruning iteration Clustering framework(FIPC). This framework combines the iterative clustering framework with the K-mediods algorithm, using the collaborative pruning strategy to cluster text of small class. The result of experiments shows that the FIPC framework can achieve text clustering of small class with high accuracy, and avoid the problem of overlapping clusters.

引文

1丁兆云,贾焰,周斌.微博数据挖掘研究综述.计算机研究与发展,2014,51(4):691-706.
    2 Zhao Y,Liang SS,Ren ZC,et al.Explainable user clustering in short text streams.Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.Pisa,Italy.2016.155-164.
    3 Song HS,Li N,Zhang W.Application of VSM model to document structure identification.Journal of Beijing Information Science and Technology University,2011,26(6):66-69,75.
    4 Dasgupta S,Ng V.Towards subjectifying text clustering.Proceedings of the 33rd International ACM SIGIRConference on Research and Development in Information Retrieval.Geneva,Switzerland.2010.483-490.
    5 Hinz O,Eckert J,Skiera B.Drivers of the long tail phenomenon:An empirical analysis.Journal of Management Information Systems,2011,27(4):43-70.[doi:10.2753/MIS0742-1222270402]
    6 Weng JS,Lim EP,Jiang J,et al.TwitterRank:Finding topicsensitive influential twitterers.Proceedings of the 3rd ACMInternational Conference on Web Search and Data Mining.New York,NY,USA.2010.261-270.
    7 Marcacini RM,Corrêa GN,Rezende SO.An active learning approach to frequent itemset-based text clustering.Proceedings of the 21st International Conference on Pattern Recognition.Tsukuba,Japan.2012.3529-3532.
    8 Zhang W,Yoshida T,Tang XJ,et al.Text clustering using frequent itemsets.Knowledge-Based Systems,2010,23(5):379-388.[doi:10.1016/j.knosys.2010.01.011]
    9 Su ZT,Song W,Lin MS,et al.Web text clustering for personalized e-learning based on maximal frequent itemsets.Proceedings of 2008 International Conference on Computer Science and Software Engineering.Hubei,China.2008.452-455.
    10栗伟,许洪涛,赵大哲,等.一种面向医学短文本的自适应聚类方法.东北大学学报(自然科学版),2015,36(1):19-23.[doi:10.3969/j.issn.1005-3026.2015.01.005]
    11彭敏,黄佳佳,朱佳晖,等.基于频繁项集的海量短文本聚类与主题抽取.计算机研究与发展,2015,52(9):1941-1953.
    12 Singh G,Sundaram S.A subtractive clustering scheme for text-independent online writer identification.Proceedings of the 2015 13th International Conference on Document Analysis and Recognition.Tunis,Tunisia.2015.311-315.
    13张佩云,陈传明,黄波.基于子树匹配的文本相似度算法.模式识别与人工智能,2014,27(3):226-234.
    14 Zheng CT,Liu C,Wong HS.Corpus-based topic diffusion for short text clustering.Neurocomputing,2018,275:2444-2458.[doi:10.1016/j.neucom.2017.11.019]
    15 Hung PJ,Hsu PY,Cheng MS,et al.Web text clustering with dynamic themes.In:Gong ZG,Luo XF,Chen JJ,et al,eds.Web Information Systems and Mining.Berlin Heidelberg:Springer,2011.122-130.
    16张雪松,贾彩燕.一种基于频繁词集表示的新文本聚类方法.计算机研究与发展,2018,55(1):102-112.
    17彭泽映,俞晓明,许洪波,等.大规模短文本的不完全聚类.中文信息学报,2011,25(1):54-59.[doi:10.3969/j.issn.1003-0077.2011.01.009]
    18张群,王红军,王伦文.一种结合上下文语义的短文本聚类算法.计算机科学,2016,43(S2):443-446,450.
    19 Abu-Salih B.Applying vector space model(VSM)techniques in information retrieval for Arabic language.arXiv:1801.03627,2018.
    20邢光林,胡一然,孙翀,等.改进的K中心点算法在茶叶拼配中的应用.中南民族大学学报(自然科学版),2017,36(4):126-130.
    21 Linden G,Smith B,York J.Amazon.com recommendations:Item-to-item collaborative filtering.IEEE Internet Computing,2003,7(1):76-80.[doi:10.1109/MIC.2003.1167344]
    (1)http://www.sougo.com/labs/resources/list_yuliao.php
    (2)http://www.nlpir.org/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700