基于热词语义聚类的领域特征挖掘方法

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于热词语义聚类的领域特征挖掘方法

详细信息查看全文 | 推荐本文 |

英文篇名：A field feature mining method via buzzword semantic clustering
作者：庄建昌 ; 武娇 ; 顾兴全 ; 洪彩凤
英文作者：ZHUANG Jianchang;WU Jiao;GU Xingquang;HONG Caifeng;College of Sciences,China Jiliang University;College of Standardization,China Jiliang University;
关键词：计量 ; 关键词提取 ; 词向量 ; 聚类
英文关键词：measurement;;keyword extraction;;word2vec;;clustering
中文刊名：ZGJL
英文刊名：Journal of China University of Metrology
机构：中国计量大学理学院;中国计量大学标准化学院;
出版日期：2019-06-15
出版单位：中国计量大学学报
年：2019
期：v.30;No.94
基金：国家自然科学基金项目(No.61302190)
语种：中文;
页：ZGJL201902014
页数：9
CN：02
ISSN：33-1401/TB
分类号：89-97

摘要

目的:帮助人们更好地利用领域关键词挖掘和分析领域特征,解决领域关键词提取技术面临的领域语料信息冗余且分布不均衡的问题。方法:提出二次关键词提取策略,并结合词向量模型和聚类算法构建领域的局部热词模型。结果:得到了领域的热词和热词频率分布、特征划分及其分布图。结论:旅游评论挖掘的结果表明该方法能够有效提取领域特征,实现领域特征可视化,降低领域语料分布不平衡的负面影响。
Aims:This paper aims to mine and analyze the field features by extracting the keywords of a research field and solve the problem caused by the unbalanced corpus and information redundancy.Methods:A two-step keyword extraction strategy was developed.Then the local buzzword model was proposed by combining the word vector model and clustering the keywords.Results:The buzzwords of research field,the frequency distribution,the partition of features and the cluster distribution graph were obtained.Conclusions:The application on the tourist review mining validates that the features and the feature-based visualization of the research field can be obtained by using the proposed method,and the negative impact caused by the unbalanced corpus can be reduced.

引文

[1]SU H N,LEE P C.Mapping knowledge structure by keyword co-occurrence:A first look at journal papers in technology Fforesight[J].Scientometrics,2010,85(1):65-79.
    [2]XU Z,ZHANG S,CHOO K K,et al.Hierarchy-cutting model based association semantic for analyzing domain topic on the web[J].IEEE Transactions on Industrial Informatics,2017,13(4):1941-1950.
    [3]张航,何灵敏.一种负样本改进的LDA主题模型推荐算法[J].中国计量大学学报,2018,29(1):55-58.ZHANG H,HE L M.An improved recommendation algorithm of the LDA theme model based on negative samples[J].Journal of China University of Metrology,2018,29(1):55-58.
    [4]DIEDERICH J,BALKE W T.Automatically created concept graphs using descriptive keywords in the medical domain[J].Methods of Information in Medicine,2008,47(3):241-250.
    [5]宋培彦,李丹丹.肿瘤领域关键词共现网络聚类方法研究[J].医学信息学杂志,2018,39(8):55-61.SONG P Y,LI D D.Study on the clustering method of keyword co-occurrence network in the field of tumor[J].Journal of Medical Intelligence,2018,39(8):55-61.
    [6]乔方园,杨萌萌,汪雪锋,等.纳米技术领域的关键词共现分析研究[J].情报杂志,2013,32(5):150-154.QIAO F Y,YANG M M,WANF X F,et al.Research on keywords co-occurrence network analysis in nanotechnology[J].Journal of Intelligence,2013,32(5):150-154.
    [7]耿志杰,朱学芳,王文鼐.情报学领域关键词同现网络结构研究[J].情报科学,2010,28(8):1179-1182.GENG Z J,ZHU X F,WANG W N.Research on keyword co-occurrence network structure in the field of informatics[J].Information Science,2010,28(8):1179-1182.
    [8]闵超,孙建军.基于关键词交集的学科交叉研究热点分析---以图书情报学和新闻传播学为例[J].情报杂志,2014,33(5):76-82.MIN C,SUN J J.Discipline-crossing research hotspots analysis based on keyword intersection:an example of library and information science and journalism and communication studies[J].Journal of Intelligence,2014,33(5):76-82.
    [9]YU M S,PING Y H,NING Y P.An approach to discover and recommend cross-domain bridge-keywords in document banks[J].The Electronic Library,2010,28(5):669-687.
    [10]高继平,丁堃,潘云涛,等.多词共现分析方法的实现及其在研究热点识别中的应用[J].图书情报工作,2014,58(24):80-85.GAO J P,JING K,PAN Y T,et al.Implementation of multiple words co-occurrence analysis and its application in the recognition of research hotspots[J].Library and Information Service,2014,58(24):80-85.
    [11]潘玮,牟冬梅,李茵等.关键词共现方法识别领域研究热点过程中的数据清洗方法[J].图书情报工作,2017,61(7):111-117.PAN W,MU D M,LI Y,et al.Data cleaning in the process of identifying research hotpot based on keywords co-occurrence[J].Library and Information Service,2017,61(7):111-117.
    [12]LUO X,FANG N,XU W,et al.Experimental study on the extraction and distribution of textual domain keywords[J].Concurrency and Computation:Practice and Experience,2008,20(16):1917-1932.
    [13]HU K,WU H,QI K,et al.A domain keyword analysis approach extending term frequency-keyword active index with google word2vec model[J].Scientometrics,2018,114(3):1031-1068.
    [14]韦强申.领域关键词抽取:结合LDA与Word2Vec[D].贵州:贵州师范大学,2016.WEI Q S.Keyword Extraction Based on Lda and Word2vec[D].Guizhou:Guizhou Normal University,2016.
    [15]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//Proceedings of Workshop at ICLR 2013.Scottsdale,Arizona,USA:arXiv,2013:1-12.
    [16]WANG X,LI H,JIA Y,et al.Chinese text filtering based on domain keywords extracted from wikipedia[J].Lecture Notes in Electrical Engineering,2013,211:991-1000.
    [17]RADIM,REHUREK,PETR S.Software Framework for Topic Modelling with Large Corpora[M].Valletta Malta:ELRA,2010:45-50.
    [18]SALTON G,BUCKLEY C.Term weighting approaches in automatic text retrieval[J].Information Processing&Management,1987,24(5):513-523.
    [19]QUONIAM L,BALME F,ROSTAINGo H,et al.Bibliometric law used for information retrieval[J].Scientometrics,1998,41(1):83-91.
    [20]WOLD S,ESBENSEN K,GELADI P.Principal component analysis[J].Chemometrics&Intelligent Laboratory Systems,1987,2(1):37-52.
    [21]SWAMI A,JAIN R.Scikit-learn:Machine learning in python[J].Journal of Machine Learning Research,2012,12(10):2825-2830.
    [22]HARTIGAN J A,WONG M A.Algorithm as 136:A kmeans clustering algorithm[J].Journal of the Royal Statistical Society,1979,28(1):100-108.
    [23]DHILLON I S,MODHA D S.Concept decompositions for large sparse text data using clustering[J].Machine Learning,2001,42(1-2):143-175.
    [24]BEEFERMAN D,BERGER A.Agglomerative clustering of a search engine query log[C]//Proceeding of KDD2000.Boston,USA:ACM,2000:407-416.
    [25]CALISKI T,HARABASZ J.A dendrite method for cluster analysis[J].Communications in Statistics-theory and Methods,1974(3):1-27.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700