专利文本主题建模中领域停用词自动选取研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Automatic Selection of Domain-Specific Stopwords in Topic Model of Patent Text
  • 作者:俞琰 ; 赵乃瑄
  • 英文作者:Yu Yan;Zhao Nianxuan;Information Service Department,Nanjing Tech University;Computer Science department,Southeast University Chengxian College;
  • 关键词:专利文本 ; 主题建模 ; 领域停用词 ; 自动选取
  • 英文关键词:patent text;;topic model;;domain-specific stopword;;automatic selection
  • 中文刊名:TSQB
  • 英文刊名:Library and Information Service
  • 机构:南京工业大学信息服务部;东南大学成贤学院电子与计算机学院;
  • 出版日期:2018-06-05
  • 出版单位:图书情报工作
  • 年:2018
  • 期:v.62;No.600
  • 基金:教育部人文社科规划项目项目“大数据时代技能知识图谱构建研究”(项目编号:16YJAZH073);; 国家社会科学基金一般规划项目“大数据时代支持创新设计的多维度多层次专利文本挖掘研究”(项目编号:17BTQ059)研究成果之一
  • 语种:中文;
  • 页:TSQB201811019
  • 页数:7
  • CN:11
  • ISSN:11-1541/G2
  • 分类号:121-127
摘要
[目的 /意义]针对专利文本主题建模中领域停用词自动选取尚未有充分研究的问题,提出一种新的领域停用词自动选取方法,用于专利文本主题模型分析,以提高专利主题模型的区分度与建模质量。[方法 /过程]领域停用词本质上是信息比较少,在不同类别专利文本中区分度低的词。因此,引入辅助专利文本集,使用类别熵衡量词的分布情况,然后依据词的类别熵进行排序,选取类别熵最大的若干词作为领域停用词。[结果 /结论]实验通过专利文本数据,验证了该方法的可行性与有效性,能够有效地提高专利主题模型的区分度。
        [Purpose/significance] Because the research that automatic selection of domain-specific stopwords in topic model of patent text is insufficient,this paper proposes a new method of automatic selection of domain-specific stopwords,for patent text topic model analysis,in order to improve the differentiation and modeling quality of the patent topic model. [Method/process] In essence,domain-specific stopwords are less important words which contain relatively less information,such words are poorly differentiated in different kinds of patent. Therefore,this paper introduced the auxiliary multi-category patent text dataset and measured the distributions of words through the category entropy. Then,according to the category entropy of words. It chose some words that have the maximum category entropy as the domain-specific stopwords. [Result/conclusion] Experimental results show the feasibility and validity of the method proposed in this paper,which can improve the differentiation and quality of topic model for patent text analysis.
引文
[1]YOON B,PARK Y.A text-mining-based patent network:analytical tool for high-technology trend[J].Journal of high technology management research,2004,15(1):37-50.
    [2]郭炜强,戴天,文贵华.基于领域知识的专利自动分类[J].计算机工程,2005,31(23):52-54.
    [3]KIM M,PARK Y,YOON J.Generating patent development maps for technology monitoring using semantic patent-topic analysis[J].Computers&industrial engineering,2016,98(1):289-299.
    [4]高利丹,肖国华,张娴,等.共现分析在专利地图中的应用研究[J].现代情报,2009,29(7):36-39.
    [5]张杰,刘美佳,翟东升.基于专利共词分析的RFID领域技术主题研究[J].科技管理研究,2013,33(10):129-132.
    [6]TANG J,WANG B,YANG Y,et al.Patent Miner:topic-driven patent analysis and mining[C]//ACM SIGKDD international conference on knowledge discovery and data mining.New York:ACM,2012:1366-1374.
    [7]WANG B,LIU S,DING K,et al.Identifying technological topics and institution-topic distribution probability for patent competitive intelligence analysis:a case study in LTE technology[J].Scientometrics,2014,101(1):685-704.
    [8]CHEN H,ZHANG G,LU J,et al.A fuzzy approach for measuring development of topics in patents using Latent Dirichlet Allocation[C]//IEEE international conference on fuzzy systems.Piscataway:IEEE,2015:1116-1116.
    [9]SUOMINEN A,TOIVANEN H,SEPPNEN M.Firms’knowledge profiles:Mapping patent data with unsupervised learning[J].Technological forecasting&social change,2017,115(1):131-142.
    [10]范宇,符红光,文奕.基于LDA模型的专利信息聚类技术[J].计算机应用,2013,33(S1):87-89.
    [11]王博,刘盛博,丁堃,等.基于LDA主题模型的专利内容分析方法[J].科研管理,2015,36(3):111-117.
    [12]吴菲菲,张亚茹,黄鲁成,等.基于ATo T模型的技术主题多维动态演化分析---以石墨烯技术为例[J].图书情报工作,2017,61(5):95-102.
    [13]廖列法,勒孚刚.基于LDA模型和分类号的专利技术演化研究[J].现代情报,2017,37(5):13-18.
    [14]陈亮,张静,张海超,等.层次主题模型在技术演化分析上的应用研究[J].图书情报工作,2017,61(5):103-108.
    [15]FRAKES W B,BAEZA-YATES R.Information retrieval:data structures and algorithms[M].出版地:Prentice-Hall,Inc.,1992.
    [16]SILVA C,RIBEIRO B.The importance of stop word removal on recall values in text categorization[C]//International joint conference on neural networks.Piscataway:IEEE,2003:1661-1666.
    [17]官琴,邓三鸿,王昊.中文文本聚类常用停用词表对比研究[J].现代图书情报技术,2017(3):72-80.
    [18]CROW D,DESANTO J.A hybrid approach to concept extraction and recognition-based matching in the domain of human resources[C]//IEEE international conference on TOOLS with Artificial Intelligence.Piscataway:IEEE,2004:535-541.
    [19]SEKI K,MOSTAFA J.An application of text categorization methods to gene ontology annotation[C]//International Conference on Research and Development in Information Retrieval.New York:ACM,2005:138-145.
    [20]TONG S,LERNER U,SINGHAL A,et al.Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems[EB/OL].[2018-04-06].http://www.google.com/patents/US8626787.
    [21]WHITE B J.Impact of domain-specific stop-sord lists on ECommerce website search performance[J].Journal of strategic e-commerce,2007,5(2):83-101.
    [22]LO T W,HE B,OUNIS I.Automatically building a stopword list for an information retrieval system[J].Journal of digital information management,2005,3(1):3-8.
    [23]HAO L,HAO L.Automatic identification of stop words in Chinese text classification[C]//International conference on computer science and software engineering.Piscataway:IEEE Computer Society,2008:718-722.
    [24]SINKA M P,CORNE D W.Evolving better stoplists for document clustering and Web intelligence[C]//T Design and application of hybrid intelligent systems,His03,thhe third international conference on hybrid intelligent system.New York:ACM,2008:1015-1023.
    [25]JUNGIEWICZ M,OPUSZYN'SKI M.Unsupervised keyword extraction from polish legal texts[C]//International conference on natural language processing.Berlin:Springer International Publishing,2014:65-70.
    [26]MAKREHCHI M,KAMEL M S.Extracting domain-specific stop words for text classifiers[J].期刊名,2017,21(1):39-62.
    [27]顾益军,樊孝忠,王建华,等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.
    [28]巩政,关高娃.蒙古文停用词和英文停用词比较研究[J].中文信息学报,2011,25(4):35-38.
    [29]珠杰,李天瑞.藏文停用词选取与自动处理方法研究[J].中文信息学报,2015,29(2):125-132.
    [30]BLEI D M,NG A Y,Jordan M I.Latent Dirichlet allocation[J].Journal of machine learning research,2003,3(1):993-1022.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700