基于专利摘要词嵌入分布式表示方法的改进

英文篇名：Improvement of Distributed Representation Method Based on Patent Summary Words Embedding
作者：刘刚 ; 曹雨虹 ; 裴莹莹 ; 李玉
英文作者：Liu Gang;Cao Yuhong;Pei Yingying;Li Yu;School of Electronics and Control Engineering,North China Institute of Aerospace Engineering;School of Computer and Remote Sensing Information Technology,North China Institute of Aerospace Engineering;
关键词：专利摘要 ; 词嵌入 ; 语言模型 ; 聚类 ; 自然语言处理
英文关键词：Patent Abstract;;Word Embedding;;Linguistic Model;;Cluster;;Natural Language Processing
中文刊名：HBYD
英文刊名：Information & Communications
机构：北华航天工业学院电子与控制工程学院;北华航天工业学院计算机与遥感信息技术学院;
出版日期：2019-04-15
出版单位：信息通信
年：2019
期：No.196
基金：廊坊科技局基金项目(2018011051)
语种：中文;
页：HBYD201904012
页数：3
CN：04
ISSN：42-1739/TN
分类号：34-36

摘要

专利文献是记录专利的主要依据,而专利摘要则是专利文献的进一步浓缩。实验基于中文专利摘要部分,借助Python第三方库jieba进行分词、词性标注、gensim进行词向量映射,探讨对中文专利摘要部分进行分词、词性标注的问题,进而探讨词嵌入中基于词袋模型和分布式模型的差异。针对现有的分布式表示方法中词向量连续稠密等问题,提出了在相关语料库的基础上将词语聚类之后再结合CBOW和Skip-Gram模型训练语料词语得到权重矩阵,并将此权重矩阵用户测试数据中去预测中心词并得到其词向量。研究表明改进后的方法在词嵌入分布式表示词向量更适合用于循环神经网络的研究。
Patent documents are the main basis for recording patents, while patent abstracts are the further enrichment of patent documents.Based on the Chinese patent abstract, this experiment uses the Python third-party library jieba for word segmentation,part-of-speech tagging and uses gensim for word vector mapping,and explores the problem of word segmentation and part-of speech tagging in Chinese patent abstracts, and then explores the differences between bag-of-words model and distributed model in word embedding.Aiming at the problems of continuous and dense word vectors in existing distributed representation methods,a new method of clustering of words based on basic corpus is proposed, which combines CBOW and Skip-Gram with different weights.The research shows that the improved method is more dense in embedding distributed word vectors into words.

引文

[1]Sinclair J.Corpus,concordance,collocation[M].Oxford University Press,1991.Chapter 1,pp 12-35.
    [2]张剑,屈丹,李真.基于词向量特征的循环神经网络语言模型[J].模式识别与人工智能,2015,(4):299-305.DOI:10.16451/j.cnki.issn1003-6059.201504002.
    [3]Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space[J].Computer Science,2013.
    [4]董涛,贺慧.中国专利质量报告--实用新型与外观设计专利制度实施情况研究[J].科技与法律,2017.7(2):220-305
    [5]欧阳文俊.文档表示与双语词嵌入算法研究[D].中国科学技术大学,2018.
    [6]文奕,陈文杰,张鑫,杨宁,赵爽.利用词嵌入模型实现基于网站访问日志的专利聚类研究[J].现代情报,2018,38(04):112-117.
    [7]张佳晖,张宇.基于矩阵分解和评论嵌入表示的推荐模型研究[J].浙江理工大学学报(自然科学版),2019,41(01):79-91.
    [8]何涛,王桂芳,杨美妮,郭楷模.基于词嵌入语义的精准检索式构建方法[J].现代情报,2018,38(11):55-58.
    [9]张学敬,吕学强,周强.基于词嵌入的书面语篇多层次差异探究[J/OL].计算机工程与应用:1-10[2019-01-18].
    [10]蒋权.基于主题模型的微博主题挖掘及预测[D].长春工业大学,2018.
    [11]孙旭明.基于半监督学习的文本分类关键技术研究[D]哈尔滨工业大学,2018.
    [12]刘娇.基于深度学习的多语种短文本分类方法的研究[D]延边大学,2018.
    [13]刘娇.基于深度学习的多语种短文本分类方法的研究[D]延边大学,2018.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700