摘要
专利文献是记录专利的主要依据,而专利摘要则是专利文献的进一步浓缩。实验基于中文专利摘要部分,借助Python第三方库jieba进行分词、词性标注、gensim进行词向量映射,探讨对中文专利摘要部分进行分词、词性标注的问题,进而探讨词嵌入中基于词袋模型和分布式模型的差异。针对现有的分布式表示方法中词向量连续稠密等问题,提出了在相关语料库的基础上将词语聚类之后再结合CBOW和Skip-Gram模型训练语料词语得到权重矩阵,并将此权重矩阵用户测试数据中去预测中心词并得到其词向量。研究表明改进后的方法在词嵌入分布式表示词向量更适合用于循环神经网络的研究。
Patent documents are the main basis for recording patents, while patent abstracts are the further enrichment of patent documents.Based on the Chinese patent abstract, this experiment uses the Python third-party library jieba for word segmentation,part-of-speech tagging and uses gensim for word vector mapping,and explores the problem of word segmentation and part-of speech tagging in Chinese patent abstracts, and then explores the differences between bag-of-words model and distributed model in word embedding.Aiming at the problems of continuous and dense word vectors in existing distributed representation methods,a new method of clustering of words based on basic corpus is proposed, which combines CBOW and Skip-Gram with different weights.The research shows that the improved method is more dense in embedding distributed word vectors into words.
引文
[1]Sinclair J.Corpus,concordance,collocation[M].Oxford University Press,1991.Chapter 1,pp 12-35.
[2]张剑,屈丹,李真.基于词向量特征的循环神经网络语言模型[J].模式识别与人工智能,2015,(4):299-305.DOI:10.16451/j.cnki.issn1003-6059.201504002.
[3]Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space[J].Computer Science,2013.
[4]董涛,贺慧.中国专利质量报告--实用新型与外观设计专利制度实施情况研究[J].科技与法律,2017.7(2):220-305
[5]欧阳文俊.文档表示与双语词嵌入算法研究[D].中国科学技术大学,2018.
[6]文奕,陈文杰,张鑫,杨宁,赵爽.利用词嵌入模型实现基于网站访问日志的专利聚类研究[J].现代情报,2018,38(04):112-117.
[7]张佳晖,张宇.基于矩阵分解和评论嵌入表示的推荐模型研究[J].浙江理工大学学报(自然科学版),2019,41(01):79-91.
[8]何涛,王桂芳,杨美妮,郭楷模.基于词嵌入语义的精准检索式构建方法[J].现代情报,2018,38(11):55-58.
[9]张学敬,吕学强,周强.基于词嵌入的书面语篇多层次差异探究[J/OL].计算机工程与应用:1-10[2019-01-18].
[10]蒋权.基于主题模型的微博主题挖掘及预测[D].长春工业大学,2018.
[11]孙旭明.基于半监督学习的文本分类关键技术研究[D]哈尔滨工业大学,2018.
[12]刘娇.基于深度学习的多语种短文本分类方法的研究[D]延边大学,2018.
[13]刘娇.基于深度学习的多语种短文本分类方法的研究[D]延边大学,2018.