基于BLSTM＿attention＿CRF模型的新能源汽车领域术语抽取

英文篇名：Terminology extraction for new energy vehicle based on BLSTM＿attention＿CRF model
作者：马建红 ; 张亚梅 ; 姚爽 ; 张炳斐 ; 郭昌宏
英文作者：Ma Jianhong;Zhang Yamei;Yao Shuang;Zhang Bingfei;Guo Changhong;School of Computer Science & Software,Hebei University of Technology;
关键词：领域术语抽取 ; attention机制 ; 双向长短时记忆网络 ; 条件随机场 ; 词典 ; 规则
英文关键词：domain term extraction;;attention mechanism;;bidirectional long short-term memory;;conditional random fields;;dictionary;;rules
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：河北工业大学计算机科学与软件学院;
出版日期：2018-03-14 17:30
出版单位：计算机应用研究
年：2019
期：v.36;No.331
语种：中文;
页：JSYJ201905024
页数：6
CN：05
ISSN：51-1196/TP
分类号：111-115+121

摘要

为提高新能源汽车领域术语抽取准确率,面向新能源汽车专利文本提出一种领域术语抽取模型。传统的领域术语抽取方法过度依赖人工定义特征和领域知识,无法自动挖掘隐含特征,其识别性能过度依赖所选特征的质量。从深度学习的角度出发,提出了一种基于attention的双向长短时记忆网络(bidirectional long short-term memory,BLSTM)与条件随机场(conditional random fields,CRF)相结合的领域术语抽取模型(BLSTM_attention_CRF模型),并使用基于词典与规则相结合的方法对结果进行校正,准确率可达到86%以上,方法切实可行。
In order to improve the accuracy and recall rate of terminology extraction results in the field of new energy vehicles,this paper presented a domain terminology extraction model for the new energy vehicles patent text. Traditional domain terminology extraction methods rely too much on human-defined features and specialized domain knowledge to automatically mine implicit features whose recognition performance greatly depends on the quality of the selected features. In order to solve the problems,this paper proposed a model from the perspective of deep learning. Firstly,it extracted the domain terms by a combination of BLSTM(bidirectional long short-term memory) model based on the attention mechanism and CRF(conditional random fields) model(BLSTM_attention_CRF model),and then it corrected the result by a combination of dictionary and rules. Experimental results show that the accuracy of BLSTM-ATT-CRF model can reach more than 86%,which shows that BLSTM-ATT-CRF model is effective to term extraction of new energy vehicles.

引文

[1]Zhu Xiaojin.Semi-supervised learning literature survey,TR-1530[R].[S.l.]:University of Wisconsin-Madison,2008.
    [2]王密平.汉语专利术语抽取及应用研究---以钢铁冶金领域为例[D].南京:南京大学,2017.(Wang Miping.A study on Chinese terms extraction and their application:the case of iron and steel metallurgy[D].Nanjing:Nanjing University,2017.)
    [3]樊梦佳,段东圣,杜翠兰,等.统计与规则相融合的领域术语抽取算法[J].计算机应用研究,2016,33(8):2282-2285,2306.(Fan Mengjia,Duan Dongsheng,Du Cuilan,et al.Domain term extraction algorithm based on statistics and rules[J].Journal of Computer Applications,2016,33(8):2282-2285,2306.)
    [4]葛煦,卢宝华,杨湘华,等.谈高校科技发展中专利文献的利用[J].技术与创新管理,2005,26(1):68-70.(Ge Xu,Lu Baohua,Yang Xianghua,et al.Discussion on the utilization of patent documents in the development of science and technology in colleges and universities[J].Technology and Innovation Management,2005,26(1):68-70.)
    [5]贾志琦,邵曰剑.有效利用专利文献提高企业技术创新能力[J].山西科技,2008(1):91-93.(Jia Zhiqi,Shao Yuejian.Effective utilization of patent documents to improve enterprise’s technology innovation capability[J].Shanxi Science and Technology,2008(1):91-93.)
    [6]王密平,王昊,邓三鸿,等.基于CRF的冶金领域中文专利术语抽取研究[J].现代图书情报技术,2016(6):28-36.(Wang Miping,Wang Hao,Deng Sanhong,et al.Research on extraction of Chinese patent terminology in metallurgical field based on CRF[J].Modern Library&Information Technology,2016(6):28-36.)
    [7]周浪,史树敏,冯冲,等.基于多策略融合的中文术语抽取方法[J].情报学报,2010,29(3):460-467.(Zhou Lang,Shi Shumin,Feng Chong,et al.A Chinese term extraction system based on multi-strategies integration[J].Journal of the China Society for Information Technology,2010,29(3):460-467.)
    [8]郭剑毅,薛征山,余正涛,等.基于层叠条件随机场的旅游领域命名实体识别[J].中文信息学报,2009,23(5):47-52.(Guo Jianyi,Xue Zhengshan,Yu Zhengtao,et al.Named entity recognition of the tourism field based on cascaded conditional random fields[J].Chinese Journal of Information,2009,23(5):47-52.)
    [9]何宇,吕学强,徐丽萍.新能源汽车领域中文术语抽取方法[J].现代图书情报技术,2015(10):88-94.(He Yu,Lyu Xueqiang,Xu Liping.A Chinese term extraction system in new energy vehicles domain[J].Modern Library&Information Technology,2015(10):88-94.)
    [10]刘里,肖迎元.基于术语长度和语法特征的统计领域术语抽取[J].哈尔滨工程大学学报,2017(9):1437-1443.(Liu Li,Xiao Yingyuan.A statistical domain terminology extraction method based on word length and grammatical feature[J].Journal of Harbin Engineering University,2017(9):1437-1443.)
    [11]冯艳红,于红,孙庚,等.基于BLSTM的命名实体识别方法[J].计算机科学,2018,45(2):261-268.(Feng Yanhong,Yu Hong,Sun Geng,et al.Named entity recognition method based on BLSTM[J].Computer Science,2018,45(2):261-268.)
    [12]侯伟涛,姬东鸿.基于Bi-LSTM的医疗事件识别研究[J].计算机应用研究,2018,35(7):1974-1977.(Hou Weitao,Ji Donghong.Research on clinic event recognition based Bi-LSTM[J].Application Research of Computers,2018,35(7):1974-1977.)
    [13]Raffel C,Ellis D P W.Feed-forward networks with attention can solve some long-term memory problems[C/OL]//Proc of ICLR2016 Workshop Submission Readers.2016.(2016-09-20).https://arxiv.org/abs/1512.08756.
    [14]Yang Zichao,Yang Diyi,Dyer C,et al.Hierarchical attention networks for document classification[C]//Proc of Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2017:1480-1489.
    [15]张冲.基于attention-Based LSTM模型的文本分类技术的研究[D].南京:南京大学,2016.(Zhang Chong.Text classification based on attention-based LSTM model[D].Nanjing:Nanjing University,2016.)(下转第1395页)
    [16]Li Fei,Zhang Meishan,Tian Bo,et al.Recognizing irregular entities in biomedical text via deep neural networks[J].Pattern Recognition Letters,2018,105(4):105-113.
    [17]Gridach M.Character-level neural network for biomedical named entity recognition[J].Journal of Biomedical Informatics,2017,70(6):85-91.
    [18]Mikolov T,Sutskever I,Chen Kai,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
    [19]孟欣,左万利.基于word embedding的短文本特征扩展与分类[J].小型微型计算机系统,2017,38(8):1712-1717.(Meng Xin,Zuo Wanli.Short text expansion and classification based on word embedding[J].Journal of Chinese Computer Systems,2017,38(8):1712-1717.)
    [20]Jozefowicz R,Zaremba W,Sutskever I.An empirical exploration of recurrent network architectures[C]//Proc of International Conference on Machine Learning.2015:2342-2350.
    [21]Graves A,Schmidhuber J.Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J].Neural Networks,2005,18(5-6):602-610.
    [22]Lafferty J,Mc Callum A,Pereira F.Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proc of the 18th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers,2001:282-289.
    [23]郑敏洁,雷志城,廖祥文,等.中文句子评价对象抽取的特征分析研究[J].福州大学学报:自然科学版,2012,40(5):584-590.(Zheng Minjie,Lei Zhicheng,Liao Xiangwen,et al.Analysis of features used in extracting sentiment-objects from Chinese sentences[J].Journal of Fuzhou University:Natural Science Edition,2012,40(5):584-590.)
    [24]Werbos P J.Backpropagation through time:what it does and how to do it[J].Proceedings of the IEEE,1990,78(10):1550-1560.
    [25]Hinton G E,Srivastava N,Krizhevsky A,et al.Improving neural networks by preventing co-adaptation of feature detectors[EB/OL].(2012-07-03).https://arxiv.org/abs/1207.0580.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700