基于CRFs的专利文献领域术语抽取方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Method of extracting patent domain terms based on conditional random fields
  • 作者:王健 ; 殷旭 ; 吕学强 ; 徐丽萍
  • 英文作者:WANG Jian;YIN Xu;LYU Xue-qiang;XU Li-ping;Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University;Beijing Research Center of Urban System Engineering;
  • 关键词:中文专利术语 ; 术语抽取 ; 条件随机场 ; 序列标注 ; 新能源汽车领域
  • 英文关键词:Chinese patent terminology;;term extraction;;CRFs;;sequence labeling;;new energy vehicles
  • 中文刊名:SJSJ
  • 英文刊名:Computer Engineering and Design
  • 机构:北京信息科技大学网络文化与数字传播北京市重点实验室;北京城市系统工程研究中心;
  • 出版日期:2019-01-16
  • 出版单位:计算机工程与设计
  • 年:2019
  • 期:v.40;No.385
  • 基金:国家自然科学基金项目(61671070);; 北京成像技术高精尖创新中心基金项目(BAICIT-2016003);; 国家社会科学基金重大基金项目(14@ZH036);; 国家语委重点基金项目(ZDI135-53);国家语委重大课题基金项目(ZDA125-26)
  • 语种:中文;
  • 页:SJSJ201901046
  • 页数:6
  • CN:01
  • ISSN:11-1775/TP
  • 分类号:287-292
摘要
通过对新能源汽车领域中文专利文献中术语特点的分析,提出利用条件随机场模型,分别基于三词位、四词位和六词位的字序列标注进行术语抽取的方法。以字为切分粒度,避免在术语抽取过程中因分词原因导致术语识别错误问题,并探讨不同词位标注集对术语抽取性能的影响。实验结果表明,基于六词位字标注的条件随机场模型术语抽取的性能最好,准确率、召回率和F值优于对比方法中基于词、词性、词长等信息作为特征的抽取方法,验证了所提方法的有效性。
        After analyzing the features of terms in the Chinese patent documents about new energy vehicles,an optimization method that used the conditional random fields model to extract the terminologies based on the word sequence of three,four and six word tagging was proposed.Single character was used as the shard granularity and the recognition error caused by word segmentation in term extraction was avoided.The extraction performances on different word level tagging sets were discussed.Experimental results show that the condition of the six word tagging is the best in conditional random fields model,and the accuracy rate,recall rate and F values are better than contrast method using word,word POS,word length and other information as features to extract terms,thus verifying the effectiveness.
引文
[1]Kaushik N,Chatterjee N.A practical approach for term and relationship extraction for automatic ontology creation from agricultural text[C]//International Conference on Information Technology.Piscataway,NJ:IEEE,2016:241-247.
    [2]Gim J,Kim DJ,Hwang M,et al.Extracting protein terminologies in literatures[C]//Green Computing and Communications(GreenCom),IEEE and Internet of Things(iThings/CPSCom),IEEE International Conference on and IEEE Cyber,Physical and Social Computing. Piscataway, NJ:IEEE,2013:2136-2140.
    [3]Guo R,Qiu J,Zhang G.Web-based Chinese term extraction in the field of study[C]//11th International Conference on Semantics, Knowledge and Grids. Piscataway, NJ:IEEE,2015:133-139.
    [4]Du L,Li X,Lin D.Chinese term extraction from web pages based on expected point-wise mutual information[C]//12th International Conference on Natural Computation,Fuzzy Systems and Knowledge Discovery. Piscataway, NJ:IEEE,2016:1647-1651.
    [5]Mijangos V.Extraction of definitional contexts through machine learning[C]//26th International Workshop on Database and Expert Systems Applications.Piscataway, NJ:IEEE,2015:217-221.
    [6]Judea A,Schütze H,Brügann S.Unsupervised training set generation for automatic acquisition of technical terminology in patents[C]//COLING,2014:290-300.
    [7]Nassirudin M,Purwarianti A.Indonesian-Japanese term extraction from bilingual corpora using machine learning[C]//International Conference on Advanced Computer Science and Information Systems.Piscataway,NJ:IEEE,2015:111-116.
    [8]HE Yu,LYU Xueqiang,XU Liping.A Chinese term extraction system in new energy vehicles domain[J].New Technology of Library and Information Service,2015,31(10):88-94(in Chinese).[何宇,吕学强,徐丽萍.新能源汽车领域中文术语抽取方法[J].现代图书情报技术,2015,31(10):88-94.]
    [9]Pan HS,Zhao JY.Combining syntactic information with HMM for term extraction[C]//2nd International Conference on Information Science and Control Engineering.Piscataway,NJ:IEEE,2015:170-173.
    [10]Zhan Q, Wang C.A hybrid strategy for Chinese domainspecific terminology extraction[C]//11th International Conference on Semantics, Knowledge and Grids.Piscataway,NJ:IEEE,2015:217-221.
    [11]Guan A,Wang Y,Yang L.Automatic term extraction for Chinese opera domain ontology[C]//12th International Conference on Fuzzy Systems and Knowledge Discovery.Piscataway,NJ:IEEE,2015:1372-1376.
    [12]LI Lishuang,DANG Yanzhong,ZHANG Jing,et al.Automotive term extraction based on conditional random fields[J].Journal of Dalian University of Technology,2013,53(2):267-272(in Chinese).[李丽双,党延忠,张婧,等.基于条件随机场的汽车领域术语抽取[J].大连理工大学学报,2013,53(2):267-272.]
    [13]LIU Hui,LIU Yao.Patent term extraction based on conditional random fields[J].Digital Library Forum,2014(12):46-49(in Chinese).[刘辉,刘耀.基于条件随机场的专利术语抽取[J].数字图书馆论坛,2014(12):46-49.]
    [14]WANG Miping,WANG Hao,DENG Sanhong,et al.Extracting Chinese metallurgy patent terms with conditional random fields[J].New Technology of Library and Information Service,2016,32(6):28-36(in Chinese).[王密平,王昊,邓三鸿,等.基于CRFs的冶金领域中文专利术语抽取研究[J].现代图书情报技术,2016,32(6):28-36.]

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700