摘要
通过对新能源汽车领域中文专利文献中术语特点的分析,提出利用条件随机场模型,分别基于三词位、四词位和六词位的字序列标注进行术语抽取的方法。以字为切分粒度,避免在术语抽取过程中因分词原因导致术语识别错误问题,并探讨不同词位标注集对术语抽取性能的影响。实验结果表明,基于六词位字标注的条件随机场模型术语抽取的性能最好,准确率、召回率和F值优于对比方法中基于词、词性、词长等信息作为特征的抽取方法,验证了所提方法的有效性。
After analyzing the features of terms in the Chinese patent documents about new energy vehicles,an optimization method that used the conditional random fields model to extract the terminologies based on the word sequence of three,four and six word tagging was proposed.Single character was used as the shard granularity and the recognition error caused by word segmentation in term extraction was avoided.The extraction performances on different word level tagging sets were discussed.Experimental results show that the condition of the six word tagging is the best in conditional random fields model,and the accuracy rate,recall rate and F values are better than contrast method using word,word POS,word length and other information as features to extract terms,thus verifying the effectiveness.
引文
[1]Kaushik N,Chatterjee N.A practical approach for term and relationship extraction for automatic ontology creation from agricultural text[C]//International Conference on Information Technology.Piscataway,NJ:IEEE,2016:241-247.
[2]Gim J,Kim DJ,Hwang M,et al.Extracting protein terminologies in literatures[C]//Green Computing and Communications(GreenCom),IEEE and Internet of Things(iThings/CPSCom),IEEE International Conference on and IEEE Cyber,Physical and Social Computing. Piscataway, NJ:IEEE,2013:2136-2140.
[3]Guo R,Qiu J,Zhang G.Web-based Chinese term extraction in the field of study[C]//11th International Conference on Semantics, Knowledge and Grids. Piscataway, NJ:IEEE,2015:133-139.
[4]Du L,Li X,Lin D.Chinese term extraction from web pages based on expected point-wise mutual information[C]//12th International Conference on Natural Computation,Fuzzy Systems and Knowledge Discovery. Piscataway, NJ:IEEE,2016:1647-1651.
[5]Mijangos V.Extraction of definitional contexts through machine learning[C]//26th International Workshop on Database and Expert Systems Applications.Piscataway, NJ:IEEE,2015:217-221.
[6]Judea A,Schütze H,Brügann S.Unsupervised training set generation for automatic acquisition of technical terminology in patents[C]//COLING,2014:290-300.
[7]Nassirudin M,Purwarianti A.Indonesian-Japanese term extraction from bilingual corpora using machine learning[C]//International Conference on Advanced Computer Science and Information Systems.Piscataway,NJ:IEEE,2015:111-116.
[8]HE Yu,LYU Xueqiang,XU Liping.A Chinese term extraction system in new energy vehicles domain[J].New Technology of Library and Information Service,2015,31(10):88-94(in Chinese).[何宇,吕学强,徐丽萍.新能源汽车领域中文术语抽取方法[J].现代图书情报技术,2015,31(10):88-94.]
[9]Pan HS,Zhao JY.Combining syntactic information with HMM for term extraction[C]//2nd International Conference on Information Science and Control Engineering.Piscataway,NJ:IEEE,2015:170-173.
[10]Zhan Q, Wang C.A hybrid strategy for Chinese domainspecific terminology extraction[C]//11th International Conference on Semantics, Knowledge and Grids.Piscataway,NJ:IEEE,2015:217-221.
[11]Guan A,Wang Y,Yang L.Automatic term extraction for Chinese opera domain ontology[C]//12th International Conference on Fuzzy Systems and Knowledge Discovery.Piscataway,NJ:IEEE,2015:1372-1376.
[12]LI Lishuang,DANG Yanzhong,ZHANG Jing,et al.Automotive term extraction based on conditional random fields[J].Journal of Dalian University of Technology,2013,53(2):267-272(in Chinese).[李丽双,党延忠,张婧,等.基于条件随机场的汽车领域术语抽取[J].大连理工大学学报,2013,53(2):267-272.]
[13]LIU Hui,LIU Yao.Patent term extraction based on conditional random fields[J].Digital Library Forum,2014(12):46-49(in Chinese).[刘辉,刘耀.基于条件随机场的专利术语抽取[J].数字图书馆论坛,2014(12):46-49.]
[14]WANG Miping,WANG Hao,DENG Sanhong,et al.Extracting Chinese metallurgy patent terms with conditional random fields[J].New Technology of Library and Information Service,2016,32(6):28-36(in Chinese).[王密平,王昊,邓三鸿,等.基于CRFs的冶金领域中文专利术语抽取研究[J].现代图书情报技术,2016,32(6):28-36.]