PU场景下的生物医学命名实体识别算法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Biomedical Named Entity Recognition Algorithm in PU Scene
  • 作者:高冰涛 ; 翟振刚 ; 刘斌
  • 英文作者:GAO Bing-tao;ZHAI Zhen-gang;LIU Bin;No.36 Research Institute of CETC;College of Information Engineering,Northwest A&F University;
  • 关键词:正例未标注学习 ; 隐马尔科夫模型 ; 命名实体识别 ; 文本挖掘
  • 英文关键词:positive unabeled learning;;hidden markov model;;named entity recognition;;text mining
  • 中文刊名:TLAA
  • 英文刊名:Technology of IoT & AI
  • 机构:中国电子科技集团公司第三十六研究所;西北农林科技大学信息工程学院;
  • 出版日期:2019-01-18
  • 出版单位:智能物联技术
  • 年:2019
  • 期:v.51;No.345
  • 基金:陕西省自然科学基金项目(2017JM6059);; 中央高校基本科研业务费专项资金资助项目(2452016081);; 中国博士后基金(2017M613216);; 陕西省博士后基金(2016BSHEDZZ121)
  • 语种:中文;
  • 页:TLAA201901004
  • 页数:8
  • CN:01
  • ISSN:33-1411/TP
  • 分类号:26-32+51
摘要
传统的生物医学命名实体识别方法需要大量的标注数据样本,但是在实际应用中标注样本代价高昂。为降低生物医学命名实体识别对标注样本的需求,本文提出通过使用PU学习中的两步法方法,将生物医学命名实体识别问题转化为PU场景下的命名实体识别问题。在第一步中分别使用1-DNF、Spy、NB和Rocchio算法在未标注数据中抽取强负例,然后在已有的正例数据和强负例数据的基础上构建隐马尔可夫模型,最后对待分类数据进行命名实体识别。在GENIA语料库上的实验结果显示,在标注数据较少的情况下,通过使用PU学习方法的两步法构建分类模型,其性能显著优于直接使用标注数据构建的分类模型,同时降低了人工标注数据的成本。
        Traditional biomedical NER(named entity recognition) methods require a large number of labeled data,but it is expensive to annotate data in practical applications. For reducing the requirement for labeled data,in this article,the problem of biomedicine NER was transformed into a NER problem under PU(Positive and unlabeled)learning. By using twostep PU learning methods,this study used the 1-DNF method,Spy technology,Naive Bayesian Classifier and Rocchio method respectively in the first step to classify strong negative sample from the unlabeled data,and then construct HMM classification model based on the positive data and strong negative data to recognize named entity in biomedical text. Experimental results showed that in the case of less data annotations,the performance of the algorithm in this article was significantly better than the performance of model constructed by using the less of annotation data directly,and reduced the cost and time for labeling the data at the same time.
引文
[1]Merry K P,Modi M.A Survey on Protein-Protein Interaction Network in Bioinformatics[C].National Conference Cum Workshop on Bioinformatics and Computational Biology,Ncwbcb,2014.
    [2]Sriparna Saha,Asif Ekbal.Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition[J].Data&Knowledge Engineering,2013,85(8):15-39.
    [3]Rabiner L,Juang B.An introduction to Hidden Markov Models[J].IEEE ASSP Magazine.
    [4]David Campos,Sérgio Matos,JoséLuís Oliveira.Gimli:open source and high-performance biomedical name recognition[J].Bmc Bioinformatics,14(1):54.
    [5]Horn,H.et al.2014.KinomeXplorer:an integrated platform for kinome biology studies[J].Nat.Methods,11,603-604.
    [6]K.G.Srinivasagan,S.Suganthi,N.Jeyashenbagavalli.2014.NER for Hindi language using association rules[J].Data Mining and Intelligent Computing(ICDMIC).
    [7]高冰涛,张阳,刘斌.BioTrHMM:基于迁移学习的生物医学命名实体识别算法[J],计算机应用研究,2019,36(1).
    [8]Vivekananda Gayen,Kamal Sarkar,An HMM Based Named Entity Recognition System for Indian Languages[J],The JU System at ICON,2013.
    [9]A Arnold,R Nallapati,WW Cohen.2008.Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition[J].Acl:Hlt,2008:245-253.
    [10]Marthinus C.du Plessis,Gang Niu,Masashi Sugiyama.Analysis of Learning from Positive and Unlabeled Data[J].Advances in Neural Information Processing Systems,2014(1):703-711.
    [11]Yang P,Li X,Chua H,Kwoh C K,Ng S K.Ensemble positive unlabeled learning for disease gene identification[J].Bioinformatics,2014,28(20):2640~264.
    [12]Li X,Philip S Y,Liu B,Kiong S.Positive Unlabeled Learning for Data Stream Classification[J].In SDM,2009(9):257~268.
    [13]Tomoya Sakai,Gang Niu,Masashi Sugiyama Semi-supervised AUC optimization based on positive-unlabeled learning[J].Machine Learning,2017(107):767-794.
    [14]张星,张阳,刘明建,王勇.DTU-PU:针对不确定数据PU学习的决策树[J].计算机工程与应用,2013,49(9):127-133.
    [15]张金蕾,李梅,张阳,梁春泉,王勇.P-AnDT:平均n依赖决策树的正例未标注学习算法[J].计算机应用研究,2016,33(7):1941-1944.
    [16]邵强,张阳,蔡晓妍.基于随机森林的正例与未标注学习[J].计算机工程与设计,2014,(12):4329-4334.
    [17]Ryuichi Kiryo,Gang Niu,Marthinus C.du Plessis,Masashi Sugiyama.Positive-Unlabeled Learning with Non-Negative Risk Estimator[J].31st Conference on Neural Information Processing Systems(NIPS 2017),Long Beach,CA,USA,2017.
    [18]X Li,B Liu.Learning to classify texts using positive and unlabeled data[J].International Joint Conference on Artificial Intelligence,2003:587-592.
    [19]Yuanhua Huang,Bosen Xu,Xueya Zhou et al.Systematic Characterization and Prediction of Post-Translational Modification Cross-Talk[J],Molecular&Cellular Proteomics,2015,14(3):761-770.
    [20]Elkan C,Noto K.Learning classifiers from only positive and unlabeled data[C].ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2008:213-220.
    [21]He J,Zhang Y,Li X,et al.Bayesian classifiers for positive unlabeled learning[C].International Conference on Web-Age Information Management.Springer Berlin Heidelberg,2011:81-93.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700