基于多特征融合的中文电子病历命名实体识别
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Named entity recognition in Chinese electronic medical records based on multi-feature integration
  • 作者:于楠 ; 王普 ; 翁壮 ; 方丽英
  • 英文作者:YU Nan;WANG Pu;WENG Zhuang;FANG Liying;Ministry of Information Technology,Beijing University of Technology;Beijing Laboratory for Urban Mass Transit;Engineering Research Center of Digital Community,Ministry of Education;Beijing Key Laboratory of Computational Intelligence and Intelligent System;
  • 关键词:电子病历 ; 多特征融合 ; 条件随机场模型 ; 命名实体识别
  • 英文关键词:electronic medical records(EMR);;multi-feature integration;;conditional random field model;;named entity recognition
  • 中文刊名:BJSC
  • 英文刊名:Beijing Biomedical Engineering
  • 机构:北京工业大学信息学部;城市轨道交通北京实验室;数字社区教育部工程研究中心;计算智能与智能系统北京市重点实验室;
  • 出版日期:2018-06-12 15:26
  • 出版单位:北京生物医学工程
  • 年:2018
  • 期:v.37
  • 基金:北京市教委科研计划面上项目(KM201410005004)资助
  • 语种:中文;
  • 页:BJSC201803011
  • 页数:7
  • CN:03
  • ISSN:11-2261/R
  • 分类号:63-68+108
摘要
目的针对某三级甲等医院电子病历中的非结构化部分(诊断和病情),建立多特征融合的条件随机场模型,自动化识别用自然语言描述的电子病历(electronic medical records,EMR)中的疾病和症状,从而实现电子病历信息的结构化存储,以利于电子病历的信息挖掘和统计分析。方法将手动标注的语料库分为训练集和测试集,借助NLPIR工具分割文本,选择CRF++工具进行实验。针对中文电子病历的数据特点,先选取基本特征和相应的特征模板,通过不同上下文窗口的对比实验确定其大小;再分别添加引导词特征和构词结构特征,对比两种高级特征对实验结果的影响。结果仅选取基本特征,上下文窗口为7时,识别效果最好;添加高级特征后,最终疾病实体F值为92.80%,症状实体F值为94.17%。结论条件随机场模型融合多种有效的特征,可以很好地识别出电子病历中的疾病和症状实体。本研究对电子病历的命名实体识别有重要的意义。
        Objective For the unstructured components(medical diagnosis and patients' condition) of a tertiary hospital electronic medical records,we establish the conditional random field model with multi-feature integration,automatically identify diseases and symptoms in electronic medical record(EMR) which is described by natural language,in order to realize the structured storage of EMR,and it is beneficial for EMR information mining and statistical analysis. Methods The manually labeled corpus was divided into training set and testing set,we used NLPIR to segment the text and chose CRF + + tool for experiments. According to the data characteristics of Chinese EMR,we selected basic features and templates,determined the size of context window by contrast experiments. Then we added guide word pattern and word formation pattern,compared the effects of two advanced features on experimental result. Results When we only chose basic features,the context window was 7,the recognition performance was better; then we added advanced features,the F-measures in disease entities reached 92. 80%,the F-measures in symptom entities reached 94. 17%. Conclusions Conditionalrandom field model with multi-feature integration can achieve high recognition performance for disease entities and symptom entities in EMR. The study is of great significance to the named entity recognition in EMR.
引文
[1]孙丽君,徐勇勇,刘运成,等.数字化医疗基本要素与应用系统体系架构构建[J].中国卫生信息管理杂志,2015,12(6):580-586.Sun LJ,Xu YY,Liu YC,et al.The basic elements of digital healthcare and its framework of application information system[J].Chinese Journal of Health Informatics and Management,2005,12(6):580-586.
    [2]冯志香.结构化电子病历的应用及问题[J].中国病案,2009(11):23,22.Feng ZX.Application and problems of structural e-medical records[J].Chinese Medical Record,2009(11):23,22.
    [3]Lafferty JD,Mc Callum A,Pereira FCN.Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data.In Proceedings of the Eighteenth International Conference on Machine Learning[C].Brodley CE,Danyluk AP.ICML’01.San Francisco,CA:Morgan Kaufmann Publishers Inc,2001:282-289.
    [4]廖文平.基于CRF的中文地名识别研究[D].大连:大连理工大学,2010.Liao WP.A study on Chinese location names recognition based on CRF[D].Dalian:Dalian University of Technology,2010.
    [5]杨晓东,晏立,尤慧丽.CCRF与规则相结合的中文机构名识别[J].计算机工程,2011(8):169-171,174.Yang XD,Yan L,You HL.Chinese organization names recognition combined with CCRF and rules[J].Computer Engineering,2011(8):169-171,174.
    [6]倪吉,孔芳,朱巧明,等.基于可信度模型的中文人名识别研究[J].中文信息学报,2011,25(3):45-50.Ni J,Kong F,Zhu QM,Li FM,et al.Research on Chinese name recognition based on trust worthiness[J].Journal of Chinese Information Processing,2011,25(3):45-50.
    [7]豆增发,高琳.应用粒子群优化-条件随机域的文本生物实体识别[J].西安交通大学学报,2010,44(12):38-42,124.Dou ZF,Gao L.A bio-entity recognition algorithm for literature by conditional random field model based on improved particle swarm optimizer[J].Journal of Xi’an Jiaotong University,2010,44(12):38-42,124.
    [8]Long L,Yan J,Fang L,et al.The identification of Chinese named entity in the field of medicine based on Bootstrapping method[C]//International Conference on Multisensor Fusion and Information Integration for Intelligent Systems.IEEE,2014:1-6.
    [9]陈万礼,昝红英,吴泳钢.基于多源知识和Ranking SVM的中文微博命名实体链接[J].中文信息学报,2015,29(5):117-124.Chen WL,Zan HY,Wu YG.Chinese Micro-blog named entity linking based on multisourse knowledge[J].Journal of Chinese Information Processing,2015,29(5):117-124.
    [10]曲春燕.中文电子病历命名实体识别研究[D].哈尔滨:哈尔滨工业大学,2015.Qu CY.Research on named entity recognition for Chinese electronic medical records[D].Harbin:Harbin Institute of Technology,2015.
    [11]Wei QK,Tao C,Xu RF,et al.Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks[J].Database the Journal of Biological Databases&Curation,2016,2016(1):1-6.
    [12]赵琳瑛.基于隐马尔科夫模型的中文命名实体识别研究[D].西安:西安电子科技大学,2008.Zhao Linying.Study on Chinese named entity recognition based on Hidden Markov Model[D].Xi‘an:Xidian University,2008.
    [13]何径舟,王厚峰.基于特征选择和最大熵模型的汉语词义消歧[J].软件学报,2010,11(6):1287-1295.He JZ,Wang HF.Chinese word sense disambiguation based on maximum entropy model with feature selection[J].Journal of Software,2010,11(6):1287-1295.
    [14]常甜甜.支持向量机学习算法若干问题的研究[D].西安:西安电子科技大学,2010.Chang TT.Research on some problems of support vector machine learning algorithm[D].Xi’an:Xidian University,2010.
    [15]Hao Z,Wang H,Cai R,et al.Product named entity recognition for Chinese query questions based on a skip-chain CRF model[J].Neural Computing and Applications,2013,23(2):371-379.
    [16]韩彦昭,乔亚男,范亚平,等.基于条件随机场模型和文本纠错的微博新词词性识别研究[J].南京大学学报(自然科学),2016,52(2):353-360.Han YZ,Qiao YN,Fan YP,et al.Part-of-speech tagging of microblog unknown words based on conditional random fields and error correction[J].Journal of Nanjing University(Natural Sciences),2016,52(2):353-360.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700