基于句子级Lattice-长短记忆神经网络的中文电子病历命名实体识别

英文篇名：Chinese electronic medical record named entity recognition based on sentence-level Lattice-long short-term memory neural network
作者：潘璀然 ; 王青华 ; 汤步洲 ; 姜磊 ; 黄勋 ; 王理
英文作者：PAN Cui-ran;WANG Qing-hua;TANG Bu-zhou;JIANG Lei;HUANG Xun;WANG Li;Department of Medical Informatics, School of Medicine, Nantong University;College of Computer Science and Technology, Harbin Institute of Technology, Shenzhen;Department of Rheumatology and Immunology, Changzheng Hospital, Naval Medical University (Second Military Medical University);Department of Communication Engineering, School of Information Science and Technology, Nantong University;
关键词：计算机化病案系统 ; 中文电子病历 ; 实体识别 ; 条件随机场 ; 双向长短记忆神经网络 ; 点阵长短记忆神经网络
英文关键词：computed medical records systems;;electronic medical record;;entity identification;;conditional random field;;bi-directional long short-term memory neural network;;lattice-long short-term memory neural network
中文刊名：DEJD
英文刊名：Academic Journal of Second Military Medical University
机构：南通大学医学院医学信息学教研室;哈尔滨工业大学(深圳)计算机科学与技术学院;海军军医大学(第二军医大学)长征医院风湿免疫科;南通大学信息科学技术学院通讯工程教研室;
出版日期：2019-05-20
出版单位：第二军医大学学报
年：2019
期：v.40;No.357
基金：国家重点研发计划(2018YFC0116902);; 国家自然科学基金(81873915);; 江苏省研究生科研与实践创新计划项目(KYCX17-1932)~~
语种：中文;
页：DEJD201905006
页数：10
CN：05
ISSN：31-1001/R
分类号：38-47

摘要

目的提出一种基于Re-entity新分词方法的条件随机场(CRF)模型,并与双向长短记忆神经网络(BiLSTM)-CRF和Lattice-长短记忆神经网络(LSTM)进行比较。方法比较了现有实体识别方法和模型后,针对2018年全国知识图谱与语义计算大会(CCKS2018)任务一"电子病历命名实体识别",提出基于Re-entity的CRF、BiLSTM-CRF、Lattice-LSTM方法,并在不同语料库训练不同参数级别的字符向量集。分别将各方法引入神经网络模型中进行模型性能对比实验,最后分别基于句子级和篇级输入句长进行对比研究。结果 CRF模型在最优特征工程的结果下引入Re-entity方法后性能得到提高,句子级的Lattice-LSTM模型在该任务上取得了89.75%的严格F1-measure,优于CCKS2018任务一的最高结果(89.25%)。结论基于Re-entity新分词方法的CRF模型可利用中文临床药物知识库有效提高电子病历中药物的识别率,Re-entity方法可改善数据预处理阶段分词导致的错误累加,Lattice结构可以更好地结合字符和词序列的潜在语义信息,同时句子级输入能有效提高神经网络模型的识别准确率。
Objective To propose a conditional random field(CRF) model based on the new word segmentation method Re-entity, and to compare with bi-directional long short-term memory neural network(BiLSTM)-CRF and Lattice-long short-term memory neural network(LSTM). Methods After analyzing the existing entity recognition methods, we proposed CRF method based on Re-entity, BiLSTM-CRF and Lattice-LSTM for the China Conference on Knowledge Graph and Semantic Computing in 2018(CCKS2018) task one: Chinese clinical named entity recognition, and trained character vector sets at different parameter levels based on different corpora. The comparative experiments on model performance were carried out in the different neural network models for each methods. Finally, the comparative study was carried out based on different input lengths such as the sentence level and the text level. Results Re-entity method can improve the performance of CRF model.Lattice-LSTM model based on sentence level achieved a strict F1-measure of 89.75% on this task, which was higher than the highest F1-measure(89.25%) on the task one of CCKS2018. Conclusion The CRF model based on Re-entity can effectively improve the recognition rate of traditional Chinese medicines in electronic medical records by using normalized Chinese clinical drug. Re-entity method can improve the error accumulation caused by word segmentation in data preprocessing. Lattice structure can better combine the latent semantic information of characters and word sequences. At the same time, sentence-level input can effectively improve the recognition accuracy of neural network models.

引文

[1]杨锦锋,于秋滨,关毅,蒋志鹏.电子病历命名实体识别和实体关系抽取研究综述[J].自动化学报,2014,40:1537-1562.
    [2]叶枫,陈莺莺,周根贵,李昊旻,李莹.电子病历中命名实体的智能识别[J].中国生物医学工程学报,2011,30:256-262.
    [3]程健一,关毅,何彬.基于SVM和CRF双层分类器的英文电子病历去隐私化[J].智能计算机与应用,2016,6:17-19,24.
    [4]张海楠,伍大勇,刘悦,程学旗.基于深度神经网络的中文命名实体识别[J].中文信息学报,2017,31:28-35.
    [5]LI H,HAGIWARA M,LI Q,JI H.Comparison of the impact of word segmentation on name tagging for Chinese and Japanese[C/OL]//Proceedings of the Ninth International Conference on Language Resources and Evaluation(LREC’14).Reykjavik:LREC,2014:2532-2536.[2019-01-28].http://www.lrec-conf.org/proceedings/lrec2014/index.html.
    [6]ZHANG Y,YANG J.Chinese NER using Lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Long Papers).Melbourne:ACL,2018:1554-1564.
    [7]中国中文信息学会语言与知识计算专业委员会.全国知识图谱与语义计算大会[C/OL].天津:CCKS,2018.[2019-01-28].http://www.ccks2018.cn/?page_id=16.%20doi:%20http://www.ccks2018.cn/?page_id=16.
    [8]PAN Y F,HOU X,LIU C L.Text localization in natural scene images based on conditional random field[C/OL]//10th International Conference on Document Analysis and Recognition.Catalonia:ICDR2009,2009:6-10.doi:10.1109/ICDAR.2009.97.
    [9]张祥伟,李智.基于多特征融合的中文电子病历命名实体识别[J].软件导刊,2017,16:128-131.
    [10]WANG L,ZHANG Y,JIANG M,WANG J,DONGJ,LIU Y,et al.Toward a normalized clinical drug knowledge base in China-applying the RxNorm model to Chinese clinical drugs[J].J Am Med Inform Assoc,2018,25:809-818.
    [11]曾冠明.基于条件随机场的中文命名实体识别研究[D].北京:北京邮电大学,2009.
    [12]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Comput,1997,9:1735-1780.
    [13]李洋,董红斌.基于CNN和BiLSTM网络特征融合的文本情感分析[J].计算机应用研究,2018,38:3075-3080.
    [14]LAMPLE G,BALLESTEROS M,SUBRAMANIANS,KAWAKAMI K,DYER C.Neural architectures for named entity recognition[Z/OL].arXiv:1603.01360v3[cs.CL].(2016-04-07)[2019-01-28].https://arxiv.org/pdf/1603.01360.pdf.
    [15]MA X,HOVY E.End-to-end sequence labeling via bidirectional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin:ACL,2016:1064-1074.
    [16]陶砾,杨朔,杨威.深度学习的模型搭建及过拟合问题的研究[J].计算机时代,2018(2):14-17,21.
    [17]隋明爽,崔雷.结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J].现代图书情报技术,2016(10):91-97.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700