中文电子病历的分词及实体识别研究

英文篇名：Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record
作者：王若佳 ; 赵常煜 ; 王继民
英文作者：Wang Ruojia;Cho Sang Wouk;Wang Jimin;Department of information management, Peking University;Institute of Ocean Research, Peking University;
关键词：电子病历 ; 中文分词 ; 实体识别 ; 健康医疗大数据 ; AC自动机 ; 条件随机场
英文关键词：healthcare data mining;;electronic medical record;;Chinese word segmentation;;named entity recognition;;AC automaton;;conditional random field
中文刊名：TSQB
英文刊名：Library and Information Service
机构：北京大学信息管理系;北京大学海洋研究院;
出版日期：2019-01-20
出版单位：图书情报工作
年：2019
期：v.63;No.615
语种：中文;
页：TSQB201902007
页数：9
CN：02
ISSN：11-1541/G2
分类号：35-43

摘要

[目的/意义]健康医疗大数据是我国重要的基础性战略资源,本研究对中文电子病历分词与实体识别的探讨与实证较好地完成了医疗数据的信息抽取任务,对今后医疗大数据在语义层面的应用发展具有重要意义。[方法/过程]本研究首先融合权威词表、官方标准、健康网站数据及其他医学补充词库构建了词语数量级达到10万的医学词表;然后对电子病历的字段进行分词,对比了jieba工具、导入词典后的jieba、无监督学习及AC自动机4种模型的分词效果;最后,以自动分词和人工标注结果为语料,实现基于条件随机场的电子病历实体识别研究,并比较不同实体类别以及不同文本特征下的实体识别效果,选出最优模板。[结果/结论]分词结果显示,AC自动机的效果最好,F值可达82%;实体识别结果表明,"检查"和"疾病"实体的识别效果最好,而"症状"的识别效果不太理想。
[Purpose/significance] Healthcare big data is an important basic strategic resource in China. Word segmentation and entity recognition of Chinese electronic medical record(EMR) is helpful in extracting important information from a large number of unstructured text. [Method/process] In this study, a Chinese medical thesaurus is firstly built in terms of authoritative medical subject headings, official standards and health website data; then, the effect of four segmentation methods is compared based on the corpus of artificial segmentation and manual annotation; finally, CRF model is used to identify 5 entities, including disease, symptom, test, drug and treatment. [Result/conclusion] Results show that(i)AC automaton model has the best F-measure in EMR word segmentation, which is 82%;(ii) compared with Western medical record, it's difficult to identify medical entities in the record of traditional Chinese medicine. Besides, "Test" and "Disease" entities have better F-measure, while the F-measure of "Symptom" entity is not that ideal.

引文

[1] 国家卫生健康委员会. 电子病历应用管理规范(试行)[EB/OL]. [2018-02-20]. http://www.nhfpc.gov.cn/yzygj/s3593/201702/22bb2525318f496f846e8566754876a1.shtml.
    [2] 刘群, 张华平, 俞鸿魁,等. 基于层叠隐马模型的汉语词法分析[J]. 计算机研究与发展, 2004, 41(8):1421-1429.
    [3] 李兆福. 基于K最短路径的中文分词算法研究与实现[D]. 哈尔滨:哈尔滨工程大学, 2009.
    [4] 张立邦. 基于半监督学习的中文电子病历分词和名实体挖掘[D]. 哈尔滨:哈尔滨工业大学, 2014.
    [5] 张立邦, 关毅, 杨锦峰. 基于无监督学习的中文电子病历分词[J]. 智能计算机与应用, 2014(2):68-71.
    [6] 李国垒, 陈先来, 夏冬,等. 面向临床决策的电子病历文本潜在语义分析[J]. 现代图书情报技术, 2016, 32(3):50-57.
    [7] FRIEDMAN C, HRIPCSAK G, DUMOUCHEL W, et al. Natural language processing in an operational clinical information system[J]. Natural language engineering, 1995, 1(1):83-108.
    [8] SEVENSTER M, VAN O R, QIAN Y. Automatically correlating clinical findings and body locations in radiology reports using MedLEE[J]. Journal of digital imaging, 2012, 25(2):240-249.
    [9] MetaMap. A Tool For Recognizing UMLS Concepts in Text [EB/OL].[2018-08-18]. https://mmtx.nlm.nih.gov/.
    [10] XU H, STENNER S P, DOAN S, et al. MedEx: a medication information extraction system for clinical narratives[J]. Journal of the American medical informatics association, 2010, 17(1):19-24.
    [11] SAVOVA G K, MASANZ J J, OGREN P V, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications[J]. Journal of the American medical informatics association jamia, 2010, 17(5):507-513.
    [12] LI Y, GORMAN S L. Section classification in clinical notes using supervised hidden markov model[C]// Arlington, VA, USA:Proceedings of the 1st ACM International Health Informatics Symposium. ACM, 2010:744-750.
    [13] 王鹏远, 姬东鸿. 基于多标签CRF的疾病名称抽取[J]. 计算机应用研究, 2017, 34(1):118-122.
    [14] 叶枫, 陈莺莺, 周根贵,等. 电子病历中命名实体的智能识别[J]. 中国生物医学工程学报, 2011, 30(2):256-262.
    [15] LEI J, TANG B, LU X, et al. A comprehensive study of named entity recognition in Chinese clinical text[J]. Journal of the American medical informatics association, 2014, 21(5):808-814.
    [16] LIANG J, XIAN X, HE X, et al. A novel approach towards medical entity recognition in Chinese clinical text[J]. Journal of healthcare engineering,2017(2):1-16.
    [17] UMLS. Current semantic types[EB/OL]. [2018-02-20]. https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html.
    [18] UZUNER ?, SOUTH B R, SHEN S Y, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American medical informatics association, 2011, 18(5):552-556.
    [19] 结巴中文分词[EB/OL].[2018-02-20].https://github.com.
    [20] 沈翔翔, 李小勇. 使用无监督学习改进中文分词[J]. 小型微型计算机系统, 2017, 38(4):744-748.
    [21] 孔东林, 罗向阳, 邓崎皓,等. 基于AC自动机匹配算法的入侵检测系统研究[J]. 微电子学与计算机, 2005, 22(3):89-92.
    [22] 李原.中文文本分类中分词和特征选择方法研究[D].长春:吉林大学,2011.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700