面向文本命名实体识别的深层网络模型

英文篇名：Deep Network Model for Text Named Entity Recognition
作者：李慧林 ; 柴玉梅 ; 孙穆祯
英文作者：LI Hui-lin;CHAI Yu-mei;SUN Mu-zhen;School of Information Engineering,Zhengzhou University;School of Public Administration,Huazhong University of Science and Technology;
关键词：命名实体识别 ; 神经网络 ; 条件随机场 ; 数据挖掘
英文关键词：named entity recognition;;neural network;;condition random field;;data mining
中文刊名：XXWX
英文刊名：Journal of Chinese Computer Systems
机构：郑州大学信息工程学院;华中科技大学公共管理学院;
出版日期：2019-01-15
出版单位：小型微型计算机系统
年：2019
期：v.40
基金：国家自然科学基金项目(U1636111)资助
语种：中文;
页：XXWX201901011
页数：8
CN：01
ISSN：21-1106/TP
分类号：52-59

摘要

文本命名实体识别是信息抽取和预测的基本与关键任务,提出基于深层网络模型的命名实体识别方法,构建多种学习模型.首先对文本进行清洗并规范化,生成基本结构和表示方法,结合边界特征构建深层条件随机场模型,选择最优特征集训练.将文本表示为词向量形式,以向量作为深层神经网络的输入进行模型的训练,提出了基于块表示的BR-BiRNN、BR-BiLSTM-CRF命名实体识别深层网络模型,在I2B2 2006年和2014年评测数据集及妇产科真实医疗文本上实验,结果均比传统的SVM、HM M、CRF的F值高.
Text named entity recognition is the basic and key task of information extraction and prediction. The named entity recognition method based on deep network model is proposed,and then we build several learning models. First,the text is cleaned and normalized,basic structure and representation methods are generated,and a deep conditional random field model is built with boundary features,then we choose the optimal feature set to train. The text is represented as a word vector form,and the vector is used as the input of the deep neural network to train the model. We propose the BR-BiRNN,BR-BiLSTM-CRF deep network model for named entity recognition based on block representation,do experiment on the I2B2 2006 and 2014 evaluation datasets and gynecological real medical text,the results are higher than the traditional SVM,HMMand CRF on F value.

引文

[1] Aronson A R,Lang F. An overviewof Meta Map:historical perspective and recent advances[J]. Journal of the American Medical Informatics Association,2015,17(3):229-236.
    [2] Chen T,Cullen R,Godwin M. Hidden Markov model using Dirichlet Process for de-identification.[J]. Journal of Biomedical Informatics,2015,58S:S60-S66.
    [3] Zuccon G,Kotzur D,Nguyen A,et al. De-identification of health records using Anonym:effectiveness and robustness across datasets[J]. Artificial Intelligence in Medicine,2014,61(3):145-151.
    [4] Liu Z,Chen Y,Tang B,et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields[J]. Journal of Biomedical Informatics,2015,58:47-52.
    [5] Lample G,Ballesteros M,Subramanian S,et al. Neural architectures for named entity recognition[C]. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,2016:260-270.
    [6] Collobert R,Weston J,Bottou L,et al. Natural language processing(almost)from scratch[J]. Journal of Machine Learning Research,2011,12(1):2493-2537.
    [7] Habibi M,Weber L,Neves M,et al. Deep learning with word embeddings improves biomedical named entity recognition[J]. Bioinformatics,2017,33(14):37-48.
    [8] Sweeney L. Replacing personally-identifying information in medical records,the Scrub system[C]. Proceedings:A conference of the American Medical Informatics Association,AMIA Fall Symposium,American Medical Informatics Association,1996:333-337.
    [9] Yang H. Automatic extraction of medication information from medical discharge summaries[J]. Journal of the American Medical Informatics Association,2010,17(5):545-548.
    [10] Guo Y,Gaizauskas R,Roberts I,et al. Identifying personal health information using support vector machines[C]. I2B2 Workshop on Challenges in Natural Language Processing for Clinical Data,2006:10-11.
    [11] Szarvas G,Farkas R,Busa-Fekete R. State-of-the-art anonymization of medical records using an iterative machine learning framework[J]. Journal of the American Medical Informatics Association,2007,14(5):574-580.
    [12] Lafferty J D,Mccallum A,Pereira F C N. Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]. Eighteenth International Conference on Machine Learning,Morgan Kaufmann Publishers Inc,2001:282-289.
    [13] Wellner B,Huyck M,Mardis S,et al. Rapidly retargetable approaches to de-identification in medical records[J]. Journal of the American Medical Informatics Association,2006,14(5):564-573.
    [14] Yang Jin-feng,Guan Yi,He Bin,et al. Corpus construction for named entities and entity relations on chinese electronic medical records[J]. Journal of Software,2016,27(11):2725-2746.
    [15] Wang Hong-bin,Shen Qiang,Xian Yan-tuan. Research on chinese named entity recognition fusing transfer learning[J]. Journal of Chinese Computer Systems,2017,38(2):346-351.
    [16] Li Li-shuang,He Hong-lei,Liu Shan-shan,et al. Research of word representations on biomedical named entity recognition[J]. Journal of Chinese Computer Systems,2016,37(2):302-307.
    [17] Chiu J P C,Nichols E. Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics,2016,4:357-370.
    [18] Ma X,Hovy E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016:1064-1074.
    [19] Dernoncourt F,Lee J Y,Szolovits P. NeuroNER:an easy-to-use program for named-entity recognition based on neural networks[C].Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing:System Demonstrations,2017:97-102.
    [20] Peters M,Ammar W,Bhagavatula C,et al. Semi-supervised sequence tagging with bidirectional language models[C]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,2017:1756-1765.
    [21] Rei M,Crichton G,Pyysalo S. Attending to characters in neural sequence labeling models[C]. Proceedings of COLING 2016,the26th International Conference on Computational Linguistics:Technical Papers,2016:309-318.
    [14]杨锦锋,关毅,何彬,等.中文电子病历命名实体和实体关系语料库构建[J].软件学报,2016,27(11):2725-2746.
    [15]王红斌,沈强,线岩团.融合迁移学习的中文命名实体识别[J].小型微型计算机系统,2017,38(2):346-351.
    [16]李丽双,何红磊,刘珊珊,等.基于词表示方法的生物医学命名实体识别[J].小型微型计算机系统,2016,37(2):302-307.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700