一种面向医学文本数据的结构化信息抽取方法

英文篇名：Approach of Structured Information Extraction for Medical Text Data
作者：杨兵 ; 聂铁铮 ; 申德荣 ; 寇月 ; 于戈
英文作者：YANG Bing;NIE Tie-zheng;SHEN De-rong;KOU Yue;YU Ge;School of Computer Science and Engineering,Northeastern University;
关键词：结构化信息抽取 ; 文本聚类 ; 关键词提取 ; 语义依存
英文关键词：structured information extraction;;text clustering;;keywords extraction;;semantic dependency
中文刊名：XXWX
英文刊名：Journal of Chinese Computer Systems
机构：东北大学计算机科学与工程学院;
出版日期：2019-07-15
出版单位：小型微型计算机系统
年：2019
期：v.40
基金：国家重点研究发展计划项目(2018YFB1003404)资助;; 国家自然科学基金项目(61672142,61402213,U1435216)资助;; 中央大学基础研究基金项目(N150408001-3,N150404013)资助
语种：中文;
页：XXWX201907024
页数：7
CN：07
ISSN：21-1106/TP
分类号：121-127

摘要

医学文本作为医疗领域重要的信息载体,为临床诊断和病理学研究提供了重要的数据支持,然而使用自然语言编写的文本数据往往是非结构化的,不便于机器理解和自动化处理.对于中文的医学文本数据而言,由于专业性强,需要丰富的领域知识,并且语法上多采用短句形式,这给结构化信息的抽取带来了巨大的挑战.为此,本文设计了一种针对医学领域的文本数据进行结构化信息抽取的方法,该方法首先通过文本聚类和关键词提取来获得医学描述语言中常用的表达术语,然后使用生成的医学术语库辅助中文分词处理,以提高中文医学文本的分词质量.然后,分析词与词之间的语义依存关系并随之构建依存句法树.最后,从该句法树中识别和抽取医学文本描述中的关键指标及其对应的指标值,最终得到结构化的键值对数据.本文采用真实的医学影像报告文本作为实验数据,实验结果表明该方法有效提高了中文医学文本的分词质量,准确率最高可达98. 24%,并在结构化的信息抽取中效果显著,具有最高83. 76%的准确率和88. 09%的召回率.本文提出的方法能覆盖多种依存语法,且有很好的适用性.
As an important information carrier in the medical field,texts provide important data which support for clinical diagnosis and pathological research. However,texts written with the natural language are often unstructured and difficult for understanding and automatic processing. Especially for medical texts in Chinese,due to its strong professionalism,which requires extensive domain knowledge,and many short sentences are used in grammar which brings more difficulties for information extraction. Therefore,this paper proposes an approach for extracting structured information from medical text data. This approach firstly uses text clustering and keywords extraction to get commonly used expression terms in medical descriptions,and then generates the medical term database to assist Chinese segmentation to improve quality of segmentation in Chinese medical texts. Then,we analyze semantic dependency between words,and construct syntactic dependency trees for identifying and extracting key indicators with the corresponding value in medical texts from these syntactic dependency trees to obtain the structured output data. We use texts data of medical image reports for experiments,and experimental results show that this approach can effectively improve the quality of Chinese word segmentation,with the accuracy up to 98. 24%. Moreover,there are significant effects in structured knowledge extraction,with the most accuracy of 83. 76% and recall of 88. 09%. In addition,this approach can cover a variety of dependency grammar,thus has a good applicability.

引文

[1]Tian Chi-yuan,Chen De-hua,Wang Mei,et al.Structured processing for pathological reports based on dependency parsing[J].Journal of Computer Research and Development,2016,53(12):2669-2680.
    [2]Atkinson-Abutridy J,Mellish C,Aitken S.Combining information extraction with genetic algorithms for text mining[J].IEEE Intelligent Systems,2017,19(3):22-30.
    [3]Luo G,Frey L J.Efficient execution methods of pivoting for bulk extraction of entity-attribute-value-modeled data[J].IEEE J Biomed Health Inform,2016,20(2):644-654.
    [4]Wang Jia-yang,Yang Li-ping,Yan Tian-wei.Text similarity computing method based on vector space model[J].Science Mosaic,2017,(2):9-13.
    [5]Cerbulescu C,Leotescu G S,Cerbulescu C,et al.Extracting text keywords using WordNet[C]//Balkan Conference in Informatics,2017:1-4.
    [6]Deng Li-ping,Luo Zhi-yong.Domain adaptation of Chinese word segmentation on semi-supervised conditional random fields[J].Journal of Chinese Information Processing,2017,31(4):9-19.
    [7]Chen X,Shi Z,Qiu X,et al.Adversarial multi-criteria learning for Chinese word segmentation[C]//Meeting of the Association for Computational Linguistics,2017:1193-1203.
    [8]Peng N,Dredze M.Improving named entity recognition for Chinese social media with word segmentation representation learning[C]//Meeting of the Association for Computational Linguistics,2016:149-155.
    [9]You B,Liu X R,Li N,et al.Using information content to evaluate semantic similarity on HowNet[C]//Eighth International Conference on Computational Intelligence and Security,IEEE,2013:142-145.
    [10]Zhou L,Zhang D.NLPIR:a theoretical framework for applying natural language processing to information retrieval[J].Journal of the American Society for Information Science&Technology,2010,54(2):115-123.
    [11]Zhang W,Zhou L,Shi Y,et al.Soft-fault diagnosis of analog circuit with tolerance using FNLP[J].Metrology&Measurement Systems,2010,17(3):349-361.
    [12]Zheng Cui-xian,Sun Wen-qiang,Deng Chuang-xing,et al.Research of NPC proposals similarity based on IKAnalyzer and VSM[J].Information&Communications,2016,(8):48-50.
    [13]Wang X,Thompson P,Ananiadou S.Biomedical Chinese-English CLIR using an extended CMeSH resource to expand queries[C]//Eighth International Conference on Language Resources and Evaluation,2012:1148-1155.
    [14]Socher R,Karpathy A,Le Q V,et al.Grounded compositional semantics for finding and describing images with sentences[J].Nlp.Stanford.edu,2013,10(4):10-21.
    [15]Li Hai-guang,Wu Xin-dong,Li Zhao,et al.A relation extraction method of Chinese named entities based on location;and semantic features[J].Applied Intelligence,2013,38(1):1-15.
    [16]Jonnalagadda S,Cohen T,Wu S,et al.Enhancing clinical concept extraction with distributional semantics[J].Journal of Biomedical Informatics,2012,45(1):129-140.
    [17]Denecke K.Semantic structuring of and information extraction from medical documents using the UMLS.[J].Methods of Information in Medicine,2008,47(5):425-434.
    [18]Shehata S,Karray F,Kamel M S.An efficient concept-based mining model for enhancing text clustering[J].IEEE Transactions on Knowledge&Data Engineering,2010,22(10):1360-1371.
    [19]Peng L,Wang B,Shi Z,et al.Tag-TextRank:a webpage keyword extraction method based on tags[J].Journal of Computer Research&Development,2012,49(11):2344-2351.
    [20]Ashari A,Riasetiawan M.Document summarization using textrank and semantic network[J].International Journal of Intelligent Systems&Applications,2017,9(11):26-33.
    [21]Zhou Liang-jun,Xiang Yang.Chinese semantic dependency parsing based on sentence compression[J].Journal of Computer Applications,2017,37(1):266-269.
    [22]Bohnet B,Hafdell L,Nugues P.A high-performance syntactic and semantic dependency parser[C]//International Conference on Computational Linguistics,Demonstrations,Association for Computational Linguistics,2010:33-36.
    [23]Li Y,Shao Y.Annotating Chinese noun phrases based on Semantic Dependency graph[C]//International Conference on Asian Language Processing,IEEE,2017:18-21.
    [24]Robinson.Dependency structures and transformational rules[J].Language,1970,46(2):259-285.
    [25]Chen De-hua,Feng Jie-ying,Le Jia-jin,et al.Research on structured method for Chinese pathological text[J].Computer Science,2016,43(10):272-276.
    [1]田驰远,陈德华,王梅,等.基于依存句法分析的病理报告结构化处理方法[J].计算机研究与发展,2016,53(12):2669-2680.
    [4]王嘉旸,杨丽萍,闫天伟.基于向量空间模型的文本相似度计算方法[J].科技广场,2017(2):9-13.
    [6]邓丽萍,罗智勇.基于半监督CRF的跨领域中文分词[J].中文信息学报,2017,31(4):9-19.
    [12]郑翠仙,孙文强,邓创兴,等.基于IKAnalyzer及VSM的人大代表议案建议相似度计算[J].信息通信,2016(8):48-50.
    [21]周亮俊,向阳.基于语句压缩的中文语义依存分析[J].计算机应用,2017,37(1):266-269.
    [25]陈德华,冯洁莹,乐嘉锦,等.中文病理文本的结构化处理方法研究[J].计算机科学,2016,43(10):272-276.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700