基于最大概率法探讨中医症状信息提取与标准化
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Discussion on the extraction and standardization of TCM symptom based on maximum probability method
  • 作者:梁礼铿 ; 黎敬波
  • 英文作者:LIANG Li-keng;LI Jing-bo;Guangzhou University of Chinese Medicine;
  • 关键词:症状 ; 文本挖掘 ; 文本数据结构化 ; 中文分词 ; 大概率法 ; 标准化
  • 英文关键词:Symptom;;Text mining;;Text data structure;;Chinese word segmentation;;Maximum probability method;;Standardization
  • 中文刊名:BXYY
  • 英文刊名:China Journal of Traditional Chinese Medicine and Pharmacy
  • 机构:广州中医药大学;
  • 出版日期:2017-05-01
  • 出版单位:中华中医药杂志
  • 年:2017
  • 期:v.32
  • 基金:教育部博士点基金项目(No.20114425110009)~~
  • 语种:中文;
  • 页:BXYY201705060
  • 页数:4
  • CN:05
  • ISSN:11-5334/R
  • 分类号:275-278
摘要
目的:通过比较两个基于最大概率法的症状提取方案,探讨中医症状信息的提取和标准化。方法:数据分析和处理在R 3.3.2上进行。运用《诊断学》《中医诊断学》及1 000份已标记的肺炎住院病历建立症状标准化数据库,症状描述词库和关键词-形容词词库。基于最大概率法分别设计出中文分词方案,直接提取方案和组合提取方案。并用这3种方案对2 311份肺炎病历进行症状信息提取和标准化,从产生维度、手工处理情况、症状提取效果对方案进行比较。结果:直接提取方案和组合提取方案均能有效降低维度,组合提取方案手工处理百分比较小和症状提取效果较好。结论:基于最大概率法的组合提取方案能有效提取中医症状信息。
        Objective: To discuss the extraction and standardization of traditional Chinese medicine symptom by comparing two symptom extraction programs based on the maximum probability method. Methods: All data were analyzed and processed on R 3.3.2. Diagnostics, Diagnostics of Traditional Chinese Medicine and 1 000 marked pneumonia hospitalized medical records were used to establish symptomstandardization database, symptom description lexicon and keyword-adjective lexicon. Based on the maximum probability method, Chinese word segmentation program(CSP), direct extraction program(DEP) and combination extraction program(CEP) weredesigned respectively. And these three programs were used to extract and standardize the symptoms of 2 311 pneumonia medical records,and the results were compared with each other bygenerating dimension, manual processing and the efficiency of symptom extraction. Results: Compared with CSP, CEP and DEP were effective in reducing the dimension. And CEP was lower on the manual processing rate and more efficient on the symptom extraction. Conclusion: CEP based on the maximum probability methodcan effectively extract TCM symptom information.
引文
[1]韩冬煦,常宝宝.中文分词模型的领域适应性方法.计算机学报,2015,38(2):272-281
    [2]张帆,刘晓峰,孙燕.中医医案文献自动分词研究.中国中医药信息杂志,2015(2):38-41
    [3]邓铁涛.中医诊断学.上海:上海科学技术出版社,2006
    [4]戴万亨,张永涛.诊断学.北京:中国中医药出版社,2012
    [5]吴春颖,王士同.基于二元语法的N-最大概率中文粗分模型.计算机应用,2007(12):2902-2905
    [6]王庆福.隐马尔可夫模型在中文文本分词中应用研究.无线互联科技,2016(13):106-107
    [7]栗伟,赵大哲,李博,等.CRF与规则相结合的医学病历实体识别.计算机应用研究,2015(4):1082-1086
    [8]Tang L X,Geva S,Trotman A,et al.A boundary-oriented Chinese segmentation method using n-gram mutual information.Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing,2010:234-239
    [9]Zhang C,Chen Z,Hu G.A Chinese word segmentation system based on structured support vector machine utilization of unlabeled text corpus.CIPS-SIGHAN Joint Conference on Chinese Language Processing,2010:221

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700