用户名: 密码: 验证码:
基于中文文本分类的自动诊病系统
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本文研究基于文本分类技术的自动诊病系统。已有病案记录了疾病现象与疾病种类的关系,利用机器学习方法构造分类器,可以习得疾病现象与疾病种类之间的规律性知识。当面对新的病案时,根据习得的知识,通过对疾病现象的分析,可以预测病人所患疾病的种类,从而实现自动诊病过程。
    医疗部门积累的海量文本,为针对医疗领域的信息处理研究提供了宝贵的数据资源。应用自然语言处理技术对医疗领域的信息进行处理,日益成为自然语言处理的一个新兴的研究和应用热点。通过对已有电子病案的分析,可预测各种疾病的人群分布、常见特征以及发展趋向,有利于我们提高医疗水平和治疗效率。因此基于自然语言处理技术的医疗病案研究具有理论意义和实用价值。对于中文病案,实现自动诊病系统需解决组织电子病案、分词、构造分类器等几个主要问题,本文围绕这几个主要问题展开研究。
    首先组织电子病案,它是对文本的采集过程。本系统采用出院病人病案为原始数据进行数据采集。电子病案因为已经包括疾病的症状、诊断和治疗情况,成为人工分类后的训练数据,即学习文本。由于学习文本的质量直接关系到系统能否实现,因此需要对其进行预处理,使病案文本以便于计算机处理的数据形式保存。为此,本文构建了病案自动生成和管理子系统确保数据的准确和高效采集,它是诊断系统的支持系统。
    接下来从中文文本的自动分词入手,对电子病案进行处理。在自然语言理解当中,词是有意义的最小处理单位。把没有分割标志,也就是没有词的边界的汉字串,自动转换到符合语言实际的词串,即在书面汉语中建立词的边界,这是汉语分词的任务。汉语自动分词是任何中文自然语言处理的第一道“工序”,其作用非常重要。只有逾越这个障碍,中文处理系统才能称得上初步打上了“智能”的印记。本文介绍了目前采用的几种汉语自动分词技术,包括:最大匹配法、改进的最大匹配法、全切分法等。本系统采用分词和词性标注一体化的方法对病案进行预处理,实验表明,该方法准确率高于直接分词方法。
    最终本文利用贝叶斯算法,通过对训练文本的学习,构建了一个面向医疗领域的文本分类器,从而实现了本文提出的自动诊病的目标。贝叶斯分类算法
    
    是现在比较流行的方法,它的分类效果比较好,简单而且高效,可以通过大规模的训练语料提高分类的质量,还可以对它进行改进。比如应用基于向量空间模型的其它方法对其结果进行修正。贝叶斯算法在概率的分布上做了假设,假设文本的所有属性值在给定类的上下文中是相互独立的。使用一个包含这些假设的具体模型,用大量标记好的文本训练,生成模型参数。测试文本的分类是选择最有可能生成该文本的类。只有以病案文本为知识源建立了知识库,按照疾病的分类进行训练,才可以构造相应的分类器,对相关的新病案进行分类而生成诊断。
    本课题采用的是基于统计的信息抽取方法,可解决基于知识的专家系统中知识获取瓶颈的问题,而且知识是来源于真实病案,具有客观性好,一致性强等优点。通过试验证明,本系统具有一定实用性,可实现辅助诊病。本系统还具有较强的可移植性,可扩展到其他应用领域。将知识源扩展一下,通过对分类器的训练,经过一定的预处理,本文可以实现对其它领域的文本处理。本文的研究表明,基于文本分类的自动诊病系统具有较强的通用性,具有一定的后续开发潜力。
This paper has realized an automatic diagnostician, which based on text classification. Medical records have annotated the relationship between disease phenomenon and category. The classifier, which includes knowledge between the disease phenomenon and category, can be trained from annotated texts by machine learning technology. And then, it predicts the sort of disease by analysis on disease phenomenon. At last, automatic diagnostician is realized using our methods.
    Large-scale texts are useful for the information processing about medical area. Recently using natural language processing technology to process electronic patient's record is becoming a hot spot on research and application in information processing area. It is benefit to forecast the distribution of patient's and the developing tend of various diseases. It is an efficient way to improve the level and efficiency of remedy. Therefore using natural language processing technology to process medical information has theoretic meaning and practical worth. In order to build the automatic diagnostician system, the key problems, we need to solve, are the organization of electronic patient's records and word segmentation and building classifier.
    First, we organize patients' records for our system. In fact, it is a course of gathering training data. This system uses cured patient's medical record as original data. Electronic patient's record includes symptom, diagnosis and curing circumstances of diseases. The quality of the training data concerns the realization of automatic diagnostician system. By pretreatment, it is preserved on the form of easy to process. This paper constructs a subsystem to store and manage patient's records accurately and effectively.
    Second, we realize the automatic word segmentation on Chinese texts. Word is the smallest processing unit in understanding natural language. Chinese segmentation is the first step of any Chinese natural language processing system. It is very
    
    important. Only exceeding this obstacle, the processing system can be called having initially "intellect". This paper introduces many technology of segmentation, such as maximum matching, improved maximum matching, full segmentation, and so on. This system uses the integrated method for segmentation and part-of-speech. The experiments show the performance of our method is better than others.
    At last, we build Bayes classifier to learn knowledge from the records. The classifier realizes the automatic diagnostician. This method is general and effective. And, the performance will be improved by enlargement of training data. Such as use vector space model(VSM) to improve it. Bayes classifier is the simplest one of these models. The parameters of the model are trained in the data we gathered under independent assume. The most likely category of the new example is selected with Bayes model rule.
    This paper builds an automatic diagnostician based on statistical methods; it has many merits comparing with traditional rule-based expert system in medical field. It solves the difficult problem on knowledge acquirement. The knowledge is learned from real medical records, so it is good and objective. The primary experiments indicate that doctors can get useful information form automatic diagnosing candidates, and it is helpful to improve the efficiency of doctors. The system still possesses stronger transplant property, and can expand to other domain. It is signification to explore the system in the future work.
引文
1 梁嘉凯,王双惠,李常洪等.医疗诊断专家系统开发的新思想与新方法.系统工程学报,1999,14(1):83~89
    2 欧阳一鸣.诊断型专家系统分析.合肥工业大学学报,1999,22(3):84~86
    3 吴鹤龄.专家系统工具CLIPS及其应用.北京理工大学出版社,1992:20~25
    4 王安平,宋晓芬.应用实践病案管理和医疗统计系统设计.山西电子技术,2001,(5):40~43
    5 史忠植.知识工程.清华大学出版社,1988:10~20
    6 孙茂松,邹嘉彦.汉语自动分词研究评述.当代语言学,2001,3(1):22~32
    7 刘挺,吴岩等.串频统计和词匹配相结合的汉语自动分词系统.中文信息学报,1998,(1):17~25
    8 Wu, A.D. and Jiang Z.X. Word segmentation in sentence analysis Proceedings of the International Conference on Chinese Information Processing, 1998 , Istenbul: 169~180
    9 Palmer, D.D.A trainable rule-based Algorithm for word segmentation. Proceedings of the 35th Annual Meeting of ACL and 8th Conference of the European Chapter of ACL, 1997,Madrid: 110~120
    10 姚天顺,张桂平等.基于规则的汉语自动分词系统.中文信息学报,1990,(1):37~43
    11 Sproat, R. and Shih C.L.A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 1993,4(4): 336~349
    12 沈达阳.基于统计和规则的汉语真实文本自动分词和词性标注系统的研究与实现.清华大学硕士学位论文,1996:30~45
    13 周强.规则与统计相结合的汉语词类标注方法.中文信息学报,1995,9(2) : 1~10
    14 尹俊涛.汉字概率检索理论及其实现.http://www.i-power . com . cn / ipower /library/qbxb/qbxb99/qbxb9905/990503.html
    Chen, K.J. and Liu S. H. Word identification for Mandarin Chinese sentences. Proceedings of the 14th International Conference on Computational Linguistics,
    
    15 1992,Nantes: 101~107
    16 翁福良,王野翊.计算语言学导论.中国社会科学出版社,1998:136~145
    17 周强.汉语语料库的短语自动划分和标注研究.北京大学博士学位论文,1996:38~51
    18 王永成,顾晓明,王丽霞.中文文献主题的自动标引.http://www.i-power. com . cn / ipower / library / jsjyy / jsj99 / jsjyy9906 / 990603.html
    19 庄庆雨.分布式专家系统中的问题意义与分解.计算机科学,1990,(6):56~61
    20 施宏宝,李波.分布式专家系统协同问题求解.计算机科学,1989,(11) : 58~65
    21 刘伟权,钟义信.自然语言处理与全文情报检索. http : // www . north . cetin . net.cn/hqingbao/qb97/qb1/qb1013.html
    22 周强.一个汉语短语自动界定模型.软件学报,1996,(7):315~322
    23 黄萱菁,吴立德等.基于机器学习的无须人工编制词典的切词系统.模式识别与人工智能,1996,(4):297~303
    24 王永成,苏海菊等.中文词的自动处理.中文信息学报,1990,(4):1~10
    25 刘开瑛.自动分词和词性标注软件评测述评.第四届中国计算机智能接口与智能应用学术会议论文集.上海,1999:75~80
    26 孙茂松,黄昌宁等.中文姓名的自动辨识.中文信息学报,1995,(2):16~27
    27 谭红叶,郑家恒,刘开瑛.中国地名的自动识别方法研究.计算语言学论文集.北京,1999:55~69
    28 孙茂松,张维杰.英语姓名译名的自动辨识.计算语言学研究与应用论文集.北京,1993:144~149
    29 沈达阳等.汉语分词系统的信息集成和最佳路径搜索方法.中文信息学报,1997,(2):34~47
    30 刘挺,王开铸.关于歧义字段切分的思考与实验.中文信息学报,1998,(2):63~64
    31 Church K.A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of the 2nd Conference on Applied Natural Language Processing, 1988, Austin: 136~143
    32 孙茂松,左正平.汉语真实文本的交集型切分歧义.汉语计量与计算研究论文集.香港,1998:323~338
    Nie, J.Y., Jin W. Y., et al. A hybrid approach to unknown word detection and segmentation of Chinese. Proceedings of International Conference on Chinese
    
    33 Computing, 1994,Singapore: 305~310
    34 Chang. C. H. and Chen C.D.A study on integrating Chinese word segmentation and part-of-speech tagging. Communications of COLIPS, 1993,3(2): 66~77
    35 孙茂松,左正平等.消解中文三字长交集型分词歧义的算法.清华大学学报,1999,(5):101~103
    36 刘涌泉.再谈词的问题.中文信息学报,1988,(2):47~50
    37 苏中义,郭辉.一种改进的MM分词算法.微型电脑应用,2002,18(1):35~40
    38 刘源,谭强,沈旭昆.处理用现代汉语分词规范及自动分词方法.清华大学出版社、广西科学技术出版社,1994:10~34
    39 Nie, J.Y., Brisebois M., et al. On Chinese word segmentation and word-based text retrieval. Proceedings of International Conference on Chinese Computing, 1996,Singapore: 405~412
    40 董振东.汉语分词研究漫谈.语言文字应用,1997,(1):107~112
    41 Sun, M.S. and Benjamin K.T. Ambiguity resolution in Chinese word segmentation. Proceedings of the 10th Asia conference on Language, Information and Computation, 1995, Istenbul: 121~126
    42 徐辉,何克抗等.书面汉语自动分词专家系统的实现.中文信息学报,1991,(3):38~47
    43 张俊盛,陈舜德等.多语料库作法之中文姓名辨别.中文信息学报,1992,(3):7~15
    44 Wu, Z.M.and Tseng G. Chinese text segmentation for text retrieval: achievements and problems. Journal of the American Society for Information Science, 1993,44(9): 532~542
    45 韩世欣,王开铸.基于短语结构文法的分词研究.中文信息学报,1992 , (3) : 48~53
    46 Bourigault D. Surface grammatical analysis for the extraction of terminological noun phrases. the 15th International Conference on Computational Linguistics (COLING'92),1992,Nantes:977~981
    47 黄居仁,陈克健等.“资讯处理用中文分词规范”设计理念及规范内容.语言文字应用,1997,(1):92~100
    48 庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.http://www.ict.ac.cn/xueshu/2001/115.doc
    
    
    49 孙茂松,张磊.人机共存,质量合一——谈谈制定信息处理用汉语词表的策略.语言文字应用,1997,(1):79~86
    50 刘开瑛.现代汉语自动分词评测技术研究.语言文字应用,1997,(1):101~106
    51 秦兵、郑实福等.基于改进的贝叶斯模型的中文网页分类器.第六届计算语言学联合学术会议论文集.太原,2001:373~379
    52 Perdo Domingos, Michael Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. http : // www.cs. Washington.edu/ ai/naive.html
    53 David D. Lewis, Marc Ringuette. A Comparison of Two Learning Algorithms for Text Classification.http://citeseer.nj.nec.com/update/18549
    54 Andrew Mccallum, Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification.http://citeseer.nj.nec.com/mccallum98comparison.html
    55 邓汉成, 王瑛, 王敏芳.从检索实例看召回率与查准率之间的关系.http://www.i-power.com.cn/ipower/library/qbxb/qbxb2000/0003/0003ml.html
    56 Humphrey, T., and Zhou, F. Period Disambiguation Using a Neural Network. In International Joint Conference on Neural Networks, 1989,Washington: 606~610

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700