中医医案数据挖掘技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
名老中医的医案是智慧的结晶,使用数据挖掘技术可以帮助我们从专家的医案中挖掘出大量隐藏的临证经验与用药规律。然而中医医案是以自由文本的形式存在的,必须先使用文本挖掘技术从自由文本中抽取出信息,构建结构化的医案,才能更好地使用数据挖掘技术来获取知识。
     本文首先研究了文本挖掘技术中的文本分类和信息抽取这两个技术,并将这些技术应用于名老中医医案结构化研究中。对于上述结构化医案,采用数据挖掘方法挖掘出其中的一些临证经验。本文研究内容如下:
     1.研究了基于字特征的中文文本分类技术。采用了信息增益(IG)技术进行特征选择,用余弦相似度来度量文档间的相似性,采用KNN分类器,在基于复旦大学新闻语料库的实验中,文本分类的正确率达到86.92%,宏平均分类性能达到接近87%的水平。实验结果表明字特征是中文文本分类特征建模中的一种有效方法。
     2.研究了中文文本信息抽取技术。针对名老中医医案,采用了Meta-Bootstrapping算法来提取术语,并设计了术语抽取中所需的模式结构。该方法无需任何浅层自然语言处理和语料标注,仅需提供少量的种子词,经过一定的迭代次数,就可以完成术语抽取任务。在对某名医206份医案的术语抽取实验中,方剂名,辨证信息和治则的术语抽取实验F_1-测度值分别为64.29%,56.21%和76.64%。在抽取术语的基础上,完成了医案结构化的实验。
     3.基于文本分类和信息抽取处理后的病案,本文就名老中医临证经验挖掘系统中的数据预处理模块进行了深入研究,为后续数据挖掘工作的进行提供了清洁的,结构化的源数据。
     4.基于预处理后的症状信息,完成了慢性胃炎辨证过程的建模研究。采用基于因子分析的方法对现有的隐结构模型进行改进,改进了模型的准确性和训练速度。
     5.基于预处理后的处方信息,完成了药物量效关系研究。设计并实现了基于加权欧式距离的层次聚类算法。以某名医哮喘医案数据为例,挖掘了药物使用的规律并得到合理的解释。
The medical records of TCM(Traditional Chinese Medicine) experts are crystallization of famous herbalist doctors's experience, Data Mining(DM) can help us to get the clinical experience of the famous herbalist doctors and their medicine law. However, the medical records are usually in the form of unstructured data, in order to mine such data, Text Mining technology should be used to extract information from such so as to structuralize the medical records, which is the foundation for mining.
     In this thesis, Text Mining technology is researched first, which focuses on the Text Classification and Information Extraction. Then, these techniques are applied to structuralize medical records of famous herbalist doctors. Based on above structuralized medical records, some data mining methods are used to mine some clinic experience. Concrete research work is as follows:
     1. The study of Chinese text classification based on character feature. The techniques of Information Gain is applied to select features, cosine distance to measure the similarity between documents, and KNN methods as classifier, a systematic comparative experiments have been conducted on the news corpus from Fudan University, which achieves the 86.92% precision and 87% Macro-F score. The experimental results indicate that character based feature is an effective modeling method for Chinese text classification.
     2. The study of information extraction to extract the terms from clinical medical records. For structured medical records, it adopted the Meta-Bootstrapping algorithm to extract terms, meanwhile the pattern structure was designed for this purpose. The algorithm began with a few seed words provided artificially, after several iterations, term extraction can be accomplished, which featured no need of any shallow Chinese NLP techniques and labeled training corpus. The experiments are carried out on the 206 clinical medical records, the names of prescription, the dialectical information and the rules of treatment are extracted, F1 score achieved 64.29%, 56.21% and 76.64% respectively. On the basis of term extraction, unstructured medical records are converted into structured records.
     3. Based on medical records processed by text classification and information extraction, data preprocessing for Data Mining system of Traditional Chinese Medicine has been researched, which provide clean, structured data for the subsequent mining work.
     4. Based on the structured symptom information in medical records, a latent structure of syndrome differentiation of chronic gastritis has been researched. The improvement was made on current latent structure based on the factor analysis, which improved the accuracy of model and training speed.
     5. Based on structured prescriptions, the dose-effect relations of Chinese medicine has been mined. An agglomerative clustering algorithm based on weighted Euclidean Distance has been designed and implemented. The experiment on the Asthmatic Clinical Records of a famous herbalist doctor shows the essentials of his experience and has been well supported by the theory of Traditional Chinese Medicine.
引文
[1]屈景辉,廖琪梅等.医学信息数据库的建立与数据挖掘[J].第四军医大学学报,2001,22(1):88-89.
    [2]Jiawei Han,Micheline Kamber著,范明,孟小峰等译.数据挖掘:概念与技术[M].第一版,北京:机械工业出版社,2001.
    [3]钟晓,马少平等.数据挖掘综述[J].模式识别与人工智能,2001,14(1):48-55.
    [4]易高翔,程耕国.Web文本挖掘研究[J].武汉科技大学学报:自然科学版,2005(1):72-74.
    [5]Mingshan Chen,Jiawei Han,Philp S Yu.Dataming:An Overview from a Database Perceptive[J].IEEE Transaction on Knowledge and Data Engineering,1996,8(6):866-882.
    [6]Feldman R,Dagan L.Knowledge Discovery in Textural Databases(KDT)[C].In Proc of the 1~(st) Conference on Knowledge Discovery and Data Mining(KDD-95).Montreal,Canada,August 20-21,AAAI Press,112-117,1995.
    [7]DiaoLi-li et al.Improved Stumps Combined by Boosting for Text Categorization[J].Journal of Software,2007,13(8):1361.
    [8]He J et al.On Machine Learning Methods for Chinese Document Categorization[J].Applied Intelligence,2003,3(18):311-322.
    [9]Tan A-H,Yu P.A Comparative Study on Chinese Text Categorization Methods[C].PRICAI 2000 Workshop on Text and Web Mining.Melbourne,pp.24-35,August 2000.
    [10]Tsay J.-J,Wang J.-D.Design and Evaluation of Approaches to Automatic Chinese Text Categorization[C].Computational Linguistics and Chinese Language Processing,Vol.5,No.2,August 2000,pp.43-58.
    [11]Tsay J-J et al.Improving Automatic Chinese Text Categorization by Error Correction[C].Proceedings of the 5~(th) International Workshop Information Retrieval with Asian Languages.pp.1-8,November 2000,Hong Kong,China.
    [12]Wong C.K.P.,Luk R.W.P.,Wong K.F.,Kwok K.L.Text Categorization using Hybrid(Mined) Terms[C].Proceedings of the 5~(th) International Workshop Information Retrieval with Asian Languages.pp.217-218,November 2000,Hong Kong,China.
    [13]曹素丽,曾伏虎,曹焕光等.基于汉字字频向量的中文文本自动分类系统[J].山西大学学报(自然科学版),1999,22(2):144-149.
    [14]Peng F.C.,Huang X.J.et al.Text Classification in Asian Languages without Word Segmentation[C].Proceedings of the 6~(th) International Workshop Information Retrieval with Asian Languages(IRAL2003),July 7,2003,Sapporo Japan.
    [15]Peng F.C.,Schuurmans D,Wang S.J.Augmenting Na(i|¨)ve Bayes Classifiers with Statistical Language Models[C].JIR 7,317-345,2004.
    [16]Zhou X.,Fang Q.,Wu Z.A Comparative Study on Text Representation and Classifiers in Chinese Text Categorization[C].ICCPOL,2003.pp.454-461.
    [17]周学忠.文本挖掘在中医药中研究[D].浙江:浙江大学,2004.
    [18]Freitag D et al.Information Extraction with HMMs and Shrinkage[C].Workshop on ML and IE(AAAI-99),1999.
    [19]Riloff E.Automatically Constructing a Dictionary for Information Extraction Tasks[C].In Proceedings of the Eleventh National Conference on Artificial Intelligence(AAAI-96),pp.1044-1049,1996.
    [20]Yarowsky D.Word Sense Disambiguation using Statistical Methods of Roget's Categories Trained on Large Corpora[C].In Proceedings of COLING92,pp.454-460,Nantes,1992.
    [21]Yarowsky D.Unsupervised Word-Sense Disambiguation Rivaling Supervised Methods[C].Proceedings of the 33~(rd) Annual Meeting of the Association for Computational Linguistics,189-196,1995.
    [22]Brin S.Extracting Patterns and Relations from the World Wide Web[C].In WebDB Workshop at EDBT-98 Lecture Notes in Computer Science,Spring-Verlag,London,UK,pp.172-183,1998.
    [23]Blum A.,Mitchell T.Combining Labeled and Unlabeled Data with Co-training[C].In COLT:Proceedings of the Workshop on Computational Learning Theory.ACM Press,New York,USA,pp.92-100,1998.
    [24]Jones R,McCallum A,Nigam K,Riloff E.Bootstrapping for Text Learning Tasks[C].In IJCAI-99 Workshop on Text Mining:Foundations,Techniques and Applications.pp.52-63,1999.
    [25]Bekkerman R,E1Yaniv R,Tishby N and Winter Y.On Feature Distributional Clustering for Text Categorization[C].Proc of the 24~(th) SIGIR.ACM Press,NY,USA,pp.146-153,2001.
    [26]王维娜,康耀红,伍小芹.文本分类中特征选择方法研究[J].信息技术,2008,12:29-31.
    [27]胡佳妮,徐蔚然,郭军,邓伟洪.中文文本分类中的特征选择算法研究[J].光通信研究,2005,3:44-46.
    [28]Yang Y.A Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval.Volume 1,Issue I-2,pp.69-90,1999.
    [29]谭金波,黄峰,杨晓江,李艺.一种改进的互信息特征选择算法[J].情报学报,2006,25(6):651-656.
    [30]伍建军,康耀红.基于改进的互信息特征选择的文本分类[J].计算机应用,2006,26:172-173.
    [31]王晔,黄上腾.基于n-gram相邻字的中文文本特征提取算法[C].第一届全国信息检索与内容安全学术会议(NCIRCS-2004).上海,2004.
    [32]边肇祺,张学工等.模式识别[M].第二版,北京:清华大学出版社,1999.
    [33]李静梅,孙丽华,张巧荣,张春生.一种文本处理中的朴素贝叶斯分类器[J].哈尔滨工程大学学报,2003,24(1):71-74.
    [34]李旭升,郭耀煌.基于朴素贝叶斯分类器的个人信用评估模型[J].计算机工程与应用,2006,30:197-201.
    [35]R.-E.Fan,P.-H.Chen,and C.-J.Lin.Working set selection using second order information for training SVM[J].Journal of Machine Learning Research 6,1889-1918,2005.
    [36]B.Sch(o|¨)lkopf,A.Smola,R.Williamson,and P.L.Bartlett.New support vector algorithms[J].Neural Computation,12,2000,1207-1245.
    [37]许云,樊孝忠,张锋.一种不需分词的中文文本分类方法[J].北京理工大学学报,2005,25(9):778-781.
    [38]董乐红,耿国华,周明全.一个中文文本自动分类系统的设计[J].计算机应用与软件,2008,25(4):14-16.
    [39]郑凤萍.一种新的中文文本分类算法[J].现代情报,2007,3:143-144.
    [40]王俊英,郭景峰,霍铮.中文文本分类系统的设计与实现[J].微电子学与计算机,2006,23:262-265.
    [41]李荣陆.文本分类及相关技术研究[D].上海:复旦大学,2005.
    [42]李向阳,苗壮.自由文本信息抽取技术[J].情报科学,2004,22(7):815-821.
    [43]刘迁,焦慧,贾惠波.信息抽取技术的发展现状及构建方法的研究[J].计算机应用研究,2007,24(7):6-9.
    [44]周顺先.文本信息抽取模型及算法研究[D].湖南:湖南大学电气与信息工程学院,2007.
    [45]朱靖波,姚天顺.中文信息自动抽取[J].东北大学学报(自然科学版),1998,19(1):52-54.
    [46]陈文亮,朱慕华,朱靖波,姚天顺.基于Bootstrapping的文本分类模型[J].中文信 息学报,2005,19(2):86-92.
    [47]陈文亮,朱靖波,姚天顺,张宇新.基于Bootstrapping的领域词汇自动获取[C].全国第七届计算语言学联合学术会议(JSCL-2003),哈尔滨,2003,8.
    [48]王振宇,谭红叶,郑家恒.基于Bootstrapping的交通工具名识别[J].计算机科学,2008,35(4):233-234.
    [49]Li Weigang,Liu Ting,Li Sheng.Bootstrapping for Extracting Relations from Large Corpora[J].Journal of Electronics(China),2008,25(1):80-95.
    [50]Ian H.Witten,Elbe Frank著,董琳,邱泉,于晓峰等译.数据挖掘:实用机器学习技术[M].第二版,北京:机械工业出版社,2006.
    [51]Tom M.Mitchell著,曾华军,张银奎等译.机器学习[M].北京:机械工业出版社,2003.
    [52]Nevin.L.Zhang and Shihong Yuan.Latent Structure Models and Diagnosis in Traditional Chinese Medicine[R].Technical Report HKUST-CS04-12,Department of Computer Science,The Hong Kong University of Science and Technology,2006.
    [53]李文林,赵国平,陆建峰等.因子分析法建立隐结构在慢性胃炎辨证中应用的初步分析[J].南京中医药大学学报,2006年,22(5):282-285.
    [54]张文彤.SPSS统计分析高级教程[M].北京:高等教育出版社,2004.
    [55]Kim,Jae-on and Charles W.Mueller.Introduction to factor analysis:what it is and how to do it[M].Beverly Hills,Calif:Sage Publications,1978.
    [56]Zoubin Ghahramani and Geoffrey E.Hinton.The EM Algorithm for Mixtures of Factor Analyzers[R].Technical Report CRG-TR-96-1,Department of Computer Science,University of Toronto,1996.
    [57]张连文,郭海鹏.贝叶斯网引论[M].北京:科学出版社,2006年.194-220.
    [58]林琳.周仲瑛教授治疗哮喘病经验介绍[J].新中医,2004,36(11):7-9.
    [59]陈颖.古方剂量规范处理和分析方法研究[D].成都中医药大学,2006.
    [60]宋宇辰,张玉英,孟海东.一种基于加权欧氏距离聚类方法的研究[J].计算机工程与应用,2007,43(4):179-180.
    [61]任靖,李春平.最小距离分类器的改进算法--加权最小距离分类器[J].计算机应用,2005,25(5):992-994.
    [62]李文林,郭立中,吴勉华等.层次聚类法对周仲瑛哮喘病案方药量效关系的研究[J].南京中医药大学学报,2009,25(1):17-20.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700