数据挖掘在甲状腺功能减退症分类中的应用与研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
医疗信息化的发展,诊断数据量的激增,需要结合数据挖掘技术进行深入分析,提取有潜在意义的知识。当前基于甲状腺功能减退症(简称甲减)的分类挖掘研究较少,只是纯粹地从医疗分析的角度,统计学原理的角度,单一数据挖掘模型的角度出发,未能将统计方法与数据挖掘技术有机结合,也未能将数据挖掘的多种模型进行综合性地比较分析,以此决定甲减分类模型的优劣。
     本文针对甲减分类在以上研究领域的不足,挖掘甲减的不同测量数据,从统计原理的方法和实际应用两方面对多种分类模型进行了较为深入的研究。从变量要求、数据鲁棒性、时间运行、结果解释、分类准确率和性能伸缩性等多因素,综合研究了三类模型的性能优劣,对临床甲减分类诊断具有一定的参考作用和指导意义。本文所做的主要工作有:
     1)阐述了数据挖掘技术的相关概念和主要应用领域,较为深入地分析了数据挖掘过程CRISP-DM中的各个实施阶段,及其产生的相应结果。结合研究与应用,对甲减分类进行较为透彻的业务理解。同时在数据理解过程中,进行了甲减属性的深入探索,使训练集和测试集的选择具备一般性。在数据准备方面,针对相关变量字段存在的缺失值,离群值,无用属性或冗余属性等情况,进行了较为全面的数据分析和数据预处理工作。
     2)基于数据模型的统计学原理,本文着重探讨了统计方法与数据挖掘的异同之处和相互关系,主要研究了判别式分析算法,Logistic回归算法和CHAID决策树算法的数学原理及应用。通过建立相应的数据挖掘模型,得出了甲减分类的主要判别指标。以统计原理的方法与多种数据挖掘模型相结合的方式,进行了较为全面的数据统计分析和挖掘算法研究,找到较优的挖掘模型,并进一步将三种模型从不同测量因素上进行综合分析与比较。
     3)在Clementine12.0开发环境下,采用了CRISP-DM数据挖掘标准过程进行系统性的甲减挖掘研究与开发,从总体上和细节上有机地把握挖掘实施过程的六个阶段,以一种结构化的、体系化的、标准化的、可视化的流程进行数据挖掘工作。利用Script脚本语言开发数据挖掘的整个过程,从而改善了那些手动的、重复的、耗时的工作任务,有利于在操作界面上实现过程的自动化和处理对象的批量化。
With the development of medical information and the increment of diagnostic data, it is necessary to extract the potential and significant knowledge using the deep analysis of data mining technology.The current research based on hypothyroidism classification mining is not good enough to determine the advantages and disadvantages of classification models,because it comes from the perspective of medical analysis, statistical theory, or the single data mining model, not combing with statistical method and data mining, and failing to compare and analyze the variety of data mining models comprehensively.
     In this paper, researches the different datas of hypothyrodisim from the statistical methods and practical application, and compares with different classification models to make up the current deficiency. Makes a comprehensive analysis of the performance of three models from the variable demands, data robustness, time cosuming, result interpretation, classification accuracy, performance scalability and many other factors, also provids a referencing and guiding significance to the clinical diagnosis of hypothyroidism.This paper contains the following aspects:
     1) Introduces the concepts of data mining technology and major applications, analyzes the CRISP-DM data mining process in the various stages of implementation, and the corresponding results deeply. Takes a more deep business understanding of hypothyroidism classification combing with research and application. At the same time, conducts in-depth exploration of hypothyroidism properties in the data understand process, so that making the training set and testing set more general and representative. Analyzes and pre-processes the fields relevant with missing values, outliers, useless or redundant attributes in the data preparation process.
     2) Researches the main method, mathematical principle and application of the discriminant analysis, Logistic regression and CHAID decision tree, explores the similarities, differences and mutual relations of the statistical methods and data mining, based on the statistical theory and data models. Determines the main indicators of hypothyroidism classification through the establishment of appropriate data mining model. Makes a more comprehensive statistical analysis of data mining algorithms and research of the mining models to find optimum with a variety of statistical methods and principles of data mining model combination, carries out a further measurement and comprehensive analysis of three models from different factors.
     3) Uses the CRISP-DM data mining standard process for systematic hypothyroidism research and development to grasp the six stages of the implementation process from the whole and detail views in Clementine12.0 development environment. Takes the data mining work in a structured, systematic, standard, and visual process. Uses the Script language to develop the whole process of data mining to improve those manual, repetitive, time consuming tasks, and also help to achieve the automatic process and batching process in the user interface.
引文
[1]康晓东.基于数据仓库的数据挖掘技术[M].北京:机械工业出版社.2005:20-23
    [2]王炳德.医院信息系统[M].北京:北京医科大学北京协和医科大学联合出版社.1994:135-138
    [3]葛海波.现代医院统计与信息处理争议[J].现代医院.2005,5(3):114-115
    [4]Wolf Stuglinger, et al.Intel(?)igent Data Mining for Medical Quality Management [OL]. http://www.ifs.tuwien.ac.at/-silvia/idamap-2000
    [5]Evered BC.Diseases of the Thyroid[J]. Pitman Medical.1996,7 (9):130-139
    [6]闻海霞,李忠萍.微粒子化学发光酶免疫法测定sTSH与放免法测定TSH临床价值比较[J].宁夏医学杂志.2002,24(10):600-602
    [7]饶小雪,张金谷,赵树玲.甲状腺疾病诊断中运用多项目“放免”联合测定的临床意义[J].放射免疫学杂志.1996,9(3):165-168
    [8]王旒贵,于凤霞,陈秀元.238例患者血清T3、T4、TSH、TMA、TGA联合检测阳性率分析[J].放射免疫学杂志.1996,9(2):97-99
    [9]Atwa M A.et al. Monocyte chemoattractant protein-1 in chronic proliferative immune complex nephritis[J]. Clin Immunol Immunopathol.1996,80 (2):123
    [10]刘超.TSH受体与甲状腺疾病[J].国外医学:内分泌分册.2001,3(21):153-160
    [11]Duprez Parma J, Van Sande J,et al. TSH receptor mutations and thyroid Disease[J]. TEM 1998,9(2):133-140
    [12]Kentala E,Viiki K,Juhola M.Production of diagnostic rules From an eurotologic database with decision trees[J].The Annals of otology.2000,109 (2):170-176
    [13]雷玉洁,梁正东.Logistic分布及其医学应用[J].Journal of Mathermatical Medicine.2002,15(1):103-120
    [14]中羽,庄天戈,程红岩.朴素贝叶斯算法在原发性肝癌预后预测中的研究[J].航天医学与医学工程.2004,17(5):350-354
    [15]郭蕾.二型糖尿病判别分析和logistic回归分析[D].中南大学硕士学位论文.2007
    [16]刘丹红,徐勇勇.住院患者病情危重度的分类决策树研究[J].数理统计与管理.2005,24(1):121-126
    [17]Oytun Bilgen. An Evalutaion of Hypothroid dataset[J]. Bahcesehir University Instute of Scinece Department of Computer Engineering.2008,10 (5):107-131
    [18]Han Shuguo. Analysis of Thyroid Disease Data Based on C4.5[M]. School of Computer Engineering.2006:11-17
    [19]周友俊,全新胜,张玲.131_I治疗甲亢后早发甲减的多因素分析[J].Labeled Immunoassays and Clin Med.2008,2 (15):107-113
    [20]姚建宇.血清T_3T_4H_TSH_FT_3FT_4在甲亢和甲减诊断中的评价[J].中国临床医生.2003,35(1):102-121
    [21]樊英芳,陈乃麟.血清TSH检查是甲亢和甲减诊断的首选[J].标记免疫分析与临床.2002,9(4):131-135
    [22]Jiawei Han, Micheline Kamber. Data Mining Concepts and Techniques [M].北京:高等教育出版社.2002:168-172
    [23]施伯乐,朱扬勇.数据库与智能数据分析[M].上海:复旦大学出版社.2003:9-13
    [24]Hand D., Mannila H., Smyth P. Principle of Data Mining[M]. Cambridge,CA.MIT Press.2001:12-16
    [25]Fayyad.U, Smyth P. The KDD process for extracting useful knowledge form volumes of data[J]. Communications of the ACM.1996,39 (11):27-35
    [26]Jef W. Treads in Databases:Reasoning and Mining[J]. IEEE Trans.on Knowledge and Data Engineering.2001,13 (3):426-438
    [27]黄解军,潘和平,万幼川.数据挖掘技术的应用研究[J].计算机工程与应用.2003,02:45-48
    [28]Quinlan J.R. Induction of Decision Trees[J]. Machine Learning.1986,1:81-106
    [29]2009 data-mining-tools-used [OL]. http://www.kdnuggets.com/polls/2009/ data-mining-tools-used.htm
    [30]谢邦昌.数据挖掘Clementine应用事务[M].北京:电子工业出版社.2007:359-360
    [31]刘世平,姚玉辉.数据挖掘工具的评判[J].数字财富(技术与管理).2003,6:120-127
    [32]苏新宁.数据仓库和数据挖掘[M].北京:清华大学出版社.2009:123-124
    [33]CRISP-DM1.0数据挖掘方法论指南,第一版.CRISP-DM协会.2000
    [34]陈京民.数据仓库与数据挖掘技术[M].北京:电子工业出版社.2002,1:25-26
    [35]CRoss Industry Standard Proceessfor Data Mining [OL].http://www.crisp-dm.org/ process/index.html
    [36]朱铁红.甲状腺功能减退症的诊断与治疗[J].内分泌代谢杂志.2007,27(2):142-145
    [37]De Groot. et al. The Thyroid and Its Diseases,6th ed. John Wiley and Sons,Toronto. 1995:401-422
    [38]亚临床甲状腺功能减退症.[OL].http://www.chinesebaojian.com/disease/zsmyxj/135/
    [39]Hall R.Clin Endocrinol Metab.1999,26 (8):29-38
    [40]UC Irvine Machine Learning Repository. [OL].http://www.ics.uci.edu/-mlearn/ MLRepository.html
    [41]陈再君,蒋宁一.碘131治疗甲亢后致甲减的研究变迁[M].国外医学放射医学核医学分册.2004,28(4):153-156
    [42]Kurioka H, Takahashi K, Miyazaki K. Maternal thyroid function during pregnancy and puerperal period.Endocr J.2005,52:587-592
    [43]李文,何源.误诊为心绞痛发作的抑郁症11例分析[J].中国全科医学.2004.7(12):894-897
    [44]UMKDataSets[OL]. http://www.is.umk.pl/-twin/dload_data.html.
    [45]韩明.数据挖掘及其对统计学的挑战[J].统计研究.2001(8):3-4
    [46]Hand.D.J. Data Mining:Statistics and More? [J]. The American Statistician.1998,52 (2): 112-118
    [47]Lambert.D. What Use is Statistics for Massive Data?In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.2000:54-62
    [48]Cardoso.J.Statistical Principle.Proc.IEEE.1998,86:2009-2025
    [49]Klymour C.,Madigan D.,Pregibon D.et al.Statistical Themes and Lessons for Data Mining.IEEE Data Mining and Knowledge Discovery.1997,1(1):11-28
    [50]David J. Statisties and Data mining interesting disciplines[J].ACM SIGKDD.1999: 16-19
    [51]中国人民大学统计学系数据挖掘中心.统计学与数据挖掘[J].统计与信息论坛.2002(1):4-10
    [52]杜栋.统计信息系统[M].北京:中国统计出版社.2006:6-7
    [53]郭志刚.社会统计分析方法-SPSS软件应用[M].北京:中国人民大学出版社.1999:285-289
    [54]于秀林,任雪松.多元统计分析[M].北京:中国统计出版社.2002:134-146
    [55]杨绪兵.线性判别分析及其推广研究[D].南京航空航天大学硕士学位论文.2004
    [56]Chen W,Yuen P,Huang R.A new regularized linear discriminat analysis method to solve small sample size problems[J]. International Journal of Pattern Recognition and Artifical Intelligence.2005,19 (7):917-940
    [57]王济川,郭志刚.Logistic回归模型方法与应用[M].北京:高等教育出版社.2001:19-22
    [58]吕纯濂.Logistic判别及其应用[J].数学的实践与认识.1998,3,53-67
    [59]Paolo Giudici著,袁方等译.数据挖掘导论[M].北京:电子工业出版社.2004:74-77
    [60]郭鹏飞,张罗漫.多分类logistic回归分析研究军队人员就诊意向的影响因素[J].第二军医大学学报.2005,26(11):1287-1290
    [61]Efrom.B. The efficiency of Logistic Regression compared to Normal Discriminant Analysis.Statist.Assoc.1995,21 (70):890-898
    [62]中国人民大学统计学系数据挖掘中心.数据挖掘中的决策树技术与应用[J].统计与信息论坛.2002,17(2):4-10
    [63]Tan P.著,范明等译.数据挖掘导论[M].北京:人民邮电出版社.2006:101-102
    [64]Kass,G.V.An exploratory technique for investigating large quantities of categorical data[J].Applied Statistics.1989,29 (2):119-127
    [65]Glenn D,Katharina E.Classification and regression trees:A powerful simple technique for ecological data analysis[J].Ecology.2000,81 (11):3178-3192
    [66]Jordan MI.Learning in Graphical Models[M]. The MIT Press.1998:70-87
    [67]Sprites P,et al. Causation, prediction and Search 2nd.[M]. The MIT Press.2001:34-39
    [68]胡可云著.数据挖掘理论与应用[M].北京:清华大学出版社.2008:194-197
    [69]Wang M,et al. Discovering Knowledgefrom Medical Databases Using Evolutionary Algorithms[J]. IEEE Engineering in Medicine and Biology.2000,19 (4):45-48
    [70]ClaudiaFofi,et al. A nephropathy:multivariate statistical analysis aimed at predicting outcome[J]. Journal of Nephrology.2001,14 (4):280-285
    [71]石玲,王燕.婴幼儿死亡危险因素的研究-兼论CHAID方法的原理及应用[J].中国卫生统计.2002,19(5):283-285
    [72]Trevor H.等著.范明,柴玉梅等译.数据挖掘推理与预测[M].北京:电子工业出版社.2004:153-160

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700