大肠早癌辅助诊断数据挖掘方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着医疗诊断技术的发展,各个医疗部门积累了大量医疗诊断信息,如病人的医学影像资料、生理生化指标、生物信息学指标、病人背景资料等,这些数据资料背后隐藏着很多有可能成为临床辅助诊断依据的重要信息,因此有必要利用相关技术对这些重要信息进行分析处理。
     数据挖掘是广泛应用于医疗诊断数据分析处理的技术之一,采用数据挖掘技术可以通过对患者资料数据库中大量历史数据的处理,挖掘出有价值的诊断规则,从而依据患者的年龄、性别、生活习性、辅助检查结果、生化指标等做出判断,排除人为因素的干扰,客观性强,得到的诊断规则有着较好的普遍性。
     本文以数据挖掘技术为基础,以激光诱导自体荧光大肠早癌诊断数据为载体,通过分析诊断数据特征,从数据预处理、训练数据集的形成以及分类预测方法三个方面,对大肠早癌诊断数据进行深入分析研究,形成激光诱导大肠早癌辅助诊断系统,为临床医生提供辅助诊断的手段。
     本文首先分析了激光诱导自体荧光诊断大肠早癌的机理、特点、研究意义,根据医疗诊断数据特征,提出了激光诱导自体荧光大肠早癌辅助诊断数据分析处理流程,并对各部分进行了分析,着重阐述光谱数据采集系统组成以及光谱数据的采集方法,同时进行了滤除高频电子噪音,剔除光谱基线、截取有效带宽信号以及归一化荧光光谱的数据除噪处理。
     面向不完整的大肠早癌荧光数据,通过分析比较特征提取方法,本文提出基于容错关系的信息熵粗糙集主成分分析算法,容错关系粗糙集较之传统粗糙集能满足诊断数据的不完备性,同时引入随信息量减小而单调下降的信息熵,在此基础上提出属性约简方法,对光谱数据进行属性约减,并利用主成分分析算法进行进一步的特征属性提取。通过该算法,提取了影响大肠早癌诊断的特征数据,降低数据维度,减少后续数据处理的复杂度。
     由于医疗诊断数据中多为混合数据的特性,通过分析现有混合数据聚类算法,本文提出了基于格论的混合数据聚类算法。利用格进行数据分布以消除数值型属性和符号属性的分布差别,利用数据间格的涵盖数目来进行聚类计算,因此该算法在进行混合数据处理时不再需要进行数据转换。针对算法中的参数,即初始聚类数目和中心点的选取进行了优化分析,其中初始聚类数目利用遗传算法进行优化,获得初始聚类数目的取值空间;同时对中心点的选取进行了优化说明,同时对算法性能进行了分析。以形成的聚类数据集为基础,利用均值方差法和荧光强度比值判别法进行数据特征的提取,得到正常组织和癌症组织的分类特征,为分类判别提供依据。
     针对医疗诊断数据中实时性要求,通过分析所采用的分类算法性能,发现该分类算法存在着大量重复计算,因此算法复杂度和算法的空间复杂度比较高。为解决这一问题,本文提出了基于检索树结构的处理方法,通过构建检索树,将多数重复计算节点构建在检索树的高层,无重复节点建立在检索树的下层,以此来降低算法的重复计算,有效地降低了算法复杂度以及空间复杂度,以满足诊断实时性要求。
     针对医疗诊断数据中的不平衡性,在分析了非平衡数据分布特征以及当前的非平衡数据处理方法后,利用样本处理技术,本文提出了全局密度非平衡数据分类,μ-密度非平衡数据分类方法以及边界样本局部密度的非平衡数据分类方法,全局密度非平衡数据分类方法以各自类别的样本为基础进行综合平均,这种方法有利于稀疏数据的分类而降低密集数据分类有效性;μ-密度非平衡数据分类方法通过代价敏感方法,分析样本分类正确性代价,得到合适的μ值进行样本数据的选取,以提高非平衡数据分类有效性;边界样本局部密度的非平衡数据分类方法着重分析处于非平衡数据集中的边界样本数据,通过多种方法进行边界数据的分类,同时对算法中的相关参数进行分析。这三种算法都是通过样本数据选择,提高少数类样本数据量以减少数据非平衡性。
     论文最后总结了全文的创新点,提出了今后将继续进行的研究方向。
With the development of Bioinformatics and Biomedical Engineering, a lot of medical information including medical imagine resource, physiological guideline, bioinformation and some patients' stuff are available in many hospitals and research groups. We need to analyze the information as some useful information is concealed by the general processing methods which sometimes can be the aided diagnosis rules.
     Data mining technology is improved quickly in biomedical areas. It can be used to process ocean-store history medical data that results some useful diagnosis rules derives from the patients' information including age, gender, habits and examine results, so the rules are in popular items with no inference and large-scale data processing.
     This dissertation presents the research issues to process Auto-Fluorescence Spectrogram for Colorectal Carcinoma by data mining techniques with the steps of preprocessing, forming the training samples, building the classification model. Some Auto-Fluorescence Spectrogram for Colorectal Carcinoma Aided Diagnosis Methods will be built with the research results, and try to provide the ways to the doctors for the diagnosis.
     This dissertation first analyses the theory, characteristics of Auto-Fluorescence Spectrogram for Colorectal Carcinoma, and presents the modules in Auto-Fluorescence Spectrogram for Colorectal Carcinoma Aided Diagnosis System, together with the details of each part. And some methods to derive noises from the spectrogram are provided.
     To meet the requirement of data incomplety, the dissertation presents an algorithm, called RPCA, to deal with the attributes reduction by rough set with PCA based on tolerant relation. A novel definition of entropy is introduced which knowledge decreases as the granularity of information becomes smaller. Then a new reduction algorithm in tolerant rough set is presented, extract the data feature together with PCA. With the algorithm, data feature cab be extracted, data attributes can be reduced, and the complexity can be reduced as well for later testing.
     As most biomedical data are hybrid data, the dissertation presents a clustering algorithm based on lattice for hybrid data. The algorithm uses lattice to eliminate the difference between ordinal and nominal samples without exchanges which affects the algorithm accuracy. And the parameters in this Algorithm are optimized as well. Genetic Algorithm is used to optimize the initial clustering number and the mean points are optimized as well. With the clustering samples, we use several ways to get the rules between normal and pathology tissues.
     To solve the time-restrict problem, a novel Index algorithm for classification is designed and applied to solve this problem. The algorithm uses index tree to reduce the repetition calculation and gets higher efficiency both on computation and storage amount, especially in the application with large scale repetition data.
     To deal with the data unbalance, the dissertation presents several ways to solve the problem as Overall-density unbalance classification,μ-density unbalance classification and Margin-density unbalance classification algorithms. All of these ways are based on the samples theory as increasing the sparse data number and obtain higher performance, especially on unbalance data processing. Some parameters in these algorithms are analyzed, as a cost-sensitive way is presented to optimizeμby the cost of right and error ratio; and other two parameters in Margin-density unbalance classification algorithm are analyzed as well.
     Finally, the innovations of this thesis have summarized. And the future research subjects were also presented.
引文
[1]A.Edward Profio.Laser Excited Fluorescence of Hemat orphyrin Derivative for Diagnosis of Cancer.IEEE J Quantum Electronics,1984.QE-20(12):1502-1506
    [2]唐贵林.激光诱导荧光检测大肠癌变组织的理论与方法研究:[博士学位论文].长沙:国防科技大学,2001
    [3]刘蔚东,张阳德,唐贵林.结肠早癌自体荧光内镜诊断系统研究-肿瘤组织自体荧光区别于正常组织的机制.中国内镜杂志,2000,6(2):1-3
    [4]曾塑,朱九德,叶衍铭等.活体组织固有荧光光谱的研究.自然杂志,1982,5-511
    [5]Abes Jajiri,Izuishi K,et al.Comparison between fluorescent images of gastric cancer with an endoscopic auto-fluorescence imaging system and the histological findings.Gastroenterology,1998,114-554
    [6]Ina H.Witten,Eibe Frank.Data Mining:Practical Machine Learning Tools and Technique.Second Edition,Morgan Kaufmanm,2006.2
    [7]Manoranjan Dash,Huan Liu.Feature Selection for Classification.Intelligent Data Analysis,1997,1(3):131-156
    [8]I Kononenko.Estimating Attributes:Analysis and Extension of Relief.Proc of European Conf on Machine Learning,1994,171-182
    [9]C Cardie.Using Decision Trees to Improve Case Based Learning.Proc of 10th In'l Conf on Machine Learning,1993:25-32
    [10]S B Serpico,L Bruzzone.A New Search Algorithm for Feature Selection in Hyper Spectral Remote Sensing Images.IEEE Trans on Geoscience and Remote Sensing,2001,39(7):1360:1367
    [11]H Liu,R Setiono.A Probabilistic Approach to Feature Selection:A filter Solution.Proc of Int 'Conf on Machine Learning,1996,319-327
    [12]B Chakraborty.Genetic Algorithm with Fuzzy Fitness Function for Feature Selection.Proc of the 2002 IEEE International Syrup on Industrial Electronics,2002,315-319.
    [13]曹奎,冯玉才.一种压缩域特征提取与语义图像检索技术.小型微型计算机系统,2005,26(1):151-155
    [14]程健.特征提取中的子空间分析方法研究及其应用:[博士学位论文].北 京:中国科学院,2005
    [15]P.Comon.Independent component analysis-a new concept?.Signal Processing,1994,36:287-314
    [16]孙桂玲,张翠兰,方勇华等.小波变换在光谱特征提取方面的应用.量子电子学报,2006,23(1):22-26
    [17]Mykola Pechenizkiy.Feature Extraction for Supervised Learning in Knowledge Discovery Systems:[PhD Thesis].Finnish:University of Jyv(a|¨)skyl(a|¨),2005
    [18]Z.Pawlak.Rough Sets-Theoretical Aspects of Reasoning about Data.Kluwer Academic Publishers,1991,6-42
    [19]Ziarko W,ShanN.KDD-R A Comprehensive System for Knowledge Discovery in Databases Using Rough Sets.Conference Proceeding of the Third International Workshop on Rough Sets and Soft Computing(RSSC'94),California,USA,1994,164-173
    [20]Jiawei Han,Micheline Kamber.Data Mining Concepts and Techniques.Beijing:China Machine Press,2001,8-338
    [21]MacQueen.Some Methods for classification and Analysis of Multivariate Observations.Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability,Berkeley,University of California Press,1:281-297
    [22]W.H.E.Day,H.Edelsbrunner.Efficient algorithms for agglomerative hierarchical clustering methods.Classification,1984,1:7-24
    [23]P.Berkhin.A Survey of Clustering Data Mining Techniques.Grouping Multidimensional Data,Springer Berlin Heidelberg:25-71
    [24]W.Wang,J.Yang,R.R.Muntz.STING:A statistical information grid approach to spatial data mining.Int'l Proc.of the 23rd Conference on Very Large Data Bases,Athens,Greece,1997:186-195
    [25]Simon Haykin.神经网络的综合基础(第2版).北京:清华大学出版社,2001.446-451.
    [26]Fayyad U,Piatetsky Shapiro G,Smith P,Uthurusamy,R.Advances in knowledge discovery and data mining.AAAI/MIT Press,1996
    [27]罗可,林睦纲,郗东妹.数据挖掘中分类算法综述.计算机工程,2005,31(1):7-11
    [28]Chidanand Apte,Sholom Weiss.Data mining with decision trees and decision rules.Future Generation Computer Systems,1997,13:197-210
    [29]张朝晖,陆玉昌,张钹.利用神经网络发现分类规则.计算机学报,1999,22(1):108-112
    [30]Tsymbal A.,Puuronen S.,Pechenizkiy M.,Baumgarten M.,Patterson D.Eigenvector-based feature extraction for classification.Proc.15th Int.FLAIRS Conference on Artificial Intelligence,USA,AAAI Press,2002,354-358
    [31]Ohm A,Row land T.Rough sets:a know ledge discovery technique for multi-factorial medical outcomes.Am J Phys Med Rehabil,2000,79-100
    [32]张蔚,王文昌.季节效应分析在医院管理中的应用.第三军医大学学报.1998,20(6):553-555
    [33]Yue Huang,Paul J.McCullagh,Norman Black,Roy Harper.Feature Selection and Classification Model Construction on Type 2 Diabetic Patient's Data.Industrial Conference on Data Mining,2004,153-162
    [34]Yue Huang,Paul J.McCullagh,Norman Black,Roy Harper.Evaluation of Outcome Prediction for a Clinical Diabetes Database.KELSI 2004:181-190
    [35]Lee Yingjie,Zhu Yisheng,Xu Yuhong,et al.The nonlinear dynamical analysis of the EEG in schizophrenia with temporal and spatial embedding dimension.Journal of Medical Engineering & Techno logy,2001,25:79-83.
    [36]Shah B.Relationship between diabetes and age in human metatarsal bones.The 17th Southern Biomedical Engineering Conference,1998,2-32.
    [37]Harris ND,Ireland RH,Marques JLB,et al.Can changes in Q T interval be used to predict the onset of Hypoglycemia in type 1 diabetes.Computers in Cardio logy,2000,27:375-378
    [38]Milan Z,Gou M,Peter K,et al.Mining diabetes database with decision trees and association rules[C].Proceedings of the 15th IEEE Symposium on Computer-Based Medical Systems,2002,134-139
    [39]张辉,钱宗才,屈景辉.粗糙集在构建骨肿瘤辅助诊断知识库的应用研究.医学信息,2004,17(5)
    [40]罗森林,成华,顾毓清,张铁梅,曾平,陈峰.C4.5算法在2型糖尿病分类规则建立中的应用.计算机应用研究,2004,(7):175-177
    [41]Cho Y,Walbot V.Computational methods for gene annotation:the arabidopsis genome.Biotechnology,2001,12:126-130
    [42]樊晓平,彭展,杨胜跃等.基于多层前馈型人工神经网络的抑郁症分类系统研究.计算机工程与应用,2004,40(13):205-208
    [43]杨胜跃,彭展,樊晓平,吉艳平.基于动态BP神经网络的抑郁病症诊断系 统.铁道科学与工程学报,2005,2(3):71-74
    [44]LIU Zheng-jun,WANG Chang-yao,ZHANG Ji-xian.Feature Extraction and Feature Selection Based on Wavelet and Genetic Algorithm.Journal of Remote Sensing.2005,9(2):176-185
    [45]李异凡.激光诱导荧光诊断大肠癌的实验研究和临床初步研究[博士学位论文].长沙:中南大学,2006
    [46]罗湘建,张阳德,李建国.大肠癌激光诱导自体荧光光谱分析研究.北京生物医学工程,2006,Vol 25,No 3:285-287
    [47]夏代林.5-ALA-PpⅨ在SD大鼠组织内的分布及激光诱导荧光光谱结肠早癌诊断方法研究[博士学位论文].长沙:中南大学,2005
    [48]K Kira,L A Rendell.The Feature Selection Problem:Traditiona Methods and a New Algorithrn.Proc of 9t h National Conf on AI,1992:129-134
    [49]G H John,R Kohavi,K Pfleger.Irrelevant Features and the Subset Selection Problem.Proc oft he 11t h Int'l Conf on Machine Learning,1994,121-129
    [50]D Koller,M Sahami.Toward Optimal Feature Selection.Proc of Int'l Conf on Machine Learning,1996,284-292
    [51]Manoranjan Dash,Huan Liu.Feature Selection for Classification.Intelligent Data Analysis,1997,1(3):131-156
    [52]S Ghandeharizadeh,J D Dewitt.Hybrid Range Partitioning Strategy:A New Declustering Strategy for Multiprocessor Database Machines.Proc of t he 16th VLDB Conf.1990,481-492
    [53]肖健华.智能模式识别方法(第1版).广州:华南理工大学出版社,2006
    [54]Grzymala-Bausse J.W.LERS-A System Learning from Examples Based on Rough Sets.Intelligent Decision Support:Handbook of Applications and Advances of the Rough Sets Theory.DordechtKluwer Academic Publishers,1992,3-18
    [55]Z.Pawlak.Rough sets.International Journal of Information and Computer Science,1982,(11):341-356
    [56]M.Kryszkiewicz.Rough set approach to incomplete information systems.Information Sciences,1998,112:39-49
    [57]马志峰,邢汉承.基于不分明与相似关系的Rough集的超图描述.计算机科学.1999,26(9):35-39
    [58]王国胤,于洪,杨大春.基于条件信息熵的决策表约简.计算机学报,2002,25(7):759-766.
    [59]Hui Wang.A Novel Clustering Method Based on Spatial Operations.Lecture Notes in Computer Science.Springer Berlin/Heidelberg,2006
    [60]J.Handl and J.Knowles.Cluster Generators:Synthetic Data for the Evaluation of Clustering Algorithms.http://dbkweb.ch.umist.ac.uk/handlgenerators/,2007
    [61]H.Wang,W.Dubitzky.A Flexible and Robust Similarity Measure Based on Contextual Probability.Proc.19th Int'l Joint Conf.Artificial Intelligence (IJCAI'05),2005,27-32
    [62]Huang Z.Extensions to the k-means algorithm for clustering large data sets with categorical values.Data Mining Knowledge Discovery,1998,2(3):283-304
    [63]H.Ralambondrainy.A conceptual version of the K-means algorithm.Pattern Recognition Letters,1995:1147-1157
    [64]Huang Zhexue.A fuzzy k-modes algorithm for clustering categorical data.IEEE Transactions on Fuzzy Systems,1999,7:446-452
    [65]汪加才,文巨峰,陈奇等.结构化模糊k-prototypes聚类算法.计算机科学,2005,32(5):155-158
    [66]Michael K Ng.A Fuzzy k-modes Algorithm for Clustering Categorical Data.IEEE Transactions on Fuzzy System,1999,1063-1087
    [67]Christian D(o|¨)ring,Christian Borgelt,Rudolf Kruse.Fuzzy Clustering of Quantitative and Qualitative Data.IEEE Annual Meeting of the Fuzzy Information,2004:27-30
    [68]胡长流,宋振明.格论基础.郑州:河南大学出版社,1990,64-65
    [69]Hui Wang.Subsequence counting as a measure of similarity for sequences.International Journal of Pattern Recognition and Artificial Intelligence,2007,21(4):745-758.
    [70]http://mlearn.ics.uci.edu/MLRepository.html
    [71]Eshref Januzaj,Hans-Peter Kriegel,Martin Pfeifle.DBDC:Density Based Distributed Clustering.9th Int.Conf.on Extending Database Technology (EDBT'04),Heraklion,Germany,2004
    [72]Haishan Zeng,Alan Weiss,Richard Cline,et al.Real-time endoscopic fluorescence imaging for early cancer detection in the gastrointestinal tract.Bioimaging,1998,6:151-165
    [73]Cothren RM,Richards-Kortum R,Sivak MV Jr,et al.Gastrointestinal Tissue Diagnosis by Laser-Induced Fluorescence Spectroscopy at Endoscopy.Gastrointestinal Endoscopy,1990,36(2):105-111
    [74]彭健,张阳德,李罗丝等.激光诱导自体荧光光谱区分大肠癌组织与大肠正常组织.中国现代医学杂志,2005,15(24):3696-3699
    [75]刘建华,王勇,洪月好.遗传算法编码设计及其在数据挖掘中的应用.上海电力学院学报,2005,21(2):244-248
    [76]牛琨.聚类分析中若干关键技术及其在电信领域的应用研究:[博士学位论文].北京:北京邮电大学,2007
    [77]赵恒.数据挖掘中聚类若干问题研究[博士学位论文].西安:西安电子科技大学,2005
    [78]Hui Wang.Nearest Neighbors by Neighborhood Counting.IEEE Transactions on Pattern Analysis and Machine Intelligence,2006,28:942-953,
    [79]S.A.Dudani.The distance-weighted k-nearest-neighbor rule.IEEE Trans.Syst.Man Cyber.1976,6:325-327
    [80]G.Towell,J.Shavlik,M.Noordewier.Refinement of approximate domain theories by knowledge-based neural networks.In Proceedings Eighth National Conference on Artificial Intelligence,AAAI Press,1990,861-866.
    [81]T.Baily,A.K.Jain.A note on distance-weighted k-nearest neighbor rules.IEEE Trans.Syst.Man Cyber.,1978,8(4):311-313
    [82]Belur V.Dasarathy.Nearest Neighbor Norms:NN Pattern Classification Techniques.IEEE Computer Society Press,Los Alamitos,California,1991
    [83]T.Denoeux.A k-nearest neighbor classification rule based on Dempster-Shafer theory.IEEE Transactions on Systems,Man and Cybernetice,1995,25:804-813
    [84]李道国,苗夺谦.决策树剪枝算法的研究与改进.计算机工程,2005,31(8):19-21
    [85]Burkhard W.,Keller R.Some Approaches to Best Match File Searching.Communications of the ACM,1973,230-236
    [86]Baeza-Yates R.,Cunto W.,Wu S.Proximity Matching Using Fixed-Tueries tree.Proc.of the 5~(th) Annual Symposium on Combinatorial Pattern Matching,1994,198-212
    [87]Berchtold S.,Keim D.A.,Kriegel H.-P.,Seidl T.Indexing the Solution Space:A New Technique for Nearest Neighbor Search in High-Dimensional Space.IEEE Transactions on Knowledge and Data Engineering(TKDE 2000),2000, Vol.12,No.1:45-57.
    [88]Navarro G..Searching in Metric Spaces by Spatial Approximation.Proc.of 6~(th)South American Symposium on String Processing and Information Retrieval,1999,141-148
    [89]刘胥影,吴建鑫,周志华.一种基于级联模型的类别不平衡数据分类方法.南京大学学报(自然科学).2006.3,42(2):148-155
    [90]Z.H.Zhou,M.Li.Tri-training:Exploiting unlabeled data using three classifiers.IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541
    [91]Albert Orriols,Ester Bernad(?)-Mansilla.The class imbalance problem in learning classifier systems:a preliminary study.Proceedings of the 2005workshops on Genetic and evolutionary computation.Washington,D.C.2005,74-78
    [92]Nathalie Japkowicz,Shaju Stephen.The class imbalance problem:A systematic study.Intelligent Data Analysis,IOS Press,2002,429-449
    [93]Foster Probost.Machine learning from imbalanced data sets 101.Invited paper for the AAAI'2000 Workshop on Imbalanced Data Sets,2000
    [94]Chris Drummond.C4.5,Class Imbalance,and Cost Sensitivity:Why Undersampling beats Over-Sampling.ICML-KDD'2003 Workshop:Learning from Imbalanced Data Sets,2003.
    [95]Nathalie Japkowicz.Class Imbalances:Are we Focusing on the Right Issue?.ICML-KDD'2003 Workshop:Learning from Imbalanced Data Sets,2003.
    [96]Japkowicz,N..The Class Imbalance Problem:Significance and Strategies.Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000),2002,111-117.
    [97]Gang Wu.Class-Boundary Alignment for Imbalanced Dataset Learning.ICML-KDD'2003 Workshop:Learning from Imbalanced Data Sets,2003.
    [98]Estabrooks A.,Jo,T.,Japkowicz,N..A Multiple Resampling Method for Learning from Imbalances Data Sets.Computational Intelligence,Volume 20,Number 1,2004
    [99]N.Japkowicz S.Stephen.The class imbalance problem:A systematic study.Intelligent Data Analysis,2002,6(5):429-450
    [100]A.Estabrooks.A Combination Scheme for Inductive Learning from Imbalanced Data Sets.MCS Thesis,Faculty of Computer Science,Dalhousie University,2000.
    [101]Chao Chen,Andy Liaw,Leo Breiman.Using Random Forest to Learn Imbalanced Data.Technical Report,No.666,Department of Statistics,University of Berkely,2004:128-137.
    [102]Breunig M.M.,Kriegel H.-P.,Ng R.,Sander J.LOF:Identifying Density-Based Local Outliers.Proc.ACM SIGMOD Int.Conf.on Management of Data(SIGMOD 2000),Dallas,TX,2000,93-104.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700