机器学习方法及其在生物信息学领域中的应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

机器学习方法及其在生物信息学领域中的应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Machine Learning Methods and Their Applications in Bioinformatics
作者：王淑琴
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：生物信息学 ; 机器学习 ; 操纵子预测 ; 癌症分类 ; 决策树 ; 遗传算法 ; 变精度粗糙集 ; 变精度明确区 ; 变精度非明确区 ; 基因间距离 ; COG基因功能 ; 基因表达谱 ; 新陈代谢通路 ; 熵 ; 关键基因 ; k-TSP
英文关键词：bioinformatics ; machine learning ; operon prediction ; classifying cancers ; decision tree ; genetic algorithm ; variable precision rough set ; variable precision explicit region ; variable precision implicit region ; intergenic distance ; COG gene functions ; gene expression profile ; metabolic pathway ; entropy ; key gene ; k-TSP
学位年度：2009
导师：梁艳春
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2009-04-01
答辩委员会主席：邢忠宝

摘要

生物信息学是八十年代末随着人类基因组计划的启动而兴起的一门新兴交叉学科,它是当今生命科学和自然科学的重大前沿领域之一,是生物学与计算机科学以及应用数学等学科交叉而成。利用生物信息学方法能够处理大规模数据,抽取出所需信息,从而更好的认识生命,揭示生物界的奥秘。随着基因组项目的不断完成,大量有待于分析和解释的数据呈指数级增长。数据量之大,研究之深入,以及基因组数据本身的复杂性之高,对理论、算法和软件的发展都提出了迫切的需求。而机器学习方法例如遗传算法和决策树等正适合于处理这种数据量大、含有噪声并且缺乏统一理论的领域。
     本文对机器学习方法及其在生物信息学中的应用进行了一定的研究,主要工作有以下四个方面:
     1.提出一种基于变精度粗糙集的决策树构造方法。提出了变精度明确区和变精度非明确区的概念。并给出基本的基于变精度粗糙集理论选取决策树分支属性的算法。利用UCI国际开放数据库中的19个数据集作为测试集对提出的方法进行测试,并将结果与较流行的决策树生成算法C4.5所得到的结果进行比较研究。
     2.提出一种基于多方法引导的遗传算法的操纵子预测方法。应用不同的方法来评价不同的基因组数据以充分发挥各自的生物特点。提出了一种局部熵最小化的方法来评价基因间距离。实验结果显示基于多属性信息的预测能力高于基于单个属性的预测能力,也证明了E. coli的基于局部熵最小化得到的基因间距离区间得分可用于其它基因组操纵子预测。
     3.提出基于变精度粗糙集的决策树构造的操纵子预测方法。使用基因间距离、COG功能、代谢pathway、微阵列表达数据、系统进化谱和保守基因对六种基因组数据进行操纵子预测。在E. coli、B. subtilis和P. aeruginosa三个基因组上进行测试,并与C4.5进行了比较,实验结果表明这是一种有效的操纵子预测方法。
     4.提出一种基于信息熵的改进k-TSP癌症分类预测方法,首先使用信息熵的方法来选取特征基因,然后使用k-TSP方法进行癌症分类预测。将公开的二类基因表达谱数据集作为实验数据集,采用留一交叉校验法来计算实验中预测的准确率,并将此方法与其他7种机器学习方法进行比较,取得了较好的效果。
Bioinformatics is an interdisciplinary subject with start-up of the Human Genome Project at the end of the eighties. It is one of the great frontiers of life sciences and natural sciences. It will be one of core fields of natural sciences in the 21st century's. It is formed from several subjects such as biology, computer science and applied mathematics. Bioinformatics researches include biology data collection and management, database search and sequence alignment, genome sequence analysis, gene expression data analysis and processing, protein structure prediction, and the construction of metabolic pathway, signal pathway and gene regulatory networks, etc.
     Bioinformatics methods can be used to deal with large-scale data, extract the necessary information, so that we can better understand and reveal the mysteries of living systems. With the accomplishment of the genome sequencing projects, data to analyze and explain is increasing exponentially. So many data and in-depth studies need urgently the developments of theories, algorithms and software. In addition, because of the complexity of the genome data itself, it also needs more urgently the developments of them. Machine learning methods such as neural networks, genetic algorithms, decision tree and support vector machines, etc. are suitable for the field in which there is large amount of data, containing noise and lack of a unified theory.
     In this thesis, we do some researches on machine learning methods and their applications in bioinformatics. The main jobs include the following four aspects:
     1. We present a new approach for inducing decision trees based on Variable Precision Rough Set Model (VPRSM). Decision tree classification method is popular in mathine learning. The current methods of constructing decision trees are based on the purity measurement methods, such as information entropy, the Gini index. From the Rough Set theory point of view, the common character of these methods is only to consider the information of implicit region, without considering the information of explicit region. Correspondingly, the rough set based approaches for inducing decision trees consider the information of explicit region. The more certain the information is, the better the results are. In real applications, however, data always contains noises. The methods based on rough set divide accurately the samples, so that they can’t avoid that noises effect on constructing the decision tree. In order to reduce the classifier's sensitivity to noise data and improve classifier generalization ability, we introduce variable precision rough set theory in constructing decision tree classifier, and propose approach for inducing decision trees based on Variable Precision Rough Set Model. We propose two main concepts, i.e. variable precision explicit region and variable precision implicit region, and give the algorithm of inducing decision trees based on variable precision rough set model. The comparison between the presented approach and C4.5 on some data sets from the UCI Machine Learning Repository is also reported. Experimental results show the approach for inducing decision trees based on Variable Precision Rough Set Model is superior to the classical decision tree algorithm C4.5, especially before pruning.
     2. A novel multi-approach guided genetic algorithm for operon prediction is presented. Because the fuzzy rules used in Jacob’s approach are intuitive, it is difficult to create its fuzzy rules for non-specialists. Moreover, it used the same method for assessing each genome data, so that it can’t explore the biological characteristics for genome data. So we use different methods to preprocess different genome features for exerting their unique characteristics, and utilize intergenic distance, participation in the same metabolic pathway, COG gene functions and microarray expression data to predict operons. A novel local-entropy-minimization method (LEM) is proposed to partition intergenic distance for evaluating intergenic distance. LEM divides the intergenic distances into several intervals and assigns a score for each interval. COG function log-likelihood is computed for adjacent gene pair. Correlation coefficient of microarray expression value is calculated. At last, genetic algorithm is used to fuse the above four genome features and predict operons. The proposed method is examined on Escherichia coli K12 genome, Bacillus subtilis genome, and Pseudomonas aeruginosa PAO1 genome. The accuracies of prediction of 85.9987%, 88.296% and 81.2384% for the three genomes are obtained respectively. Experimental results demonstrate that prediction performance using multiple features is better than that only using one feature. Experimental results also show that it is possible to use intervals of intergenic distance obtained by using Local-Entropy-Minimization method in Escherichia coli for operon prediction in other prokaryotic genome.
     3. We present an operon prediction methods by decision tree classifier based on Variable Precision Rough Set. We increase two genome features: phylogenetic profile and conserved gene pairs, except for intergenic distance, COG gene functions, metabolic pathway, microarray expression data used in the 4th chapter. We introduce how to extract phylogenetic profile and conserved gene pairs. Firstly we use 360 genomes and BLAST program to compute phylogenetic profile of each gene and conserved gene pairs of each gene pair. Then the hamming distances of phylogenetic profile of adjacent gene pairs are computed. We give frequency distribution and Log-likelihoods for different distances of the phylogenetic profile. At last, we take these six genome features as the input data of the proposed method. The proposed method is examined on Escherichia coli K12, Bacillus subtilis and Pseudomonas aeruginosa PAO1, and is compared with C4.5. Experimental results show that the proposed method is an effective method of operon prediction.
     4.An entropy-based improved k-TSP method (Ik-TSP) for classifying cancer is proposed. Because the method proposed by Aik Choon Tan chooses the top k high-score pairs of genes as decision rule instead of only the highest gene pair. So, the method needs to calculate the score of each gene pair and determine the decision rules according to the scores of all gene pairs. In fact, each cancer dataset has a huge size (the datasets used in this paper contain at least 2,000 genes), so the algorithm has relatively high time and space complexity. So we propose an entropy-based improved k-TSP method for classifying cancer. We use the information entropy for key genes selection, and then use k-TSP method to predict classes of cancers. In order to evaluate the performance of Ik-TSP method in classification prediction, we consider 9 binary gene expression datasets, which are used by Aik Choon Tan, as our experimental datasets. Leave-one-out cross-validation (LOOCV) is employed to estimate the prediction accuracy in our experiments. Compared with the results of seven other existing machine learning methods, Ik-TSP method obtains averagely 95.44% accuracy, and improves 3% better than k-TSP method.
     We have obtained some reseaches on operon prediction and cancer prediction. These researches have enriched the study of machine learning theory application. They provide theoretical basis for the application of operon prediction and cancer prediction. Operon prediction provides valuable information for the reconstruction of regulatory networks and drug design. Cancer prediction provides a new method for finding gene marker. It can promote early diagnosis and treatment of cancer.

引文

[1]张春霆.生物信息学的现状与展望[J].院士论坛, 2007, 22(6): 17-20.
    [2]李巍.生物信息学导论[M].郑州大学出版社, 2004.
    [3] Kanehisa M著.后基因组信息学[M].清华大学出版社, 2002.
    [4]赵静,俞鸿,骆建华,曹志伟,李亦学.应用复杂网络理论研究代谢网络的进展[J].科学通报, 2006, 51(11): 1241-1248.
    [5] Alizadeh A A et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J]. Nature, 2000, 403: 503-511.
    [6] Alon U et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J]. Proc. Natl Acad. Sci. USA, 1998, 96: 6745-6750.
    [7] Amit Y and Geman D. Shape quantization and recognition with randomized trees [J]. IEEE Trans. Pattern Anal. Machine Intell., 1997, 19: 1300-1305.
    [8] Armstrong S et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia [J]. Nat. Genet., 2002, 30: 41-47.
    [9] Beer D G et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma [J]. Nat. Med., 2002, 8: 816-824.
    [10] Bernstein I et al. Differences in the frequency of normal and clonal precursors of colony-forming cells in chronic myelogenous leukemia and acute myelogenous leukemia [J]. Blood, 1992, 79: 1811-1816.
    [11] Bhattacharjee A et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J]. Proc. Natl Acad. Sci. USA, 2001, 98: 13790-13795.
    [12]翁时锋.基于机器学习的几种医学数据处理方法研究[D].北京:清华大学自动化系, 2005.
    [13] Mitchell T M. Machine learning [M]. McGraw-Hill, 1997.
    [14] Mjolsness E, DeCoste D. Machine Learning for science: state of the art and future prospects [J]. Science, 2001, 293(14): 2051-2055.
    [15] Kohavi R and Provost F. Glossary of Temrs,Special Issue on applications of Machine Learning and the knowledge discovery process [J]. Machine Learning.1998, 30:271-274.
    [16] Shavlik J and Dieterrich T. Readings in Machine Learning [M]. San Mateo, CA: Morgan Kaufmann, 1990.
    [17] Wilson B. The Machine Learning Dictionary for COMP9414 [EB/OL]. http://www.cse.unsw.edu.au/~billw/mldict.html, 2008-04-22.
    [18] Baldi p, Nrunak S著,张东晖等译.生物信息学—机器学习方法[M].中信出版社, 2003.
    [19] Stormo G D, Schneider T D, Gold L, Ehrenfeucht A. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli [J]. Nucleic Acids Res., 1982, 10: 2997-3011.
    [20] Qian N, Sejnowski T J. Predicting the secondary structure of globular proteins using neural network models [J]. J. Mol. Biol., 1988, 202: 865-884.
    [21] Borodovsky M, Mcininch J. GeneMark: parallel gene recognition for both DNA strands [J]. Comput. Chem., 1993, 17: 123-133.
    [22] Cheng Y, Church G M. Biclustering of expression data [C]. In Proceedings of the 2000 Conference on Intelligent Systems for Molecular Biology (ISMB00), LaJolla,CA,AAAI Press,Menlo Park,CA. 2000, 93-103.
    [23] Long A D, Mangalam H J, Chan B Y, Tolleri L, Hatfield G W, Baldi P. Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12 [J]. J. Biol. Chem., 2001, 276(3): 19937-19944.
    [24] Jacob F, Monod J. Genetic regulatory mechanisms in the synthesis of proteins [J]. Journal of Molecular Biology, 1961, 3: 318-356.
    [25] Zheng Y, Szustakowski J D, Fortnow L, Rberts R J, and Kasif S. Computational identification of operons in microbial genomes [J]. Genome Research, 2002, 12(8): 1221-1230.
    [26] Chen X, Su Z, Dam P, Palenik B, Xu Y, and Jiang T. Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome [J]. Nucleic Acids Research, 2004, 32: 2147-2157.
    [27] Xu Y. Computational genome annotation, Chapter 3 of Microbial Fuctional Genomics (J.Zhou, D. Thompson, Y. Xu, and J. Tidge) [M], John Wiley and Sons, 2004.
    [28] Yeh P, Tschumi A I, Kishony R. Functional classification of drugs by properties of their pairwise interactions [J]. Nature Genetics 2006, 38: 489-494.
    [29] Aloy P, Russell R B. Structural systems biology: modelling protein interactions [J]. Nature Reviews Molecular Cell Biology 2006, 7: 188-197.
    [30] Gon S, Camara J E, Klungsoyr H K, Crooke E, Skarstad K, and Beckwith J. A novel regulatory mechanism couples deoxyribonucleotide synthesis and DNA replication in Escherichia coli [J]. The EMBO Journal, 2006, 25: 1137-1147.
    [31] Yada T, Nakao M, Totoki Y, and Nakai K. Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models [J]. Bioinformatics, 1999, 15: 987-993.
    [32] Overbeek R, Fonstein M, D'Souza M, Pusch G D, and Maltsev N. The use of gene clusters to infer functional coupling [J]. Proc. Natl. Acad. Sci., 1999, 96: 2896-2901.
    [33] Salgado H, Moreno-Hagelsieb G, Smith T, and Collado-Vides J. Operons in Escherichia coli: genomic analyses and predictions [J]. Proc Natl Acad. Sci., 2000, 97: 6652-6657.
    [34] Craven M, Page D, Shavlik J, Bockhorst J, and Glasner J. A probabilistic learning approach to whole-genome operon prediction [C]. Proc. 8th International Conference on Intelligent Systems for Mol. Biol., 2000, 116-127.
    [35] Moreno-Hagelsieb G, Collado-Vides J. A powerful non-homology method for the prediction of operons in prokaryotes [J]. Bioinformatics, 2002, 18: 329-336.
    [36] Sabatti C, Rohlin L, Oh M K, and Liao J C. Co-expression pattern from DNA microarray experiments as a tool for operon prediction [J]. Nucleic Acids Res., 2002, 30: 2886-2893.
    [37] Bockhorst J, Craven M, Page D, Shavlik J, and Glasner J. A Bayesian network approach to operon prediction [J]. Bioinformatics, 2003, 19: 1227-1235.
    [38] Westover B P, Buhler J D, Sonnenburg J L, and Gordon J I. Operon prediction without a training set [J]. Bioinformatics, 2005, 21: 880-888.
    [39] Edwards M T, Rison S C G, Stoker N G and Wernisch L. A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context [J]. Nucleic Acids Res., 2005, 33: 3253-3262.
    [40] Dam P, Olman V, Xu Y. Improving Operon Prediction in E. coli [C]. 2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW), 2005, p. 69-70.
    [41] Jacob E, Sasikumar R, Nair K N R. A fuzzy guided genetic algorithm for operon prediction [J]. Bioinformatics, 2005, 21: 1403-1407.
    [42] Tran T T, Dam P, Wu H, Su Z, Poole F, Adams M, Zhou G T, and Xu Y. Operon prediction in Pyrococcus furiosus [J]. Nucleic Acids Research, 2007, 35: 69-78.
    [43] Roback P, Beard J, Baumann D, Gille C, Henry K, Krohn S, Wiste H, Voskuil M I, Rainville C, and Rutherford R. A predicted operon map for Mycobacterium tuberculosis [J]. Nucleic Acids Res., 2007, 35(15): 5085-5095.
    [44] Price M N, Huang K H, Alm E J, and Arkin A P. A novel method for accurate operon predictions in all sequenced prokaryotes [J]. Nucleic Acids Res., 2005, 33: 880-892.
    [45] Wang L S, Trawick J D, Yamamoto R, and Zamudio C: Genome-wide operon prediction in Staphylococcus aureus [J]. Nucleic Acids Res., 2004, 32: 3689-3702.
    [46] Wu H, Su Z, Mao F, Olman V, and Xu Y. Prediction of functionalmodules based on comparative genome analysis and Gene Ontology application [J]. Nucleic Acids Res., 2005, 33: 2822-2837.
    [47] Dam P, Olman V, Harris K, Su Z, and Xu Y. Operon prediction using both genome-specific and general genome information [J]. Nucleic Acids Research, 2007, 35: 288-298.
    [48] Joseph B, Yu Q, Jeremy G, Liu M Z, Frederick B, and Mark C. Predicting bacterial transcription units using sequence and expression data [J]. Bioinformatics, 2003, 19: 34-43.
    [49] Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, and Lipman D. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J]. Nucleic Acids Res., 1997, 25(17): 3389-3402.
    [50] Gupta A, Maranas C D, Albert R. Elucidation of Directionality for Co-Expressed Genes:Predicting Intra-Operon Termination Sites [J]. Bioinformatics, 2006, 22: 209-214.
    [51] Marcotte E M, Pellegrini M, Thompson M J, Yeast T O, and Eisenberg D. A combined algorithm for genome-wide prediction of protein function [J]. Nature, 1999, 402: 83-86.
    [52] Marcotte E M. Computational genetics: Finding protein function by nonhomology methods [J]. Curr. Opin. Struct. Biol., 2000, 10: 359-365.
    [53] Huynen M, Snel B, Lathe W, and Bork P. Predicting protein function by genomic context: Quantitative evaluation and qualitative inferences [J]. Genome Res., 2000, 10: 1024-1210.
    [54] Liberles D, Thoren A, Heijne G, and Elofsson A. The use of phylogenetic profiles of gene predictions [J]. Current Genomies, 2002, 3: 131-137.
    [55] Pellegrini M, Marcotte E, Thomopson M, Eisenberg D, and Yeates T. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles [J]. Proc. Natl. Acad. Sci., 1999, 96: 4285-4288.
    [56] Tatusov R, Natale D, Garkavtsev I, Tatusova T, Schankavaram U, Rao B, Kiryutin B, Galperin M, Fedorova N, and Koonin E. The COG databases: new development in phylogenetic classification of proteins from complete genomes [J]. Nucleic Acids Res., 2001, 29(1): 22-28.
    [57] Ermolaeva M. Khalak H, White O, Smith H, and Salzberg S. Prediction of transcription terminators in bacterial genomes [J]. The Institute for Genomic Research, 2000, 301(1): 27-33.
    [58] Prestridge D. SIGNAL SCAN: A computer program that scans DNA sequences for eukaryotic transcriptional elements [J].CABIOS, 1991, 7: 203-206.
    [59] Angellotti M C, Shafquat B B, Chen G R, and Wan X F: CodonO. Codon usage bias analysis within and across genomes [J]. Nucleic Acids Res., 2007, 35, Web Server issue: 132-136.
    [60] Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-Solano F, Santos-Zavaleta A, Martinez-Flores I, Jimenez-Jacinto V, Bonavides-Martinez C, Segura-Salazar J, Mart?′nez-Antonio A, and Collado-Vides J. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions [J]. Nucleic Acids Res., 2006, 34: 394– 397.
    [61] Okuda S, Katayama T, Kawashima S, Goto S and Kanehisa M. ODB: a database of operons accumulating known operons across multiple genomes [J]. Nucleic Acids Res., 2006, 34: D358-D362.
    [62] Kanehisa M,Goto S, Hattori M,Aoki-Kinoshita K F,Itoh M, Kawashima S, Katayama T, Araki M, and Hirakawa M. From genomics to chemical genomics: new developments in KEGG [J]. Nucleic Acids Res., 2006, 34: 354-357.
    [63] Keseler I M, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen I T, Peralta-Gil M, Karp P D. EcoCyc: a comprehensive database resource for Escherichia coli [J]. Nucleic Acids Res., 2005, 33: 334-337.
    [64] Barrett T, Suzek T O, Troup D B, Wilhite S E, Ngau W C, Ledoux P, Rudnev D, Lash A E, Fujibuchi W, and Edgar R. NCBI GEO: mining millions of expression profiles-database and tools [J]. Nucleic Acids Res., 2005, 33: 562-566.
    [65] Du W, Wang Y, Wang S Q, Wang X M, Sun F X, Zhang C, Zhou C G, Hu C Q, and Liang Y C. Operon Prediction by GRNN Based on Log-Likelihoods and Wavelet Transform [J]. Dynamics of Continuous, Discrete and Impulsive Systems, A Supplement, Advances in Neural Networks, 2007, 14(S1): 323-327.
    [66] Du W, Wang Y, Wang S Q, Wang X M, Sun F X, Zhang C, Zhou C G, Hu C Q, and Liang Y C. Operon Prediction Using Neural Network Based on Multiple Information of Log-likelihoods [C]. The 4th International Symposium on Natural Networks (ISNN'07), 2007, Nanjing, China, Lecture Notes in Computer Science, 2007, 4491: 656-661.
    [67] Wang S Q, Wang Y, Du W, Sun F X, Wang X M, Zhou C G, and Liang Y C. A multi-approaches-guided genetic algorithm with application to operon prediction [J]. Artificial Intelligence in Medicine, 2007, 41(2): 151-159.
    [68] Wang X M, Du W, Wang Y, Zhang C, Zhou C G, Wang S Q, Liang Y C. The Application of Support Vector Machine to Operon Prediction [C]. The 2nd International Conference on Future Generation Communication and Networking (FGCN 2008). 2008, 3:59-62.
    [69] Cary M P, Bader G D, Sander C. pathway information for systems biology [J]. FEBS Letters, 2005, 579:1815-1820.
    [70] MetaCyc Pathway: glycolysis III [EB/OL]. http://biocyc.org/META/NEW-IMAGE?type= PATHWAY&object=ANAGLYCOLYSIS-PWY&detail-level=2&detail-level=3&detail-level=2, 2009.
    [71] McShan D C, Rao S, Shah I. PathMiner-predicting metabolic pathways by heuristic search [J]. Bioinformatics, 2003, 19: 1692-1698.
    [72] Goesmann A, Haubrock M, Meyer F, Kalinowski J, Giegerich R. PathFinder-reconstruction and dynamic visualization of metabolic pathways [J]. Bioinformatics, 2002, 18:124-129.
    [73] Mao X Z, Olyarchuk J G, Wei L P. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary [J]. Bioinformatics, 2005, 21: 3787-3793.
    [74] Goeman J J, Oosting J, Cleton-Jansen A, Anninga J K, van Houwelingen H C. Testing association of a pathway with survival using gene expression data [J]. Bioinformatics, 2005, 21: 1950-1957.
    [75] Pandey J, Koyuturk M, Kim Y, Szpankowski W, Subramaniam S, Grama A. Functional annotation of regulatory pathways [J]. Bioinformatics, 2007, 23: 1377-1386.
    [76] Verhoff F H, Spradlin J E. Mass and energy balance analysis of metabolic pathways applied to citric acid production by Aspergillus niger [J]. Biotechnology and Bioengineering, 1976, 18: 425-432.
    [77]山根恒夫.生物反应工程[M].上海:科学技术出版, 1989, 156-161.
    [78] Seressiotis A, Bailey J E. MPS: an Artificaially Intelligent Software System for the Analyss and Sysnthesis of Metabolic Pathways [J]. Biotechnology and Bioengineering, 1988, 31: 587-602.
    [79] Mavrovouniotis M L, Stephanopoulos G, Stephanopoulos G. Computer-Aided Synthesis of Biochemical Pathways [J]. Biotechnology and Bioengineering, 1990, 36: 1119-1132.
    [80] Mavrovouniotis M L. Identification of qualitatively feasible metabolic pathways [M]. In Artificial Intelligence and Molecular Biology, AAAI Press, Menlo Park, USA, 1993.
    [81] K¨uffner R, Zimmer R, Lengauer T. Pathway analysis in metabolic databases via differential metabolic display (DMD) [J]. Bioinformatics, 2000, 16: 825-836.
    [82] Schilling C H, Letcher D, Palsson B O. Theory for the sytemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective [J]. J. Theor. Biol., 2000, 203: 229-248.
    [83] Schuster S, Fell D A, Dandekar T. A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks [J]. Nat. Biotechnol., 2000, 18: 326-332.
    [84] Boyer F, Viari A. Ab inito reconstruction of metabolic pathways [J]. Biosinformatics, 2003, Suppl. 19(2): ii26-ii34.
    [85] Ideker T, Ozier O, Schwikowski B, Siegel A F. Discovering regulatory and signaling circuits in molecular interaction networks [J]. Bioinformatics, 2002, Suppl. 18(1):S233-S240.
    [86] Rajagopalan D, Agarwal P. Inferring pathways from gene lists using a literature-derived network of biological relationships [J]. Bioinformatics, 2005, 21: 788-793.
    [87] Li Z, Chan C. Inferring pathways and networks with a Bayesian framework [J]. Faseb J., 2004, 18(6): 746-748.
    [88] Liu Y, Zhao H. A computational approach for ordering signal transduction pathway components from genomics and proteomics Data [J]. BMC Bioinformatics, 2004, 5: 158.
    [89] Novak BA, Jain AN. Pathway Recognition and Augmentation by Computational Analysis of Microarray Expression Data [J]. Bioinformatics, 2006, 22(2): 233-241.
    [90] Markowetz F, Bloch J, Spang R. Non-transcriptional pathway features reconstructed from secondary effects of RNA interference [J]. Bioinformatics, 2005, 21: 4026-4032.
    [91] Froehlich H, Fellmann M, Sueltmann H, Poustka A, Beissbarth T. Large scale statistical inference of signaling pathways from RNAi and microarray data [J]. BMC Bioinformatics, 2007, 8(1): 386.
    [92] Darvish A, Najarian K. Prediction of regulatory pathways using mRNA expression and protein interaction data: application to identification of galactose regulatory pathway [J]. Biosystems. Feb-Mar; 2006, 83(2-3): 125-35.
    [93] Li Z, Srivastava S, Mittal S, Yang X, Sheng L, Chan C. A Three Stage Integrative Pathway Search (TIPS) framework to identify toxicity relevant genes and pathways [J]. BMC Bioinformatics, 2007, 8: 202.
    [94] Ourfali O, Shlomi T, Ideker T, Ruppin E, Sharan R. SPINE: a framework for signaling-regulatory pathway inference from cause-effect experiments [J]. Bioinformatics, 2007, 23(13): i359-66.
    [95] Mukhopadhyay N D, Chatterjee S. Causality and pathway search in microarray time series experiment [J]. Bioinformatics, 2007, 23: 442-449.
    [96] Cai Y D, Muldoon M. Metabolic Pathway Modeling by Using the Nearest Neighbor Algorithm [R]. MIMS EPrint: 110, 2007.
    [97] Bono H, Ogata H, Goto S, and Kanehisa M. Reconstruction ofAmino Acid Biosynthesis Pathways from the Complete Genome Sequence [J]. Genome Res., 1998, 8: 203-210.
    [98] Green M L, Karp P D. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases [J]. BMC Bioinformatics, 2004, 5: 76.
    [99] Yamanishi Y. et al. Prediction of missing enzyme genes in a bacterial metabolic network [J]. FEBS J., 2007, 274: 2262-2273.
    [100] Pireddu L, Szafron D, Lu P, Greiner R. The Path-A metabolic pathway prediction web server [J]. Nucleic Acids Res., 2006, 34: 714-719.
    [101] Cakmak A, Ozsoyoglu G. Mining biological networks for unknown pathways [J]. Bioinformatics, 2007, 23: 2775-2783.
    [102] Wu J M, Mao X Z, Cai T, Luo J C, Wei L P. KOBAS server: a web-based Platform for automated annotation and pathway identification [J]. Nucleic Acids Res., 2006, 34: 720-724.
    [103] Chang W C, Li C W, Chen B S. Quantitative inference of dynamic regulatory pathways via microarray data [J]. BMC Bioinformatics, 2005, 6:1-19.
    [104] Zhang Y, Deng Z D. Identifying biological pathways via phase decomposition and profile extraction [C]. Proceedings of Computational Systems Bioinformatics (CSB2006), Stanford CA, 2006: 269-280.
    [105] Mao F L, Su Z C, Olman V, Dam P, Liu Z J, Xu Y. Mapping of orthologous genes in the context of biological pathways: An application of integer programming [J]. PNAS, 2006, 103(1): 129-134.
    [106]王琳芳,方福德.疾病基因组学研究———后基因组时代的主旋律[J].中国医学科学院学报, 2002, 24(3): 217-218.
    [107] Alon U, Barkai N, Notterman D A, et al. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays [J]. PNAS, 1998, 96(12): 6745-6750.
    [108] Perou C M, Sorlie T, Eisen M B, et al. Molecular Portraits of Human Breast Tumors [J]. Nature, 2000, 406(6797): 747-752.
    [109] Alizadeh A A, Eisen M B, Davis R E, et al. Distinct Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling [J]. Nature, 2000 , 403: 503-511.
    [110] Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring [J]. Science, 1999, 286: 531-537.
    [111] Khan J, Wei J S, Ringner M, Saal L H, Ladanyi M, Westermann F, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks [J]. Nature Medicine, 2001, 7: 673-679.
    [112] Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, et al. Gene expression profiles in hereditary breast cancer [J]. The New England Journal of Medicine, 2001, 344: 539-548.
    [113] Tan A C, Naiman D Q, Xu L, Winslow R L, Geman D. simple decision rules for classifying human cancers from gene expression profiles [J]. Bioinformatics, 2005, 21, 3896-3904.
    [114] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines [J]. Machine Learning, 2002, 46: 389-422.
    [115] Zhou X B, Liu K Y, Wong S T C. Cancer classification and prediction using logistic regression with Bayesian gene selection [J]. Journal of Biomedical Informatics, 2004 37: 249-259.
    [116] Shevade S K, Keerthi S S. A simple and efficient algorithm for gene selection using sparse logistic regression [J]. Bioinformatics, 2003, 19: 2246-2253.
    [117] Li Y, Campbell C, Tipping M. Bayesian automatic relevance determination algorithms for classifying gene expression data [J]. Bioinformatics, 2002, 18: 1332-1339.
    [118] Chu W, Ghahramani Z, Falciani F, Wild D L. Biomarker discovery in microarray gene expression data with Gaussian processes [J]. Bioinformatics, 2005, 21: 3385-3393.
    [119]李颖新,阮晓钢.基于基因表达谱的肿瘤亚型识别与分类特征基因选取研究[J].电子学报, 2005, 33(4): 651-655.
    [120]李泽,包雷,黄英武,孙之荣.基于基因表达谱的肿瘤分型和特征基因选取[J].生物物理学报, 2002, 33(4): 413-417.
    [121] Liu C C, Chen W S E, Lin C C, Liu H C, Chen H Y, Yang P C, Chang P C, Chen J J W. Topology-based cancer classification and related pathway mining using microarray data [J]. Nucleic Acids Research, 2006, 34: 4069-4080.
    [122] Helman P, Veroff R, Atlas S R, and Willman C. A Bayesian network classification methodology for gene expression data [J]. J. Comput. Biol., 2004, 11: 581–615.
    [123] Geman D, d'Avignon C, Naiman D Q, Winslow R L. Classifying gene expression profiles from pairwise mRNA comparisons [J]. Stat. Appl. Genet. Mol. Biol., 2004, 3 (1): Article19.
    [124] Holland J H. Adaptation in Natural and Artificial Systems [M]. Ann Arbor: University of Michigan Press, 1975.
    [125] Goldberg D E. Genetic Algorithms in Search, Optimization&Machine Learning, Second Edition [M]. Addison~Wesley Publishing Company, 1989: 185-200.
    [126]王小平.遗传算法·遗传算法:理论、应用及软件实现[M].西安交通大学出版社, 2002.
    [127] Hunt E B, Marin J, Stone P J. Experiments in Induction [M]. New York: Academic Press. 1966.
    [128] Selbig J, Mevissen T, and Lengauer T. Decision tree-based formation of consensus protein secondary structure prediction [J]. Bioinformatics, 1999, 15(12): 1039-1046.
    [129] Comley J W, Allison L, and Fitzgibbon L J. Flexible Decision Trees in a General Data-Mining Environment [C]. Fourth International Conference on Intelligent DataEngineering and Automated Learning (IDEAL-2003), Hong Kong, 2003, 761-767.
    [130] Barlow T, Neville P. Case Study: Visualization for Decision Tree Analysis in Data Mining [C]. IEEE Symposium on Information Visualization (INFOVIS'01), 2001, 149-152.
    [131] WANG J, CUI J, ZHAO K. Investigation on AQ11, ID3 and the Principle of Discernibility Matrix [J]. J. Comput. Sci. & Technol. 2001, 16(1): 1-12.
    [132] Theodoridis S and Koutroumbas K. Pattern Recognition, second edition [M]. Elsevier, USA. 2003.
    [133] Quinlan J R. Introduction of Decision Trees [J]. Machine Learning, 1986, 3: 81-106.
    [134] Quinlan J R. C4.5: Programs for machine learning [M]. Sna Mateo, CA: Morgan Kaufmann. 1993.
    [135] Clark P and Niblett T. The CN2 induction algorithm [J]. Machine Learning. 1989, 3: 261-284.
    [136] Kononenko I, Bratko I, and Roskar E. Experiments in automatic learning of medical diagnostic rules (Technical report) [R]. Jozef Stefan Institute, Ljubljana, Yugoslavia. 1984.
    [137] Quinlan J R. Data Mining Tools See5 and C5.0 [EB/OL]. http://www.rulequest.com/ see5-info.html. 2000.
    [138] Breiman L, Friedman J H, Olshen R A, and Stone C J. Classification and Regression Trees [M]. Wadsworth, Belmont, 1984.
    [139] DB2 Intelligent Miner for Data [EB/OL]. http://www-4.ibm.com/software /data/iminer/ fordata/ about.html。
    [140] Fayyad U M and Irani K B. Multi-interval discretization of continuous–valued attributes for classification learning [C]. In R. Bajcsy (Ed.), Proc. the 13th International Joint Conference on Artificial Intelligence. Morgan-Kaufmann. 1993,1022-1027.
    [141] Hussain F, Liu H, Tan C L, and Dash M. Discretization: An Enabling Tech-nique [J]. Data Mining and Knowledge Discovery, 2002, 6:393-423.
    [142] Pyle D. Data preparation for data mining [M].San Francisco, Morgan Daufinann, 1999.
    [143] Nevill-Manning C G, Holmes G, and witten I H. The Development of Holte’s 1R Classifier [C]. Proceedings of First New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, Dunedin, New Zealand, November: IEEE Computer Society Press, 1995, 239-242.
    [144] Fayyad U and Irani K. Discretizing continuous attributes while learning Bayesian networks [C]. Proc.of the Thirteenth International Conference on Machine Learning, Morgan Kaufmann, 1996, 156-165.
    [145] Rabaséda S, Rakotomalala R, Sebban M. A Comparison of Some Contextual Discretizarion Methods [J], Information sciences, 1996, 92: 137-157.
    [146]苗夺谦. Rough Set理论中连续属性的离散化方法[J].自动化学报. 2001, 27(3): 296-302.
    [147]蒋嵘,李德毅,范建华.数值型数据的泛概念树的自动生成方法[J].计算机学报, 2000, 23(5): 1-7.
    [148]叶东毅,陈昭炯.粗糙集属性量化的一个算法[J].小型微型计算机系统. 2002, 23 (10): 1239-1240.
    [149]权光日,刘文远,叶风,等.连续属性空间上的规则学习算法[J].软件学报, 1999, 10(11): 1225-1232.
    [150]谢宏,程浩忠,牛东晓.基于信息熵的粗糙集连续属性离散化算法[J].计算机学报, 2005, 28(9): 1570-1574.
    [151] Mingers J. An expirical comparison of selection measures for decision tree induction [J]. Machine Learning. 1989, 3(4): 319-342.
    [152] Nunez M. The use of background knowledge in decision tree induction [J]. Machine Learning. 1991, 6(3): 231-250.
    [153] Pawlak Z. Rough Sets [J], International Journal of Computer and Information Science, 1982, 11(5): 341-356.
    [154]描夺谦,李道国.粗糙集理论、算法与应用[M].清华大学出版社, 2008.
    [155]刘清.算子Rough逻辑及其归结原理[J].计算机学报, 1998, 21(5):476-480.
    [156]王志海,胡可云,胡学钢等.基于粗糙集合理论的知识发现综述[J].模式识别与人工智能, 1998, 11(2): 176-183.
    [157]曾黄麟.基于粗集理论的机器学习与推理[J].控制与决策, 1997, 12(6): 708-711.
    [158] Pawlak Z. AI and Intelligent Industrial Applications: The Rough Set Perspective [J]. Cybernetics and Systems, 2000, 31(3): 227-252.
    [159] Wei J M, Huang D. New methods for knowledge discovery based on rough sets [C]. In Proceedings of AMSMA’2000. Guangzhou, China, 2000, 1008-1011.
    [160] Guan J W, Bell D A. Rough computational methods for information systems [J]. Artificial Intelligence. 1998, 105(1/2): 77-103.
    [161] Chen S C, Shyu M L, Chen M, Zhang C C. A Decision Tree-based Multimodal Data Mining Framework for Soccer Goal Detection [C]. IEEE International Conference on Multimedia and Expo (ICME 2004), June 27 - June 30, 2004, Taipei, Taiwan, R.O.C. 2004, 265-268.
    [162] Orsenigo C, Vercellis C. Discrete support vector decision trees via tabu search [J]. Computational Statistics & Data Analysis, 2004, 47: 311-322.
    [163] Ziarko W. Variable precision rough set model. Journal of Computer and System Sciences[J]. 1993, 46(1):39-59.
    [164] Wang S Q, Wei J M, et al. A VPRSM Based Approach for Inducing Decision Trees [C]. TheFirst International Conference on Rough Sets and Knowledge Technology (RSKT2006), Chongqing, China. LNCS, 2006, 421-429.
    [165] Stover K C, Pham X Q, Erwin A L, Mizoguchi S D, Warrener P, Hickey M J, Brinkman F S L, Hufnagle W O, Kowalik D J, Lagrou M, Garber R L, Goltry L, Tolentino E, Westbrock-Wadman S, Yuan Y, Brody L L, Coulter S N, Folger K R, Kas A, Larbig K, Lim R, Smith K, Spencer D, Wong GK-S, Wu Z, Paulsen I, Reizer J, Saier M H, Hancock R E W, Lory S, and Olson M V. Complete genome sequence of Pseudomonas aeruginosa PAO1: an opportunistic pathogen [J]. Nature, 2000, 406: 959-964.
    [166] Shannon C E. A Mathematical Theory of Communication [J]. Bell System Technical Journal, 1948, 27: 379-423.
    [167] Tan A C, Naiman D Q, Xu L, Winslow R L, and Geman D. Online supplementary materials for simple decision rules for classifying human cancers from gene expression profiles [[EB/OL]]. https://jshare.johnshopkins.edu/atan6/public_html/KTSP/, 2005.
    [168] Wei J M, Wang S Q, Wang M Y. Novel approach to decision-tree construction [J]. Journal of Advanced Computational Intelligence and Intelligent Informatics. 2004, 8(3): 332-335.
    [169] Shannon C E and Weaver W. The Mathematical Theory of Communication [M]. The University of Illinois Press, Urbana, 1949.
    [170] Wang S Q, Zhou C B, Wu Y S, Wang J X, Zhou C G, Liang Y C. A Novel Approach for Classifying Human Cancers [C]. The 9th International Conference for Young Computer Scientists (ICYCS2008), 2008, 976-981.
    [171] Dietterich T G. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization[J]. Mach. Learn., 2000, 40: 139–157.
    [172] Tan A C, Gilbert D. Ensemble machine learning on gene expression data for cancer classification[J]. Appl. Bioinformatics, 2003, 2: S75–S83.
    [173] Harris M A, Clark J, Ireland A, Lomax J et al. The gene ontology (GO) database and informatics resource [J]. Nucleic Acids Res., 2004, 32: D258–D261.
    [174] Mutis T, Verdijk R, Schrama E, Esendam B, Brand A, and Goulmy E. Feasibility of Immunotherapy of Relapsed Leukemia With Ex Vivo-Generated Cytotoxic T Lymphocytes Specific for Hematopoietic System-Restricted Minor Histocompatibility Antigens [J]. Blood, 1999, 93: 2336–2341.
    [175] Dudoit S and Fridlyand J. Classification in microarray experiments [C]. In Speed, T.P. (ed.), Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, 2003, 93–158.
    [176] Yang S, Shin J, Park K H, Jeung H C, Rha S Y, Noh S H, Yang W I, Chung H C. Molecular basis of the differences between normal and tumor tissues of gastric cancer [J]. BiochimBiophys Acta, 2007, 1772: 1033-1040.
    [177] Cho Y G, Song J H, Kim C J, Nam S W, Yoo N J, Lee J Y, and Park W S. Genetic and epigenetic analysis of the KLF4 gene in gastric cancer [J]. APMIS, 2007, 115: 802-808.
    [178] Wei D, Gong W, Kanai M, Schlunk C, Wang L, Yao JC, Wu TT, Huang S, Xie K. Drastic down-regulation of Kruppel-like factor 4 expression is critical in human gastric cancer development and progression [J]. Cancer Res., 2005, 65: 2746-2754.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700