基于机器学习的生物基因剪切位点识别

英文题名：Recognition of the Splice Sites and Analysis of Gene Expression Based on Machine Learning Theory
作者：苏洪全
论文级别：博士
学科专业名称：通信与信息系统
中文关键词：机器学习 ; 生物信息学 ; 基因表达 ; 剪切位点
英文关键词：Machine Learning ; Bioinformatics ; Gene expression ; Splice Sites
学位年度：2011
导师：朱义胜
学科代码：081001
学位授予单位：大连海事大学
论文提交日期：2011-01-01

摘要

生物信息学是分子生物学和计算机科学的交叉科学。生物信息学的工作是对不同领域的生物信息进行分析,包括核苷酸和氨基酸序列、蛋白质及其结构的分析、基因数据表达等。由于生物信息数据所具有的高维量大的特性,对用于生物信息的存储、检索、处理、分析及可视化等方面的理论、算法、软件等都将提出了严格的要求。计算机算法已经是生物信息学研究中必不可少的组成。
     由于生物进化方式的复杂和在分子水平上对生物组织理论的缺乏,因此生物学系统具有内在的复杂性(inherent complexity)。机器学习非常适合用于分析高维、多噪声、缺少相关理论的生物信息学数据,如神经网络,隐马尔可夫模型、支持向量机、信度网等。本文主要研究机器学习算法及其在生物信息学中的应用。根据生物信息学数据的特性,改进相关的学习算法,以提高其学习的准确率和效率。主要分为以下四个部分：
     (1)自组织神经网络的改进
     自组织神经网络能够将高维输入信号转变为低维的(通常是一到二维)的离散信号,并且保持其拓扑结构不变,自组织神经网络在模式识别、数据分析等领域都得到了广泛的应用。在Kohonen学习率中,自组织神经网络学习过程中的权重调整主要由学习率函数和邻域宽度函数决定的,这两个函数的选择没有数学上的方法,通常是根据经验选择。本文提出将无先导卡尔曼滤波器和卡尔曼滤波器分别应用于学习率函数和邻域宽度函数的自适应过程。
     (2)核方法的应用与改进
     核方法主要思想是将非线性数据映射到特征空间中,在特征空间中使用线性的学习和分类算法。核方法是机器学习中的基础,并且功的应用于多个领域,如数据的聚类、分类和降维等。实际应用中,核函数及其参数的选择非常关键。本文根据基因系列表达分析数据的统计特性,提出了基于Poisson分布的核函数(poisson-model based kernel, PMK)。
     (3)剪切位点的识别
     在真核细胞中,多数基因都是被长度不等的内含子所隔离,形成镶嵌形式的断裂方式。在转录的时候,RNA聚合酶将这些内含子剔除,把外显子链接起来,产生成熟的mRNA。显然,对于这些剪切位点的精确识别,对于基因组的分析有重要意义。本文将改进的自组织神经网络应用于人类剪切位点的识别中。
     (4)基因系列表达分析数据的分析
     核方法中包括了支持向量机和核主分量分析算法。支持向量机建立在统计学习理论中的结构风险最小化的基础上,可用于对数据的分类。核主分量分析算法是主分量分析算法在核方法中的推广,可有效的处理非线性数据,捕捉其特征。本文分别将基于Poisson分布核函数的支持向量机和核主分量分析算法应用基因系列表达分析数据的处理。
Bioinformatics is the interdiscipline of the molecular biology and computer science. The field of bioinformatics involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, protein structures, gene expression data etc. In fact, the large amounts and high dimensional of bioinformatics data create a critical need for theoretical, algorithmic, and software advances in storing, retrieving, processing, analyzing, and visualizing biological information. Computational algorithms have become an essential component of the research process.
     This is due to the inherent complexity of biological systems, brought about by evolutionary tinkering, and to our lack of a comprehensive theory of life's organization at the molecular level. Machine-learning approaches (e.g. neural networks, hidden Markov models, vector support machines, belief networks), on the other hand, are ideally suited for domains characterized by the presence of large amounts of data, "noisy" patterns, and the absence of general theories. The aim of this thesis is that improving the accuracy and efficiency by modifying existing machine learning algorithm based on the statistical property of the bioinformatics data. There are four sections contained:
     (1) Improvement of the self-organizing feature maps (SOFM).
     Based on transformation of high-dimensional input space onto a lower-dimensional (usual one or two-dimensional) discrete map while maintaining original similarity relations, SOFM have demonstrated several beneficial features that make them a valuable tool in pattern discovery, data analysis etc. During the updating the weighs of the neural, the Kohonen learning algorithm is controlled by two learning parameters the learning coefficient and the width of the neighborhood function, which have to be chosen empirically because there exists neither rules nor a method for their calculation. To circumvent these parameters study, a novel methods was proposed into the learning algorithm, which can adjust the learning coefficient and the width of the neighborhood function by unscented Kalman filter (UKF) and Kalman filters (KF) respectively.
     (2) Application of kernel methods.
     Kernel methods, which generalize linear learning methods to non-linear ones, have become a cornerstone for much of the recent work in machine learning and have been used successfully for many core machine learning tasks such as clustering, classification, and regression. In practice, kernel methods depend on an appropriate kernel function and parameters which should be chosen by the statistical property of the data. For SAGE data which was obeys Poisson distribution, a poisson-model based kernel (PMK) was proposed.
     (3) Recognition of the splice sites.
     In eukaryotic cells, most genes are interrupted by introns that must be removed before the genetic information can be decoded. RNA polymerase does not discriminate these introns from coding regions (exons) since they are normally transcribed together as a common precursor mRNA (pre-mRNA). Splicing, the process that removes introns from a pre-mRNA, probably represents the most important post-transcriptional step to determine the protein output from a gene. Obviously, intron-exon boundaries have to be precisely defined. The SOFM, whose parameters were adjusted by UKF and KF, was used for Humo Sapiens Splice Site Dataset (HS3D).
     (4) Analysis of the SAGE data.
     Support Vector Machines (SVM) and Kernel Principle Component Analysis (KPCA) are the two algorithms of kernel methods. SVM is built upon the structural risk minimization principle from the statistical theory, which suggests that generalization error of learning machines is bound by both empirical risk and confidence interval. The KPCA is an efficient generalization of traditional Principle Component Analysis (PCA) that allows for the detection and characterization of low-dimensional nonlinear structure in multivariate data sets. The SVM based on PMK and KPCA based on PMK were used for SAGE data analysis.

引文

[1]Needleman S B and Wunsch C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology,1970,48(3):443-453.
    [2]Gibbs A J and McIntyre G A. The diagram, a method for comparing sequences. European Journal of Biochemistry,1970,16(1):1-11.
    [3]Hunt L T and Dayhoff M O. The occurrence in proteins of the tripeptides Asn-X-Ser and Asn-X-Thr and of bound carbohydrate* 1. Biochemical and Biophysical Research Communications, 1970,39(4):757-765.
    [4]Pustell J M. Interactive molecular biology computing. Nucleic Acids Research,1988, 16(5):1813-1820.
    [5]Pearson W R. [5] Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in enzymology,1990,183:63-98.
    [6]http://www.ddbj.nig.ac.jp
    [7]Miyazaki S, Sugawara H, Ikeo K, et al. DDBJ in the stream of various biological data. Nucleic Acids Research,2004,32(Suppl 1):D31-D34.
    [8]http://www.ncbi.nlm.nih.gov/genebank
    [9]Edgar R, Domrachev M and Lash A E. Gene Expression Omnibus:NCBI gene expression and hybridization array data repository. Nucleic Acids Research,2002,30(1):207-210.
    [10]Dulbecco R. A turning point in cancer research:sequencing the human genome. Science,1986, 231(4742):1055-1056.
    [11]李衍达,孙志荣.生物信息学—基因和蛋白质分析的实用指南.北京：清华大学出版社,2000.
    [12]赵国屏.生物信息学.北京：科学出版社,2002.
    [13]李巍.生物信息学导论.郑州：郑州大学出版社,2004.
    [14]孙啸,陆祖宏和谢建明.生物信息学基础.北京：清华大学出版社,2006.
    [15]Hughey R and Krogh A. Hidden Markov models for sequence analysis:extension and analysis of the basic method. Computer applications in the biosciences,1996,12(2):95-107.
    [16]Liwo A, Lee J, Ripoll D R, et al. Protein structure prediction by global optimization of a potential energy function. Proceedings of the National Academy of Sciences of the United States of America,1999,96(10):5482-5485.
    [17]Pillardy J, Czaplewski C, Liwo A, et al. Recent improvements in prediction of protein structure by global optimization of a potential energy function. Proceedings of the National Academy of Sciences of the United States of America,2001,98(5):2329-2333.
    [18]Burge S, Parkinson G N, Hazel P, et al. Quadruplex DNA:sequence, topology and structure. Nucleic Acids Research,2006,34(19):5402-5415.
    [19]Bellman R and Dreyfus S. Functional approximations and dynamic programming. Mathematical Tables and Other Aids to Computation,1959,13(68):247-251.
    [20]Bertsekas D P and Tsitsiklis J N. Neuro-dynamic programming:an overview. Proceeding of the 34th IEEE Conference on Decision and Control. New York,1995:560-564.
    [21]Chenna R, Sugawara H, Koike T, et al. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Research,2003,31(13):3497-3500.
    [22]Sali A and Blundell T L. Definition of general topological equivalence in protein structures:A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. Journal of Molecular Biology,1990,212(2):403-428.
    [23]Solovyev V V, Salamov A A and Lawrence C B. Identification of human gene structure using linear discriminant functions and dynamic programming. http://www.aaai.org/Papers/ISMB/1995/ISMB95-044.pdf,1995.
    [24]Rivas E and Eddy S R. A dynamic programming algorithm for RNA structure prediction including pseudoknotsl. Journal of Molecular Biology,1999,285(5):2053-2068.
    [25]Birney E and Durbin R. Dynamite:a flexible code generating language for dynamic programming methods used in sequence comparison. https://www.aaai.org/Papers/ISMB/1997/ISMB97-008.pdf,1997.
    [26]Baldi P and Brunak S. Bioinformatics:the machine learning approach:The MIT Press,2001.
    [27]Larranaga P, Calvo B, Santana R, et al. Machine learning in bioinformatics. Briefings in bioinformatics,2006,7(1):86.
    [28]Hubert M and Engelen S. Robust PCA and classification in biosciences. Bioinformatics,2004, 20(11):1728-1736.
    [29]Teixeira A R, Tome A M, Stadlthanner K, et al. KPCA denoising and the pre-image problem revisited. Digital Signal Processing,2008,18(4):568-580.
    [30]Mamlouk A M, Sharp H, Menne K M L, et al. Unsupervised spike sorting with ICA and its evaluation using GENESIS simulations. Neurocomputing,2005(65-66):275-282.
    [31]Guo S and Zhu Y S. An integrative algorithm for predicting protein coding regions. IEEE Asia Pacific Conference on Circuits and Systems; Macao, China,2008.
    [32]Guo S and Zhu Y S. Prediction of protein coding regions by support vector machine. International Symposium on Intelligent Ubiquitous Computing and Education; Chengdu, China, 2009.
    [33]Guo S and Zhu Y S. Predicting splice site by improved bayesian classifier.5th International Conference on Natural Computation; Tianjian, China,2009.
    [34]Ma B S, Qu D and Zhu Y S. VRM Normalized Subband Adaptive Filter for genomic signal processing application.2nd International Conference on Signal Processing Systems; Dalian, China, 2010.
    [35]Ma B S and Zhu Y S. Kalman filtering approach for human gene finding.2nd International Conference on Signal Processing Systems; Dalian, China,2010.
    [36]Ma B S, Zhu Y S and Chen Y. An improved fourier method for DNA sequence classification. 3rd International Conference on Bioinformatics and Biomedical Engineering; Beijing, China,2009.
    [37]郭烁,朱义胜.Takagi-Sugeno模型在剪接位点识别中的应用.大连海事大学学报,2007(04).
    [38]郭烁,朱义胜.蛋白质编码区的Takagi-Sugeno模糊模型辨识.计算机工程与应用,2009(26).
    [39]郭烁,朱义胜.基于加权贝叶斯分类器的人类启动子辨识方法.电路与系统学报,2010(04).
    [40]郭烁,朱义胜.基于支持向量机的蛋白质编码区辨识.数据采集与处理,2010(05).
    [41]马宝山,朱义胜.用于基因预测的自适应滤波器的仿真研究.系统仿真学报,2007(24).
    [42]马宝山,朱义胜.一种用于基因预测的FIR数字滤波器.电子学报,2007(09).
    [43]马宝山,朱义胜.基于隐马尔科夫模型的基因预测算法.大连海事大学学报,2008(04).
    [44]马宝山,朱义胜.用多种统计特征识别基因序列.计算机工程与应用,2009(29).
    [45]Su H Q and Zhu Y S. UKF based Self-organizing Feature Maps Algorithm for Recognition of Splice Sites.2008 Second International Symposium on Intelligent Information Technology Application, Shanghai,2008:543-547.
    [46]Su H Q and Zhu Y S. erial Analysis of Gene Expression with Poisson-Model based Kernel Principle Component Analysis, Wuhan:The 2nd International Conference on Information Engineering and Computer Science (ICIECS2010), Wuhan,2010:1445-1449.
    [47]Su H Q and Zhu Y S. UKF based Self-organizing Feature Maps Algorithm for Serial analysis of Gene Expression Data.2009 Pacific-Asia Conference on Circuits, Communications and Systems, Chengdu,2009:595-597.
    [48]苏洪全,朱义胜.自组织神经网络的参数自适应方法.计算机工程与应用,2009(31).
    [49]苏洪全,朱义胜.基于改进的自组织神经网络的基因剪切位点的识别.大连海事大学学报,2009(03).
    [50]苏洪全,朱义胜.基于多分类支持向量机的基因表达系列分析.生物信息学,2010,8(4)：356-358.
    [51]Irizarry R A, Bolstad B M, Collin F, et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research,2003,31(4):e15.
    [52]Sturn A, Quackenbush J and Trajanoski Z. Genesis:cluster analysis of microarray data. Bioinformatics,2002,18(1):207-208.
    [53]Toh H and Horimoto K. Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics,2002,18(2):287-297.
    [54]Lynch M. The origins of eukaryotic gene structure. Molecular biology and evolution,2006, 23(2):450-468.
    [55]DayhoffJ. Neural Network Architectures. New York:Van Nostrand,1989.
    [56]Haykin S and Network N. Neural Networks:A comprehensive foundation,2nd Edition. Pearson Education,1999.
    [57]Rosenblatt F. The perceptron:A probabilistic model for information storage and organization in the brain. Psychological Review,1958,65(6):386-408.
    [58]Zipser D and Andersen R A. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature,1988,331(6158):679-684.
    [59]Williams R J and Peng J. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation,1990,2(4):490-501.
    [60]Psichogios D C and Ungar L H. A hybrid neural network? first principles approach to process modeling. AIChE Journal,1992,38(10):1499-1511.
    [61]Eddy S R. Hidden markov models. Current Opinion in Structural Biology,1996,6(3):361-365.
    [62]Lukashin A V and Borodovsky M. GeneMark. hmm:new solutions for gene finding. Nucleic Acids Research,1998,26(4):1107.
    [63]Krogh A. Using database matches with HMMGene for automated gene detection in Drosophila. Genome Research,2000,10(4):523.
    [64]Henderson J, Salzberg S and Fasman K H. Finding genes in DNA with a hidden Markov model. Journal of Computational Biology,1997,4(2):127-141.
    [65]Reese M G, Kulp D, Tammana H, et al. Genie-gene finding in Drosophila melanogaster. Genome Research,2000,10(4):529-538.
    [66]Snyder E E and Stormo G D. Identification of coding regions in genomic DNA sequences:an application of dynamic programming and neural networks. Nucleic Acids Research,1993, 21(3):607-613.
    [67]Brockdorff N, Ashworth A, Kay G F, et al. The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus. Cell,1992, 71(3):515-526.
    [68]Qu Y, Adam B L, Yasui Y, et al. Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clinical Chemistry,2002,48(10):1835-1843.
    [69]Frank E, Hall M, Trigg L, et al. Data mining in bioinformatics using Weka. Bioinformatics, 2004,20(15):2479-2481.
    [70]Heymans M and Singh A K. Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics,2003,19(Suppl 1):i138.
    [71]Brunak S, Engelbrecht J and Knudsen S. Neural nerwork detects errors in the assignment of messenger-RNA splice sites. Nucleic Acids Research,1990,18(16):4797-4801.
    [72]Brunak S, Engelbrecht J and Knudsen S. Prediction of human messenger-RNA donor and acceptor sites from the DNA-sequence. Journal of Molecular Biology,1991,220(1):49-65.
    [73]Liu L B, Ho Y K and Yau S. Prediction of primate splice site using inhomogeneous Markov chain and neural network. DNA and Cell Biology,2007,26(7):477-483.
    [74]Johansen O, Ryen T, Eftesol T, et al. Splice Site Prediction Using Artificial Neural Networks. Computational Intelligence Methods for Bioinformatics and Biostatistics,2009,5488:102-113.
    [75]Wang H Y, Zheng H R and Azuaje F. Poisson-based self-organizing feature maps and hierarchical clustering for serial analysis of gene expression data. IEEE Transactions on Computational Biology and Bioinformatics,2007,4(2):163-175.
    [76]Adams M D, Kelley J M, Gocayne J D, et al. Complementary DNA sequencing:expressed sequence tags and human genomeproject. Science,1991,252:1651-1656.
    [77]Velculescu V E, Zhang L, Vogelstrin B, et al. Serial Analysis of Gene Expression. Science, 1997,276:1268-1272.
    [78]Brenner S, Johnson M, Bridgham J, et al. Gene expression analysis by massively parallel signature sequencing(MPSS)on microbead arrays. Nature Biotechnology 2000,18:630-634.
    [79]Liang P and Pardee A B. Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science,1992,257:967-971.
    [80]Velculescu V E, Zhang L, Vogelstein B, et al. Serial analysis of gene-expression. Science,1995, 270(5235):484-487.
    [81]郑芳,周新,严明,et al.微量材料系列性基因表达分析技术的研究.生物化学与生物物理进展,2002,(03).
    [82]Xiong Y-H, Xu Y, Lai W-H, et al. Serial Analysis of Gene Expression in Monascus aurantiacus Producing Citrinin. Biomedical and Environmental Sciences,2005, (01).
    [83]Hibi K, Liu Q, Beaudry G A, et al. Serial analysis of gene expression in non-small cell lung cancer. Cancer Research,1998,58(24):5690.
    [84]Nacht M, Ferguson A T, Zhang W, et al. Combining serial analysis of gene expression and array technologies to identify genes differentially expressed in breast cancer. Cancer Research,1999, 59(21):5464.
    [85]Hough C D, Sherman-Baust C A, Pizer E S, et al. Large-scale serial analysis of gene expression reveals genes differentially expressed in ovarian cancer. Cancer Research,2000,60(22):6281.
    [86]Argani P, Rosty C, Reiter R E, et al. Discovery of new markers of cancer through serial analysis of gene expression. Cancer Research,2001,61(11):4320.
    [87]Bong J J, Cho K K and Baik M. Comparison of gene expression profiling between bovine subcutaneous and intramuscular adipose tissues by serial analysis of gene expression. Cell Biology International,2010,34(1):125-133.
    [88]George A J, Gordon L, Beissbarth T, et al. A Serial Analysis of Gene Expression Profile of the Alzheimer's Disease Tg2576 Mouse Model. Neurotoxicity Research,2010,17(4):360-379.
    [89]Mizoguchi Y, Hirano T, Itoh T, et al. Differentially expressed genes during bovine intramuscular adipocyte differentiation profiled by serial analysis of gene expression. Animal Genetics,2010, 41(4):436-441.
    [90]Sakamoto N, Oue N, Noguchi T, et al. Serial analysis of gene expression of esophageal squamous cell carcinoma:ADAMTS16 is upregulated in esophageal squamous cell carcinoma. Cancer Science,2010,101(4):1038-1044.
    [91]King H C and Sinha A A. Gene expression profile analysis by DNA microarrays:promise and pitfalls. Jama,2001,286(18):2280.
    [92]Gerhold D, Lu M, Xu J, et al. Monitoring expression of genes involved in drug metabolism and toxicology using DNA micrroarrays. Physiological Genomics,2001,5(4):161-170.
    [93]Baldi P and Long A D. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics,2001,17(6):509-519.
    [94]Pavlidis P. Using ANOVA for gene selection from microarray studies of the nervous system. Methods,2003,31(4):282-289.
    [95]Aubert J, Bar-Hen A, Daudin J J, et al. Determination of the differentially expressed genes in microarray experiments using local FDR. BMC Bioinformatics,2004,5(1):125.
    [96]Troyanskaya O G, Garber M E, Brown P O, et al. Nonparametric methods for identifying differentially expressed genes in microarray data Bioinformatics,2002,18(11):1454-1461.
    [97]Efron B and Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genetic Epidemiology,2002,23(1):70-86.
    [98]Tusher V G, Tibshirani R and Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America,2001,98(9):5116-5121.
    [99]Porter D A, Krop I E, Nasser S, et al. A SAGE (serial analysis of gene expression) view of breast tumor progression. Cancer Research,2001,61(15):5697.
    [100]Cai L, Huang H Y, Blackshaw S, et al. Clustering analysis of SAGE data using a Poisson approach. Genome Biology,2004,5(7).
    [101]Woolf P J and Wang Y. A fuzzy logic approach to analyzing gene expression data. Physiological Genomics,2000,3(1):9.
    [102]Kohonen T. Self-organizing Maps. Berlin:Springer-Verlag,2001.
    [103]Herrero J, Valencia A and Dopazo J. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics,2001,17(2):126-136.
    [104]Fritzke B. Growing cell structures-a self-organizing network for unsupervised and supervised learning. Neural Networks,1994,7(9):1441-1460.
    [105]Luo F, Khan L, Bastan F, et al. a dynamically growing self-organizing tree (DGSOT) for hierarchical clustering gene expression data. Bioinformatics,20004,20(16):2605-2617.
    [106]Shawe-Taylor J and Cristianini N. Kernel Methods for Pattern Analysis. Cambridge: Cambridge University Press,2004.
    [107]Li Y, Campbell C and Tipping M. Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics,2002,18(10):1332-1339.
    [108]Quackenbush J. Computational genetics:Computational analysis of microarray data Nature Reviews Genetics 2001,2:418-427.
    [109]Yeung K Y and Ruzzo W L. Principal component analysis for clustering gene expression data. Bioinformatics,2001,17(9):763.
    [110]Abba M C, Fabris V T, Hu Y H, et al. Identification of novel amplification gene targets in mouse and human breast cancer at a syntenic cluster mapping to mouse ch8A1 and human ch13q34. Cancer Research,2007,67(9):4104-4112.
    [111]Yeung K Y and Ruzzo W L. Principal component analysis for clustering gene expression data. Bioinformatics,2001,17(9):763-74.
    [112]Zhou J, Pan Y Q, Chen Y H, et al. Ensemble Classifiers Based on Kernel PCA for Cancer Data Classification. In:D. S. Huang, K. H. Jo, H. H. Lee, V. Bevilacqua and H. J. Kang, editors. Emerging Intelligent Computing Technology and Applications:With Aspects of Artificial Intelligence. Berlin:Springer-Verlag Berlin,2009:955-964.
    [113]Wang H, Zheng H and Azuaje F. Clustering-based approaches to SAGE data mining. BioData Mining,2008,1(1):5.
    [114]Su H Q and Zhu Y S. UKF based Self-organizing Feature Maps Algorithm for Serial Analysis of Gene expression Data. Proceedings of the 2009 Pacific-Asia Conference on Circuits, Communications and System,2009:595-597.
    [115]Zheng H R, Wang H Y and Azuaje F. Improving pattern discovery and visualization of SAGE data through poisson-based, self-adaptive neural networks. IEEE Transactions on Information Technology in Biomedicine,2008,12(4):459-469.
    [116]Sugumaran V, Muralidharan V and Ramachandran K I. Feature selection using Decision Tree and classification through Proximal Support Vector Machine for fault diagnostics of roller bearing. Mechanical Systems and Signal Processing,2007,21(2):930-942.
    [117]Witkosski U, Ruping S, Ruckert U, et al. System identification using selforganizing feature maps; 2002. IET.
    [118]Han T, Yang B S, Choi W H, et al. Fault diagnosis system of induction motors based on neural network and genetic algorithm using stator current signals. International Journal of Rotating Machinery,2006,1:1-13.
    [119]Xiao-lin L I. SOM neural networks applied in voice target recognition system. Journal of Changchun University of Technology (Natural Science Edition),2008,5.
    [120]Papamarkos N. Color reduction using local features and a SOFM neural network. Intenational Journal of Imaging Systems and Technology,1999,10(5):404-409.
    [121]Fang G, Gao W and Ma J. Signer-independent sign language recognition based on SOFM/HMM. ratfg-rts,2001:0090.
    [122]Bertone P and Gerstein M. Integrative data mining:the new direction in bioinformatics. IEEE Transaction on Engineering in Medicine and Biology Magazine,2002,20(4):33-40.
    [123]Pulakka K and Kujanp V. Rough level path planning method for a robot using SOFM neural network. Robotica,1998,16(4):415-423.
    [124]Borga M. Learning multidimensional signal processing. Linkoping studies in science and technology-dissertations,1998.
    [125]Zhang Z, Sun S and Zheng F. Image fusion based on median filters and SOFM neural networks:a three-step scheme. Signal Processing,2001,81(6):1325-1330.
    [126]Principe J C and Wang L. Non-linear time series modeling with Self-Organization Feature Maps.2002. IEEE.
    [127]Kalman R E. A new approach to linear filtering and prediction problems. Transactions of the ASME-Journal of Basic Enginerring,1960,85(D):35-45.
    [128]Haykin S. Kalman filtering and neural networks:Wiley Online Library,2001.
    [129]Meinhold R J and Singpurwalla N D. Rohustification of Kalman filter models. Journal of the American Statistical Association,1989,84:479-486.
    [130]Julier S J and Uhlmann J K. Unscented filtering and nonlinear estimation. Proceedings of the IEEE,2004,92(3):401-422.
    [131]Julier S J and Uhlmann J K. Consistent debiased method for converting between polar and Cartesian coordinate systems. In:K. M. Michael and A. S. Larry, editors. Proceedings of AeroSense: Acquisition, Tranking and Pointing XI. Orlando:SPIE,1997:110-121.
    [132]Villmann T, Der R, Herrmann M, et al. Topology preservation in self-organizing feature maps: exact definition and measurement. IEEE Transactions on Neural Networks,1997,8(2):256-266.
    [133]Beaton D, Valova I and MacLean D. CQoCO:A measure for comparative quality of coverage and organization for self-organizing maps. Neurocomputing,2010,73(10-12):2147-2159.
    [134]Kamimura R. Information-theoretic enhancement learning and its application to visualization of self-organizing maps. Neurocomputing,2010,73(13-15):2642-2664.
    [135]Haese K. Self-organizing feature maps with self-adjusting learning parameters. IEEE Trans Neural Networks,1998,9(6):1270-8.
    [136]Pollastro P and Rampone S. HS3D, A dataset of homo splice site regions, and its extraction procedure form a major public database. International Journal of Modern Physics C,2002, 13(8):1005-1117.
    [137]Crooks G E, Hon G, Chandonia J M, et al. WebLogo:A sequence logo generator. Genome Research,2004,14(6):1188-1190.
    [138]张勇,阮晓钢.分支位点在真核基因受体位点识别中的作用.生物物理学报,2004,(03).
    [139]Hequan S, Qinke P, Quanwei Z, et al. Splice Site Prediction Based on Characteristic of Sequential Motifs and C4.5 Algorithm. Fuzzy Systems and Knowledge Discovery,2008. FSKD '08. Fifth International Conference on,2008 18-20 Oct.2008.
    [140]孙贺全,彭勤科,张全伟.基于序列模式特征和svm的剪切位点预测.计算机工程,2009, (05).
    [142]Blackshaw S, Harpavat S, Trimarchi J, et al. Genomic Analysis of Mouse Retinal Development. PLoS Biology,2004,2(9):1411-1431.
    [143]Fisher R A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1986,7(2):179-188.
    [144]Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees. Wadsworth International,1984.
    [145]Quinlan J R. C4.5:Programs for Machine Learning. San Fransisco:Morgan Kauffmann, 1993.
    [146]Hertz J, Krogh A and Palmer R G. Introduction to the Theory of Neural Computation: Addison-Wesley,1991.
    [149]Vapnik V and Chervonenkis A. A note on one class of perceptrons Automation and Remote Control,1964,25.
    [150]Scholkopf B, Smola A and Muller K R. Nolinear Component analysis as a kernel eigenvalue problem. Neural Computation,1998,10(5):1299-1319.
    [151]Mika S, Ratsch G, Weston J, et al. Fisher discriminant analysis with kernel. Proceedings of the IEEE International Workshop in Neural Networks for Signal Processing. Wisconsin,1999:41-48.
    [152]Ben-hur A and Horn D. Support Vector Clustering. Machines Learning Research,2001, 2:125-137.
    [153]Bach F R and Jordan M I. Kernel Independant Component Analysis. Journal of Machine Learning Research,2002,3:1-48.
    [154]Milgram J, Cheriet M and Sabourin R. "One Against One" or "One Against All":Which One is Better for Handwriting Recognition with SVMs? 2006.
    [155]Hotta K. Robust face recognition under partial occlusion based on support vector machine with local Gaussian summation kernel. Image and Vision Computing,2008,26(11):1490-1498.
    [156]Lewis D P, Jebara T and Noble W S. Support vector machine learning from heterogeneous data:an empirical analysis using protein sequence and structure. Bioinformatics,2006,22(22):2753.
    [157]Kwang Loong S N G and Mishra S K. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics,2007.
    [158]Hu Q, He Z, Zhang Z, et al. Fault diagnosis of rotating machinery based on improved wavelet package transform and SVMs ensemble. Mechanical Systems and Signal Processing,2007, 21(2):688-705.
    [159]犹杰,刘文予,杨琳琳.基于交叉熵的矢量量化分类算法.舰船电子工程,2007(04).
    [160]Weston J and Watkins C. Multi-class Support Vector Machines. Weston98multi-classsupport, 1998.
    [161]Crammer K and Singer Y. On the Learnability and Design of Output Codes for Multicalss Problems. Machine Learning,2002,47:201-233.
    [162]Dietterich T G and Bakiri G. Solving multiclass learning problems via error-correcting output codes. Journal of Artifical Intelligence Research,1995,2:263-286.
    [163]Bredensteiner E and Jbennett K P. Multi-category classification by support vector machines. MCompytational Optimizations and Applications,1999:53-79.
    [164]Takahashi F and Shigeo A. Dicision tree based muliti-class support vector machines. Proceeding of ICON IP. Singapore:IEEE Press,2002:1419-1422.
    [165]Vapnik V. Statitical Learning Theory. New York:John Wiley & Sons,1998.
    [166]Platt J C, Cristianini N and Shawe-Taylor J. Large margin in DAG's for multiclass classification. Advances in Neural Information Processing Systems. Cambridge:MIT Press, 2000:547-553.
    [167]Hsu C W and Lin C J. A comparison of methods for multiclass support vector machine. IEEE Trans Neural Networks,2002,13(2):415-425.
    [168]Wang S M. Understanding SAGE data. Trends Genet,2007,23(1):42-50.
    [169]Porter D, Weremowicz S, Chin K, et al. A neural survival factor is a candidate oncogene in breast cancer. Proceedings of the National Academy of Sciences of the United States of America, 2003,100(19):10931-10936.
    [170]MacAulay C, Lonergan K, Chi B, et al. Serial Analysis of Gene Expression Profiles of Developmental Stages in Non-small Cell Lung Carcinoma Chest,2004,125(5):suppl 97S
    [171]Tavazoie S, Hughes J D, Campbell M J, et al. Systematic determination of genetic network architecture. Nature Genetics,1999,22(3):281-285.
    [172]Jackson J E and Mudholkar G S. Control procedures for residuals associated with principle component analysis. Technometrics,1979,21(3):341-349.
    [173]Blackshaw S, Fraioli R E, Furukawa T, et al. Comprehensive Analysis of Photoreceptor Gene Expression and the Identification of Candidate Retinal Disease Genes. Cell,2001,107(5):579-589.
    [174]Huang H, Cai L and Wong W H. Clustering Analysis of SAGE Transcription Profiles Using a Poisson Approach In:K. L. Nielsen, editor. Serial Analysis of Gene Expression (SAGE) Methods and Protocols,2008:185-198.
    [175]Han X. Nonnegative Pricipal Component Analysis for Cancer Molecular Pattern Discovery. IEEE Transaction on Computational Biology and Bioinformatics,2010,7(3):537-549.