基于群体智能的特征选择算法在SELDI质谱数据分析中的研究

英文题名：The Research of Feature Selection Algorithm Based on Swarm Intelligence in SELDI Mass Spectral Data Analysis
作者：张蓉
论文级别：硕士
学科专业名称：计算机应用与技术
中文关键词：表面增强激光解析电离飞行时间质谱(SELDI-TOF-MS) ; 特征选择技术 ; 生物标记物 ; 蚁群优化算法(ACO) ; 粒子群算法(PSO) ; 支持向量机(SVM)
英文关键词：Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF-MS) ; Feature selection ; biomarker ; Ant Colony Optimization Algorithm (ACO) ; Particle Swarm Optimization (PSO) ; support vector machines (SVM)
学位年度：2009
导师：冯斌
学科代码：081203
学位授予单位：江南大学
论文提交日期：2009-08-01

摘要

特征选择是生物信息学各个应用领域建模任务的前提。这些领域如生物序列分析、微阵列数据分析及质谱数据分析等都存在高维小样本和内部空间疏散的特性,由于小样本数据存在其固有的危险:不精确和过拟合,因而数据分析面临着巨大的挑战。结合生物信息学应用领域这些具体的特点,各种新的稳定行和鲁棒性好的特征选择算法不断地被提出。
     质谱技术能够检测生物样本(组织和细胞抽取物、血液、尿液等),获得样本中目标蛋白的分子量。因此,该方法能够识别出与疾病相关的模式,从而为寻找疾病标记物、特异的治疗疾病的靶分子、药物开发和疾病的诊断、治疗等提供重要的、直接的线索。
     本文系统地研究了SELDI-TOF质谱的数据分析,并将群体智能优化算法结合支持向量机(SVM)应用于质谱数据分析的生物标记物特征选择中。主要工作分为以几个方面:
     1)对国际上目前的研究前沿SELDI-TOF质谱技术进行理论研究,归纳了比较了SELDI-TOF质谱数据分析中的预处理方法和生物标记物选择方法,并总结了质谱技术存在的问题和发展方向。
     2)对群体智能算法,特别是蚁群算法(ACO)、粒子群算法(PSO)、及对应的改进算法的基本原理进行研究,为以后的学习应用提供了理论基础。
     3)将特征的权重因子作为ACO算法搜索过程中的先验信息,结合支持向量机(SVM)用于筛选血清蛋白相关生物标记物,该方法建立的癌症诊断模型取得了较好的分类性能测试仿真结果。
     4)将基于量子粒子群算法(QPSO)、ACO算法和粒子群算法(PSO)分别与SVM结合,并将建立的诊断模型用于生物标记物的选择。通过实验表明,基于量子粒子群算法建立的模型不仅具有良好的预测精度而且在速度上有大幅的提高,因此,具有一定的理论意义和实用价值。
     最后对本论文的主要研究成果进行了总结,并对有待进一步研究的方向进行了展望。
Applying feature selection (FS) techniques in bioinformatics has become a real prerequisite for model building. In particular, the high dimensional and small sample sizes natures of many modeling tasks in bioinformatics, going from sequence analysis over microarray analysis to spectral analyses and literature mining has given rise to a wealth of feature selection techniques being presented in the field. Small sample sizes and their inherent risk of imprecision and overfitting pose a great challenge for many modeling problems in bioinformatics. Specific applications in bioinformatics have led to a wealth of newly proposed techniques.
     Mass spectrometry (MS) technology is used to measure the mixture of proteins/peptides of biological tissues or fluids, such as serum or urine. Such measurements can be used to identify disease-related patterns, which hold potential for early diagnosis, prognosis, monitoring disease progression, response to treatment and drug target research.
     Comprehensive analyses on Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF-MS) data analyses are mainly discussed in our work and the application of swarm intelligence algorithm combined with SVM in biomarker selection is also studied in the work. The main contents of this dissertation are as follows:
     (1) The thesis researched fundamental principle of SELDI-TOF-MS technology and summarized various methods of its two main phases: pre-processing and biomarker selection. And its shortcomings and progress are discussed here.
     (2) Research on fundamental principle of Ant Colony Optimization Algorithm (ACO), Particle Swarm Optimization Algorithm (PSO) and their improved methods provides theoretical principles for further learning.
     (3) New method is raised using weighting factor as prior information in the ant colony optimization searching process. Combined with support vector machines (SVM), it was applied to identify relevant serum proteomic biomarkers. Experiments proposed method has strong power in distinguishing cancer patients from healthy individuals.
     (4) Combined SVM with QPSO, ACO and PSO, and using the models biomarkers selection, the experiments show that model built by QPSO achieved not only high prediction accuracy but also extremely fast velocity, so the proposed method QPSO-SVM has a certain good theoretical and utility value.
     The main contributions of this paper are summarized and the further researches on work are suggested at the end of this dissertation.

引文

1.陈主初.疾病蛋白质组学[M].北京:化学工业出版社, 2006. 1-108.
    2. Yvan Saeys, I?aki Inza, Pedro Larra?aga. A review of feature selection techniques in bioinformatics[J]. Bioinformatics(2007) 23(19): 2507–2517.
    3. Yvan Saeys, I?aki Inza, Pedro Larra?aga. A review of feature selection techniques in bioinformatics[J]. Bioinformatics.2007,23(19): 2507–2517.
    4. Liu,H., et al. (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform., 13, 51–60.
    5. Wu B, et al. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics (2003) 19:1636–1643.
    6. Bhanot G, et al. A robust meta classification strategy for cancer detection from MS data. Proteomics (2006) 6:592–604.
    7. Tibshirani R, et al. Sample classification from protein mass spectrometry, by‘peak probability contrast’. Bioinformatics (2004) 20:3034–3044 .
    8. Yu J, et al. Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics (2005) 21:2200–2209.
    9. Prados J, et al. Mining mass-spectra for diagnosis and biomarker discovery of cerebral accidents. Proteomics (2004) 4:2320–2332.
    10. Li L, et al. Applications of the GA/KNN method to SELDI proteomics data. Bioinformatics (2004) 20:1638–1640.
    11. Petricoin,E., et al. (2002) Use of proteomics patterns in serum to identify ovarian cancer. The Lancet, 359, 572–577
    12. Ressom,H., et al. (2005) Analysis of mass spectral serum profiles for biomarker selection. Bioinformatics, 21, 4039–4045..
    13. Geurts P, et al. Proteomic mass spectra classification using decision tree basedensemble methods. Bioinformatics (2005) 21:3138–3145.
    14. Jong K, et al. Feature selection in proteomic pattern data with support vector machines. (2004) Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. 41–48.
    15. Prados J, et al. Mining mass-spectra for diagnosis and biomarker discovery of cerebral accidents. Proteomics (2004) 4:2320–2332.
    16. Zhang X, et al. Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics (2006) 7:197.
    17. Ball,G., et al. (2002) An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics, 18, 395–404.
    18.吴启迪、汪镭.智能蚁群算法及应用[M] ,上海:上海科技教育出版社, 2004. 57-97.
    19.曾建潮,介婧,崔志华.微粒群算法[M].北京:科学出版社,2004.6-8.
    20. Dorigo M,Maniezzo V,Colomi A.Ant System:Optimization by a Colony of Cooperating Agent[J].IEEE Transactions on Systems,Man and Cybernetics,1996,26(1):29-41.
    21. Dorigo M,Gambardella L M.Ant Colony System:A Cooperative Learning Approach tO the Traveling Salesman Problem[J] . IEEE Transactions on Evolutionary Computation,1997,41(1):53—66.
    22. J. Kennedy, R. C. Eberhart, Particle Swarm Optimization. Proc. IEEE Int’l Conference on Neural Networks, IV. Piscataway, NJ: IEEE Service Center, 1995, pp. 1942-1948.
    23. Y. Shi, R. C. Eberhart, A Modified Particle Swarm. Proc. 1998 IEEE International Conference on Evolutionary Computation, pp. 1945-1950.
    24. M. Clerc, The Swarm and Queen: Towards a Deterministic and Adaptive Particle Swarm Optimization. Proc. CEC 1999, pp. 1951-1957.
    25. SUN J, FENG B, XU WB. Particle Swarm Optimization with Particles Having Quantum Behavior[A]. Proceedings of 2004 Congress on EvolutionaryComputation[C].2004, 325-331.
    26. Diamandis EP. Serum proteomic profiling by matrix-assisted laser desorption-ionization time-of-flight mass spectrometry for cancer diagnosis: next steps[J]. Cancer Res .2006, 66:5540–1.
    27. Diamandis EP. Is Early Detection of Cancer with Serum Biomarkers or Proteomic Profiling Feasible?. AACR Education Book 2007,129-132.
    28. Alejandro Cruz-Marcelo, et al. Comparison of algorithms for pre-processing of SELDI-TOF mass spectrometry data [J]. Bioinformatics, 2008, 24(19):2129–2136.
    29. Ciphergen Biosystems,I. ProteinChip Software3.1 Operation Manual. Femont, 2002,CA 94555.
    30. Li,X.et al. Seldi-tof mass spectrometry protein data. In Gentleman, R.et al.(eds) Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Ch. 6, Springer, New York.2007,pp.91-109.
    31. Coombes,K.R.et al. Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Inform.2007,1,41-52.
    32. Wong,J. et al.Specalign-processing and alignment of mass spectra datasets[J]. Bioinformatics.2005,21:2088-2090.
    33. Du,P. et al. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching[J].Bioinformatics. 2007,22:2059-2065.
    34. H.Liu and L.Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. On Knowledge and Data Engineering, 2005, vol.17(3),pp:1-12.
    35. L.C.Molina,L.Belanche and A.Nebot, Feature selection algorithm: a survey and experimental evaluation, In Proc.2002 IEEE International Conference on Data mining, 2002,pp.306-313.
    36. P.M.Narendra and K.Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Trans. On Computer,1977, vol.26(9),pp.917-922.
    37. J.Doak. An evaluation of feature selection methods and their application to computersecurity. Technical report. Davis CA: University of California, Department of Computer Science, 1992.
    38.边肇祺,张学工.模式识别第二版[M].北京:清华大学出版社,2001.
    39. W.Siedlecki and J. Sklansky. On automatic feature selection[J]. Pattern Recognition and Artificial Intelligence, 1988,vol.2(2),pp.197-220.
    40. Coombes K, et al. Pre-processing mass spectrometry data. In. In: Fundamentals of Data Mining in Genomics and Proteomics—Dubitzky M, et al, eds. (2007) Kluwer, Boston. 79–99.
    41. Hilario M, et al. Processing and classification of protein mass spectra. Mass Spectrom. Rev. (2006) 25:409–449.
    42. Shin H, Markey M. A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples. J. Biomed. Inform. (2006) 39:227–248.
    43. Yvan Saeys. et al. A review of feature selection techniques in bioinformatics [J]. Bioinformatics .2007,23(19):2507-2517.
    44.汪采萍.蚁群算法的应用研究[D]:[硕士学位论文].合肥工业大学,2007.
    45. StUtZle T, Hoos HH. Improvements on the ant system: introducing Max-min ant system .In: Proceedings of International Conference on Artificial Neural Network and Genetic Algorithm,wien: Springer weriag,1997.
    46.徐精明,曹先彬,王煦法.多态蚁群算法[J].中国科学技术大学学报, 2005, 35 (1) :59-6 5.
    47. http://lombardi.georgetown.edu/labpage [DB].
    48. Diamandis,E.P. (2004) Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations.[J], Mol. Cell Proteomics, 3, 367–378.
    49. Sauve,A.C. and Speed,T.P. (2004) Normalization, baseline correction and alignment of high-throughput mass spectrometry data. [J] Proceedings of the Genomic Signal Processing and Statistics workshop, Baltimore, MO, USA,May 26–27, 2004.
    50. Golub,T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.[J], Science, 1999, 286, 531–537.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700