基于贝叶斯的质谱数据分析方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
伴随着人类基因组计划发展起来的基因组学为人类探索生命的原理起来划时代的重要作用。但是在其发展的同时,人们慢慢认识到只从基因水平上去探索生命的本质是完全不够的,需要从更根本的本质上去研究揭示生命现象,这样蛋白质组学应运而生。质谱作为一种有效的工具为科学家们研究蛋白质提供了很大的帮助。
     本文首先介绍了目前主流的基于质谱的蛋白质分析流程和技术,并介绍了一些常用的基于质谱的蛋白质的算法,包括SEQUEST、MASCOT、X! Tandom中的算法。总结了蛋白质定量分析的两种策略同位素标记方法和无标记定量技术,并分析了他们的区别和各自的优点,介绍了目前基于质谱的蛋白质翻译后修饰发现与鉴定的常用算法。
     现有的基于质谱的蛋白质鉴定算法各有千秋,各有各的优点。我们尝试利用机器学习并结合朴素贝叶斯理论对现有的算法进行整合。选取的机器学习方法包括SVM、LDA、logistic回归、KNN、贝叶斯置信网络、人工神经网络等方法。选取的分类特征包括SEQUEST算法中提供的多种参数。训练数据来自于18组已知的混合蛋白的质谱数据。通过机器学习的方法得到分类器的分界面,并计算阴阳极样本在分类器分类函数作用下的条件分布。利用阴阳极的条件分布和新样本在分类器下的特征得分,在均匀先验的条件下通过朴素贝叶斯的方法就可以计算出蛋白质鉴定结果的后验概率。通过交叉验证的结果表明我们的算法的正确率在80%-90%,同时可以保证召回率达到40%-50%,具有加好的实用价值。
     蛋白质翻译后修饰的鉴定一直是蛋白质组研究里面一个重要的领域。通常的基于质谱的蛋白质翻译后修饰的鉴定的方法是机器学习和直接与已知数据库对比。与已知数据库对比的算法时间复杂度较高,同时因为比对的次数很多算法的假阳性率较高。我们尝试利用基于投影距离的聚类算法来对质谱数据先进行聚类分析,然后再在此基础上进行翻译后修饰的识别,这样不仅降低了算法的时间复杂度,而且也提高了精度。投影方向是利用已知样本基于LDA和SVM计算出来的,使得在投影方向上类内距离尽可能的小,类间的距离尽可能大。得到投影方向之后在通过对未知样本两两之间进行投影距离的计算得到距离矩阵。通过利用距离矩阵和常用的聚类算法对数据直接进行聚类分析。得到的聚类结果中的每一个类可能就是同一肽段的不同的翻译后修饰的实例,通过比较同一类内的结果可以快速高效的发现可能存在的翻译后修饰。在已知数据的交叉验证下算法的正确率和召回率都在70%左右
     自从Google提出了云计算的概念,各种基于云计算应用层出不穷,蛋白质质谱数据分析具有高通量和可并行化的特点,可以方便的部署到云计算平台上。我们提出了两种部署策略并比较了两种策略的优点和不足。
The genomics, developing with the Human Genome Project (HGP), plays an imp ortant role in exploring the element of life. During this process, people realize that it i s inadequate to know the life on the genetic level, and then the proteomics grows up. The mass spectra technology is a kind of effective tools for scientists to study.
     The main pipelines and technologies for analyzing proteins are introduced in this article and also the common algorithms for mass spectra are discussed, such as SEQUEST, MASCOT, X!Tandom. We also summarize the distinction and merit of two stratigies for quantitive analysis of proteins, the isotopic labeling and lable free method. Moreover, the common methods for discovering and identifing the post-translational modifications of proteins are depicted in this paper.
     Becasue different algorithms for identifing proteins have their own qulities, we attempt to indegrate the existing methods by machine learning combining with naive bayes theory. The approaches of machine learing include SVM, LDA, Logistic Regression, KNN, Bayesian Network and Artificial Neural Networks. We choose the parameters used in SEQUEST as the characters for classification. The traing data sets come from the mass specture of known protein mixture divided by18teams. By machine learning, we acquire the interface of classifier and calculate the conditional distribution of positive and negtive samples with the classifing function. By the conditional distribution and the scores of features, we could count the posterior probabilities of idntification, utilizing the bayes methods on the priori homogeneous distribution. From the cross validation, the accuracy of our method could achieve between80%and90%and the recall could be between40%and50%, which accounts for the utility value of the novel method.
     How to identify the post-translational modifications of proteins is always a key problem in proteomics. The troditional algorithm for identifing the unknown proteins by mass spectra is to search the protein database employing the methods of machine learning. However, it is time-consuming and the false positive rate will increase. We try to classify the mass spectra using the projection distance and then discover the post-translational modifications, which could not only decrease the time complexity, but also improve the accuracy. The projecting direction is calculated by LDA and SVM, which making the distance in classter smaller and out classter bigger. We get the distance matrix by projection and perform the classification with certain classtering algorithms. The peptides in the same class may derive from the same protein with different post-translational modifications. Comparing the peptides in the same class, we could find the latent post-translational modifications more quickly and efficiently. By cross validation, the accuracy and recall could both reach about70%.
     Since the concept of cloud computing is put forward by google company, many kinds of distributed cloud computings have emerged. Due to the characters of high flux and parallelism, the calculation of proteomics could assign to the cloud computing platform. So we suggest two kinds of strategies and tell the advantages and defects of them.
引文
1. Abbott, A., And now for the proteome. Nature,2001.409(6822):p.747.
    2. Fields, S., Proteomics-Proteomics in genomeland. Science,2001.291(5507):p.1221-+.
    3. Velculescu, V.E., et al., Serial analysis of gene expression. Science,1995.270(5235):p. 484-7.
    4. Maskos, U. and E.M. Southern, Oligonucleotide hybridizations on glass supports:a novel linker for oligonucleotide synthesis and hybridization properties of oligonucleotides synthesised in situ. Nucleic Acids Res,1992.20(7):p.1679-84.
    5. Shendure, J. and H. Ji, Next-generation DNA sequencing. Nat Biotechnol,2008.26(10):p. 1135-45.
    6. James, P., Protein identification in the post-genome era:the rapid rise of proteomics. Q Rev Biophys,1997.30(4):p.279-331.
    7. MacBeath, G. and S.L. Schreiber, Printing proteins as microarrays for high-throughput function determination. Science,2000.289(5485):p.1760-1763.
    8. 陈希孺,数理统计学简史.2002:湖南教育出版社.
    9. Bayes, T., An essay towards solving a problem in the doctrine of chances.1763. MD Comput,1991.8(3):p.157-71.
    10. Wald, A., Statistical decision functions. Wiley publications in statistics.1950, New York,London:John Wiley & Sons;Chapman & Hall. ix,179 p.
    11. Jeffreys, H., An invariant form for the prior probability in estimation problems. Proc R Soc Lond A Math Phys Sci,1946.186(1007):p.453-61.
    12. Motwani, R. and P. Raghavan, Randomized algorithms.1995, Cambridge:Cambridge University Press. xiv,476 p.
    13. Markov, A.A., Theory of algorithms (Teoriya algorifmov). Works of the Mathematical Institute imeni V.A. Steklov.1962, Jerusalem:Israel Program for Scientific Translations. 444 p.
    14. Manly, B.F.J., Randomization, Bootstrap and Monte Carlo methods in biology.2nd ed. Texts in statistical science.1997, London:Chapman and Hall. xix,399 p.
    15. Robert, C.P. and G. Casella, Monte Carlo statistical methods.2nd ed. Springer texts in statistics.2004, New York:Springer. xxx,645 p.
    16. Metropolis, N.R., A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E., Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics 21 (6),1953:p. 1087-1092.
    17. WR Gilks, P.W., Adaptive rejection sampling for Gibbs sampling. Applied Statistics,1992-JSTOR,1992.
    18. Liu, J.S., Liang, F. and Wong, W. H., The use of multiple-try method and local optimization in Metropolis sampling. Journal of the American Statistical Association, 95(449):121-134 JSTOR,2000.
    19. Neal, R.M., Slice Sampling. Annals of Statistics 31 (3):705-767,2003.
    20. Casella, G.G., Edward I., Explaining the Gibbs sampler. The American Statistician 46 (3): 167-174,1992.
    21. Pritchard, J.K., et al., Population growth of human Y chromosomes:a study of Y chromosome microsatellites. Mol Biol Evol,1999.16(12):p.1791-8.
    22. Beaumont, M.A., W. Zhang, and D.J. Balding, Approximate Bayesian computation in population genetics. Genetics,2002.162(4):p.2025-35.
    23. Lawrence, C.E., et al., Detecting subtle sequence signals:a Gibbs sampling strategy for multiple alignment. Science,1993.262(5131):p.208-14.
    24. Newberg, L.A., et al., A phylogenetic Gibbs sampler that yields centroid solutions for cis-regulatory site prediction. Bioinformatics,2007.23(14):p.1718-1727.
    25. Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol,2004.3:p. Article3.
    26. Tamayo, P., et al., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A,1999. 96(6):p.2907-12.
    27. 何美玉,现代有机与生物质谱.2002:北京大学出版社.
    28. 王慧聪,有机质谱技术与方法.2011.
    29. Pappin, D.J., P. Hojrup, and A.J. Bleasby, Rapid identification of proteins by peptide-mass fingerprinting. Curr Biol,1993.3(6):p.327-32.
    30. Jimmy K. Eng, A.L.M., and John R. Yates, An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J Am Soc Mass Spectrom 5(11):976-989,1994.
    31. Perkins, D.N., et al., Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis,1999.20(18):p.3551-3567.
    32. Craig, R. and R.C. Beavis, TANDEM:matching proteins with tandem mass spectra. Bioinformatics,2004.20(9):p.1466-7.
    33. Brenner, N.R., C., A New Principle for Fast Fourier Transformation. IEEE Acoustics, Speech & Signal Processing 24 (3):264-266.,1976.
    34. Brigham, E.O., The fast Fourier transform and its applications.1988.
    35. Hochberg, Y. and A.C. Tamhane, Multiple comparison procedures. Wiley series in probability and mathematical statistics. Applied probability and statistics.1987, New York:Wiley. ⅹⅹⅱ,450 p.
    36. Benjamini, Y.H., Yosef Controlling the false discovery rate:a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological) 57 (1):289-300,1995.
    37. Reiner, A., D. Yekutieli, and Y. Benjamini, Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics,2003.19(3):p.368-375.
    38. Altschul, S.F. and W. Gish, Local alignment statistics. Methods Enzymol,1996.266:p. 460-80.
    39. Keller, A., et al., A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Molecular systems biology,2005.1(1).
    40. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol,1990.215(3):p. 403-10.
    41. Mount, D.W., Using the Basic Local Alignment Search Tool (BLAST). CSH Protoc,2007. 2007:p. pdb top 17.
    42. Keller, A., et al., Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical Chemistry,2002.74(20): p.5383-5392.
    43. Dempster, A.P.L., N.M.; Rubin, D.B., Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1-38.JSTOR,1977.
    44. Bishop, C.M., Pattern recognition and machine learning. Information Science and Statistics.2006, New York:Springer. xx,738 p.
    45. Tibshirani, R. and J.H. Friedman, The elements of statistical learning:data mining, inference, and prediction.2nd ed. Springer series in statistics.2009, New York, NY: Springer. xxii,746 p.
    46. Kass, R.E.R., A. E., Bayes Factors. JOURNAL-AMERICAN STATISTICAL ASSOCIATION,1995.1995, VOL 90; NUMBER 430, pages 773
    47. Tang, W.H., I.V. Shilov, and S.L. Seymour, Nonlinear fitting method for determining local false discovery rates from decoy database searches. Journal of proteome research,2008. 7(9):p.3661-3667.
    48. Elias, J.E., et al., Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nature methods,2005.2(9):p.667-675.
    49. Brown, H., F. Sanger, and R. Kitai, The structure of pig and sheep insulins. Biochem J, 1955.60(4):p.556-65.
    50. Edman, P., A method for the determination of amino acid sequence in peptides. Arch Biochem,1949.22(3):p.475.
    51. T. Sakurai, T.M., H. Matsuda, I. Katakuse, PAAS 3:A computer program to determine probable sequence of peptides from mass spectrometric data. Biological Mass Spectrometry Volume 11, Issue 8, pages 396-399, August, August 1984.
    52. Bartels, C., Fast algorithm for peptide sequencing by mass spectroscopy. Biological Mass Spectrometry Volume 19, Issue 6, pages 363-368, June 1990,1990.
    53. Bellman, R., On the Theory of Dynamic Programming. Proc Natl Acad Sci U S A,1952. 38(8):p.716-9.
    54. Connen, T.H., Introduction to algorithms.3rd ed.2009, Cambridge, Mass.; London:MIT Press. xix,1292 p.
    55. Knuth, D.E., The art of computer programming. Volumes 1-4a.2011, [Upper Saddle River, N.J.]:Addison-Wesley.
    56. Pevzner, P.A. and S.H. Sze, Combinatorial approaches to finding subtle signals in DNA sequences. Proc Int Conf Intell Syst Mol Biol,2000.8:p.269-78.
    57. Ong, S.E., I. Kratchmarova, and M. Mann, Properties of 13C-substituted arginine in stable isotope labeling by amino acids in cell culture (SILAC). J Proteome Res,2003.2(2): p.173-81.
    58. Ross, P.L., et al., Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Molecular & Cellular Proteomics,2004.3(12): p.1154-1169.
    59. Gygi, S.P., et al., Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnology,1999.17(10):p.994-999.
    60. Vrbanac, J.J., et al., Automated qualitative and quantitative metabolic profiling analysis of urinary steroids by a gas chromatography-mass spectrometry-data system. J Chromatogr,1982.239:p.265-76.
    61. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A,2001.98(9):p.5116-21.
    62. Cleveland, W.S., Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association 74 (368):829-836,1979.
    63. Ma, B., et al., PEAKS:powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom,2003.17(20):p.2337-42.
    64. Thompson, J.D.H., D. G. Gibson, T. J., CLUSTAL W:improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. NUCLEIC ACIDS RESEARCH,1994.1994, VOL 22; NUMBER 22, pages 4673
    65. Eddy, S.R., Profile hidden Markov models. Bioinformatics,1998.14(9):p.755-63.
    66. Carrillo, B., et al., Accurate samples for testing mass spectrometry based peptide quantification algorithms. Conf Proc IEEE Eng Med Biol Soc,2010.2010:p.5524-8.
    67. Vapnik, V.N., The nature of statistical learning theory.1995, New York:Springer, xv,188 P.
    68. Chih-Chung Chang, C.-J.L., LIBSVM:A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011. Volume 2 Issue 3, April 2011
    69. Fan, R.E., et al., LIBLINEAR:A Library for Large Linear Classification. Journal of Machine Learning Research,2008.9:p.1871-1874.
    70. Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7 (2):179-188,1936.
    71. Duda, R.O., P.E. Hart, and D.G. Stork, Pattern classification.2nd ed.2001, New York Chichester:Wiley.xx,654 p.
    72. Hosmer, D.W., S. Lemeshow, and E.D. Cook, Applied logistic regression.2nd ed. Wiley series in probability and statistics.2000, New York; Chichester:Wiley. xii,373 p.
    73. PW Holland, R.W., Robust regression using iteratively reweighted least-squares. Communications in Statistics-Theory and Methods,1977.
    74. Hestenes, M.R.S., Eduard, Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards 49 (6).1952.
    75. Ferguson, T.S., An Inconsistent Maximum-Likelihood Estimate. Journal of the American Statistical Association,1982.77(380):p.831-834.
    76. Bishop, C.M., Neural networks for pattern recognition.1995, Oxford:Clarendon Press. xvii,482 p.
    77. Arthur Earl Bryson, Y.-C.H., Applied optimal control:optimization, estimation, and control. Blaisdell Publishing Company or Xerox College Publishing. pp.481.,1969.
    78. Domingos, P., Michael Pazzani, On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning,29:103-137.,1997.
    79. Klimek, J., et al., The standard protein mix database:a diverse data set to assist in the production of improved peptide and protein identification software tools. The Journal of Proteome Research,2007.7(01):p.96-103.
    80. Parzen, E., On estimation of a probability density function and mode. Annals of Mathematical Statistics 33:1065-1076,1962.
    81. Rosenblatt, M., Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics 27: 832-837., 1956.
    82. Mahalanobis, P.C., On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2 (1): 49-55. Retrieved 2008-11-05., 1936.
    83. Hamming, R.W., Error detecting and error correcting codes. Bell System Technical Journal 29 (2): 147-160, 1950.
    84. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10: 707-10., 1966.
    85. Lance, G.N.W., W. T. , Computer programs for hierarchical polythetic classification ("similarity analysis"). Computer Journal 9 (1): 60-64., 1966.
    86. Jaccard, P., Etude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Societe Vaudoise des Sciences Naturelles 37: 547-579., 1901.
    87. Spearman, C., The proof and measurement of association between two things. Amer. J. Psychol, 15(1904) pp. 72-101, 1904.
    88. MacQueen, J.B., Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, pp. 281-297., 1967.
    89. Dubes, A.KJ.a.R.C, Algorithms for Clustering Data. Prentice-Hall, 1981.
    90. Han, J., M. Kamber, and J. Pei, Data mining : concepts and techniques. 3rd ed. Morgan Kaufmann series in data management systems. 2012, Amsterdam ; London: Elsevier MK. xxxv,703 p.
    91. Ng, R.T. and J.W. Han, CLARANS: A method for clustering objects for spatial data mining. Ieee Transactions on Knowledge and Data Engineering, 2002. 14(5):p. 1003-1016.
    92. Frey, B.J. and D. Dueck, Clustering by passing messages between data points. Science, 2007.315(5814): p. 972-976.
    93. Ferguson, T., Bayesian analysis of some nonparametric problems. Annals of Statistics 1 (2): 209-230, 1973.
    94. Green, P.J., Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika 82 (4): 711-732, 1995.
    95. Guha, S.R., R. Shim, K., ROCK: A Robust Clustering Algorithm for Categorical Attributes. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING, 1999. 1999, EDIT 15, pages 512-521
    96. Karypis, G.H., E.-H. Kumar, V., Chameleon: Hierarchical Clustering Using Dynamic Modeling. COMPUTER -LOS ALAMITOS-, 1999. 1999, VOL 32; NUMBER 8, pages 68-89
    97. Zhang, T.R., R. Livny, M., BIRCH: an efficient data clustering method for very large databases. SIGMOD RECORD, 1996. 1996, VOL 25; NUMBER 2, pages 103-114
    98. Martin Ester, H.-P.K., Jiirg Sander, Xiaowei Xu, Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD-96 Proceedings., 1996.
    99. Mihael Ankerst, M.M.B., Hans-Peter Kriegel, Jorg Sander OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on Management of data. ACM Press, pp. 49-60, 1999.
    100. Hinneburg, A.G., H.-H. , DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation. LECTURE NOTES IN COMPUTER SCIENCE,2007.2007, NUMB 4723, pages 70-80
    101. Kohonen, T, Self-Organizing Maps. SPRINGER SERIES IN INFORMATION SCIENCES,1997.1997, VOL 30; EDIT 2ND, pages ALL
    102. Wang, W.Y., J. Muntz, R., STING:A Statistical Information Grid Approach to Spatial Data Mining. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES,1997.1997, EDIT 23, pages 186-195
    103. Agrawal, R.G., J. Gunopulos, D. Raghavan, P., Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD RECORD,1998.1998, VOL 27; NUMBER 2, pages 94-105
    104. Daubechies, I., Ten lectures on wavelets.1992, Philadelphia:Society for Industrial and Applied Mathematics. xix,357 p.
    105. Milligan, G.W. and M.C. Cooper, An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika,1985.50(2):p.159-179.
    106. Akaike, H., A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 (6):716-723,1974.
    107. Akaike, H., On entropy maximization principle. Applications of Statistics, North-Holland, Amsterdam, pp.27-41.,1977.
    108. Chang, F., et al., Bigtable:A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS),2008.26(2):p.4.
    109. Dean, J. and S. Ghemawat, MapReduce:Simplified data processing on large clusters. Communications of the ACM,2008.51(1):p.107-113.
    110. Ghemawat, S., H. Gobioff, and S.T. Leung. The Google file system.2003:ACM.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700