A Bayesian nonparametric method for prediction in EST analysis
详细信息    查看全文
  • 作者:Antonio Lijoi (1)
    Ramsés H Mena (2)
    Igor Prünster (3)
  • 刊名:BMC Bioinformatics
  • 出版年:2007
  • 出版时间:December 2007
  • 年:2007
  • 卷:8
  • 期:1
  • 全文大小:280KB
  • 参考文献:1. Adams M, Kelley J, Gocayne J, Mark D, Polymeropoulos M, Xiao H, Merril C, Wu A, Olde B, Moreno R, Kerlavage A, McCombe W, Venter J: Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome Project. / Science 1991, 252:1651-656. CrossRef
    2. Emrich S, Barbazuk W, Li L, Schnable P: Gene discovery and annotation using LCM-454 transcriptome sequencing. / Genome Res 2007, 17:69-3. CrossRef
    3. Good IJ: The population frequencies of species and the estimation of population parameters. / Biometrika 1953, 40:237-64.
    4. Good IJ, Toulmin GH: The number of new species, and the increase in population coverage, when a sample is increased. / Biometrika 1956, 43:45-3.
    5. Mao CX: Prediction of the conditional probability of discovering a new class. / J Amer Statist Assoc 2004, 99:1108-118. CrossRef
    6. Susko E, Roger AJ: Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys. / Bioinformatics 2004, 20:2279-287. CrossRef
    7. Wang JPZ, Lindsay BG, Cui L, Wall PK, Marion J, Zhang J, dePamphilis CW: Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries. / BMC Bioinformatics 2005, 6:300. CrossRef
    8. Mao CX: Estimating species accumulation curves and diversity indices. / Statistica Sinica 2007, / in press.
    9. Hill BM: Posterior moments of the number of species in a finite population and the posterior probability of finding a new species. / J Amer Statist Assoc 1979, 74:668-73. CrossRef
    10. Lijoi A, Mena R, Prünster I: Bayesian nonparametric estimation of the probability of discovering new species. / Biometrika 2007, / in press.
    11. Pitman J: / Combinatorial Stochastic Processes. Lecture Notes in Mathematics 1875 Berlin: Springer 2006.
    12. Wang JPZ, Lindsay BG, Cui L, Wall PK, Miller WC, dePamphilis CW: EST clustering error evaluation and correction. / Bioinformatics 2004, 20:2973-984. CrossRef
    13. Bernardo JM, Smith AFM: / Bayesian theory Chichester: Wiley 1994. CrossRef
    14. Pitman J: Exchangeable and partially exchangeable random partitions. / Probab Theory Related Fields 1995, 102:145-58. CrossRef
    15. Ewens WJ: The sampling theory of selectively neutral alleles. / Theor Popul Biol 1972, 3:87-12. CrossRef
    16. Gyllenberg M, Koski T: Probabilistic models for bacterial taxonomy. / Int Statist Review 2001, 69:249-76.
    17. Zhaohui S: Clustering microarray gene expression data using weighted Chinese restaurant process. / Bioinformatics 2006, 22:1988-997. CrossRef
    18. Ishwaran H, James LF: Gibbs sampling methods for stick-breaking priors. / J Amer Statist Assoc 2001, 96:161-73. CrossRef
    19. Teh YW: A hierarchical Bayesian language model based on Pitman-Yor processes. / Proceedings of the Annual Meeting of the Association for Computational Linguistics 2006., 44:
  • 作者单位:Antonio Lijoi (1)
    Ramsés H Mena (2)
    Igor Prünster (3)

    1. Department of Economics and Quantitative Methods, University of Pavia, 27100 Pavia and Institute for Applied Mathematics and Information Technology, National Research Council, 20133, Milan, Italy
    2. Research Institute for Applied Mathematics and Systems, National Autonomous University of Mexico, Mexico City, A.P., 20-726, Mexico
    3. Department of Statistics and Applied Mathematics and ICER, University of Turin, 10122 Turin and Carlo Alberto College, 10024, Moncalieri, Italy
  • ISSN:1471-2105
文摘
Background Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library. Results In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries, previously studied with frequentist methods, are analyzed in detail. Conclusion The Bayesian nonparametric approach we undertake yields valuable tools for gene capture and prediction in EST libraries. The estimators we obtain do not feature the kind of drawbacks associated with frequentist estimators and are reliable for any size of the additional sample.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700