基于单倍型的关联分析方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
人类基因组计划的完成,不论从数量上还是从质量上,都极大地丰富了人类遗传的数据资源,但也容易使人迷失在这浩如烟海的信息中。统计学,作为一种强有力的数据分析工具,越来越受到人们的重视并在遗传流行病的研究中发挥着不可替代的作用。
     关联分析主要通过研究遗传标记物与可观测的性状之间的统计相关性,来寻找和定位致病基因,并为我们更好的地理解疾病遗传基础发挥了重要的作用。单倍型,作为一种常见的数据类型,被人们认为含有更多的连锁不平衡(LD)信息,而且与其他方法相比,基于单倍型的关联分析在识别疾病关联上有更大的功效,尤其是病例—对照研究中稀有疾病的情况。但是,对这些单倍型进行建模,其中的稀有单倍型会带来很多的统计问题——大量的参数会使功效减少、效率降低。为了克服这些问题,单倍型聚类是个不错的解决方式。本文着重介绍了在基于单倍型的关联分析中,如何有效地利用位点本身以及位点间的信息来提高检验的功效,其中包括一个参数方法和一个非参数方法。
     本文首先介绍了基于单倍型聚类来进行关联分析的方法,称之为APEG,通过使用EG距离应用AP算法对单倍型进行有效合理的聚类。新提出的针对单倍型这一特殊数据类型的相似性度量EG距离,能够利用不同位点上以及位点之间的结构信息。通过模拟和真实数据的研究发现,APEG方法要比现存的其他方法在探测单倍型与疾病之间是否相关联方面拥有更大的功效,而且在基因定位上,也能够得到比较精确的估计。然后,我们介绍了基于U—统计量的非参数方法U-EGS,其优点是渐进正态性,而且不需要对样本总体的分布进行假设。U-EGS中引入的新的核函数EGS,是EG距离的一种推广,同样也能利用位点的信息。随后的模拟研究也证实了,在不同的参数下,对不同的疾病模型,使用能够融入位点信息的核函数EGS的U—统计量要比没有利用位点信息的U—统计量在统计功效上拥有更大的优势。
The completion of the Human Genome Projection, both on quantity and qual-ity, has enriched the data resource of human genetic, which makes people easily lost in the oceans of information. Statistics, as a powerful data analysis tool, has been focused on by more researchers, and it also has played an irreplaceable role in genetic epidemiology.
     Association analysis, with the aim of investigating genetic variations, is de-signed to detect genetic associations with observable traits, which has played an increasing part in understanding the genetic basis of diseases. Haplotypes, as a common data style, are generally considered to possess more linkage disequilib-rium (LD) information, and haplotype-based association studies are believed to provide high resolution and potentially greater power for identifying genetic disease associations, compared to the other approaches, especially for the rare diseases in case-control studies. However, when modeling these haplotypes, they are subjected to statistical problems caused by rare haplotypes. Abundant parameters limits the power and decreases the efficiency. Fortunately, haplotype clustering offers an ap-pealing solution. This dissertation aims to propose new statistical methods, which combine the structure information of the loci in order to improve the power in the haplotype-based association studies.
     In this dissertation, we first present APEG for haplotype clustering in haplotype-based association studies, which adopts "affinity propagation" clustering algorithm with EG distance. The new befitting similarity EG distance, designed specially for haplotypes, can incorporate haplotype structure information, which is believed to enhance the power and provide high resolution for identifying associations between genetic variants and disease. Our simulation studies show that the proposed ap-proach offers merits in detecting disease-marker associations in comparison with other methods. We also illustrate an application of our method to a real data set, which shows quite accurate estimates during fine mapping. Then, we develop a non-parametric method based on U-statistics called U-EGS, which has an asymptotic normally distribution and without assumption to the distribution of the samples. The following simulations also shows that the U-statistics with EGS, which could incorporate locus information, gains greater power than the U-statistics without locus information, under different parameters and different disease models.
引文
[1]Aird I, Bentall HH, Roberts JAF.1953. A relationship between cancer of stomach and the ABO blood groups [J]. Brit Med J 1:799.
    [2]Akey J, Jin L, Xiong M.2001. Haplotypes vs single marker linkage disequilibrium tests:what do we gain?[J] Eur J Hum Genet 9:291-300.
    [3]Bardel C, Danjean V, Hugot JP, Darlu P, Genin E.2005. On the use of haplotype phylogeny to detect disease susceptibility loci [J]. BMC Genetics 6:24.
    [4]Bertranpetit J, Calafell F.1996. Genetic and geographical variability in cystic fibrosis:evolutionary considerations. In: Chadwick D, Cardew G, editors. Variation in the human genome. New York: John Wiley & Sons. p 97-118.
    [5]Botstein D, Risch N.2003. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease [J]. Nat Genet 33 Suppl:228-237.
    [6]Chapman JM, Cooper JD, Todd JA, et al.2003. Detecting disease association due to linkage disequilibrium using haplotype tags:a class of tests and the determinants of statistical power [J]. Hum Hered 56:18-31.
    [7]程书钧,潘锋,徐宁志.2005.话说基因[M].北京:清华大学出版社;广州:暨南大学出版社.
    [8]Clark AG.2004. The role of haplotypes in candidate gene studies [J]. Genet Epi-demiol 27(4):321-333.
    [9]Collins A, Morton NE.1998. Mapping a disease locus by allelic association [J]. Proc Natl Acad Sci USA 95:1741-1745.
    [10]Conti DV, Cortessis V, Molitor J, et al.2003. Bayesian modeling of complex metabolic pathways [J]. Hum Hered 56:83-93.
    [11]Devlin B, Risch N, Roeder K.1996. Disequilibrium mapping: Composite likelihood for pairwise disequilibrium [J]. Genomics 36:1-16.
    [12]Dudoit S, Fridlyand J.2002. A prediction-based resampling method for estimating the number of clusters in a dataset [J]. Genome Biology 3(7):1-21.
    [13]Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP.2004. Link-age disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes [J]. Am J Hum Genet 75:35-43.
    [14]Edwards JH.1965. The meaning of the association between blood groups and disease [J]. Ann Hum Genet 29(1):77-83.
    [15]Elston RC.2000. Introduction and overview. Statistical methods in genetic epi-demiology [J]. Stat Methods Med Res 9(6):527-541.
    [16]Epstein MP, Kwee LC.2009. Haplotype Association Analysis. In: Lin S, Zhao H, editors. Handbook on Analyzing Human Genetic Data Computational Approaches and Software. Berlin Heidelberg: Springer, p 241-276.
    [17]Epstein MP, Satten GA.2003. Inference on haplotype effects in case control studies using unphased genotype data [J]. Am J Hum Genet 73:1316-1329.
    [18]Fallin D, Cohen A, Essioux L, et al.2001. Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer's disease [J]. Genome Res 11(1):143-151.
    [19]Fallin D, Schork NJ.2000. Accuracy of haplotype frequency estimation for biallelic loci, via the Expectation-Maximization algorithm for unphased diploid genotype data [J]. Am J Hum Genet 67:947-959.
    [20]Fan R, Knapp M.2003. Genome association studies of complex diseases by case-control designs [J]. Am J Hum Genet 72:850-868.
    [21]Frey BJ, Dueck D.2007. Clustering by passing messages between data points [J]. Science 315:972-976.
    [22]Gibbs RA, Belmont JW, Hardenbol P, Williset TD, et al.2003. The inter-national HapMap project. Nature 426:789-796. (International HapMap Project: http://www.hapmap.org)
    [23]Graham J Thompsom EA.1998. Disequilibrium likelihoods for fine-scale mapping of a rare allele [J]. Am J Hum Genet 63:1517-1530.
    [24]郭建华.1999.流行病学研究中的混杂现象和因果推断[M].[博士学位论文].北京:北京大学数学科学学院.
    [25]郭建华等.遗传数据的统计分析[M].东北师范大学数学与统计学院,(讲义).
    [26]Hamming RW.1950. Error detecting and error correcting codes [J]. Bell System Technical Journal 29(2):147-160.
    [27]Hastbacka J, delaChapelle A, Kaitila I, et al.1992. Linkage disequilibrium mapping in isolated founder populations:diastrophic dysplasia in Finland [J]. Nature Genet 2:204-211.
    [28]江三多,吕宝忠等.1998.医学遗传数理统计方法[M].北京:科学出版社.
    [29]Jin LN, Zhu WS, Guo JH.2010. Genome-wide association studies using haplotype clustering with a new haplotype similarity [J]. Genet Epidemiol 34:633-641.
    [30]Jundson R, Stephens JC.2001. Notes from the SNP vs haplotype front [J]. Phar-macogenomics 2:7-10.
    [31]Kaplan NL, Hill WG, Weir BS.1995. Likelihood methods for locating disease genes in nonequilibrium populations [J]. Am J Hum Genet 56:18-32.
    [32]Kerem B, Rommens JM, Buchanan JA, Markiewicz D, Cox TK, Chakravarti A, Buchwald M, Tsui LC.1989. Identification of the cystic fibrosis gene:genetic analysis [J]. Science 245:1073-1080.
    [33]Klerkx AH, Tanck MW, Kastelein JJ, et al.2003. Haplotype analysis of the CETP gene: not TaqIB, but the closely linked 629CA polymorphism and a novel promoter variant are independently associated with CETP concentration [J]. Hum Mol Genet 12:111-123.
    [34]Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ. 2003. Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous [J]. Hum Hered 55:56-65.
    [35]Lam J, Roeder K, Devlin B.2000. Haplotype fine mapping by evolutionary trees [J]. Am J Hum Genet 66:659-673.
    [36]Lauritzen SL, Sheehan NA.2003. Graphical models for genetic analyses [J]. Stat Sci 18:489-514.
    [37]Li C, Li M.2008. GWAsimulator: a rapid whole-genome simulation program [J]. Bioinformatics 24:140-142.
    [38]李照海,覃红,张洪.2006.遗传学中的统计方法[M].北京:科学出版社.
    [39]Licinio J, Wong ML.2002. Pharmacogenomics:The Search for Individualized Ther-apies [M]. Weiheim: Wiley-VCH.
    [40]Lin DY.2004. Haplotype-based association analysis in cohort studies of unrelated individuals [J]. Genet Epidemiol 26:255-264.
    [41]Lin DY, Zeng D, Millikan R.2005. Maximum likelihood estimation of haplotype effects and haplotye-environment interactions in association studies [J]. Genet Epi-demiol 29(4):299-312.
    [42]Lin DY, Zeng D.2006. Likelihood-based inference on haplotype effects in genetic association studies [J]. J Am Stat Assoc 101(473):89-104.
    [43]Liu JS, Sabatti C, Teng J, Keats BJB, Risch N.2001. Bayesian analysis of haplo-types for linkage disequilibrium mapping [J]. Genome Res 11:1716-1724.
    [44]Liu N.2005. Statistical methods for haplotype analysis in genetic studies [D]. [PhD diss]. USA:Yale University.
    [45]Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN.2003. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease [J]. Nat Genet 33:177-182.
    [46]Longmate JA.2001. Complexity and power in case-control association studies [J]. Am J Hum Genet 68:1229-1237.
    [47]茆诗松,王静龙,濮晓龙.1998.高等数理统计[M].北京:高等教育出版社;海德堡:施普林格出版社.
    [48]McCullagh P, Nelder JA.1989. Generalized linear models. Vol.1,2nd edn. Chap-man & Hall, London.
    [49]McPeek MS, Strahs A.1999. Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine scale genetic mapping [J]. Am J Hum Genet 65:858-875.
    [50]Morris AP, Whittaker JC, Balding DJ.2000. Bayesian fine-scale mapping of disease loci, by hidden Markov models [J]. Am J Hum Genet 67:155-169.
    [51]Morris AP, Whittaker JC, Balding DJ.2002. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies [J]. Am J Hum Genet 70:686-707.
    [52]Morris RW, Kaplan NL.2002. On the advantage of haplotpe analysis in the pres-ence of multiple disease susceptibility alleles [J]. Genet Epidemiol 23(3):221-233.
    [53]Molitor J, Marjoram P, Thomas D.2003. Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques [J]. Am J Hum Genet 73:1368-1384.
    [54]Morton NE, Chung CS.1978. Genetic Epidemilolgy [M]. New York: Academic Press.
    [55]Mukhopadhyay I, Feingold E, Weeks DE, et al.2010. Association tests using kernel-based measures of multi-locus genotype similarity between individuals [J]. Genet Epidemiol 34:213-221.
    [56]Neale BM, Sham PC.2004. The future of association studies:gene-based analysis and replication [J]. Am J Hum Genet 75:353-362.
    [57]Ott.1995. Analysis of human genetic linkage [M]. Third edition. Baltimore: The John Hopkins Press.
    [58]Peltonen L, McKusick VA.2001. Genomics and medicine:dissecting human disease in the postgenomic era [J]. Science 291:1224-1229.
    [59]Rannala B, Slatkin M.1998. Likelihood analysis of disequilibrium mapping, and related problems [J]. Am J Hum Genet 62:459-473.
    [60]Risch N, Merikangas K.1996. The future of genetic studies of complex human disease [J]. Science 273(5281):1516-1517.
    [61]Risch N.2000. Searching for genetic determinants in the new millennium [J]. Nature 405(6788):847-856.
    [62]Sasieni PD.1997. From genotypes to genes: doubling the sample size [J]. Biometrics 53:1253-1261.
    [63]Schaid DJ.2004. Evaluating associations of haplotypes with traits [J]. Genet Epi-demiol 27:348-364.
    [64]Schaid DJ, McDonnell SK, Hebbring SJ, et al.2005. Nonparametric tests of asso-ciation of multiple genes with human disease [J]. Am J Hum Genet 76:780-793.
    [65]Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA.2002. Score tests for association between traits and haplotypes when linkage phase is ambiguous [J]. Am J Hum Genet 70:425-434.
    [66]Schlosstein L, Terasaki PI, Pearson CM.1973. High association of an HL-A antigen, W27, with ankylosing spondylitis [J]. N Engl J Med 288:704-706.
    [67]Seltman H, Roeder K, Devlin B.2003. Evolutionary-based association analysis using haplotype data [J]. Genet Epidemiol 25:48-58.
    [68]盛志廉,陈瑶生.1999.数量遗传学[M].北京:科学出版社.
    [69]Tanck MW, Klerkx AH, Jukema JW, et al.2003. Estimation of multilocus hap-lotype effects using weighted penalised log-likelihood: analysis of five sequence variations at the cholesteryl ester transfer protein gene locus [J]. Ann Hum Genet 67:175-184.
    [70]Terwilliger JD.1995. A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci [J]. Am J Hum Genet 56:777-787.
    [71]Thomas A, Camp NJ. Graphical modeling of the joint distribution of alleles at associated loci [J]. Am J Hum Genet 2004,74(6):1088-1101.
    [72]Thomas A. Charactering allelic associations from unphased diploid data by graph-ical modeling [J]. Genet Epidemiol,2005,29(1):23-35.
    [73]Thomas DC, Stram DO, Conti D, et al.2003. Bayesian spatial modeling of haplo-type associations [J]. Hum Hered 56:32-40.
    [74]Tzeng JY, Byerley W, Devlin B, Roeder K, Wasserman L.2003. Outlier detection and false discovery rates for whole-genome DNA matching [J]. J Am Stat Assoc 98:236-246.
    [75]Tzeng JY, Devlin B, Wasserman L, Roeder K.2003. On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit [J]. Am J Hum Genet 72:891-902.
    [76]Tzeng JY.2005. Evolutionary-based grouping of haplotypes in association analysis [J]. Genet Epidemiol 28:220-231.
    [77]Tzeng JY, Wang CH, Kao JT, Hsiao CK.2006. Regression-based association anal-ysis with clustered haplotypes through use of genotypes [J]. Am J Hum Genet 78:231-242.
    [78]Wallenstein S, Hodge SE, Weston A.1998. Logistic regression model for analyzing extended haplotype data [J]. Genet Epidemiol 15:173-181.
    [79]王开军,张军英,李丹,张新娜,郭涛.2007.自适应仿射传播聚类[J].自动化学报,第33卷,第12期:1242-1246.
    [80]Wang T, Elston RC.2007. Improved power by use of a weighted score test for linkage disequilibrium mapping [J]. Am J Hum Genet 80:353-360.
    [81]Weir B. Genetic data analysis Ⅱ [M]. Sunderland, MA:Sinauer Associates, Inc, 1990.
    [82]Wessel J, Schork NJ.2006. Generalized genomic distance-based regression method-ology for multilocus association analysis [J]. Am J Hum Genet 79:792-806.
    [83]Xiong M, Guo SW.1997. Fine scale genetic mapping based on linkage disequilib-rium:theory and applications [J]. Am J Hum Genet 60:1513-1531.
    [84]Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, et al.2007. Genome-wide associ-ation study of prostate cancer identifies a second risk locus at 8q24 [J]. Nat Genet 39(5):645-649.
    [85]Zaykin DV, Westfall PH, Young SS, et al.2002. Testing association of statistically inferred haplotypes with dicrete and continuous traits in samples of unrelated in-dividuals [J]. Hum Hered 53(2):79-91.
    [86]Zeng D, Lin DY.2005. Estimating haplotype-disease associations with pooled geno-type data [J]. Genet Epidemiol 28:70-82.
    [87]Zeng D. Lin DY. Avery CL, et al.2006. Efficient semiparametric estimation of haplotype-disease associations in case-cohort and nested case-control studies [J]. Biostatistics 7(3):486-502.
    [88]Zong D, Lin DY. Avery CL, et al.2005. Efficient semiparametric estimation of haplotype-disease associations in two-stage cohort studies [J]. Technical report. Department of Biostatistics, University of North Carolina at Chapel Hill.
    [89]Zhang Y, Niu T, Liu JS.2006. A coalescence-guided hierarchical Bayesian method for haplotype inference [J]. Am J Hum Genet 79:313-322.
    [90]张阳德.2009.生物信息学[M].北京:科学出版社.
    [91]赵刚,彭惠民.1998.医学遗传学教程[M].北京:科学出版社.
    [92]Zhao LP, Li SS, Khalid N.2003. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies [J]. Am J Hum Genet 72:1231-1250.
    [93]赵仲堂.2000.流行病学研究方法与应用[M].北京:科学出版社.
    [94]Zhou Y, Shi NZ, Fung WK, Guo JH.2008. Maximum likelihood estimates of two-locus recombination fractions under some natural inequality restrictions [J]. BMC Genet,9:1.
    [95]周影.2008.遗传连锁分析中重组率的统计推断[D].[博士学位论文].长春:东北师范大学.
    [96]Zhu W, Guo J.2006. A likelihood-based method for haplotype association studies of case-control data with genotyping uncertainty [J]. Sci China A Math 49:130-144.
    [97]朱文圣.2006.基因型带有误差时单倍型分析的统计方法[D].[博士学位论文].长春:东北师范大学.
    [98]朱文圣,郭建华.2009.基于单倍型的复杂疾病基因定位研究[J].数据统计与管理,第28卷,第2期:370-379.
    [99]Zollner S, Pritchard JK.2005. Coalescent-based association mapping and fine map-ping of complex trait loci [J]. Genetics 169:1071-1092.
    [100]Zondervan KT, Cardon LR.2004. The complex interplay among facors that influ-ence allelic association [J]. Nat Rev Genet 5:89-100.
    [101]The Genome International Sequencing Consortium.2001. Initial sequencing and analysis of the human genome [J]. Nature 409:860-921.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700