全基因组关联分析的研究现状及对数据科学的挑战
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Genome-wide association analysis: Current status and challenges to data science
  • 作者:张军英
  • 英文作者:ZHANG Jun-ying;School of Computer Science and Technology, Xidian University;
  • 关键词:全基因组 ; 复杂疾病/表型性状 ; 关联分析 ; 统计分析 ; 机器学习
  • 英文关键词:genome-wide;;complex diseases/phenotypic traits;;association analysis;;statistical analysis;;machine learning
  • 中文刊名:GUDZ
  • 英文刊名:Journal of Guangzhou University(Natural Science Edition)
  • 机构:西安电子科技大学计算机科学与技术学院;
  • 出版日期:2019-02-15
  • 出版单位:广州大学学报(自然科学版)
  • 年:2019
  • 期:v.18;No.103
  • 语种:中文;
  • 页:GUDZ201901001
  • 页数:9
  • CN:01
  • ISSN:44-1546/N
  • 分类号:5-13
摘要
全基因组关联分析(GWAS),是通过考察全基因组范围DNA变异的单核苷酸多态性(SNP),挖掘影响复杂疾病等的表型性状(如疾病、癌症、身高等)的SNP的计算方法,以期为疾病/表型的分子生物发现、生物机理分析、分子靶向药物研究、疾病早期风险预测和个性化治疗等提供科学依据.目前的方法多以统计学、机器学习和深度学习、智能优化等等及其它们的组合为基础,并已取得可喜成绩,但仍有许多无法复现的关联的例子,正如Ioannidis 2005年在国际知名刊物PLoS Medicine上发表、至今已被引用6 600多次的论文中所说"大部分的研究发现是错的".文章认为,这是因为其核心问题仍未解决,尤其是到底要从数据中挖掘出什么和统计重要性在什么情况下具有科学重要性,以及科学重要性是否可以科学定义等,这些都是GWAS对数据科学的严峻挑战.
        Genome-wide association study(GWAS) is a computation approach mining single nucleotide polymorphisms(SNPs) that affect phenotypic traits(such as disease, cancer, height, etc.) and complex diseases by examining SNPs of genome-wide DNA variation. It can provide scientific basis for biological discovery, biological mechanism analysis of diseases/phenotypes, molecular targeted drug research, early disease risk prediction and personalized medicine. Current methods are mainly based on statistics, machine learning, deep learning, intelligent optimization, etc., as well as their combinations, and have achieved promising results. However, there are still many examples of un-replicated associations, just as Ioannidis said in the paper "why most research findings are false", published in the esteemed international journal PLoS Medicine in 2005 which has already more than 6,600 cites so far. We believe that this is because its core problems still remain unresolved, including what is to be extracted from data and when statistical significance is scientifically significant and whether scientific significance can be reasonably defined. All these are serious challenges to data science in GWAS.
引文
[1] Chang C Q,Yesupriya A,Rowell J L,et al.A systematic review of cancer GWAS and candidate gene meta-analyses reveals limited overlap but similar effect sizes[J].European Journal of Human Genetics,2014,22(3):402-408.
    [2] Klein R J,Zeiss C,Chew E Y,et al.Complement factor Hpolymorphism in age-related macular degeneration[J].Science,2005,308:385-389.
    [3] Nelson M R,Kardia S L,Ferrell R E,et al.A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation[J].Genome Research,2001,11(3):458-470.
    [4] Ritchie M D,Hahn L W,Roodi N,et al.Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer[J].American Journal of Human Genetics,2001,69(1):138-147.
    [5] Culverhouse R,Klein T,Shannon W.Detecting epistatic interactions contributing to quantitative traits[J].Genet Epidemiol 2004,27(2):141-152.
    [6] Moore J H,Gilbert J C,Tsai C T,et al.A flexible computational framework for detecting,characterizing,and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility[J].Journal of Theoretical Biology,2006,241(2):252-261.
    [7] Zheng T,Wang H,Lo S H.Backward genotype-trait association (BGTA)-based dissection of complex traits in case-control designs[J].Hum Heredity,2006,62(4):196-212.
    [8] Tang W,Wu X,Jiang R,et al.Epistatic module detection for case-control studies:A Bayesian model with a Gibbs sampling strategy[J].PLoS Genetics,2009,5(5):e1000464.
    [9] Wang Y,Liu X,Robbins K,et al.AntEpiSeeker:Detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm[J].BMC Research Notes,2010,3(1):117.
    [10] Wan X,Yang C,Yang Q,et al.Predictive rule inference for epistatic interaction detection in genome-wide association studies[J].Bioinformatics,2010,26(1):30-37.
    [11] Wan X,Yang C,Yang Q,et al.BOOST:A fast approach to detecting gene-gene interactions in genome-wide case-control studies[J].American Journal of Human Genetics,2010,87(3):325-340.
    [12] Yang G,Jiang W,Yang Q,et al.PBOOST:A GPU based tool for parallel permutation tests in genome-wide association studies[J].Bioinformatics,2015,31(9):1460-1462.
    [13] Yosef N,Yakhini Z,Tsalenko A,et al.A supervised approach for identifying discriminating genotype patterns and its application to breast cancer data[J].Bioinformatics,2017,23:91-98.
    [14] Zhang X,Huang S,Zou F,et al.TEAM:Efficient two-locus epistasis tests in human genome-wide association study[J].Bioinformatics,2010,26(12):217-227.
    [15] Fergus P,Montanez C C,Abdulaimma B,et al.Utilising deep learning and genome wide association studies for epistatic-driven preterm birth classification in African-American women[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2018:1.
    [16] Montaňez C,Adays C,Montaňez A C,et al.Deep learning classification of polygenic obesity using genome wide association study SNPs[C]//2018 International Joint Conference on Neural Networks (IJCNN),2018.
    [17] Bellot P,Gustavo D L C,Miguel P E,et al.Can deep learning improve genomic prediction of complex human traits?[J] Genetics,2018,210(3):809-819.
    [18] Delvin B,Roeder K.Genomic control for association studies[J].Biometrics,1999,55(4):997-1004.
    [19] Pritchard J K,Stephen M,Donnelly P.Inference of population structure using multilocus genotype data[J].Genetics,2000,155(2):945-959.
    [20] Patterson N,Price A L,Reich D.Population study and eigenanalysis[J].PLoS Genetics,2006,2(12):190-194.
    [21] Yue J,Pressoier G,Briggs W H,et al.A unified mixed-model method for association mapping that accounts for multiple levels of relatedness[J].Nature Genetics,2006,38(2):203-208.
    [22] Cai Q,Zhang B,Sung H,et al.Genome-wide association analysis in east Asians identifies breast cancer susceptibility loci at 1q32.1,5q14.3 and 15q26.1[J].Nature Genetics,2014,46(8):886-890.
    [23] Haiman C A,Chen G K,Vachon C M,et al.A common variant at the TERT-CLPTM1L locus is associated with estrogen receptor-negative breast cancer[J].Nature Genetics,2011,43(12):1210-1214.
    [24] Michailidou K,Lindstr?m S,Dennis J,et al.Association analysis identifies 65 new breast cancer risk loci[J].Nature,2017,551:92-94.
    [25] Milne R L,Kuchenbaecker K B,Michailidou K,et al.Identification of ten variants associated with risk of estrogen-receptor-negative breast cancer[J].Nature Genetics,2017,49:1767-1778.
    [26] Maguire L H,Handelman S K,Du X M,et al.Genome-wide association analyses identify 39 new susceptibility loci for diverticular disease[J].Nature Genetics,2018,50(3):1359-1365.
    [27] Zhu Z,Zheng Z,Zhang F,et al.Causal associations between risk factors and common diseases inferred from GWAS summary data[J].Nature communications,2018,9(1):1-12.
    [28] Elliott L T,Sharp K,Alfaro-Almagro F,et al.Genome-wide association studies of brain imaging phenotypes in UK Biobank[J].Nature,2018,562(1):210-216.
    [29] Huyghe J R,Bien S A,Harrison T A,et al.Discovery of common and rare genetic risk variants for colorectal cancer[J].Nature genetics,2019,51(1):76-87.
    [30] Chimusa E R,Mbiyavanga M,Mazandu G K,et al.AncGWAS:A post genome-wide association study method for interaction,pathway,and ancestry analysis in homogeneous and admixed populations[J].Bioinformatics,2015,32:549-556.
    [31] Ioannidis J P A.Why most published research findings are false[J].PLoS Medicine,2005,2(8):124.
    [32] Park J H,Geum D,Eisenhut M,et al.Bayesian statistical methods in genetic association studies:Empirical examination of statistically non-significant Genome Wide Association Study (GWAS) meta-analyses in cancers:A systematic review[J].Gene,2019,685:170-178.
    [33] Jiang R,Tang W W,Wu X B,et al.A random forest approach to the detection of epistatic interactions in case-control studies[J].BMC Bioinformatics,2009,10(S1):65.
    [34] Tuo S,Zhang J,Yuan X,et al.Niche harmony search algorithm for detecting complex disease associated high-order SNP combinations[J].Scientific Reports,2017,7(1):1-18.
    [35] Shang J,Sun Y,Liu J X,et al.CINOEDV:A co-information based method for detecting and visualizing n-order epistatic interactions[J].BMC Bioinformatics,2016,17(1):214.
    [36] Xu E L,Qian X,Yu Q,et al.Feature selection with interactions in logistic regression models using multivariate synergies for a GWAS application[J].BMC Genomics,2018,19(4):17-25.
    [37] Wei Z,Wang W,Bradfield J,et al.Large sample size,wide variant spectrum,and advanced machine-learning technique boost risk prediction for inflammatory bowel disease[J].American Journal of Human Genetics,2013,92(6):1008-1012.
    [38] Poggio T,Rifkin R,Mukherjee S,et al.General conditions for predictivity in learning theory[J].Nature,2004,428(6981):419-422.
    [39] Cordell H J.Epistasis:What it means,what it doesn't mean,and statistical methods to detect it in humans[J].Human Molecular Genetics,2002,11(20):2463-2468.
    [40] Marcus E.Credibility and reproducibility[J].Chemistry and Biology,2015,22(1):3-4.
    [41] Li F,Hu J,Xie K,et al.Authentication of experimental materials:A remedy for the reproducibility crisis?[J].Genes and Diseases,2015,2(4):283-283.
    [42] Liu Y J,Papasian C J,Liu J F,et al.Is replication the gold standard for validating genome-wide association findings?[J].PLoS One,2008,3(12):4037.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700