高维数据交互作用分析的统计方法研究及其在肺癌全基因组关联研究中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
全基因组关联研究(genome-wide association study, GWAS)从2005年起初露锋芒,至今方兴未艾,成果斐然。然而,目前GWAS所识别的具有主效应的位点仅能解释一小部分遗传变异。复杂疾病由外在环境暴露因素、内在遗传因素相互影响所致。基因组学研究中忽视基因—环境、基因—基因交互作用是导致遗传性缺失(missing heritability)的重要原因之一。
     GWAS涉及的变量数高达数十万。传统交互作用分析方法受算法复杂程度、软件计算速度等限制,无法在全基因组水平检测交互作用。2007年以来,涌现出一大批针对高维基因组学数据基因—基因交互作用分析的方法。不同方法各有利弊,且缺乏专门快速检测高阶交互作用的方法。本文首先,对多种交互作用分析方法进行系统评价;其次,改进方法,提出新的高阶交互作用分析方法;再次,探索高维数据中高阶交互作用降维分析策略;最后,应用研究所得策略在实际GWAS资料中进行交互作用挖掘。全文结构如下:
     第Ⅰ部分交互作用分析方法的系统性评价。基于文献综述,系统评价了性能出色、算法典型的10种方法(7种软件),包括:BOOST、BiForce、iLOCi、SIXPAC_D、 SIXPAC_R、 SIXPAC_lod、 SNPRuler、 AntEpiSeeker_pruned、AntEpiSeeker_raw、TEAM。模拟试验一、模拟试验二分别考察各方法检出1对、多对交互作用的性能。BOOST、BiForce两法检测交互作用时一类错误可控,把握度尚可;BOOST与BiForce性能完全相同,提示“先初筛、再检验”是合理的降维分析方式。位点2分类编码的SIXPAC_lod仅在检测多对交互作用时,一类错误膨胀至15%左右,但把握度总是高于BOOST、BiForce。提示样本量较低时,位点可采用2分类编码进行初筛,后续再检验。BOOST、BiForce位点编码方式较SIXPAC_lod更灵活,因此建议实际应用时,视条件灵活应用这两个软件。AntEpiSeeker_raw、TEAM检测无任何效应位点时,一类错误可控;只要位点有主效应或者交互作用,两法均具有较高的把握度,适合过滤噪音位点。模拟试验三显示BOOST、BiForce计算速度快,可在短时间内完成检测工作。
     第Ⅱ部分基于熵的交互作用分析方法改进。基于信息论(information theory),提出迭代熵交互作用(iterative entropy epistasis, IEE)法,用于检测高阶交互作用,且适应位点不同的连锁不平衡(linkage disequilibrium, LD)结构。从方法学(模拟试验四)、实际应用(模拟试验五)角度,无论检测一阶、高阶交互作用,IEE法一类错误控制能力与对数线性模型相近,但把握度优于后者。此外,IEE法计算速度快于对数线性模型。模拟试验六显示,若进一步降低IEE法迭代收敛精度,可再次提高计算速度。检测一阶、二阶以上交互作用时,IEE法分别在原始迭代次数25%、50%条件下,可维持原始一类错误、把握度水平;分别提高3倍、1倍计算速度。
     第Ⅲ部分高阶交互作用降维分析策略研究。提出“KSA初筛→IEE再筛→logistic检验, KIL”交互作用降维分析策略。模拟试验七研究显示:不同条件下,KSA法统计量总是不低于IEE法统计量,且计算速度最快,符合快速初筛原则;IEE法速度快于logistic回归,适合高维数据筛选。模拟试验八显示,与单纯应用logistic回归相比,利用KIL策略降维分析,可以控制一类错误,且能够基本维持把握度(平均达到logistic回归效能的92%以上)、减轻计算负担(仅为原始计算量的30%-40%)。
     第Ⅳ部分肺癌全基因组关联研究数据挖掘。应用研究所得策略,在中国人群肺癌GWAS实际资料中全基因组水平检测交互作用。
     (1)基因—基因交互作用分析。采用三阶段病例—对照研究设计。第一阶段为GWAS筛选期,第二、三阶段为独立的验证期。总样本量为13,392(6,377例病例、7,015例对照),涉及591,370个位点。GWAS筛选阶段,采用KIL策略获得4对潜在交互作用位点。交互作用位点rs2562796-rs16832404在后续验证中成功。GWAS筛选阶段,其交互作用OR=2.58,95%CI=2.24-2.97, P=1.37×10-39;第一阶段验证,交互作用OR=1.17,95%CI=0.99-1.38, P=6.37×10-2;第二阶段验证,交互作用OR=1.21,95%CI=1.06-1.38, P=4.61×10-3。总样本中,交互作用OR=1.33,95%CI=1.23-1.43, P=1.03×10-13)。按年龄、性别、吸烟等因素分层分析,该交互作用位点在不同亚人群中仍具有统计学意义。基因填补分析显示,位点所在区域附近有成簇交互作用信号。
     (2)基因—环境交互作用分析。采用两阶段病例—对照设计。样本来源同第(1)节第一、二阶段。共8,440例样本(3,865例病例、4,575例病例)。GWAS筛选阶段获得6个与吸烟存在交互作用的位点,其中rs1316298、rs4589502验证成功。GWAS筛选阶段位点rs1316298、rs4589502与吸烟的交互作用P值分别为4.15×10-5、2.61×10-5。第一阶段验证,交互作用P值分别为8.87×10-4、4.40×10-2。位点rs1316298与吸烟存在拮抗型(antagonistic)交互作用;位点rs4589502与吸烟存在协同型(synergetic)交互作用,总样本中P值分别为6.73×10-6、3.84×10-6。基因填补分析显示,两位点的附近区域有簇的交互作用信号。
     (3)生物学通路基因富集分析。以生物学通路为功能单位,降维交互作用分析。采用两阶段病例—对照设计。第一阶段为GWAS南京子研究,用于筛选通路,第二阶段为GWAS北京子研究,用于验证通路。共5408例样本(2,331例病例、3,077例对照)。基于KEGG (Kyoto Encyclopedia of Genes and Genomes)、BioCarta通路数据库中368个通路,筛选、验证获得4条生物学通路。总样本中结果分别为:achPathway (P=0.012)、At1rPathway (P=0.022)、metPathway (P=0.010)和rac1Pathway (P=0.005)。敏感性分析显示4条通路关联分析结果较为稳定。保留富集在通路上的基因及其代表性位点。进一步,分别在4条通路内检测基因—基因、基因—吸烟交互作用,获得1对交互作用位点(rs17057065、rs17194885)。交互作用在南京子研究、北京子研究、总样本中P值分别为4.98×10-2、4.42×10-2、4.69×10-3。
     模拟试验及实例验证共同提示:KIL是行之有效的交互作用降维分析策略。基因、环境之间相互影响,共同导致肺癌风险。
     本文的主要创新点:
     (1)系统评价方法。系统评价了10种交互作用分析方法在多种条件下的一类错误、把握度。探索各方法的优缺点及其适用条件,为实际资料分析,提供了方法选择的参考依据。
     (2)创新筛选方法。创新提出了高阶交互作用分析方法(IEE法)。评价了多种条件下IEE法的统计学性质,以及不同迭代精度对统计学性质的影响。IEE法可作为大规模快速筛选的工具。
     (3)提出降维策略。提出了KIL高阶交互作用降维分析策略,评价了其合理性及有效性。
     (4)理论指导应用。在中国人群肺癌GWAS实际资料中,首次进行了全基因组水平的基因—基因、基因—环境交互作用分析及以生物学通路为功能单位的降维交互作用分析,为后续肺癌机制研究提供了统计学证据。
Despite the great success in genome-wide association study (GWAS) since year2005, the identified single nucleotide polymorphisms (SNPs) with main effect onlyaccount for a little proportion of genetic variation for complex diseases. Both externalfactors (environmental exposure) and internal factors (genetic mutation) contribute tothe complex diseases. Neglecting the gene-environment interaction and/or thegene-gene interaction is one of the most important reasons for missing heritability inGWAS.
     Hundreds of thousands of SNPs are available in GWAS nowadays. Due to thecomplexity of statistical algorithms and/or the limited computation speed of softwares,the traditional methods for interaction analysis are not appropriate in highdimensional data. Lots of novel methods for GWAS interaction analysis have beenproposed since year2007. However, they have both advantages and disadvantages.Meanwhile, there is no dedicated method for high-order interaction analysis. Thus,firstly, we did a systematic comparative analysis for ten representative methods.Secondly, we proposed a novel method for high-order interaction analysis. Thirdly,we proposed a three-step based strategy to reduce the high dimensional data into lowdimensional data when detecting high-order interaction. Finally, we applied theproposed strategy in GWAS real dataset for genome-wide epistasis analysis. Thethesis is organized as follows.
     In Section1, we did a systematic comparative analysis for ten methods in sevensoftwares based on literature review, including BOOST, BiForce, iLOCi, SIXPAC_D, SIXPAC_R, SIXPAC_lod, SNPRuler, AntEpiSeeker_pruned, AntEpiSeeker_raw andTEAM. Simulation1and Simulation2were designed to detect only one and morethan one genetic epistasis respectively. Both two simulations indicate that twomethods (BOOST and BiForce) are recommended for interaction analysis, since theycan control the type one error and have acceptable power. BOOST has the sameperformance with BiForce, indicating that "screening before testing" is a reasonableway for dimensional reduction. SXIPAC_lod only supports datasets in which SNPsare in dominant or recessive genetic model. The type one error was inflated up to15%for SIXPAC_lod when detecting more than one epistasis. However, it has higherpower than BOOST or BiForce in all scenarios, indicating that SNPs should be indominant or recessive genetic model when samplesize is limited. BOOST andBiForce are flexible in SNPs coding (additive, dominant or recessive). Thus, werecommend these two to be the best methods in GWAS interaction analysis. Both twosimulations show that the other two methods (AntEpiSeeker_raw and TEAM)perform best in filtering out noise SNP. They can control type one error for noise SNP,and have high power to detect SNPs whatever main effect or interaction effect exists.In Simulation3, BOOST and BiForce are the fastest tools. They can finish exhaustivesearch of epistasis on genome-wide scale in a few days.
     In Section2, we proposed a new method, iterative entropy epistasis (IEE) ininformation framework. IEE was appropriate for detecting high-order interactionwhatever linkage disequilibrium (LD) structure exists among SNPs. Simulation4andSimulation5were designed to evaluate the performance of IEE in aspect of statisticalmethod and real application respectively. Intensive simulations indicate that IEE isable to control the type one error in nominal level, and exhibits higher power thanlog-linear model and other entropy-based methods. Additionally, IEE with lessiterations executes faster than log-linear model. The lower accuracy for IEE initeration, the faster it runs. In Simulation6, we found that IEE was able to maintainits original performance when reaching25%and50%accuracy of iteration fordetecting one-order and high-order interaction respectively. Thus, the calculationspeed was improved by4-fold and2-fold respectively.
     In Section3, we proposed a three-step based strategy for high-order interactionanalysis in GWAS. The first step is fast-screening using Kirkwood superpositionapproximation (KSA), which filters out a great proportion of noise SNPs. The secondstep is testing using IEE, which again removes the false positive results. The finalstep is confirmation using logistic regression model, which provides the statisticalsignificance of interactions. The strategy is referred as KIL. Simulation7indicatesthat statistics of KSA are no less than those of IEE, and it is the fastest compared withIEE or logistic regression model. Thus, KSA is qualified in fast-screening withoutmissing of potential positive interactions. IEE is faster than logistic regression model,and is appropriate for screening epistasis in high dimensional data. In Simulation8,KIL can reduce the computational burden as low as30%-40%of original ones.Meanwhile, it keeps more than92%of the power of logistic regression modelaveragely. Compared with KSA and logistic regression model, the integrated strategyis able to control type one error, and guarantees power basically.
     In Section4, we firstly did an exhaustive search of gene-gene interaction andgene-smoking interaction, as well as biological pathways in GWAS of lung cancer inChinese Han populations.
     (1) Gene-gene interaction analysis. We adopted a three-stage designedcase-control study. The first one is the discovery stage in GWAS. The second and thethird ones are the replication stages. Totally,13,392subjects (6,377cases and7,015controls) were collected with591,370genotyped SNPs. Four pairs of epistatic lociwere screened out using KIL strategy. Among them, only rs2562796-rs16832404wassuccessfully validated in two independent replication stages. In the discovery stage,the interaction OR=2.58,95%CI=2.24-2.97, P=1.37×10-39. In the replication1,the interaction OR=1.17,95%CI=0.99-1.38, P=6.37×10-2. In the replication2,the OR=1.21,95%CI=1.06-1.38, P=4.61×10-3. In the combined dataset of threestages, the interaction OR=1.33,95%CI=1.23-1.43, P=1.03×10-13. We also didstratification analysis according to age, gender, smoking, et al. The indentifiedepistatic loci is still significant in sub-populations. Additionally, we observed clusterof interaction signals in genotype imputation analysis.
     (2) Gene-environment interaction analysis. We adopted a two-stage designedcase-control study. The populations are the same as that of the first two stagesmentioned before. Totally, we used8,440subjects (3,865cases and4,575controls).Six SNPs have potential interaction with smoking in the GWAS discovery stage. Onlytwo SNPs (rs1316298and rs4589502) were successfully validated in the replicationstage. In the discovery stage, the interaction P values for rs1316298and rs4589502are4.15×10-5and2.61×10-5respectively. In the replication stage, the interaction Pvalues are8.87×10-4and4.40×10-2respectively. SNP rs1316298has antagonisticinteraction with smoking, whereas rs4589502has synergetic interaction with smoking.The interaction P values are6.73×10-6and3.84×10-6respectively in combineddataset of two stages. In genotype imputation analysis, we also observed a cluster ofSNPs in high or low LD with these indentified SNPs, which contribute to lung cancerrisk with smoking interactively.
     (3) Biological pathway analysis. We did epistasis analysis based on biologicalpathway information. The GWAS of lung cancer is composed of two independentstudies: the Nanjing study and the Beijing study. We did an exhaustive search forpathways based on KEGG (Kyoto Encyclopedia of Genes and Genomes) andBioCarta database in the Nanjing study. The significant pathways were thenreplicated in the Beijing study. As a result, four pathways (achPathway, At1rPathway,metPathway and rac1Pathway) were successfully validated with P values0.012,0.022,0.010and0.005respectively in the combined data of two studies. Sensitivityanalysis was performed using different SNP-to-gene mapping strategy or removingoverlapped genes in four pathways. The results indicated that what we found wasrobust. Then, we did exhaustive search for interactions among representative SNPs ineach pathway. We only identified one epistasis (rs17057065-rs17194885). Theinteraction P value were4.98×10-2,4.42×10-2and4.69×10-3in the Nanjing study,the Beijing study and the GWAS respectively.
     Simulation experiments and real data analysis provide evidence that KIL is aneffective and efficient way to detect epistasis in GWAS. Both environmental exposureand genetic mutation contribute to lung cancer risk interactively.
     This study is highlighted with four innovations below:
     (1) We did a systematic comparative analysis for ten methods to evaluate theirstatistical performance. What we found provides evidences for selection ofappropriate methods in GWAS interaction analysis.
     (2) We proposed a novel high-order interaction analysis method, IEE. It isrobust even with50%accuracy of iteration, and is qualified as afast-screening method in interaction analysis of high dimensional data.
     (3) We proposed a three-stage KIL strategy for high-order interaction analysis.It is effective in statistics and computation speed for high dimensionalreduction.
     (4) We first did an exhaustive search of gene-gene and gene-environmentinteraction, as well as biological pathways in GWAS of lung cancer in HanChinese population. What we found may provide novel insight into themultifactorial etiology of lung cancer.
引文
1. Maher B (2008) Personal genomes: The case of the missing heritability. Nature456:18-21.
    2. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al.(2009) Finding the missing heritability ofcomplex diseases. Nature461:747-753.
    3. Cordell HJ (2009) Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet10:392-404.
    4. Musani SK, Shriner D, Liu N, Feng R, Coffey CS, et al.(2007) Detection of gene x gene interactions ingenome-wide association studies of human population data. Hum Hered63:67-84.
    5. Wan X, Yang C, Yang Q, Xue H, Fan X, et al.(2010) BOOST: A fast approach to detecting gene-geneinteractions in genome-wide case-control studies. Am J Hum Genet87:325-340.
    6. Gyenesei A, Moody J, Laiho A, Semple CA, Haley CS, et al.(2012) BiForce Toolbox: powerfulhigh-throughput computational analysis of gene-gene interactions in genome-wide association studies.Nucleic Acids Res40: W628-632.
    7. Piriyapongsa J, Ngamphiw C, Intarapanich A, Kulawonganunchai S, Assawamakin A, et al.(2012) iLOCi: aSNP interaction prioritization technique for detecting epistasis in genome-wide association studies. BMCGenomics13Suppl7: S2.
    8. Prabhu S, Pe'er I (2012) Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease.Genome Res22:2230-2240.
    9. Wan X, Yang C, Yang Q, Xue H, Tang NL, et al.(2010) Predictive rule inference for epistatic interactiondetection in genome-wide association studies. Bioinformatics26:30-37.
    10. Wang Y, Liu X, Robbins K, Rekaya R (2010) AntEpiSeeker: detecting epistatic interactions for case-controlstudies using a two-stage ant colony optimization algorithm. BMC Res Notes3:117.
    11. Zhang X, Huang S, Zou F, Wang W (2010) TEAM: efficient two-locus epistasis tests in human genome-wideassociation study. Bioinformatics26: i217-227.
    12. Kirkwood JG, Boggs EM (1942) The radial distribution function in liquids. Journal of Chemical Physics10:394-402.
    13. Matsuda H (2000) Physical nature of higher-order mutual information: intrinsic correlations and frustration.Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics62:3096-3102.
    14. Yang C, He Z, Wan X, Yang Q, Xue H, et al.(2009) SNPHarvester: a filtering-based approach for detectingepistatic interactions in genome-wide association studies. Bioinformatics25:504-511.
    15. Hu Z, Wu C, Shi Y, Guo H, Zhao X, et al.(2011) A genome-wide association study identifies two new lungcancer susceptibility loci at13q12.12and22q12.2in Han Chinese. Nat Genet43:792-796.
    16. Montana G (2005) HapSim: a simulation tool for generating haplotype data with pre-specified allelefrequencies and LD coefficients. Bioinformatics21:4309-4311.
    17. Shannon CE (1948) A Mathematical Theory of Communication. The Bell System Technical Journal27:379-423,623-656.
    18. McGill. WJ (1954) Multivariate information transmission. Psychometrika19:97-116.
    19. Leydesdorff L (2010) Redundancy in Systems Which Entertain a Model of Themselves: InteractionInformation and the Self-Organization of Anticipation. Entropy12:63-79.
    20. Darroch. JN, Ratcliff. D (1972) Generalized Iterative Scaling for Log-Linear Models. The Annals ofMathematical Statistics43:1470-1480.
    21. Krippendorff. K (1986) Information Theory: Structural Models for Qualitative Data. Newbury Park,CA:Sage.
    22. Krippendorff. K (2009) Ross Ashby's information theory: a bit of history, some solutions to problems, andwhat we face today. International Journal of General Systems38:189-212.
    23. Dong C, Chu X, Wang Y, Wang Y, Jin L, et al.(2008) Exploration of gene-gene interaction effects usingentropy-based methods. Eur J Hum Genet16:229-235.
    24. Kang G, Yue W, Zhang J, Cui Y, Zuo Y, et al.(2008) An entropy-based approach for testing genetic epistasisunderlying complex diseases. J Theor Biol250:362-374.
    25. Agresti. A (2003) Categorical Data Analysis. Hoboken, New Jersey: John Wiley&Sons, Inc.710p.
    26. Zhang H, Cai B (2003) The impact of tobacco on lung health in China. Respirology8:17-21.
    27. Shields PG (2002) Molecular epidemiology of smoking and lung cancer. Oncogene21:6870-6876.
    28. Peto R, Darby S, Deo H, Silcocks P, Whitley E, et al.(2000) Smoking, smoking cessation, and lung cancer inthe UK since1950: combination of national statistics with two case-control studies. Bmj321:323-329.
    29. Hung RJ, McKay JD, Gaborieau V, Boffetta P, Hashibe M, et al.(2008) A susceptibility locus for lung cancermaps to nicotinic acetylcholine receptor subunit genes on15q25. Nature452:633-637.
    30. Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, et al.(2008) Genome-wide association scan of tag SNPsidentifies a susceptibility locus for lung cancer at15q25.1. Nat Genet40:616-622.
    31. Wang Y, Broderick P, Webb E, Wu X, Vijayakrishnan J, et al.(2008) Common5p15.33and6p21.33variantsinfluence lung cancer risk. Nat Genet40:1407-1409.
    32. McKay JD, Hung RJ, Gaborieau V, Boffetta P, Chabrier A, et al.(2008) Lung cancer susceptibility locus at5p15.33. Nat Genet40:1404-1406.
    33. Lan Q, Hsiung CA, Matsuo K, Hong YC, Seow A, et al.(2012) Genome-wide association analysis identifiesnew lung cancer susceptibility loci in never-smoking women in Asia. Nat Genet44:1330-1335.
    34. Shiraishi K, Kunitoh H, Daigo Y, Takahashi A, Goto K, et al.(2012) A genome-wide association studyidentifies two new susceptibility loci for lung adenocarcinoma in the Japanese population. Nat Genet44:900-903.
    35. Dong J, Hu Z, Wu C, Guo H, Zhou B, et al.(2012) Association analyses identify multiple new lung cancersusceptibility loci and their interactions with smoking in the Chinese population. Nat Genet44:895-899.
    36. Bodmer W, Bonilla C (2008) Common and rare variants in multifactorial susceptibility to common diseases.Nat Genet40:695-701.
    37. Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, et al.(2008) Integrated genotype calling andassociation analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet40:1253-1260.
    38. Moore JH, Williams SM (2009) Epistasis and its implications for personal genetics. Am J Hum Genet85:309-320.
    39. Chial H (2008) Rare genetic disorders: Learning about genetic disease through gene mapping, SNPs, andmicroarray data. Nature Education1.
    40. Yajima I, Kumasaka MY, Naito Y, Yoshikawa T, Takahashi H, et al.(2012) Reduced GNG2expression levelsin mouse malignant melanomas and human melanoma cell lines. Am J Cancer Res2:322-329.
    41. Hollander W (1955) Epistasis and hypostasis. J Hered46:222-225.
    42. Phillips P (1998) The language of gene interaction. Genetics149:1167-1171.
    43. Tyler AL, Asselbergs FW, Williams SM, Moore JH (2009) Shadows of complexity: what biological networksreveal about epistasis and pleiotropy. Bioessays31:220-227.
    44. Moore JH, Williams SM (2005) Traversing the conceptual divide between biological and statistical epistasis:systems biology and a more modern synthesis. Bioessays27:637-646.
    45. Bateson W (1909) Mendel’s Principles of Heredity.(Cambridge: Cambridge University Press).
    46. Fisher RA (1918) The Correlation between Relatives on the Supposition of Mendelian Inheritance.Philosophical Transactions of the Royal Society of Edinburgh52:399–433.
    47. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al.(2007) PLINK: a tool set for whole-genomeassociation and population-based linkage analyses. Am J Hum Genet81:559-575.
    48. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al.(2006) Principal components analysiscorrects for stratification in genome-wide association studies. Nat Genet38:904-909.
    49. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012) Fast and accurate genotype imputationin genome-wide association studies through pre-phasing. Nat Genet44:955-959.
    50. Bell JT, Timpson NJ, Rayner NW, Zeggini E, Frayling TM, et al.(2011) Genome-wide association scanallowing for epistasis in type2diabetes. Ann Hum Genet75:10-19.
    51. Tao S, Feng J, Webster T, Jin G, Hsu FC, et al.(2012) Genome-wide two-locus epistasis scans in prostatecancer using two European populations. Hum Genet131:1225-1234.
    52. Gauderman WJ (2002) Sample size requirements for association studies of gene-gene interaction. Am JEpidemiol155:478-484.
    53. Rosenbloom KR, Dreszer TR, Long JC, Malladi VS, Sloan CA, et al.(2012) ENCODE whole-genome data inthe UCSC Genome Browser: update2012. Nucleic Acids Res40: D912-917.
    54. Zeller T, Wild P, Szymczak S, Rotival M, Schillert A, et al.(2010) Genetics and beyond--the transcriptome ofhuman monocytes and disease susceptibility. PLoS One5: e10693.
    55. Veyrieras JB, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, et al.(2008) High-resolution mapping ofexpression-QTLs yields insight into human gene regulation. PLoS Genet4: e1000214.
    56. Li SR, Gyselman VG, Lalude O, Dorudi S, Bustin SA (2000) Transcription of the inositol polyphosphate1-phosphatase gene (INPP1) is upregulated in human colorectal cancer. Mol Carcinog27:322-329.
    57. Buermeyer AB, Deschenes SM, Baker SM, Liskay RM (1999) Mammalian DNA mismatch repair. Annu RevGenet33:533-564.
    58. Stojic L, Brun R, Jiricny J (2004) Mismatch repair and DNA damage signalling. DNA Repair (Amst)3:1091-1101.
    59. Win AK, Young JP, Lindor NM, Tucker KM, Ahnen DJ, et al.(2012) Colorectal and other cancer risks forcarriers and noncarriers from families with a DNA mismatch repair gene mutation: a prospective cohort study.J Clin Oncol30:958-964.
    60. Ramirez-Ramirez MA, Sobrino-Cossio S, de la Mora-Levy JG, Hernandez-Guerrero A, Macedo-Reyes Vde J,et al.(2012) Loss of expression of DNA mismatch repair proteins in aberrant crypt foci identified in vivo bymagnifying colonoscopy in subjects with hereditary nonpolyposic and sporadic colon rectal cancer. JGastrointest Cancer43:209-214.
    61. Vogelsang M, Wang Y, Veber N, Mwapagha LM, Parker MI (2012) The cumulative effects of polymorphismsin the DNA mismatch repair genes and tobacco smoking in oesophageal cancer risk. PLoS One7: e36962.
    62. Barnetson RA, Tenesa A, Farrington SM, Nicholl ID, Cetnarskyj R, et al.(2006) Identification and survival ofcarriers of mutations in DNA mismatch-repair genes in colon cancer. N Engl J Med354:2751-2763.
    63. Stephanou A, Latchman DS (2003) STAT-1: a novel regulator of apoptosis. Int J Exp Pathol84:239-244.
    64. Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, et al.(2007) A five-gene signature and clinical outcome innon-small-cell lung cancer. N Engl J Med356:11-20.
    65. You L, He B, Xu Z, Uematsu K, Mazieres J, et al.(2004) Inhibition of Wnt-2-mediated signaling inducesprogrammed cell death in non-small-cell lung cancer cells. Oncogene23:6170-6174.
    66. Chen S, Xu Y, Chen Y, Li X, Mou W, et al.(2012) SOX2gene regulates the transcriptional network ofoncogenes and affects tumorigenesis of human lung cancer cells. PLoS One7: e36326.
    67. Kumar MS, Hancock DC, Molina-Arcas M, Steckel M, East P, et al.(2012) The GATA2transcriptionalnetwork is requisite for RAS oncogene-driven non-small cell lung cancer. Cell149:642-655.
    68. Meng X, Lu P, Bai H, Xiao P, Fan Q (2012) Transcriptional regulatory networks in human lungadenocarcinoma. Mol Med Rep6:961-966.
    69. Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, et al.(2012) Architecture of the human regulatorynetwork derived from ENCODE data. Nature489:91-100.
    70. Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al.(2011) Global cancer statistics. CA Cancer J Clin61:69-90.
    71. Stavrides JC (2006) Lung carcinogenesis: pivotal role of metals in tobacco smoke. Free Radic Biol Med41:1017-1030.
    72. Spitz MR, Wei Q, Li G, Wu X (1999) Genetic susceptibility to tobacco carcinogenesis. Cancer Invest17:645-659.
    73. Zhou W, Liu G, Miller DP, Thurston SW, Xu LL, et al.(2002) Gene-environment interaction for the ERCC2polymorphisms and cumulative cigarette smoking exposure in lung cancer. Cancer Res62:1377-1381.
    74. Lu J, Yang L, Zhao H, Liu B, Li Y, et al.(2011) The polymorphism and haplotypes of PIN1gene areassociated with the risk of lung cancer in Southern and Eastern Chinese populations. Hum Mutat32:1299-1308.
    75. Hsia TC, Liu CJ, Lin CH, Chang WS, Chu CC, et al.(2011) Interaction of CCND1genotype and smokinghabit in Taiwan lung cancer patients. Anticancer Res31:3601-3605.
    76. Ihsan R, Chauhan PS, Mishra AK, Yadav DS, Kaushal M, et al.(2012) Multiple analytical approaches revealdistinct gene-environment interactions in smokers and non smokers in lung cancer. PLoS One6: e29431.
    77. VanderWeele TJ, Asomaning K, Tchetgen Tchetgen EJ, Han Y, Spitz MR, et al.(2012) Genetic variants on15q25.1, smoking, and lung cancer: an assessment of mediation and interaction. Am J Epidemiol175:1013-1020.
    78. Huang B, Liu B, Yang L, Li Y, Cheng M, et al.(2012) Functional genetic variants of c-Jun and their interactionwith smoking and drinking increase the susceptibility to lung cancer in southern and eastern Chinese. Int JCancer131: E744-758.
    79. Zhang Z, Yu D, Yuan J, Guo Y, Wang H, et al.(2012) Cigarette smoking strongly modifies the association ofcomplement factor H variant and the risk of lung cancer. Cancer Epidemiol36: e111-115.
    80. Cheng Z, Wang W, Song YN, Kang Y, Xia J (2012) hOGG1, p53genes, and smoking interactions areassociated with the development of lung cancer. Asian Pac J Cancer Prev13:1803-1808.
    81. Kiyohara C, Horiuchi T, Takayama K, Nakanishi Y (2011) Methylenetetrahydrofolate reductasepolymorphisms and interaction with smoking and alcohol consumption in lung cancer risk: a case-controlstudy in a Japanese population. BMC Cancer11:459.
    82. Pruim RJ, Welch RP, Sanna S, Teslovich TM, Chines PS, et al.(2010) LocusZoom: regional visualization ofgenome-wide association scan results. Bioinformatics26:2336-2337.
    83. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of LD and haplotype maps.Bioinformatics21:263-265.
    84. Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O'Donnell CJ, et al.(2008) SNAP: a web-based tool foridentification and annotation of proxy SNPs using HapMap. Bioinformatics24:2938-2939.
    85. Aschard H, Lutz S, Maus B, Duell EJ, Fingerlin TE, et al.(2012) Challenges and opportunities ingenome-wide environmental interaction (GWEI) studies. Hum Genet131:1591-1613.
    86. Modarressi MH, Taylor KE, Wolfe J (2000) Cloning, characterization, and mapping of the gene encoding thehuman G protein gamma2subunit. Biochem Biophys Res Commun272:610-615.
    87. Leung T, Chen H, Stauffer AM, Giger KE, Sinha S, et al.(2006) Zebrafish G protein gamma2is required forVEGF signaling during angiogenesis. Blood108:160-166.
    88. Visser-Grieve S, Hao Y, Yang X (2012) Human homolog of Drosophila expanded, hEx, functions as a putativetumor suppressor in human cancer cell lines independently of the Hippo pathway. Oncogene31:1189-1195.
    89. Ungvari I, Hullam G, Antal P, Kiszel PS, Gezsi A, et al.(2012) Evaluation of a partial genome screening oftwo asthma susceptibility regions using bayesian network based bayesian multilevel analysis of relevance.PLoS One7: e33573.
    90. Hong MG, Reynolds CA, Feldman AL, Kallin M, Lambert JC, et al.(2012) Genome-wide and gene-basedassociation implicates FRMD6in Alzheimer disease. Hum Mutat33:521-529.
    91. Kohfeldt E, Sasaki T, Gohring W, Timpl R (1998) Nidogen-2: a new basement membrane protein with diversebinding properties. J Mol Biol282:99-109.
    92. Guerrero-Preston R, Soudry E, Acero J, Orera M, Moreno-Lopez L, et al.(2011) NID2and HOXA9promoterhypermethylation as biomarkers for prevention and early detection in oral cavity squamous cell carcinomatissues and saliva. Cancer Prev Res (Phila)4:1061-1072.
    93. Ulazzi L, Sabbioni S, Miotto E, Veronese A, Angusti A, et al.(2007) Nidogen1and2gene promoters areaberrantly methylated in human gastrointestinal cancer. Mol Cancer6:17.
    94. Renard I, Joniau S, van Cleynenbreugel B, Collette C, Naome C, et al.(2010) Identification and validation ofthe methylated TWIST1and NID2genes through real-time methylation-specific polymerase chain reactionassays for the noninvasive detection of primary bladder cancer in urine samples. Eur Urol58:96-104.
    95. Geng J, Sun J, Lin Q, Gu J, Zhao Y, et al.(2012) Methylation status of NEUROG2and NID2improves thediagnosis of stage I NSCLC. Oncol Lett3:901-906.
    96. Springer J, Scholz FR, Peiser C, Groneberg DA, Fischer A (2004) SMAD-signaling in chronic obstructivepulmonary disease: transcriptional down-regulation of inhibitory SMAD6and7by cigarette smoke. BiolChem385:649-653.
    97. Samanta D, Gonzalez AL, Nagathihalli N, Ye F, Carbone DP, et al.(2012) Smoking attenuates transforminggrowth factor-beta-mediated tumor suppression function through downregulation of Smad3in lung cancer.Cancer Prev Res (Phila)5:453-463.
    98. Samanta D, Kaufman J, Carbone DP, Datta PK (2012) Long-term smoking mediated down-regulation ofSmad3induces resistance to carboplatin in non-small cell lung cancer. Neoplasia14:644-655.
    99. Kang Y, Hong JA, Chen GA, Nguyen DM, Schrump DS (2007) Dynamic transcriptional regulatory complexesincluding BORIS, CTCF and Sp1modulate NY-ESO-1expression in lung cancer cells. Oncogene26:4394-4403.
    100. Hong JA, Kang Y, Abdullaev Z, Flanagan PT, Pack SD, et al.(2005) Reciprocal binding of CTCF and BORISto the NY-ESO-1promoter coincides with derepression of this cancer-testis gene in lung cancer cells. CancerRes65:7763-7774.
    101. Zajac-Kaye M (2001) Myc oncogene: a key component in cell cycle regulation and its implication for lungcancer. Lung Cancer34Suppl2: S43-46.
    102. Rapp UR, Korn C, Ceteci F, Karreman C, Luetkenhaus K, et al.(2009) MYC is a metastasis gene fornon-small-cell lung cancer. PLoS One4: e6029.
    103. Allen TD, Zhu CQ, Jones KD, Yanagawa N, Tsao MS, et al.(2011) Interaction between MYC and MCL1inthe genesis and outcome of non-small-cell lung cancer. Cancer Res71:2212-2221.
    104. Balding DJ (2006) A tutorial on statistical methods for population association studies. Nat Rev Genet7:781-791.
    105. Pedroso I (2010) Gaining a pathway insight into genetic association data. Methods Mol Biol628:373-382.
    106. Carlborg O, Haley CS (2004) Epistasis: too often neglected in complex trait studies? Nat Rev Genet5:618-625.
    107. Menashe I, Maeder D, Garcia-Closas M, Figueroa JD, Bhattacharjee S, et al.(2010) Pathway analysis ofbreast cancer genome-wide association study highlights three pathways and one canonical signaling cascade.Cancer Res70:4453-4459.
    108. Perry JR, McCarthy MI, Hattersley AT, Zeggini E, Weedon MN, et al.(2009) Interrogating type2diabetesgenome-wide association data using a biological pathway-based approach. Diabetes58:1463-1467.
    109. Wang K, Li M, Bucan M (2007) Pathway-based approaches for analysis of genomewide association studies.Am J Hum Genet81:1278-1283.
    110. Wang K, Zhang H, Kugathasan S, Annese V, Bradfield JP, et al.(2009) Diverse genome-wide associationstudies associate the IL12/IL23pathway with Crohn Disease. Am J Hum Genet84:399-405.
    111. Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, et al.(2009) Pathway and network-basedanalysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet18:2078-2090.
    112. Zhang M, Liang L, Morar N, Dixon AL, Lathrop GM, et al.(2012) Integrating pathway analysis and geneticsof gene expression for genome-wide association study of basal cell carcinoma. Hum Genet131:615-623.
    113. Menashe I, Figueroa JD, Garcia-Closas M, Chatterjee N, Malats N, et al.(2012) Large-scale pathway-basedanalysis of bladder cancer genome-wide association data from five studies of European background. PLoSOne7: e29396.
    114. Li D, Duell EJ, Yu K, Risch HA, Olson SH, et al.(2012) Pathway analysis of genome-wide association studydata highlights pancreatic development genes as susceptibility factors for pancreatic cancer. Carcinogenesis33:1384-1390.
    115. Zhang M, Liang L, Xu M, Qureshi AA, Han J (2011) Pathway analysis for genome-wide association study ofbasal cell carcinoma of the skin. PLoS One6: e22760.
    116. Biernacka JM, Geske J, Jenkins GD, Colby C, Rider DN, et al.(2012) Genome-wide gene-set analysis foridentification of pathways associated with alcohol dependence. Int J Neuropsychopharmacol:1-8.
    117. Chung RH, Chen YE (2012) A two-stage random forest-based pathway analysis method. PLoS One7:e36662.
    118. Fehringer G, Liu G, Briollais L, Brennan P, Amos CI, et al.(2012) Comparison of pathway analysisapproaches using lung cancer GWAS data sets. PLoS One7: e31816.
    119. Minna JD (2003) Nicotine exposure and bronchial epithelial cell nicotinic acetylcholine receptor expressionin the pathogenesis of lung cancer. J Clin Invest111:31-33.
    120. Maneckjee R, Minna JD (1994) Opioids induce while nicotine suppresses apoptosis in human lung cancercells. Cell Growth Differ5:1033-1040.
    121. Maus AD, Pereira EF, Karachunski PI, Horton RM, Navaneetham D, et al.(1998) Human and rodentbronchial epithelial cells express functional nicotinic acetylcholine receptors. Mol Pharmacol54:779-788.
    122. West KA, Brognard J, Clark AS, Linnoila IR, Yang X, et al.(2003) Rapid Akt activation by nicotine and atobacco carcinogen modulates the phenotype of normal human airway epithelial cells. J Clin Invest111:81-90.
    123. Egleton RD, Brown KC, Dasgupta P (2008) Nicotinic acetylcholine receptors in cancer: multiple roles inproliferation and inhibition of apoptosis. Trends Pharmacol Sci29:151-158.
    124. Dasgupta P, Rastogi S, Pillai S, Ordonez-Ercan D, Morris M, et al.(2006) Nicotine induces cell proliferationby beta-arrestin-mediated activation of Src and Rb-Raf-1pathways. J Clin Invest116:2208-2217.
    125. Miki D, Kubo M, Takahashi A, Yoon KA, Kim J, et al.(2010) Variation in TP63is associated with lungadenocarcinoma susceptibility in Japanese and Korean populations. Nat Genet42:893-896.
    126. Hsiung CA, Lan Q, Hong YC, Chen CJ, Hosgood HD, et al.(2010) The5p15.33locus is associated with riskof lung adenocarcinoma in never-smoking females in Asia. PLoS Genet6.
    127. Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, et al.(2009) A genome-wide association study oflung cancer identifies a region of chromosome5p15associated with risk for adenocarcinoma. Am J HumGenet85:679-691.
    128. Imai N, Hashimoto T, Kihara M, Yoshida S, Kawana I, et al.(2007) Roles for host and tumor angiotensin IItype1receptor in tumor growth and tumor-associated angiogenesis. Lab Invest87:189-198.
    129. Pickel L, Matsuzuka T, Doi C, Ayuzawa R, Maurya DK, et al.(2010) Overexpression of angiotensin II type2receptor gene induces cell death in lung adenocarcinoma cells. Cancer Biol Ther9.
    130. Cui J, Zhang M, Zhang YQ, Xu ZH (2007) JNK pathway: diseases and therapeutic potential. Acta PharmacolSin28:601-608.
    131. Wagner EF, Nebreda AR (2009) Signal integration by JNK and p38MAPK pathways in cancer development.Nat Rev Cancer9:537-549.
    132. To CT, Tsao MS (1998) The roles of hepatocyte growth factor/scatter factor and met receptor in humancancers (Review). Oncol Rep5:1013-1024.
    133. Jeffers M, Schmidt L, Nakaigawa N, Webb CP, Weirich G, et al.(1997) Activating mutations for the mettyrosine kinase receptor in human cancer. Proc Natl Acad Sci U S A94:11445-11450.
    134. Maulik G, Shrikhande A, Kijima T, Ma PC, Morrison PT, et al.(2002) Role of the hepatocyte growth factorreceptor, c-Met, in oncogenesis and potential for therapeutic inhibition. Cytokine Growth Factor Rev13:41-59.
    135. Puri N, Salgia R (2008) Synergism of EGFR and c-Met pathways, cross-talk and inhibition, in non-small celllung cancer. J Carcinog7:9.
    136. Bar-Sagi D, Hall A (2000) Ras and Rho GTPases: a family reunion. Cell103:227-238.
    137. Vivanco I, Sawyers CL (2002) The phosphatidylinositol3-Kinase AKT pathway in human cancer. Nat RevCancer2:489-501.
    138. Funato Y, Terabayashi T, Suenaga N, Seiki M, Takenawa T, et al.(2004) IRSp53/Eps8complex is importantfor positive regulation of Rac and cancer cell motility/invasiveness. Cancer Res64:5237-5244.
    139. Yu J, Huang NF, Wilson KD, Velotta JB, Huang M, et al.(2009) nAChRs mediate human embryonic stemcell-derived endothelial cells: proliferation, apoptosis, and angiogenesis. PLoS One4: e7040.
    140. Paliwal A, Vaissiere T, Krais A, Cuenin C, Cros MP, et al.(2010) Aberrant DNA methylation links cancersusceptibility locus15q25.1to apoptotic regulation and lung cancer. Cancer Res70:2779-2788.
    141. Cucina A, Fuso A, Coluccia P, Cavallaro A (2008) Nicotine inhibits apoptosis and stimulates proliferation inaortic smooth muscle cells through a functional nicotinic acetylcholine receptor. J Surg Res150:227-235.
    142. Giaccone G, Zucali PA (2008) Src as a potential therapeutic target in non-small-cell lung cancer. Ann Oncol19:1219-1223.
    143. Rothschild SI, Gautschi O, Haura EB, Johnson FM (2010) Src inhibitors in lung cancer: current status andfuture directions. Clin Lung Cancer11:238-242.
    144. Berger. A (1997) The improved iterative scaling algorithm: A gentle introduction.http://www.cs.cmu.edu/~aberger/pdf/scaling.pdf.
    145. Nocedal. J (1980) Updating Quasi-Newton Matrices with Limited Storage Mathematics of Computation35:773-782.
    146. Malouf. R, Groningen. R (2002) A comparison of algorithms for maximum entropy parameter estimation. InProceedings of the Sixth Conference on Natural Language Learning:49-55.
    147. Yung LS, Yang C, Wan X, Yu W (2011) GBOOST: a GPU-based tool for detecting gene-gene interactions ingenome-wide case control studies. Bioinformatics27:1309-1310.
    148. Murcray CE, Lewinger JP, Gauderman WJ (2009) Gene-environment interaction in genome-wide associationstudies. Am J Epidemiol169:219-226.
    149. Li S, Cui Y (2012) Gene-centric gene-gene interaction: A model-based kernel machine method. Ann ApplStat6:1134-1161.
    150. He J, Wang K, Edmondson AC, Rader DJ, Li C, et al.(2011) Gene-based interaction analysis byincorporating external linkage disequilibrium information. Eur J Hum Genet19:164-172.
    151. Peng Q, Zhao J, Xue F (2010) A gene-based method for detecting gene-gene co-association in a case-controlassociation study. Eur J Hum Genet18:582-587.
    152. Yu K, Wacholder S, Wheeler W, Wang Z, Caporaso N, et al.(2012) A flexible Bayesian model for studyinggene-environment interaction. PLoS Genet8: e1002482.
    153. Wang K, Dickson SP, Stolle CA, Krantz ID, Goldstein DB, et al.(2010) Interpretation of association signalsand identification of causal variants from genome-wide association studies. Am J Hum Genet86:730-742.
    154. Gibson G (2011) Rare and common variants: twenty arguments. Nat Rev Genet13:135-145.
    155. Ma S, Dai Y (2011) Principal component analysis based methods in bioinformatics studies. Brief Bioinform12:714-722.
    156. Chen X, Wang L, Smith JD, Zhang B (2008) Supervised principal component analysis for gene setenrichment of microarray data with continuous or survival outcomes. Bioinformatics24:2474-2481.
    157. Liu Z, Chen D, Bensmail H (2005) Gene expression data classification with Kernel principal componentanalysis. J Biomed Biotechnol2005:155-159.
    158. Liu D, Ghosh D, Lin X (2008) Estimation and testing for the effect of a genetic pathway on a diseaseoutcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics9:292.
    159. Wang K, Li M, Hakonarson H (2010) Analysing biological pathways in genome-wide association studies. NatRev Genet11:843-854.
    160. Nelder J, R. W (1972) Generalized linear models. Journal of the Royal Statistical Society Series A (General):370-384.
    161. Cox RD (1970) Analysis of Binary Data. London: Methuen and Co., Ltd.
    162. Birch WM (1963) Maximum likelihood in three-way contingency tables. Journal of the Royal StatisticalSociety Series B (Methodological):220-233.
    163. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, et al.(2001) Multifactor-dimensionality reductionreveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet69:138-147.
    164. Nelson MR, Kardia SL, Ferrell RE, Sing CF (2001) A combinatorial partitioning method to identifymultilocus genotypic partitions that predict quantitative trait variation. Genome Res11:458-470.
    165. Culverhouse R (2007) The use of the restricted partition method with case-control data. Hum Hered63:93-100.
    166. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, et al.(2006) A flexible computational framework fordetecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human diseasesusceptibility. J Theor Biol241:252-261.
    167. Zheng T, Wang H, Lo SH (2006) Backward genotype-trait association (BGTA)-based dissection of complextraits in case-control designs. Hum Hered62:196-212.
    168. Zhang X, Zou F, Wang W (2008) FastANOVA: an Efficient Algorithm for Genome-Wide Association Study.KDD:821-829.
    169. Zhang X, Zou F, Wang W (2009) FastChi: an efficient algorithm for analyzing gene-gene interactions. PacSymp Biocomput:528-539.
    170. Zhang X, Pan F, Xie Y, Zou F, Wang W (2010) COE: a general approach for efficient genome-wide two-locusepistasis test in disease association study. J Comput Biol17:401-415.
    171. Millstein J, Conti DV, Gilliland FD, Gauderman WJ (2006) A testing framework for identifying susceptibilitygenes in the presence of epistasis. Am J Hum Genet78:15-27.
    172. Jiang X, Barmada MM, Visweswaran S (2010) Identifying genetic interactions in genome-wide data usingBayesian networks. Genet Epidemiol34:575-581.
    173. Marchini J, Donnelly P, Cardon LR (2005) Genome-wide strategies for detecting multiple loci that influencecomplex diseases. Nat Genet37:413-417.
    174. Motsinger AA, Reif DM, Dudek SM, Ritchie MD (2006) Understanding the Evolutionary Process ofGrammatical Evolution Neural Networks for Feature Selection in Genetic Epidemiology. Proc IEEE SympComput Intell Bioinforma Comput Biol2006:1-8.
    175. Cook NR, Zee RY, Ridker PM (2004) Tree and spline based association analysis of gene-gene interactionmodels for ischemic stroke. Stat Med23:1439-1453.
    176. Kooperberg C, Ruczinski I (2005) Identifying interacting SNPs using Monte Carlo logic regression. GenetEpidemiol28:157-170.
    177. Wolf BJ, Hill EG, Slate EH (2010) Logic Forest: an ensemble classifier for discovering logical combinationsof binary markers. Bioinformatics26:2183-2189.
    178. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Belmont: CA:Wadsworth.
    179. Chen X, Liu CT, Zhang M, Zhang H (2007) A forest-based approach to identifying gene and gene geneinteractions. Proc Natl Acad Sci U S A104:19199-19203.
    180. Park MY, Hastie T (2008) Penalized logistic regression for detecting gene interactions. Biostatistics9:30-50.
    181. Long Q, Zhang Q, Ott J (2009) Detecting disease-associated genotype patterns. BMC Bioinformatics10Suppl1: S75.
    182. Moore JH, White BC (2007) Tuning ReliefF for genome-wide genetic analysis. Proceedings of the5thEuropean conference on Evolutionary computation, machine learning and data mining in bioinformatics.Valencia, Spain: Springer-Verlag. pp.166-175.
    183. Hahn LW, Ritchie MD, Moore JH (2003) Multifactor dimensionality reduction software for detectinggene-gene and gene-environment interactions. Bioinformatics19:376-382.
    184. Hosmer DW, Lemeshow S (2000) Applied logistic regression. New York: John Wiley&Sons.
    185. Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, et al.(2007) A balanced accuracy function forepistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol31:306-315.
    186. Pedro D. MetaCost: A General Method for Making Classifiers Cost-Sensitive;1999. pp.155-164.
    187. Greene CS, Sinnott-Armstrong NA, Himmelstein DS, Park PJ, Moore JH, et al.(2010) Multifactordimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadicALS. Bioinformatics26:694-695.
    188. Oh S, Lee J, Kwon MS, Weir B, Ha K, et al.(2012) A novel method to identify high order gene-geneinteractions in genome-wide association studies: gene-based MDR. BMC Bioinformatics13Suppl9: S5.
    189. Herold C, Steffens M, Brockschmidt FF, Baur MP, Becker T (2009) INTERSNP: genome-wide interactionanalysis guided by a priori information. Bioinformatics25:3275-3281.
    190. Wan X, Yang C, Yang Q, Xue H, Tang NL, et al.(2010) Detecting two-locus associations allowing forinteractions in genome-wide association studies. Bioinformatics26:2517-2525.
    191. Hu X, Liu Q, Zhang Z, Li Z, Wang S, et al.(2010) SHEsisEpi, a GPU-enhanced genome-wide SNP-SNPinteraction scanning algorithm, efficiently reveals the risk genetic epistasis in bipolar disorder. Cell Res20:854-857.
    192. Schupbach T, Xenarios I, Bergmann S, Kapur K (2010) FastEpistasis: a high performance computing solutionfor quantitative trait epistasis. Bioinformatics26:1468-1469.
    193. Kam-Thong T, Putz B, Karbalai N, Muller-Myhsok B, Borgwardt K (2011) Epistasis detection onquantitative phenotypes by exhaustive enumeration using GPUs. Bioinformatics27: i214-221.
    194. Wellek S, Ziegler A (2009) A genotype-based approach to assessing the association between single nucleotidepolymorphisms. Hum Hered67:128-139.
    195. Breiman L (2001) Random forests. Mach Learn45:5-32.
    196. Sun YV (2010) Multigenic modeling of complex disease by random forests. Adv Genet72:73-99.
    197. Jiang R, Tang W, Wu X, Fu W (2009) A random forest approach to the detection of epistatic interactions incase-control studies. BMC Bioinformatics10Suppl1: S65.
    198. Sun YV, Cai Z, Desai K, Lawrance R, Leff R, et al.(2007) Classification of rheumatoid arthritis status withcandidate gene and genome-wide single-nucleotide polymorphisms using random forests. BMC Proc1Suppl1: S62.
    199. Nicodemus KK, Malley JD, Strobl C, Ziegler A (2010) The behaviour of random forest permutation-basedvariable importance measures under predictor correlation. BMC Bioinformatics11:110.
    200. Schwarz DF, Konig IR, Ziegler A (2010) On safari to Random Jungle: a fast implementation of RandomForests for high-dimensional data. Bioinformatics26:1752-1758.
    201. Lin HY, Chen YA, Tsai YY, Qu X, Tseng TS, et al.(2012) TRM: a powerful two-stage machine learningapproach for identifying SNP-SNP interactions. Ann Hum Genet76:53-62.
    202. Yoshida M, Koike A (2011) SNPInterForest: a new method for detecting epistatic interactions. BMCBioinformatics12:469.
    203. Zhao Y, Chen F, Zhai R, Lin X, Wang Z, et al.(2012) Correction for population stratification in random forestanalysis. Int J Epidemiol.
    204. Zhang Y, Liu JS (2007) Bayesian inference of epistatic interactions in case-control studies. Nat Genet39:1167-1173.
    205. Peng T, Du P, Li Y (2009) PBEAM: a parallel implementation of BEAM for genome-wide inference ofepistatic interactions. Bioinformation3:349-351.
    206. Jun SL (2001) Monte Carlo Strategies in Scientific Computing. New York.: Springer.
    207. Tang W, Wu X, Jiang R, Li Y (2009) Epistatic module detection for case-control studies: a Bayesian modelwith a Gibbs sampling strategy. PLoS Genet5: e1000464.
    208. Zhang Y, Zhang J, Liu JS (2011) BLOCK-BASED BAYESIAN EPISTASIS ASSOCIATION MAPPINGWITH APPLICATION TO WTCCC TYPE1DIABETES DATA. Ann Appl Stat5:2052-2077.
    209. Zhang Y (2012) A novel bayesian graphical model for genome-wide multi-SNP association mapping. GenetEpidemiol36:36-47.
    210. Wan X, Yang C, Yang Q, Xue H, Tang NL, et al.(2009) MegaSNPHunter: a learning approach to detectdisease predisposition SNPs and high level interactions in genome wide association study. BMCBioinformatics10:13.
    211. Schapire RE (1999) Theoretical views of boosting. Computational Learning Theory: Fourth EuropeanConference, EuroCOLT. pp.1-10.
    212. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Annalsof Statistics2000. pp.337-374.
    213. Friedman JH, Popescu BE (2008) Predictive learning via rule ensembles. Ann Appl Stat2:916-954.
    214. Miller DJ, Zhang Y, Yu G, Liu Y, Chen L, et al.(2009) An algorithm for learning maximum entropyprobability models of disease risk that efficiently searches and sparingly encodes multilocus genomicinteractions. Bioinformatics25:2478-2485.
    215. McKinney BA, Crowe JE, Guo J, Tian D (2009) Capturing the spectrum of interaction effects in geneticassociation studies by simulated evaporative cooling network analysis. PLoS Genet5: e1000432.
    216. Armitage P (1955) Tests for linear trends in proportions and frequencies. Biometrics11:375-386.
    217. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, et al.(2006) From genomics to chemicalgenomics: new developments in KEGG. Nucleic Acids Res34: D354-357.
    218. Dorigo M, Gambardella LM (1997) Ant colonies for the travelling salesman problem. Biosystems43:73-81.
    219. Wang Y, Liu G, Feng M, Wong L (2011) An empirical comparison of several recent epistatic interactiondetection methods. Bioinformatics27:2936-2943.
    220. Wang H, Lo SH, Zheng T, Hu I (2012) Interaction-based feature selection and classification forhigh-dimensional biological data. Bioinformatics28:2834-2842.
    221. Freund Y, Schapire R (1997) A decision-theoretic generalization of online learning and an application toboosting. J Comput Sys Sci55:119-139.
    222. Wang Z, Wang Y, Tan KL, Wong L, Agrawal D (2011) eCEO: an efficient Cloud Epistasis cOmputing modelin genome-wide association study. Bioinformatics27:1045-1051.
    223. Quinlan JR (1987) Simplifying decision trees. International Journal of Man-Machine Studies27:221-234.
    224. Gold K, Petrosino A (2010) Using information gain to build meaningful decision forests for multilabelclassification. Development and Learning (ICDL),2010IEEE9th International Conference on:58-63.
    225. Hu T, Sinnott-Armstrong NA, Kiralis JW, Andrew AS, Karagas MR, et al.(2011) Characterizing geneticinteractions in human disease association studies using statistical epistasis networks. BMC Bioinformatics12:364.
    226. Fan R, Zhong M, Wang S, Zhang Y, Andrew A, et al.(2011) Entropy-based information gain approaches todetect and to characterize gene-gene and gene-environment interactions/correlations of complex diseases.Genet Epidemiol35:706-721.
    227. Wilks SS (1962) Mathematical Statistics. New York, Wiley: pp.418-418.
    228. Efron B, Tibshirani R (1993) An introduction to the bootstrap. London: Chapman&Hall.
    229. Li W, Reich J (2000) A complete enumeration and classification of two-locus disease models. Hum Hered50:334-349.
    230. Motsinger-Reif AA, Reif DM, Fanelli TJ, Ritchie MD (2008) A comparison of analytical methods for geneticassociation studies. Genet Epidemiol32:767-778.
    231. He H, Oetting WS, Brott MJ, Basu S (2009) Power of multifactor dimensionality reduction and penalizedlogistic regression for detecting gene-gene interaction in a case-control study. BMC Med Genet10:127.
    232. Sucheston L, Chanda P, Zhang A, Tritchler D, Ramanathan M (2010) Comparison of information-theoretic tostatistical methods for gene-gene interactions in the presence of genetic heterogeneity. BMC Genomics11:487.
    233. Chanda P, Sucheston L, Liu S, Zhang A, Ramanathan M (2009) Information-theoretic gene-gene andgene-environment interaction analysis of quantitative traits. BMC Genomics10:509.
    234. Wu J, Devlin B, Ringquist S, Trucco M, Roeder K (2010) Screen and clean: a tool for identifying interactionsin genome-wide association studies. Genet Epidemiol34:275-285.
    235. Chen L, Yu G, Langefeld CD, Miller DJ, Guy RT, et al.(2011) Comparative analysis of methods for detectinginteracting loci. BMC Genomics12:344.
    236. Shang J, Zhang J, Sun Y, Liu D, Ye D, et al.(2011) Performance analysis of novel methods for detectingepistasis. BMC Bioinformatics12:475.
    237. Kam-Thong T, Czamara D, Tsuda K, Borgwardt K, Lewis CM, et al.(2011) EPIBLASTER-fast exhaustivetwo-locus epistasis detection strategy using graphical processing units. Eur J Hum Genet19:465-471.
    238. Ma L, Runesha HB, Dvorkin D, Garbe JR, Da Y (2008) Parallel and serial computing tools for testingsingle-locus and epistatic SNP effects of quantitative traits in genome-wide association studies. BMCBioinformatics9:315.
    239. Hemani G, Theocharidis A, Wei W, Haley C (2011) EpiGPU: exhaustive pairwise epistasis scans parallelizedon consumer level graphics cards. Bioinformatics27:1462-1465.
    240. Li S, Cui Y (2012) Gene-centric gene–gene interaction: A model-based kernel machine method. Ann ApplStat6:1134-1161.