Data Mining Methods for Single Nucleotide Polymorphisms Analysis in Computational Biology.
详细信息   
  • 作者:Liu ; Yang.
  • 学历:Doctor
  • 年:2011
  • 导师:Ng, Michael K.,eadvisor
  • 毕业院校:Hong Kong Baptist University
  • ISBN:9781267358363
  • CBH:3510948
  • Country:China
  • 语种:English
  • FileSize:1388573
  • Pages:168
文摘
Single nucleotide polymorphism SNP) is among the most common genetic variations since its widespread distributions. They are considered as abundant and invaluable markers in human genome, that is a potential powerful tool for both of genetic researches and applications in practice. A well-chosen set of SNPs can represent millions of common genetic variants throughout the genome and also give researchers a better understanding of disease association information. Hence, the real challenge in association studies lies in carefully selecting reliable marker alleles which are most likely responsible for disease and furthermore, well representing them in an intuitive manner. This thesis addresses above problems by first presenting two data mining algorithms for the detection, selection, comparison and analysis of SNPs that associate with genetic disorders from genome-wide data. One of them is a clustering method, which emploies a subspace categorical clustering algorithm to compute a weight for each SNP in the group of case samples and the group of control samples, and uses the weights to identify the subsets of relevant SNPs that categorize these two groups. The other one is a classification method called shrunken centroid method that can succinctly characterize each class case and control) by shrinking each centroid with respect to the overall centroid by a certain threshold in a categorical manner, and detect association between a disease and multiple marker genotypes based on a set of relevant SNPs selected. We also investigate the use of SNP networks for the interpretation of significant SNPs based on the detection of interacting and associated SNPs. In one aspect, SNP networks are constructed based on the selected ones obtained from our proposed shrunken centroid method. A statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and pairs with smaller significant P-values than a defined threshold are chosen to identify an undirected and unweighted SNP network. Genes involved in this SNP network are further extracted. A gene-gene similarity value is computed using GOSemSim and gene pairs that have similarity values being greater than a threshold are selected to construct gene networks. Biological relationships between these two forms of networks can be analyzed. In the other aspect, we present a novel method to mine, model and evaluate SNP sub-networks from a completed version of genome-wide network, which is constructed based on the SNP-SNP interactions from a logistic regression model implemented using PLINK. Then by using gene information, selected SNP seeds are employed to detect SNP sub-networks with a maximal modularity. Finally to identify functional role of each SNP sub-network, its gene association network is constructed and their functional similarity values are calculated to show the biological relevance. We also perform a classification analysis to demonstrate how the detected SNP sub-networks are used for disease association study. Last, we incorporate DNA copy number variation CNV) data derived from SNP arrays into a computational shrunken dissimilarity measure model and formalize the detection of copy number variations as a case-control classification problem. By shrinkage, the number of relevant CNVs to disease can be determined. The corresponding SNPs and genes will be investigated and SNP and gene networks can be constructed afterwards.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700