Data Mining Framework for Metagenome Analysis.
详细信息   
  • 作者:Rasheed ; Zeehasham.
  • 学历:Ph.D.
  • 年:2013
  • 导师:Rangwala, Huzefa,eadvisorBarbara, Danielecommittee memberKosecka, Janaecommittee memberGillevet, Patrickecommittee member
  • 毕业院校:George Mason University
  • Department:Computer Science
  • ISBN:9781303276316
  • CBH:3589152
  • Country:USA
  • 语种:English
  • FileSize:2030638
  • Pages:145
文摘
Advances in biotechnology have dramatically changed the manner of characterizing large populations of microbial communities that are ubiquitous across several environments. The process of "metagenomics" involves sequencing of the genetic material of organisms co-existing within ecosystems ranging from ocean, soil and human body. Researchers are trying to determine the collective microbial community or population of microbes that co-exist across different environmental and clinical samples. Several researchers and clinicians have embarked on studying the pathogenic role played by the microbiome i.e., the collection of microbial organisms within the human body) with respect to human health and disease conditions. There is a critical need to develop new methods that can analyze metagenomes and correlate heterogeneous microbiome data to clinical metadata. Lack of such methods is an impediment for the identification of the function and presence of microbial organism within different samples, reducing our ability to elucidate the microbial-host interactions and discover novel therapeutics. From another perspective, comparing metagenomes across different ecological samples allows for the characterization of biodiversity across the planet. The goals of this dissertation are to present novel data mining algorithms that allow for the accurate and efficient analysis of metagenome data obtained from different environments. Specific contributions include the development of a suite of clustering algorithms for handling large-scale targeted and whole metagenome sequences. We developed a novel locality sensitive hashing LSH) based method for clustering metagenome sequence reads. Our method achieves efficiency by approximating the pairwise sequence comparison operations using randomized hashing technique. We incorporate this clustering approach within a computational pipeline LSH-Div) to estimate the species diversity within an ecological sample. We also developed an algorithm called MC-MinH that uses the min-wise hashing approach along with a greedy clustering algorithm to group 16S and whole metagenome sequences. We represent unequal length sequences using contiguous subsequences or k-mers, and then approximate the computation of pairwise similarity using independent min-wise hashing. Further, MC-MinH is extended as a distributed algorithm implemented within the Map-Reduce based Hadoop platform. The distributed clustering algorithm can perform a greedy iterative clustering as well as an agglomerative hierarchical clustering and can handle large volumes of input sequences. We also developed a novel sequence composition-based taxonomic classifier using extreme learning machines referred to as TAC-ELM. This algorithm uses the framework of extreme learning machines to quickly and accurately learn the weights for a neural network model. TAC-ELM when combined with BLAST Basic Local Alignment Search Tool) has shown improved taxonomy classification results. In order to make these developed computational tools accessible to a broad group of researchers, we developed a web portal for scientific analysis. The portal implements a LIMS database using Drupal content management system to store and retrieve the multi-modal microbiome data. For analysis and development of workflows, we use the Galaxy platform which provides a web-based interface for integrating the computational tools to create user-customized pipelines and a batch-based job submission system. To summarize, this dissertation has contributions in the area of metagenome sequence clustering and classification which can be easily integrated within computational workflows for species diversity estimation and large-scale microbiome analysis.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700