Selected applications in data intensive computing

详细信息

作者：Gao ; Wenxuan
学历：Doctor
年：2014
关键词：Biological sciences ; Applied sciences ; Data intensiv
导师：Yu,Philip S.
毕业院校：University of Illinois
Department：Computer Science
专业：Computer Engineering;Bioinformatics;Computer science
ISBN：9781321241389
CBH：3639581
Country：USA
语种：English
FileSize：2468613
Pages：140

文摘

As advances of science and technology develop,large amount of data are exponentially generated every day through different ways,such as scientific instruments,computer simulations,and many other methods. How to mine valuable nuggets of knowledge to make informed decisions from such large amount of data in an efficient way is challenging. However,the development of distributed computing techniques and high speed networks provides us good opportunities to solve big data problems. In this thesis,I focus on developing data intensive computing algorithms and applying data mining methods to analyze massive biological and medical data under cloud computing environments. There are many approaches which can parallelize an existing data mining algorithm in a cloud computing environment. Achieving better performance by manipulating data in an intelligent way has attracted a lot of attention. In this thesis,I propose two different approaches to parallelize the existing random decision tree algorithm,which has been implemented in the Sector/Sphere cloud environment. Some comparisons about cost and accuracy are also conducted for these two different implementations and are presented here. Recently,with the development of ChIP-chip and ChIP-seq technology,huge amounts of genome wide protein-DNA binding sites data are now available for many transcription factors and chromatin regulators for many species. Previous studies have already shown that the distribution of their localizations and modification can offer novel insight into the mechanisms of regulation. As it is strongly believed that multiple chromatin factors can work together to regulate a common target,I formally define this problem and propose a novel graph-based algorithm called Patterns of Marks (PoM) to efficiently identify these types of geometric patterns in the massive genomic data. In addition,as the amount of data grows,it is impossible to integrate data manually,therefore,I propose two algorithms to automatically integrate big tabular data. I also conduct an experimental study by developing a customizable lightweight web crawler to collect various data from Internet.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700