Scalable and robust clustering and visualization for large-scale bioinformatics data.

详细信息

作者：Ruan ; Yang.
学历：Doctor
年：2014
毕业院校：Indiana University
Department：Computer Sciences.
ISBN：9781321299816
CBH：3642841
Country：USA
语种：English
FileSize：4043635
Pages：180

文摘

During the past few decades,advances in the next generation of sequencing (NGS) techniques have enabled rapid analysis of the whole genetic information within a microbial community,bypassing the culturing of individual microbial species in the lab. These techniques have led to a proliferation of raw genomic data,which enables an unprecedented opportunity for data mining. To analyze a voluminous amount of bioinformatics data,a pipeline called DACIDR has been proposed. DACIDR adopts a taxonomy-independent approach to grouping these sequences into operational taxonomic units (OTUs),referred to as data clustering,and it enables visualization of the clustering result leveraging the power of parallelization and multidimensional scaling (MDS) techniques by utilizing large-scale computational resources. First,in order to observe the proximity of the sequences in a lower dimension,sequence alignment techniques are applied on each pair of sequences to generate similarity scores in a high dimension. These scores need to be assigned with weights in order to achieve an accurate result in MDS. Therefore,a robust and scalable MDS algorithm called WDA-SMACOF is proposed to address the issues of either missing distances or a non-trivial weight function. Second,the dataset with millions of sequences is usually divided into two parts: the first is processed with MDS,which has quadratic space and time complexity while the second is interpolated with approximation,resulting in a linear time complexity； this is also referred to as interpolation. In order to achieve real-time processing speed,a novel hierarchical approach has been proposed to further reduce the time complexity of interpolation to sub-linear. Thirdly,a phylogenetic tree is commonly used to demonstrate the phylogeny and evolutionary path of various organisms. A traditional way of visualizing phylogenetic tree preserves only the correlations between ancestors and their descendants. By utilizing MDS and interpolation,an algorithm called interpolative joining has been proposed to display the tree on the top of clustering,where their correlations can be intuitively observed in a 3D tree diagram called Spherical Phylogram. The optimizations in these three steps greatly reduce the time complexity of visualizing sequence clustering while increase its accuracy.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700