All-pair comparison of billion-base genome sequences.
详细信息   
  • 作者:Li ; Haiqiong.
  • 学历:Master
  • 年:2013
  • 毕业院校:Rensselaer Polytechnic Institute
  • Department:Computer Science.
  • ISBN:9781303736568
  • CBH:1552747
  • Country:USA
  • 语种:English
  • FileSize:3127233
  • Pages:28
文摘
The LSH-ALL-PAIRS algorithm is an important method for comparison of genomic DNA sequences in order to find conserved genome features (i.e.,subsequences of genomic sequence with d bases,called d-mers) across different species (e.g.,Escherichia coli and Mycobacterium tuberculosis ). The algorithm is based on a randomized search technique called locality-sensitive hashing (LSH),which is first applied to all the d-mers. After the hashing step,d-mers with the same hash value are grouped together into a class and pair-wise comparison is performed to find similar d-mers up to a certain number of mismatched bases. Instead of performing pair-wise comparison on potentially very large number of the input d-mers,LSH-ALL-PAIRS algorithm divides the d-mers into classes and performs pair-wise comparison on each individual group. However,since the computational complexity for pair-wise comparison within each class is still O(N2),where N is the number of d-mers in each class,the sequential LSH-ALL-PAIRS algorithm cannot process very long genomic sequence with billions of bases. In this thesis,a parallel implementation of the LSH-ALL-PAIRS algorithm is proposed. The PARALLEL-LSH-ALL-PAIRS algorithm is based on the message passing interface (MPI) and runs on high performance servers,computing clusters,and supercomputers. The PARALLEL-LSH-ALL-PAIRS algorithm first uses a pool of processes to evaluate the hashing function on billions of genomic subsequences in parallel. Then,d-mers with the same hash value are grouped together and redistributed among all the processes using MPI communication. Finally,each process performs pair-wise comparisons of the assigned subsequences and outputs groups of similar pairs. Experiments show that the PARALLEL-LSH-ALL-PAIRS algorithm achieves good scalability with an increasing number of cores and increasing sizes of the input data on the RPI's IBM Blue Gene/Q supercomputer.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700