文摘
The LSH-ALL-PAIRS algorithm is an important method for comparison of genomic DNA sequences in order to find conserved genome features (i.e.,subsequences of genomic sequence with d bases,called d-mers) across different species (e.g.,Escherichia coli and Mycobacterium tuberculosis ). The algorithm is based on a randomized search technique called locality-sensitive hashing (LSH),which is first applied to all the d-mers. After the hashing step,d-mers with the same hash value are grouped together into a class and pair-wise comparison is performed to find similar d-mers up to a certain number of mismatched bases. Instead of performing pair-wise comparison on potentially very large number of the input d-mers,LSH-ALL-PAIRS algorithm divides the d-mers into classes and performs pair-wise comparison on each individual group. However,since the computational complexity for pair-wise comparison within each class is still O(N2),where N is the number of d-mers in each class,the sequential LSH-ALL-PAIRS algorithm cannot process very long genomic sequence with billions of bases. In this thesis,a parallel implementation of the LSH-ALL-PAIRS algorithm is proposed. The PARALLEL-LSH-ALL-PAIRS algorithm is based on the message passing interface (MPI) and runs on high performance servers,computing clusters,and supercomputers. The PARALLEL-LSH-ALL-PAIRS algorithm first uses a pool of processes to evaluate the hashing function on billions of genomic subsequences in parallel. Then,d-mers with the same hash value are grouped together and redistributed among all the processes using MPI communication. Finally,each process performs pair-wise comparisons of the assigned subsequences and outputs groups of similar pairs. Experiments show that the PARALLEL-LSH-ALL-PAIRS algorithm achieves good scalability with an increasing number of cores and increasing sizes of the input data on the RPI's IBM Blue Gene/Q supercomputer.