Computational Science Laboratory
Brigham Young University Computer Science Department

Accelerated Multiple Sequence Alignment

Biologists and other researchers use multiple sequence alignment (MSA) as a fundamental analysis method to find similarities among nucleotide (DNA/RNA) or amino acid (protein) sequences. The compute time for an optimal MSA grows exponentially with respect to the number of sequences. Consequently, producing timely results on large problems requires more efficient algorithms and the use of parallel computing resources. Reconfigurable computing hardware, such as Field-Programmable Gate Arrays (FPGAs), provides one approach to the acceleration of biological sequence alignment. Other acceleration methods typically encounter scaling problems that arise from the overhead of inter-process communication and from the lack of parallelism. Reconfigurable computing allows a greater scale of parallelism using many fine-grained custom processing elements that have a low-overhead interconnect.

In this work, a new method for accelerating the third stage of progressive alignment is used that reduces subgroups of aligned sequences into discrete profiles before they are pairwise aligned on the accelerator. Our pairwise alignment algorithm produces the required traceback information and does not limit the sequence length by the number of processing elements (PEs) or by the amount of block RAM on the FPGA accelerator. Other hardware acceleration methods are inadequate for use in the third stage because the sequence length is severely limited or only similarity scores are computed. Using an FPGA accelerator, an overall speedup of 150 has been demonstrated on a large data set when compared to a 2.4 GHz Core2 processor [BMC Bioinformatics]. A version of MUSCLE with discrete profile alignment is shown to have comparable quality to other popular MSA programs. Our parallel algorithm and architecture accelerates large-scale MSA with reconfigurable computing and allows researchers to solve the larger problems that confront biologists today.

More information and source code is available at the following links. The bibliography includes parallel methods that use multiprocessor, vector, Cell BE, GPU, and FPGA systems.

Documents and Bibliography

Accelerated Large-Scale Multiple Sequence Alignment

Parallel Multiple Sequence Alignment: An Overview

Bibliography on Parallel Multiple Sequence Alignment

Source Code and Data Sets

MSA Programs - download v3.8.31b

MSA Utilities - download v1.1

Large Virus Data Sets - download (16MB)

Data Set Sequences Avg. Length
Influenza HA 12,104 1,717
HIV 2,144 9,019
Corona 400 29,531
Herpes 142 167,043

Contacts

G. Scott Lloyd

Quinn O. Snell

About Us | Contact | ©2005-2009 Computational Science Laboratory