Blast Analysis Software

This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment. See structural alignment software for structural alignment of proteins.

Blast Load Calculator
Blast Analysis Software 2017
Blast Analysis Software Pdf
Blast Fragmentation Analysis Software
System Analysis And Design Tools

Database search only[edit]

Name	Description	Sequence type*	Link	Authors	Year
BLAST	Local search with fast k-tuple heuristic (Basic Local Alignment Search Tool)	Both	NCBIEMBL-EBIDDBJDDBJ (psi-blast)GenomeNetPIR (protein only)	Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ^[1]	1990
CS-BLAST	Sequence-context specific BLAST, more sensitive than BLAST, FASTA, and SSEARCH. Position-specific iterative version CSI-BLAST more sensitive than PSI-BLAST	Protein	CS-BLAST serverdownload^{[permanent dead link]}	Angermueller C, Biegert A, Soeding J^[2]	2013
CUDASW++	GPU accelerated Smith Waterman algorithm for multiple shared-host GPUs	Protein	Liu Y, Maskell DL and Schmidt B	2009/2010
DIAMOND	BLASTX and BLASTP aligner based on double indexing	Protein	Buchfink B, Xie, C and Huson DH^[3]	2015
FASTA	Local search with fast k-tuple heuristic, faster but less sensitive than BLAST	Both	EMBL-EBIDDBJGenomeNetPIR (protein only)
GGSEARCH, GLSEARCH	Global:Global (GG), Global:Local (GL) alignment with statistics	Protein	FASTA server
Genoogle	Genoogle uses indexing and parallel processing techniques for searching DNA and Proteins sequences. It is developed in Java and open source.	Both	Albrecht F	2015
HMMER	Local and global search with profile Hidden Markov models, more sensitive than PSI-BLAST	Both	download	Durbin R, Eddy SR, Krogh A, Mitchison G^[4]	1998
HH-suite	Pairwise comparison of profile Hidden Markov models; very sensitive	Protein	Söding J^[5]^[6]	2005/2012
IDF	Inverse Document Frequency	Both	download
Infernal	Profile SCFG search	RNA	download	Eddy S
KLAST	High-performance general purpose sequence similarity search tool	Both	2009/2014
LAMBDA	High performance local aligner compatible to BLAST, but much faster; supports SAM/BAM	Protein	Hannes Hauswedell, Jochen Singer, Knut Reinert^[7]	2014
MMseqs2	Software suite to search and cluster huge sequence sets. Similar sensitivity to BLAST and PSI-BLAST but orders of magnitude faster	Protein	homepage	Steinegger M, Mirdita M, Galiez C, Söding J^[8]	2017
USEARCH	Ultra-fast sequence analysis tool	Both	homepage	Edgar, RC (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461. doi: 10.1093/bioinformatics/btq461 publication	2010
OSWALD	OpenCL Smith-Waterman on Altera's FPGA for Large Protein Databases	Protein	homepage	Rucci E, García C, Botella G, De Giusti A, Naiouf M, Prieto-Matías M^[9]	2016
parasail	Fast Smith-Waterman search using SIMD parallelization	Both	homepage	Daily J	2015
PSI-BLAST	Position-specific iterative BLAST, local search with position-specific scoring matrices, much more sensitive than BLAST	Protein	NCBI PSI-BLAST	Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ^[10]	1997
PSI-Search	Combining the Smith-Waterman search algorithm with the PSI-BLAST profile construction strategy to find distantly related protein sequences, and preventing homologous over-extension errors.	Protein	EMBL-EBI PSI-Search	Li W, McWilliam H, Goujon M, Cowley A, Lopez R, Pearson WR^[11]	2012
ScalaBLAST	Highly parallel Scalable BLAST	Both	ScalaBLAST	Oehmen et al.^[12]	2011
Sequilab	Linking and profiling sequence alignment data from NCBI-BLAST results with major sequence analysis servers/services	Nucleotide, peptide	server	2010
SAM	Local and global search with profile Hidden Markov models, more sensitive than PSI-BLAST	Both	SAM	Karplus K, Krogh A^[13]	1999
SSEARCH	Smith-Waterman search, slower but more sensitive than FASTA	Both
SWAPHI	First parallelized algorithm employing the emerging Intel Xeon Phis to accelerate Smith-Waterman protein database search	Protein	homepage	Liu Y and Schmidt B	2014
SWAPHI-LS	First parallel Smith-Waterman algorithm exploiting Intel Xeon Phi clusters to accelerate the alignment of long DNA sequences	DNA	homepage	Liu Y, Tran TT, Lauenroth F, Schmidt B	2014
SWIMM	Smith-Waterman implementation for Intel Multicore and Manycore architectures	Protein	homepage	Rucci E, García C, Botella G, De Giusti A, Naiouf M and Prieto-Matías M^[14]	2015
SWIPE	Fast Smith-Waterman search using SIMD parallelization	Both	homepage	Rognes T	2011

*Sequence type: protein or nucleotide

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. PSI-BLAST allows the user to build a PSSM (position-specific scoring matrix) using the results of the first BlastP run. PHI-BLAST performs the search but limits alignments to those that match a pattern in the query. DELTA-BLAST constructs a PSSM using the results of a Conserved Domain Database search and searches a sequence database.

Pairwise alignment[edit]

Name	Description	Sequence type*	Alignment type**	Link	Author	Year
ACANA	Fast heuristic anchor based pairwise alignment	Both	Both	download	Huang, Umbach, Li	2005
AlignMe	Alignments for membrane protein sequences	Protein	Both	download, server	M. Stamm, K. Khafizov, R. Staritzbichler, L.R. Forrest	2013
ALLALIGN	For DNA, RNA and proteins with summed length n, generates all local alignments in O(n) time using approximate suffix tree matching or mapped density dynamic alignment	Both	Local	allalign	E. Wachtel	2017
Bioconductor Biostrings::pairwiseAlignment	Dynamic programming	Both	Both + Ends-free	site	P. Aboyoun	2008
BioPerl dpAlign	Dynamic programming	Both	Both + Ends-free	site	Y. M. Chan	2003
BLASTZ, LASTZ	Seeded pattern-matching	Nucleotide	Local	download, download	Schwartz et al.^[15]^[16]	2004,2009
CUDAlign	DNA sequence alignment of unrestricted size in single or multiple GPUs	Nucleotide	Local, SemiGlobal, Global	download	E. Sandes^[17]^[18]^[19]	2011-2015
DNADot	Web-based dot-plot tool	Nucleotide	Global	server	R. Bowen	1998
DNASTAR Lasergene Molecular Biology Suite	Software to align DNA, RNA, protein, or DNA + protein sequences via pairwise and multiple sequence alignment algorithms including MUSCLE, Mauve, MAFFT, Clustal Omega, Jotun Hein, Wilbur-Lipman, Martinez Needleman-Wunsch, Lipman-Pearson and Dotplot analysis.	Both	Both	DNASTAR site	DNASTAR	1993-2016
DOTLET	Java-based dot-plot tool	Both	Global	applet	M. Pagni and T. Junier	1998
FEAST	Posterior based local extension with descriptive evolution model	Nucleotide	Local	site	A. K. Hudek and D. G. Brown	2010
Genome Compiler Genome Compiler	Align chromatogram files (.ab1, .scf) against a template sequence, locate errors, and correct them instantly. Learn more	Nucleotide	Local	Free online & download	Genome Compiler Corporation	2014
G-PAS	GPU-based dynamic programming with backtracking	Both	Local, SemiGlobal, Global	site+download	W. Frohmberg, M. Kierzynka et al.	2011
GapMis	Does pairwise sequence alignment with one gap	Both	SemiGlobal	site	K. Frousios, T. Flouri, C. S. Iliopoulos, K. Park, S. P. Pissis, G. Tischler	2012
GGSEARCH, GLSEARCH	Global:Global (GG), Global:Local (GL) alignment with statistics	Protein	Global in query	FASTA server	W. Pearson	2007
JAligner	Javaopen-source implementation of Smith-Waterman	Both	Local	JWS	A. Moustafa	2005
K*Sync	Protein sequence to structure alignment that includes secondary structure, structural conservation, structure-derived sequence profiles, and consensus alignment scores	Protein	Both	Robetta server	D. Chivian & D. Baker^[20]	2003
LALIGN	Multiple, non-overlapping, local similarity (same algorithm as SIM)	Both	Local non-overlapping	W. Pearson	1991 (algorithm)
NW-align	Standard Needleman-Wunsch dynamic programming algorithm	Protein	Global	server and download	Y Zhang	2012
mAlign	modelling alignment; models the information content of the sequences	Nucleotide	Both	doccode^{[permanent dead link]}	D. Powell, L. Allison and T. I. Dix	2004
matcher	Waterman-Eggert local alignment (based on LALIGN)	Both	Local	Pasteur	I. Longden (modified from W. Pearson)	1999
MCALIGN2	explicit models of indel evolution	DNA	Global	server	J. Wang et al.	2006
MUMmer	suffix tree based	Nucleotide	Global	download	S. Kurtz et al.	2004
needle	Needleman-Wunsch dynamic programming	Both	SemiGlobal	A. Bleasby	1999
Ngila	logarithmic and affine gap costs and explicit models of indel evolution	Both	Global	download	R. Cartwright	2007
NW	Needleman-Wunsch dynamic programming	Both	Global	download	A.C.R. Martin	1990-2015
parasail	C/C++/Python/Java SIMD dynamic programming library for SSE, AVX2	Both	Global, Ends-free, Local	site	J. Daily	2015
Path	Smith-Waterman on protein back-translationgraph (detects frameshifts at protein level)	Protein	Local	M. Gîrdea et al.^[21]	2009
PatternHunter	Seeded pattern-matching	Nucleotide	Local	download	B. Ma et al.^[22]^[23]	2002–2004
ProbA (also propA)	Stochastic partition function sampling via dynamic programming	Both	Global	download	U. Mückstein	2002
PyMOL	'align' command aligns sequence & applies it to structure	Protein	Global (by selection)	site	W. L. DeLano	2007
REPuter	suffix tree based	Nucleotide	Local	download	S. Kurtz et al.	2001
SABERTOOTH	Alignment using predicted Connectivity Profiles	Protein	Global	download on request^{[permanent dead link]}	F. Teichert, J. Minning, U. Bastolla, and M. Porto	2009
Satsuma	Parallel whole-genome synteny alignments	DNA	Local	download	M.G. Grabherr et al.	2010
SEQALN	Various dynamic programming	Both	Local or global	server	M.S. Waterman and P. Hardy	1996
SIM, GAP, NAP, LAP	Local similarity with varying gap treatments	Both	Local or global	server	X. Huang and W. Miller	1990-6
SIM	Local similarity	Both	Local	servers	X. Huang and W. Miller	1991
SPA: Super pairwise alignment	Fast pairwise global alignment	Nucleotide	Global	available upon request	Shen, Yang, Yao, Hwang	2002
SSEARCH	Local (Smith-Waterman) alignment with statistics	Protein	Local	W. Pearson	1981 (Algorithm)
Sequences Studio	Java applet demonstrating various algorithms from^[24]	Generic sequence	Local and global	A.Meskauskas	1997 (reference book)
SWIFT suit	Fast Local Alignment Searching	DNA	Local	site	K. Rasmussen,^[25] W. Gerlach	2005,2008
stretcher	Memory-optimized Needleman-Wunsch dynamic programming	Both	Global	Pasteur	I. Longden (modified from G. Myers and W. Miller)	1999
tranalign	Aligns nucleic acid sequences given a protein alignment	Nucleotide	NA	Pasteur	G. Williams (modified from B. Pearson)	2002
UGENE	Opensource Smith-Waterman for SSE/CUDA, Suffix array based repeats finder & dotplot	Both	Both	UGENE site	UniPro	2010
water	Smith-Waterman dynamic programming	Both	Local	A. Bleasby	1999
wordmatch	k-tuple pairwise match	Both	NA	Pasteur	I. Longden	1998
YASS	Seeded pattern-matching	Nucleotide	Local	L. Noe and G. Kucherov^[26]	2004

*Sequence type: protein or nucleotide **Alignment type: local or global

Multiple sequence alignment[edit]

Name	Description	Sequence type*	Alignment type**	Link	Author	Year	License
ABA	A-Bruijn alignment	Protein	Global	download	B.Raphael et al.	2004	Proprietary, freeware for education, research, nonprofit
ALE	manual alignment ; some software assistance	Nucleotides	Local	download	J. Blandy and K. Fogel	1994 (latest version 2007)	Free, GPL2
ALLALIGN	For DNA, RNA and proteins with summed length n, generates all local alignments in O(n) time using approximate suffix tree matching or mapped density dynamic alignment	Both	Local	allalign	E. Wachtel	2017	Free
AMAP	Sequence annealing	Both	Global	server	A. Schwartz and L. Pachter	2006
anon.	fast, optimal alignment of three sequences using linear gap costs	Nucleotides	Global	papersoftware^{[permanent dead link]}	D. Powell, L. Allison and T. I. Dix	2000
BAli-Phy	Tree+multi-alignment; probabilistic-Bayesian; joint estimation	Both + Codons	Global	WWW+download	BD Redelings and MA Suchard	2005 (latest version 2018)	Free, GPL
Base-By-Base	Java-based multiple sequence alignment editor with integrated analysis tools	Both	Local or global	download	R. Brodie et al.	2004	Proprietary, freeware, must register
CHAOS, DIALIGN	Iterative alignment	Both	Local (preferred)	server	M. Brudno and B. Morgenstern	2003
ClustalW	Progressive alignment	Both	Local or global	Thompson et al.	1994	Free, LGPL
CodonCode Aligner	Multi-alignment; ClustalW & Phrap support	Nucleotides	Local or global	download	P. Richterich et al.	2003 (latest version 2009)
Compass	COmparison of Multiple Protein sequence Alignments with assessment of Statistical Significance	Protein	Global	download and server	R.I. Sadreyev, et al.	2009
DECIPHER	Progressive-iterative alignment	Both	Global	download	Erik S. Wright	2014	Free, GPL
DIALIGN-TX and DIALIGN-T	Segment-based method	Both	Local (preferred) or Global	download and server	A.R.Subramanian	2005 (latest version 2008)
DNA Alignment	Segment-based method for intraspecific alignments	Both	Local (preferred) or Global	server	A.Roehl	2005 (latest version 2008)
DNA Baser Sequence Assembler	Multi-alignment; Full automatic sequence alignment; Automatic ambiguity correction; Internal base caller; Command line seq alignment	Nucleotides	Local or global	www.DnaBaser.com	Heracle BioSoft SRL	2006 (latest version 2018)	Commercial (some modules are freeware)
DNADynamo	linked DNA to Protein multiple alignment with MUSCLE, Clustal and Smith-Waterman	Both	Local or global	download	DNADynamo	2004 (newest version 2017)
DNASTAR Lasergene Molecular Biology Suite	Software to align DNA, RNA, protein, or DNA + protein sequences via pairwise and multiple sequence alignment algorithms including MUSCLE, Mauve, MAFFT, Clustal Omega, Jotun Hein, Wilbur-Lipman, Martinez Needleman-Wunsch, Lipman-Pearson and Dotplot analysis.	Both	Local or global	DNASTAR site	DNASTAR	1993-2016
EDNA	Energy Based Multiple Sequence Alignment for DNA Binding Sites	Nucleotides	Local or global	sourceforge.net/projects/msa-edna/	Salama, RA. et al.	2013
FAMSA	Progressive alignment for extremely large protein families (hundreds of thousands of members)	Protein	Global	download	Deorowicz et al.	2016
FSA	Sequence annealing	Both	Global	download and server	R. K. Bradley et al.	2008
Geneious	Progressive-Iterative alignment; ClustalW plugin	Both	Local or global	download	A.J. Drummond et al.	2005 (latest version 2017)
Kalign	Progressive alignment	Both	Global	T. Lassmann	2005
MAFFT	Progressive-iterative alignment	Both	Local or global	K. Katoh et al.	2005	Free, BSD
MARNA	Multi-alignment of RNAs	RNA	Local	S. Siebert et al.	2005
MAVID	Progressive alignment	Both	Global	server	N. Bray and L. Pachter	2004
MSA	Dynamic programming	Both	Local or global	download	D.J. Lipman et al.	1989 (modified 1995)
MSAProbs	Dynamic programming	Protein	Global	download	Y. Liu, B. Schmidt, D. Maskell	2010
MULTALIN	Dynamic programming-clustering	Both	Local or global	F. Corpet	1988
Multi-LAGAN	Progressive dynamic programming alignment	Both	Global	server	M. Brudno et al.	2003
MUSCLE	Progressive-iterative alignment	Both	Local or global	server	R. Edgar	2004
Opal	Progressive-iterative alignment	Both	Local or global	download	T. Wheeler and J. Kececioglu	2007 (latest stable 2013, latest beta 2016)
Pecan	Probabilistic-consistency	DNA	Global	download	B. Paten et al.	2008
Phylo	A human computing framework for comparative genomics to solve multiple alignment	Nucleotides	Local or global	site	McGill Bioinformatics	2010
PMFastR	Progressive structure aware alignment	RNA	Global	site	D. DeBlasio, J Braund, S Zhang	2009
Praline	Progressive-iterative-consistency-homology-extended alignment with preprofiling and secondary structure prediction	Protein	Global	server	J. Heringa	1999 (latest version 2009)
PicXAA	Nonprogressive, maximum expected accuracy alignment	Both	Global	download and server	S.M.E. Sahraeian and B.J. Yoon	2010
POA	Partial order/hidden Markov model	Protein	Local or global	download	C. Lee	2002
Probalign	Probabilistic/consistency with partition function probabilities	Protein	Global	server	Roshan and Livesay	2006	Free, public domain
ProbCons	Probabilistic/consistency	Protein	Local or global	server	C. Do et al.	2005	Free, public domain
PROMALS3D	Progressive alignment/hidden Markov model/Secondary structure/3D structure	Protein	Global	server	J. Pei et al.	2008
PRRN/PRRP	Iterative alignment (especially refinement)	Protein	Local or global	Y. Totoki (based on O. Gotoh)	1991 and later
PSAlign	Alignment preserving non-heuristic	Both	Local or global	download	S.H. Sze, Y. Lu, Q. Yang.	2006
RevTrans	Combines DNA and Protein alignment, by back translating the protein alignment to DNA.	DNA/Protein (special)	Local or global	server	Wernersson and Pedersen	2003 (newest version 2005)
SAGA	Sequence alignment by genetic algorithm	Protein	Local or global	download	C. Notredame et al.	1996 (new version 1998)
SAM	Hidden Markov model	Protein	Local or global	server	A. Krogh et al.	1994 (most recent version 2002)
Se-Al	Manual alignment	Both	Local	download	A. Rambaut	2002
StatAlign	Bayesian co-estimation of alignment and phylogeny (MCMC)	Both	Global	download	A. Novak et al.	2008
Stemloc	Multiple alignment and secondary structure prediction	RNA	Local or global	download	I. Holmes	2005	Free, GPL 3 (parte de DART)
T-Coffee	More sensitive progressive alignment	Both	Local or global	C. Notredame et al.	2000 (newest version 2008)	Free, GPL 2
UGENE	Supports multiple alignment with MUSCLE, KAlign, Clustal and MAFFT plugins	Both	Local or global	download	UGENE team	2010 (newest version 2012)	Free, GPL 2
VectorFriends	VectorFriends Aligner, MUSCLE plugin, and ClustalW plugin	Both	Local or global	download	BioFriends team	2013	Proprietary, freeware for academic use
GLProbs	Adaptive pair-Hidden Markov Model based approach	Protein	Global	download	Y. Ye et al.	2013

*Sequence type: protein or nucleotide. **Alignment type: local or global

Genomics analysis[edit]

Name	Description	Sequence type*	Link
ACT (Artemis Comparison Tool)	Synteny and comparative genomics	Nucleotide	server
AVID	Pairwise global alignment with whole genomes	Nucleotide	server
BLAT	Alignment of cDNA sequences to a genome.	Nucleotide	^[27]
DECIPHER	Alignment of rearranged genomes using 6 frame translation	Nucleotide	download
FLAK	Fuzzy whole genome alignment and analysis	Nucleotide	server
GMAP	Alignment of cDNA sequences to a genome. Identifies splice site junctions with high accuracy.	Nucleotide	http://research-pub.gene.com/gmap
Splign	Alignment of cDNA sequences to a genome. Identifies splice site junctions with high accuracy. Able to recognize and separate gene duplications.	Nucleotide	https://www.ncbi.nlm.nih.gov/sutils/splign
Mauve	Multiple alignment of rearranged genomes	Nucleotide	download
MGA	Multiple Genome Aligner	Nucleotide	download
Mulan	Local multiple alignments of genome-length sequences	Nucleotide	server
Multiz	Multiple alignment of genomes	Nucleotide	download
PLAST-ncRNA	Search for ncRNAs in genomes by partition function local alignment	Nucleotide	server
Sequerome	Profiling sequence alignment data with major servers/services	Nucleotide, peptide	server
Sequilab	Profiling sequence alignment data from NCBI-BLAST results with major servers-services	Nucleotide, peptide	server
Shuffle-LAGAN	Pairwise glocal alignment of completed genome regions	Nucleotide	server
SIBsim4, Sim4	A program designed to align an expressed DNA sequence with a genomic sequence, allowing for introns	Nucleotide	download
SLAM	Gene finding, alignment, annotation (human-mouse homology identification)	Nucleotide	server

*Sequence type: protein or nucleotide

Motif finding[edit]

Name	Description	Sequence type*	Link
PMS	Motif search and discovery	Both
FMM	Motif search and discovery (can get also positive & negative sequences as input for enriched motif search)	Nucleotide	server
BLOCKS	Ungapped motif identification from BLOCKS database	Both	server
eMOTIF	Extraction and identification of shorter motifs	Both	servers
Gibbs motif sampler	Stochastic motif extraction by statistical likelihood	Both
HMMTOP	Prediction of transmembrane helices and topology of proteins	Protein	homepage & download
I-sites	Local structure motif library	Protein	server
JCoils	Prediction of Coiled coil and Leucine Zipper	Protein	homepage & download
MEME/MAST	Motif discovery and search	Both	server
CUDA-MEME	GPU accelerated MEME (v4.4.0) algorithm for GPU clusters	Both	homepage
MERCI	Discriminative motif discovery and search	Both	homepage & download
PHI-Blast	Motif search and alignment tool	Both	Pasteur
Phyloscan	Motif search tool	Nucleotide	server
PRATT	Pattern generation for use with ScanProsite	Protein	server
ScanProsite	Motif database search tool	Protein	server
TEIRESIAS	Motif extraction and database search	Both	server
BASALT	Multiple motif and regular expression search	Both	homepage

*Sequence type: protein or nucleotide

Benchmarking[edit]

Name	Link	Authors
PFAM 30.0 (2016)
SMART (2015)	website	Letunic, Copley, Schmidt, Ciccarelli, Doerks, Schultz, Ponting, Bork
BAliBASE 3 (2015)	website	Thompson, Plewniak, Poch
Oxbench (2011)	download	Raghava, Searle, Audley, Barber, Barton
Benchmark collection (2009)	website	Edgar
HOMSTRAD (2005)	website	Mizuguchi
PREFAB 4.0 (2005)	website	Edgar
SABmark (2004)	download	Van Walle, Lasters, Wyns

Alignment viewers, editors[edit]

Please see List of alignment visualization software.

Short-read sequence alignment[edit]

Name	Description	paired-end option	Use FASTQ quality	Gapped	Multi-threaded	License	Link	Reference	Year
Arioc	Computes Smith-Waterman gapped alignments and mapping qualities on one or more GPUs. Supports BS-seq alignments. Processes 100,000 to 500,000 reads per second (varies with data, hardware, and configured sensitivity).	Yes	No	Yes	Yes	Free, BSD	github	^[28]	2015
BarraCUDA	A GPGPU accelerated Burrows-Wheeler transform (FM-index) short read alignment program based on BWA, supports alignment of indels with gap openings and extensions.	Yes	No	Yes	Yes, POSIX Threads and CUDA	Free, GPL	link
BBMap	Uses a short kmers to rapidly index genome; no size or scaffold count limit. Higher sensitivity and specificity than Burrows-Wheeler aligners, with similar or greater speed. Performs affine-transform-optimized global alignment, which is slower but more accurate than Smith-Waterman. Handles Illumina, 454, PacBio, Sanger, and Ion Torrent data. Splice-aware; capable of processing long indels and RNA-seq. Pure Java; runs on any platform. Used by the Joint Genome Institute.	Yes	Yes	Yes	Yes	Free, BSD	link	2010
BFAST	Explicit time and accuracy tradeoff with a prior accuracy estimation, supported by indexing the reference sequences. Optimally compresses indexes. Can handle billions of short reads. Can handle insertions, deletions, SNPs, and color errors (can map ABI SOLiD color space reads). Performs a full Smith Waterman alignment.	Yes, POSIX Threads	Free, GPL	link^{[permanent dead link]}	^[29]	2009
BigBWA	Runs the Burrows-Wheeler Aligner-BWA on a Hadoop cluster. It supports the algorithms BWA-MEM, BWA-ALN, and BWA-SW, working with paired and single reads. It implies an important reduction in the computational time when running in a Hadoop cluster, adding scalability and fault-tolerance.	Yes	Low quality bases trimming	Yes	Yes	Free, GPL 3	link	^[30]	2015
BLASTN	BLAST's nucleotide alignment program, slow and not accurate for short reads, and uses a sequence database (EST, Sanger sequence) rather than a reference genome.	link
BLAT	Made by Jim Kent. Can handle one mismatch in initial alignment step.	Yes, client-server	Proprietary, freeware for academic and noncommercial use	link	^[31]	2002
Bowtie	Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour. Supports Maq-like and SOAP-like alignment policies	Yes	Yes	No	Yes, POSIX Threads	Free, Artistic	link	^[32]	2009
BWA	Uses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than Bowtie but allows indels in alignment.	Yes	Low quality bases trimming	Yes	Yes	Free, GPL	link	^[33]	2009
BWA-PSSM	A probabilistic short read aligner based on the use of position specific scoring matrices (PSSM). The aligner is adaptable in the sense that it can take into account the quality scores of the reads and models of data specific biases, such as those observed in Ancient DNA, PAR-CLIP data or genomes with biased nucleotide compositions.^[34]	Yes	Yes	Yes	Yes	Free, GPL	link	^[34]	2014
CASHX	Quantify and manage large quantities of short-read sequence data. CASHX pipeline contains a set of tools that can be used together, or separately as modules. This algorithm is very accurate for perfect hits to a reference genome.	No	Proprietary, freeware for academic and noncommercial use	link
Cloudburst	Short-read mapping using Hadoop MapReduce	Yes, HadoopMapReduce	Free, Artistic	link
CUDA-EC	Short-read alignment error correction using GPUs.	Yes, GPU enabled	link
CUSHAW	A CUDA compatible short read aligner to large genomes based on Burrows-Wheeler transform	Yes	Yes	No	Yes (GPU enabled)	Free, GPL	link	^[35]	2012
CUSHAW2	Gapped short-read and long-read alignment based on maximal exact match seeds. This aligner supports both base-space (e.g. from Illumina, 454, Ion Torrent and PacBio sequencers) and ABI SOLiD color-space read alignments.	Yes	No	Yes	Yes	Free, GPL	link	2014
CUSHAW2-GPU	GPU-accelerated CUSHAW2 short-read aligner.	Yes	No	Yes	Yes	Free, GPL	link
CUSHAW3	Sensitive and accurate base-space and color-space short-read alignment with hybrid seeding	Yes	No	Yes	Yes	Free, GPL	link	^[36]	2012
drFAST	Read mapping alignment software that implements cache obliviousness to minimize main/cache memory transfers like mrFAST and mrsFAST, however designed for the SOLiD sequencing platform (color space reads). It also returns all possible map locations for improved structural variation discovery.	Yes	Yes, for structural variation	Yes	No	Free, BSD	link
ELAND	Implemented by Illumina. Includes ungapped alignment with a finite read length.
ERNE	Extended Randomized Numerical alignEr for accurate alignment of NGS reads. It can map bisulfite-treated reads.	Yes	Low quality bases trimming	Yes	Multithreading and MPI-enabled	Free, GPL 3	link
GASSST	Finds global alignments of short DNA sequences against large DNA banks	Multithreading	CeCILL version 2 License.	link	^[37]	2011
GEM	High-quality alignment engine (exhaustive mapping with substitutions and indels). More accurate and several times faster than BWA or Bowtie 1/2. Many standalone biological applications (mapper, split mapper, mappability, and other) provided.	Yes	Yes	Yes	Yes	Dual, freeware for noncommercial use; GEM source is currently unavailable	link	^[38]	2012
Genalice MAP	Ultra fast and comprehensive NGS read aligner with high precision and small storage footprint.	Yes	Low quality bases trimming	Yes	Yes	Proprietary, commercial	link
Geneious Assembler	Fast, accurate overlap assembler with the ability to handle any combination of sequencing technology, read length, any pairing orientations, with any spacer size for the pairing, with or without a reference genome.	Yes	Proprietary, commercial	link
GensearchNGS	Complete framework with user-friendly GUI to analyse NGS data. It integrates a proprietary high quality alignment algorithm and plug-in ability to integrate various public aligner into a framework allowing to import short reads, align them, detect variants, and generate reports. It is made for resequencing projects, namely in a diagnostic setting.	Yes	No	Yes	Yes	Proprietary, commercial	link
GMAP and GSNAP	Robust, fast short-read alignment. GMAP: longer reads, with multiple indels and splices (see entry above under Genomics analysis); GSNAP: shorter reads, with one indel or up to two splices per read. Useful for digital gene expression, SNP and indel genotyping. Developed by Thomas Wu at Genentech. Used by the National Center for Genome Resources (NCGR) in Alpheus.	Yes	Yes	Yes	Yes	Proprietary, freeware for academic and noncommercial use	link
GNUMAP	Accurately performs gapped alignment of sequence data obtained from next-generation sequencing machines (specifically of Solexa-Illumina) back to a genome of any size. Includes adaptor trimming, SNP calling and Bisulfite sequence analysis.	Yes, also supports Illumina _int.txt and _prb.txt files with all 4 quality scores for each base	Multithreading and MPI-enabled	link	^[39]	2009
HIVE-hexagon	Uses a hash table and bloom matrix to create and filter potential positions on the genome. For higher efficiency uses cross-similarity between short reads and avoids realigning non unique redundant sequences. It is faster than Bowtie and BWA and allows indels and divergent sensitive alignments on viruses, bacteria, and more conservative eukaryotic alignments.	Yes	Yes	Yes	Yes	Proprietary, freeware for academic and noncommercial users registered to HIVE deployment instance	link	^[40]	2014
IMOS	Improved Meta-aligner and Minimap2 On Spark. A long read distributed aligner on Apache Spark platform with linear scalability w.r.t. single node execution.	Yes	Yes	Yes	Free	github
Isaac	Fully uses all the computing power available on one server node; thus, it scales well over a broad range of hardware architectures, and alignment performance improves with hardware abilities	Yes	Yes	Yes	Yes	Free, GPL	github
LAST	Uses adaptative seeds and copes more efficiently with repeat-rich sequences (e.g. genomes). For example: it can align reads to genomes without repeat-masking, without becoming overwhelmed by repetitive hits.	Yes	Yes	Yes	No	Free, GPL	link	^[41]	2011
MAQ	Ungapped alignment that takes into account quality scores for each base.	Free, GPL	link
mrFAST, mrsFAST	Gapped (mrFAST) and ungapped (mrsFAST) alignment software that implements cache obliviousness to minimize main/cache memory transfers. They are designed for the Illumina sequencing platform and they can return all possible map locations for improved structural variation discovery.	Yes	Yes, for structural variation	Yes	No	Free, BSD
MOM	MOM or maximum oligonucleotide mapping is a query matching tool that captures a maximal length match within the short read.	Yes	link
MOSAIK	Fast gapped aligner and reference-guided assembler. Aligns reads using a banded Smith-Waterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in size from very short to very long.	Yes	link
MPscan	Fast aligner based on a filtration strategy (no indexing, use q-grams and Backward Nondeterministic DAWG Matching)	link	^[42]	2009
Novoalign & NovoalignCS	Gapped alignment of single end and paired end Illumina GA I & II, ABI Colour space & ION Torrent reads. High sensitivity and specificity, using base qualities at all steps in the alignment. Includes adapter trimming, base quality calibration, Bi-Seq alignment, and options for reporting multiple alignments per read. Use of ambiguous IUPAC codes in reference for common SNPs can improve SNP recall and remove allelic bias.	Yes	Yes	Yes	Multi-threading and MPI versions available with paid license	Proprietary, freeware single threaded version for academic and noncommercial use	Novocraft
NextGENe	Developed for use by biologists performing analysis of next generation sequencing data from Roche Genome Sequencer FLX, Illumina GA/HiSeq, Life Technologies Applied BioSystems’ SOLiD System, PacBio and Ion Torrent platforms.	Yes	Yes	Yes	Yes	Proprietary, commercial	Softgenetics
NextGenMap	Flexible and fast read mapping program (twice as fast as BWA), achieves a mapping sensitivity comparable to Stampy. Internally uses a memory efficient index structure (hash table) to store positions of all 13-mers present in the reference genome. Mapping regions where pairwise alignments are required are dynamically determined for each read. Uses fast SIMD instructions (SSE) to accelerate alignment calculations on CPU. If available, alignments are computed on GPU (using OpenCL/CUDA) further reducing runtime 20-50%.	Yes	No	Yes	Yes, POSIX Threads, OpenCL/CUDA, SSE	Free	Official GitHub Page	^[43]	2013
Omixon Variant Toolkit	Includes highly sensitive and highly accurate tools for detecting SNPs and indels. It offers a solution to map NGS short reads with a moderate distance (up to 30% sequence divergence) from reference genomes. It poses no restrictions on the size of the reference, which, combined with its high sensitivity, makes the Variant Toolkit well-suited for targeted sequencing projects and diagnostics.	Yes	Yes	Yes	Yes	Proprietary, commercial	www.omixon.com
PALMapper	Efficiently computes both spliced and unspliced alignments at high accuracy. Relying on a machine learning strategy combined with a fast mapping based on a banded Smith-Waterman-like algorithm, it aligns around 7 million reads per hour on one CPU. It refines the originally proposed QPALMA approach.	Yes	Free, GPL	link
Partek Flow	For use by biologists and bioinformaticians. It supports ungapped, gapped and splice-junction alignment from single and paired-end reads from Illumina, Life technologies Solid TM, Roche 454 and Ion Torrent raw data (with or without quality information). It integrates powerful quality control on FASTQ/Qual level and on aligned data. Additional functionality include trimming and filtering of raw reads, SNP and InDel detection, mRNA and microRNA quantification and fusion gene detection.	Yes	Yes	Yes	Multiprocessor-core, client-server installation possible	Proprietary, commercial, free trial version	[1]
PASS	Indexes the genome, then extends seeds using pre-computed alignments of words. Works with base space, color space (SOLID), and can align genomic and spliced RNA-seq reads.	Yes	Yes	Yes	Yes	Proprietary, freeware for academic and noncommercial use	PASS_HOME
PerM	Indexes the genome with periodic seeds to quickly find alignments with full sensitivity up to four mismatches. It can map Illumina and SOLiD reads. Unlike most mapping programs, speed increases for longer read lengths.	Yes	Free, GPL	link	^[44]
PRIMEX	Indexes the genome with a k-mer lookup table with full sensitivity up to an adjustable number of mismatches. It is best for mapping 15-60 bp sequences to a genome.	No	No	Yes	No, multiple processes per search	link	[2]	2003
QPalma	Can use quality scores, intron lengths, and computation splice site predictions to perform and performs an unbiased alignment. Can be trained to the specifics of a RNA-seq experiment and genome. Useful for splice site/intron discovery and for gene model building. (See PALMapper for a faster version).	Yes, client-server	Free, GPL 2	link
RazerS	No read length limit. Hamming or edit distance mapping with configurable error rates. Configurable and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping.	Free, LGPL	link
REAL, cREAL	REAL is an efficient, accurate, and sensitive tool for aligning short reads obtained from next-generation sequencing. The programme can handle an enormous amount of single-end reads generated by the next-generation Illumina/Solexa Genome Analyzer. cREAL is a simple extension of REAL for aligning short reads obtained from next-generation sequencing to a genome with circular structure.	Yes	Yes	Free, GPL	link
RMAP	Can map reads with or without error probability information (quality scores) and supports paired-end reads or bisulfite-treated read mapping. There are no limitations on read length or number of mismatches.	Yes	Yes	Yes	Free, GPL 3	link
rNA	A randomized Numerical Aligner for Accurate alignment of NGS reads	Yes	Low quality bases trimming	Yes	Multithreading and MPI-enabled	Free, GPL 3	link
RTG Investigator	Extremely fast, tolerant to high indel and substitution counts. Includes full read alignment. Product includes comprehensive pipelines for variant detection and metagenomic analysis with any combination of Illumina, Complete Genomics and Roche 454 data.	Yes	Yes, for variant calling	Yes	Yes	Proprietary, freeware for individual investigator use	link
Segemehl	Can handle insertions, deletions, mismatches; uses enhanced suffix arrays	Yes	No	Yes	Yes	Proprietary, freeware for noncommercial use	link	^[45]	2009
SeqMap	Up to 5 mixed substitutions and insertions-deletions; various tuning options and input-output formats	Proprietary, freeware for academic and noncommercial use	link
Shrec	Short read error correction with a suffix tree data structure	Yes, Java	link
SHRiMP	Indexes the reference genome as of version 2. Uses masks to generate possible keys. Can map ABI SOLiD color space reads.	Yes	Yes	Yes	Yes, OpenMP	Free, [[BSD licenses\|\|Free, BSD]] derivative	link	^[46]^[47]	2009-2011
SLIDER	Slider is an application for the Illumina Sequence Analyzer output that uses the 'probability' files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.	Yes	Yes	No	No	link	^[48]^[49]	2009-2010
SOAP, SOAP2, SOAP3, SOAP3-dp	SOAP: robust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses a 12 letter hash table. SOAP2: using bidirectional BWT to build the index of reference, and it is much faster than the first version. SOAP3: GPU-accelerated version that could find all 4-mismatch alignments in tens of seconds per one million reads. SOAP3-dp, also GPU accelerated, supports arbitrary number of mismatches and gaps according to affine gap penalty scores.	Yes	No	Yes, SOAP3-dp	Yes, POSIX Threads; SOAP3, SOAP3-dp need GPU with CUDA support	Free, GPL	link	^[50]^[51]
SOCS	For ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color errors). Uses an iterative version of the Rabin-Karp string search algorithm.	Yes	Free, GPL	link
SparkBWA	Integrates the Burrows-Wheeler Aligner—BWA on an Apache Spark framework running atop Hadoop. Version 0.2 of October 2016, supports the algorithms BWA-MEM, BWA-backtrack, and BWA-ALN. All of them work with single-reads and paired-end reads.	Yes	Low quality bases trimming	Yes	Yes	Free, GPL 3	link	^[52]	2016
SSAHA, SSAHA2	Fast for a small number of variants	Proprietary, freeware for academic and noncommercial use	link
Stampy	For Illumina reads. High specificity, and sensitive for reads with indels, structural variants, or many SNPs. Slow, but speed increased dramatically by using BWA for first alignment pass.	Yes	Yes	Yes	No	Proprietary, freeware for academic and noncommercial use	link	^[53]	2010
SToRM	For Illumina or ABI SOLiD reads, with SAM native output. Highly sensitive for reads with many errors, indels (full from 0 to 15, extended support otherwise). Uses spaced seeds (single hit) and a very fast SSE-SSE2-AVX2-AVX-512 banded alignment filter. For fixed-length reads only, authors recommend SHRiMP2 otherwise.	No	Yes	Yes	Yes, OpenMP	Free	link	^[54]	2010
Subread, Subjunc	Superfast and accurate read aligners. Subread can be used to map both gDNA-seq and RNA-seq reads. Subjunc detects exon-exon junctions and maps RNA-seq reads. They employ a novel mapping paradigm named seed-and-vote.	Yes	Yes	Yes	Yes	Free, GPL 3
Taipan	De-novo assembler for Illumina reads	Proprietary, freeware for academic and noncommercial use	link
UGENE	Visual interface both for Bowtie and BWA, and an embedded aligner	Yes	Yes	Yes	Yes	Free, GPL	link
VelociMapper	FPGA-accelerated reference sequence alignment mapping tool from TimeLogic. Faster than Burrows-Wheeler transform-based algorithms like BWA and Bowtie. Supports up to 7 mismatches and/or indels with no performance penalty. Produces sensitive Smith-Waterman gapped alignments.	Yes	Yes	Yes	Yes	Proprietary, commercial	TimeLogic
XpressAlign	FPGA based sliding window short read aligner which exploits the embarrassingly parallel property of short read alignment. Performance scales linearly with number of transistors on a chip (i.e. performance guaranteed to double with each iteration of Moore's Law without modification to algorithm). Low power consumption is useful for datacentre equipment. Predictable runtime. Better price/performance than software sliding window aligners on current hardware, but not better than software BWT-based aligners currently. Can manage large numbers (>2) of mismatches. Will find all hit positions for all seeds. Single-FPGA experimental version, needs work to develop it into a multi-FPGA production version.	Proprietary, freeware for academic and noncommercial use	link
ZOOM	100% sensitivity for a reads between 15-240 bp with practical mismatches. Very fast. Support insertions and deletions. Works with Illumina & SOLiD instruments, not 454.	Yes (GUI), no (CLI)	Proprietary, commercial	link	^[55]

References[edit]

^Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ; Gish; Miller; Myers; Lipman (October 1990). 'Basic local alignment search tool'. Journal of Molecular Biology. 215 (3): 403–10. doi:10.1016/S0022-2836(05)80360-2. PMID2231712.CS1 maint: multiple names: authors list (link)
^Angermüller, C.; Biegert, A.; Söding, J. (Dec 2012). 'Discriminative modelling of context-specific amino acid substitution probabilities'. Bioinformatics. 28 (24): 3240–7. doi:10.1093/bioinformatics/bts622. PMID23080114.
^Buchfink, Xie and Huson (2015). 'Fast and sensitive protein alignment using DIAMOND'. Nature Methods. 12 (1): 59–60. doi:10.1038/nmeth.3176. PMID25402007.
^Durbin, Richard; Eddy, Sean R.; Krogh, Anders; Mitchison, Graeme, eds. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press. ISBN978-0-521-62971-3.^{[page needed]}
^Söding J (April 2005). 'Protein homology detection by HMM-HMM comparison'. Bioinformatics. 21 (7): 951–60. doi:10.1093/bioinformatics/bti125. PMID15531603.
^Remmert, Michael; Biegert, Andreas; Hauser, Andreas; Söding, Johannes (2011-12-25). 'HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment'. Nature Methods. 9 (2): 173–175. doi:10.1038/nmeth.1818. hdl:11858/00-001M-0000-0015-8D56-A. ISSN1548-7105. PMID22198341.
^Hauswedell H, Singer J, Reinert K (2014-09-01). 'Lambda: the local aligner for massive biological data'. Bioinformatics. 30 (17): 349–355. doi:10.1093/bioinformatics/btu439. PMC4147892. PMID25161219.
^Steinegger, Martin; Soeding, Johannes (2017-10-16). 'MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets'. Nature Biotechnology. 35 (11): 1026–1028. doi:10.1038/nbt.3988. hdl:11858/00-001M-0000-002E-1967-3. PMID29035372.
^Rucci, Enzo; Garcia, Carlos; Botella, Guillermo; Giusti, Armando E. De; Naiouf, Marcelo; Prieto-Matias, Manuel (2016-06-30). 'OSWALD: OpenCL Smith–Waterman on Altera's FPGA for Large Protein Databases'. International Journal of High Performance Computing Applications. 32 (3): 337–350. doi:10.1177/1094342016654215. ISSN1094-3420.
^Altschul SF, Madden TL, Schäffer AA, et al. (September 1997). 'Gapped BLAST and PSI-BLAST: a new generation of protein database search programs'. Nucleic Acids Research. 25 (17): 3389–402. doi:10.1093/nar/25.17.3389. PMC146917. PMID9254694.
^Li W, McWilliam H, Goujon M, et al. (June 2012). 'PSI-Search: iterative HOE-reduced profile SSEARCH searching'. Bioinformatics. 28 (12): 1650–1651. doi:10.1093/bioinformatics/bts240. PMC3371869. PMID22539666.
^Oehmen, C.; Nieplocha, J. (August 2006). 'ScalaBLAST: A scalable implementation of BLAST for high-performancemw-data:TemplateStyles:r886058088'>
^Hughey, R.; Karplus, K.; Krogh, A. (2003). SAM: sequence alignment and modeling software system. Technical report UCSC-CRL-99-11 (Report). University of California, Santa Cruz, CA.
^Rucci, Enzo; García, Carlos; Botella, Guillermo; De Giusti, Armando; Naiouf, Marcelo; Prieto-Matías, Manuel (2015-12-25). 'An energy-aware performance analysis of SWIMM: Smith–Waterman implementation on Intel's Multicore and Manycore architectures'. Concurrency and Computation: Practice and Experience. 27 (18): 5517–5537. doi:10.1002/cpe.3598. ISSN1532-0634.
^Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W; Kent; Smit; Zhang; Baertsch; Hardison; Haussler; Miller (2003). 'Human-mouse alignments with BLASTZ'. Genome Research. 13 (1): 103–107. doi:10.1101/gr.809403. PMC430961. PMID12529312.CS1 maint: multiple names: authors list (link)
^Harris R S (2007). Improved pairwise alignment of genomic DNA (Thesis).
^Sandes, Edans F. de O.; de Melo, Alba Cristina M.A. (May 2013). 'Retrieving Smith-Waterman Alignments with Optimizations for Megabase Biological Sequences Using GPU'. IEEE Transactions on Parallel and Distributed Systems. 24 (5): 1009–1021. doi:10.1109/TPDS.2012.194.
^Sandes, Edans F. de O.; Miranda, G.; De Melo, A.C.M.A.; Martorell, X.; Ayguade, E. (May 2014). CUDAlign 3.0: Parallel Biological Sequence Comparison in Large GPU Clusters. Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on. p. 160. doi:10.1109/CCGrid.2014.18.
^Sandes, Edans F. de O.; Miranda, G.; De Melo, A.C.M.A.; Martorell, X.; Ayguade, E. (August 2014). Fine-grain Parallel Megabase Sequence Comparison with Multiple Heterogeneous GPUs. Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 383–384. doi:10.1145/2555243.2555280.
^Chivian, D; Baker, D (2006). 'Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection'. Nucleic Acids Research. 34 (17): e112. doi:10.1093/nar/gkl480. PMC1635247. PMID16971460.
^Girdea, M; Noe, L; Kucherov, G (January 2010). 'Back-translation for discovering distant protein homologies in the presence of frameshift mutations'. Algorithms for Molecular Biology. 5 (6): 6. doi:10.1186/1748-7188-5-6. PMC2821327. PMID20047662.
^Ma, B.; Tromp, J.; Li, M. (2002). 'PatternHunter: faster and more sensitive homology search'. Bioinformatics. 18 (3): 440–445. doi:10.1093/bioinformatics/18.3.440. PMID11934743.
^Li, M.; Ma, B.; Kisman, D.; Tromp, J. (2004). 'Patternhunter II: highly sensitive and fast homology search'. Journal of Bioinformatics and Computational Biology. 2 (3): 417–439. CiteSeerX10.1.1.1.2393. doi:10.1142/S0219720004000661. PMID15359419.
^Gusfield, Dan (1997). Algorithms on strings, trees and sequences. Cambridge university press. ISBN978-0-521-58519-4.
^Rasmussen K, Stoye J, Myers EW; Stoye; Myers (2006). 'Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length'. Journal of Computational Biology. 13 (2): 296–308. CiteSeerX10.1.1.465.2084. doi:10.1089/cmb.2006.13.296. PMID16597241.CS1 maint: multiple names: authors list (link)
^Noe L, Kucherov G; Kucherov (2005). 'YASS: enhancing the sensitivity of DNA similarity search'. Nucleic Acids Research. 33 (suppl_2): W540–W543. doi:10.1093/nar/gki478. PMC1160238. PMID15980530.
^'Index of /admin/exe'.
^Wilton, Richard; Budavari, Tamas; Langmead, Ben; Wheelan, Sarah J.; Salzberg, Steven L.; Szalay, Alexander S. (2015). 'Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space'. PeerJ. 3: e808. doi:10.7717/peerj.808. PMC4358639. PMID25780763.
^Homer, Nils; Merriman, Barry; Nelson, Stanley F. (2009). 'BFAST: An Alignment Tool for Large Scale Genome Resequencing'. PLOS ONE. 4 (11): e7767. doi:10.1371/journal.pone.0007767. PMC2770639. PMID19907642.
^Abuín, J.M.; Pichel, J.C.; Pena, T.F.; Amigo, J. (2015). 'BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies'. Bioinformatics. 31 (24): 4003–5. doi:10.1093/bioinformatics/btv506. PMID26323715.
^Kent, W. J. (2002). 'BLAT---The BLAST-Like Alignment Tool'. Genome Research. 12 (4): 656–664. doi:10.1101/gr.229202. ISSN1088-9051. PMC187518. PMID11932250.
^Langmead, Ben; Trapnell, Cole; Pop, Mihai; Salzberg, Steven L (2009). 'Ultrafast and memory-efficient alignment of short DNA sequences to the human genome'. Genome Biology. 10 (3): R25. doi:10.1186/gb-2009-10-3-r25. ISSN1465-6906. PMC2690996. PMID19261174.
^Li, H.; Durbin, R. (2009). 'Fast and accurate short read alignment with Burrows-Wheeler transform'. Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. ISSN1367-4803. PMC2705234. PMID19451168.
^ ^a^bKerpedjiev, Peter; Frellsen, Jes; Lindgreen, Stinus; Krogh, Anders (2014). 'Adaptable probabilistic mapping of short reads using position specific scoring matrices'. BMC Bioinformatics. 15 (1): 100. doi:10.1186/1471-2105-15-100. ISSN1471-2105. PMC4021105. PMID24717095.
^Liu, Y.; Schmidt, B.; Maskell, D. L. (2012). 'CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform'. Bioinformatics. 28 (14): 1830–1837. doi:10.1093/bioinformatics/bts276. ISSN1367-4803. PMID22576173.
^Liu, Y.; Schmidt, B. (2012). 'Long read alignment based on maximal exact match seeds'. Bioinformatics. 28 (18): i318–i324. doi:10.1093/bioinformatics/bts414. ISSN1367-4803. PMC3436841. PMID22962447.
^Rizk, Guillaume; Lavenier, Dominique (2010). 'GASSST: global alignment short sequence search tool'. Bioinformatics. 26 (20): 2534–2540. doi:10.1093/bioinformatics/btq485. PMC2951093. PMID20739310.
^Marco-Sola, Santiago; Sammeth, Michael; Guigó, Roderic; Ribeca, Paolo (2012). 'The GEM mapper: fast, accurate and versatile alignment by filtration'. Nature Methods. 9 (12): 1185–1188. doi:10.1038/nmeth.2221. ISSN1548-7091. PMID23103880.
^Clement, N. L.; Snell, Q.; Clement, M. J.; Hollenhorst, P. C.; Purwar, J.; Graves, B. J.; Cairns, B. R.; Johnson, W. E. (2009). 'The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing'. Bioinformatics. 26 (1): 38–45. doi:10.1093/bioinformatics/btp614. ISSN1367-4803. PMC6276904. PMID19861355.
^Santana-Quintero, Luis; Dingerdissen, Hayley; Thierry-Mieg, Jean; Mazumder, Raja; Simonyan, Vahan (2014). 'HIVE-Hexagon: High-Performance, Parallelized Sequence Alignment for Next-Generation Sequencing Data Analysis'. PLOS ONE. 9 (6): 1754–1760. doi:10.1371/journal.pone.0099033. PMC4053384. PMID24918764.
^Kielbasa, S.M.; Wan, R.; Sato, K.; Horton, P.; Frith, M.C. (2011). 'Adaptive seeds tame genomic sequence comparison'. Genome Research. 21 (3): 487–493. doi:10.1101/gr.113985.110. PMC3044862. PMID21209072.
^Rivals, Eric; Salmela, Leena; Kiiskinen, Petteri; Kalsi, Petri; Tarhio, Jorma (2009). mpscan: Fast Localisation of Multiple Reads in Genomes. Algorithms in Bioinformatics. Lecture Notes in Computer Science. 5724. pp. 246–260. CiteSeerX10.1.1.156.928. doi:10.1007/978-3-642-04241-6_21. ISBN978-3-642-04240-9.
^Sedlazeck, Fritz J.; Rescheneder, Philipp; von Haeseler, Arndt (2013). 'NextGenMap: fast and accurate read mapping in highly polymorphic genomes'. Bioinformatics. 29 (21): 2790–2791. doi:10.1093/bioinformatics/btt468. PMID23975764.
^Chen, Yangho; Souaiaia, Tade; Chen, Ting (2009). 'PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds'. Bioinformatics. 25 (19): 2514–2521. doi:10.1093/bioinformatics/btp486. PMC2752623. PMID19675096.
^Searls, David B.; Hoffmann, Steve; Otto, Christian; Kurtz, Stefan; Sharma, Cynthia M.; Khaitovich, Philipp; Vogel, Jörg; Stadler, Peter F.; Hackermüller, Jörg (2009). 'Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures'. PLoS Computational Biology. 5 (9): e1000502. doi:10.1371/journal.pcbi.1000502. ISSN1553-7358. PMC2730575. PMID19750212.
^Rumble, Stephen M.; Lacroute, Phil; Dalca, Adrian V.; Fiume, Marc; Sidow, Arend; Brudno, Michael (2009). 'SHRiMP: Accurate Mapping of Short Color-space Reads'. PLOS Computational Biology. 5 (5): e1000386. doi:10.1371/journal.pcbi.1000386. PMC2678294. PMID19461883.
^David, Matei; Dzamba, Misko; Lister, Dan; Ilie, Lucian; Brudno, Michael (2011). 'SHRiMP2: Sensitive yet Practical Short Read Mapping'. Bioinformatics. 27 (7): 1011–1012. doi:10.1093/bioinformatics/btr046. PMID21278192.
^Malhis, Nawar; Butterfield, Yaron S. N.; Ester, Martin; Jones, Steven J. M. (2009). 'Slider – Maximum use of probability information for alignment of short sequence reads and SNP detection'. Bioinformatics. 1 (1): 6–13. doi:10.1093/bioinformatics/btn565. PMC2638935. PMID18974170.
^Malhis, Nawar; Jones, Steven J. M. (2010). 'High Quality SNP Calling Using Illumina Data at Shallow Coverage'. Bioinformatics. 26 (8): 1029–1035. doi:10.1093/bioinformatics/btq092. PMID20190250.
^Li, R.; Li, Y.; Kristiansen, K.; Wang, J. (2008). 'SOAP: short oligonucleotide alignment program'. Bioinformatics. 24 (5): 713–714. doi:10.1093/bioinformatics/btn025. ISSN1367-4803. PMID18227114.
^Li, R.; Yu, C.; Li, Y.; Lam, T.-W.; Yiu, S.-M.; Kristiansen, K.; Wang, J. (2009). 'SOAP2: an improved ultrafast tool for short read alignment'. Bioinformatics. 25 (15): 1966–1967. doi:10.1093/bioinformatics/btp336. ISSN1367-4803. PMID19497933.
^Abuín, José M.; Pichel, Juan C.; Pena, Tomás F.; Amigo, Jorge (2016-05-16). 'SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data'. PLOS ONE. 11 (5): e0155461. doi:10.1371/journal.pone.0155461. ISSN1932-6203. PMC4868289. PMID27182962.
^Lunter, G.; Goodson, M. (2010). 'Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads'. Genome Research. 21 (6): 936–939. doi:10.1101/gr.111120.110. ISSN1088-9051. PMC3106326. PMID20980556.
^Noe, L.; Girdea, M.; Kucherov, G. (2010). 'Designing efficient spaced seeds for SOLiD read mapping'. Advances in Bioinformatics. 2010: 708501. doi:10.1155/2010/708501. PMC2945724. PMID20936175.
^Lin, H.; Zhang, Z.; Zhang, M.Q.; Ma, B.; Li, M. (2008). 'ZOOM! Zillions of oligos mapped'. Bioinformatics. 24 (21): 2431–2437. doi:10.1093/bioinformatics/btn416. PMC2732274. PMID18684737.

Retrieved from 'https://en.wikipedia.org/w/index.php?title=List_of_sequence_alignment_software&oldid=915452706'

(Redirected from BLAST)

BLAST
Developer(s)
Stable release	2.9.0+ / 1 April 2019; 5 months ago
Operating system	UNIX, Linux, Mac, MS-Windows
Type	Bioinformatics tool
License	Public domain
Website	blast.ncbi.nlm.nih.gov/Blast.cgi

In bioinformatics, BLAST (basic local alignment search tool)^[1] is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In the letters, characters admit their faults and fears, their deepest desires, and their worst sins and schemes. Like us, they were meticulous about privacy (well, like some of us).

1Background
3Algorithm
4Program
5Alternatives to BLAST

Background[edit]

BLAST, which The New York Times called the Google of biological research,^[1] is one of the most widely used bioinformatics programs for sequence searching.^[2] It addresses a fundamental problem in bioinformatics research. The heuristic algorithm it uses is much faster than other approaches, such as calculating an optimal alignment. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster.

Before BLAST, FASTA was developed by David J. Lipman and William R. Pearson in 1985.^[3]

Before fast algorithms such as BLAST and FASTA were developed, searching databases for protein or nucleic sequences was very time consuming because a full alignment procedure (e.g., the Smith–Waterman algorithm) was used.

BLAST came from the 1990 stochastic model of Samuel Karlin and Stephen Altschul^[4] They 'proposed a method for estimating similarities between the known DNA sequence of one organism with that of another,'^[1] and their work has been described as 'the statistical foundation for BLAST.'^[5] Subsequently, Altschul, along with Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman at the National Institutes of Health designed the BLAST algorithm, which was published in the Journal of Molecular Biology in 1990 and cited over 75,000 times.^[6]

While BLAST is faster than any Smith-Waterman implementation for most cases, it cannot 'guarantee the optimal alignments of the query and database sequences' as Smith-Waterman algorithm does. The optimality of Smith-Waterman 'ensured the best performance on accuracy and the most precise results' at the expense of time and computer power.

BLAST is more time-efficient than FASTA by searching only for the more significant patterns in the sequences, yet with comparative sensitivity. This could be further realized by understanding the algorithm of BLAST introduced below.

Examples of other questions that researchers use BLAST to answer are:

Which bacterialspecies have a protein that is related in lineage to a certain protein with known amino-acid sequence
What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined

BLAST is also often used as part of other algorithms that require approximate sequence matching.

BLAST is available on the web on the NCBI website. Alternative implementations include AB-BLAST (formerly known as WU-BLAST), FSA-BLAST (last updated in 2006), and ScalaBLAST.^[7]^[8]

The original paper by Altschul, et al.^[6] was the most highly cited paper published in the 1990s.^[9]

Input[edit]

Input sequences (in FASTA or Genbank format) and weight matrix.

Output[edit]

BLAST output can be delivered in a variety of formats. These formats include HTML, plain text, and XML formatting. For NCBI's web-page, the default format for output is HTML. When performing a BLAST on NCBI, the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table.

If one is attempting to search for a proprietary sequence or simply one that is unavailable in databases available to the general public through sources such as NCBI, there is a BLAST program available for download to any computer, at no cost. This can be found at BLAST+ executables. There are also commercial programs available for purchase. Databases can be found from the NCBI site, as well as from Index of BLAST databases (FTP).

Process[edit]

Using a heuristic method, BLAST finds similar sequences, by locating short matches between the two sequences. This process of finding similar sequences is called seeding. It is after this first match that BLAST begins to make local alignments. While attempting to find similarity in sequences, sets of common letters, known as words, are very important. For example, suppose that the sequence contains the following stretch of letters, GLKFA. If a BLAST was being conducted under normal conditions, the word size would be 3 letters. In this case, using the given stretch of letters, the searched words would be GLK, LKF, KFA. The heuristic algorithm of BLAST locates all common three-letter words between the sequence of interest and the hit sequence or sequences from the database. This result will then be used to build an alignment. After making words for the sequence of interest, the rest of the words are also assembled. These words must satisfy a requirement of having a score of at least the threshold T, when compared by using a scoring matrix.

One commonly used scoring matrix for BLAST searches is BLOSUM62,^[10] although the optimal scoring matrix depends on sequence similarity. Once both words and neighborhood words are assembled and compiled, they are compared to the sequences in the database in order to find matches. The threshold score T determines whether or not a particular word will be included in the alignment. Once seeding has been conducted, the alignment which is only 3 residues long, is extended in both directions by the algorithm used by BLAST. Each extension impacts the score of the alignment by either increasing or decreasing it. If this score is higher than a pre-determined T, the alignment will be included in the results given by BLAST. However, if this score is lower than this pre-determined T, the alignment will cease to extend, preventing the areas of poor alignment from being included in the BLAST results. Note that increasing the T score limits the amount of space available to search, decreasing the number of neighborhood words, while at the same time speeding up the process of BLAST

Algorithm[edit]

To run the software, BLAST requires a query sequence to search for, and a sequence to search against (also called the target sequence) or a sequence database containing multiple such sequences. BLAST will find sub-sequences in the database which are similar to sub sequences in the query. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides.

Blast Load Calculator

The main idea of BLAST is that there are often High-scoring Segment Pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and the existing sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. However, the exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristic approach that is less accurate than the Smith-Waterman algorithm but over 50 times faster. [8] The speed and relatively good accuracy of BLAST are among the key technical innovations of the BLAST programs.

An overview of the BLAST algorithm (a protein to protein search) is as follows:^[11]

Remove low-complexity region or sequence repeats in the query sequence.
'Low-complexity region' means a region of a sequence composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.
Make a k-letter word list of the query sequence.
Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) 'sequentially', until the last letter of the query sequence is included. The method is illustrated in figure 1.
Fig. 1 The method to establish the k-letter query word list.^[12]
List the possible matching words.
This step is one of the main differences between BLAST and FASTA. FASTA cares about all of the common words in the database and query sequences that are listed in step 2; however, BLAST only cares about the high-scoring words. The scores are created by comparing the word in the list in step 2 with all the 3-letter words. By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is respectively 15 and 12 with the BLOSUM62 weighting scheme. For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.
Organize the remaining high-scoring words into an efficient search tree.
This allows the program to rapidly compare the high-scoring words to the database sequences.
Repeat step 3 to 4 for each k-letter word in the query sequence.
Scan the database sequences for exact matches with the remaining high-scoring words.
The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position. If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.
Extend the exact matches to high-scoring segment pair (HSP).
- The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension does not stop until the accumulated total score of the HSP begins to decrease. A simplified example is presented in figure 2.
  Fig. 2 The process to extend the exact match. Adapted from Biological Sequence Analysis I, Current Topics in Genome Analysis [2].
- To save more time, a newer version of BLAST, called BLAST2 or gapped BLAST, has been developed. BLAST2 adopts a lower neighborhood word score threshold to maintain the same level of sensitivity for detecting sequence similarity. Therefore, the possible matching words list in step 3 becomes longer. Next, the exact matched regions, within distance A from each other on the same diagonal in figure 3, will be joined as a longer new region. Finally, the new regions are then extended by the same method as in the original version of BLAST, and the HSPs' (High-scoring segment pair) scores of the extended regions are then created by using a substitution matrix as before.
List all of the HSPs in the database whose score is high enough to be considered.
We list the HSPs whose scores are greater than the empirically determined cutoff score S. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs.
Evaluate the significance of the HSP score.
BLAST next assesses the statistical significance of each HSP score by exploiting the Gumbel extreme value distribution (EVD). (It is proved that the distribution of Smith-Waterman local alignment scores between two random sequences follows the Gumbel EVD. For local alignments containing gaps it is not proved.). In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation
${displaystyle pleft(Sgeq xright)=1-exp left(-e^{-lambda left(x-mu right)}right)}$
where
${displaystyle mu ={frac {log left(Km$
The statistical parameters ${displaystyle lambda }$ and ${displaystyle mathrm {K} }$ are estimated by fitting the distribution of the un-gapped local alignment scores, of the query sequence and a lot of shuffled versions (Global or local shuffling) of a database sequence, to the Gumbel extreme value distribution. Note that ${displaystyle lambda }$ and ${displaystyle mathrm {K} }$ depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies). ${displaystyle m$ and ${displaystyle n$ are the effective lengths of the query and database sequences, respectively. The original sequence length is shortened to the effective length to compensate for the edge effect (an alignment start near the end of one of the query or database sequence is likely not to have enough sequence to build an optimal alignment). They can be calculated as
${displaystyle m$
${displaystyle n$
where ${displaystyle mathrm {H} }$ is the average expected score per aligned pair of residues in an alignment of two random sequences. Altschul and Gish gave the typical values, ${displaystyle lambda =0.318}$ , ${displaystyle mathrm {K} =0.13}$ , and ${displaystyle mathrm {H} =0.40}$ , for un-gapped local alignment using BLOSUM62 as the substitution matrix. Using the typical values for assessing the significance is called the lookup table method; it is not accurate. The expect score E of a database match is the number of times that an unrelated database sequence would obtain a score S higher than x by chance. The expectation E obtained in a search for a database of D sequences is given by
${displaystyle Eapprox 1-e^{-pleft(s>xright)D}}$
Furthermore, when ${displaystyle p<0.1}$ , E could be approximated by the Poisson distribution as
${displaystyle Eapprox pD}$
This expectation or expect value 'E' (often called an E score or E-value or e-value) assessing the significance of the HSP score for un-gapped local alignment is reported in the BLAST results. The calculation shown here is modified if individual HSPs are combined, such as when producing gapped alignments (described below), due to the variation of the statistical parameters.
Make two or more HSP regions into a longer alignment.
Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. There are two methods, the Poisson method and the sum-of-scores method, to compare the significance of the newly combined HSP regions. Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52, 45), respectively. The Poisson method gives more significance to the set with the maximal lower score (45>40). However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than 52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU-BLAST uses the sum-of scores method.
Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences.
- The original BLAST only generates un-gapped alignments including the initially found HSPs individually, even when there is more than one HSP found in one database sequence.
- BLAST2 produces a single alignment with gaps that can include all of the initially found HSP regions. Note that the computation of the score and its corresponding E-value involves use of adequate gap penalties.
Report every match whose expect score is lower than a threshold parameter E.

Parallel BLAST[edit]

Parallel BLAST versions of split databases are implemented using MPI and Pthreads, and have been ported to various platforms including Windows, Linux, Solaris, Mac OS X, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation (partition). Databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. Specific implementations include MPIblast, ScalaBLAST, DCBLAST and so on.^[13]

Program[edit]

The BLAST program can either be downloaded and run as a command-line utility 'blastall' or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms.

The BLAST program is based on an open-source format, giving everyone access to it and enabling them to have the ability to change the program code. This has led to the creation of several BLAST 'spin-offs'.

There are now a handful of different BLAST programs available, which can be used depending on what one is attempting to do and what they are working with. These different programs vary in query sequence input, the database being searched, and what is being compared. These programs and their details are listed below:

BLAST is actually a family of programs (all included in the blastall executable). These include:^[14]

Nucleotide-nucleotide BLAST (blastn): This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies.
Protein-protein BLAST (blastp): This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies.
Position-Specific Iterative BLAST (PSI-BLAST) (blastpgp): This program is used to find distant relatives of a protein. First, a list of all closely related proteins is created. These proteins are combined into a general 'profile' sequence, which summarises significant features present in these sequences. A query against the protein database is then run using this profile, and a larger group of proteins is found. This larger group is used to construct another profile, and the process is repeated.; By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST.
Nucleotide 6-frame translation-protein (blastx): This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.
Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx): This program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences.
Protein-nucleotide 6-frame translation (tblastn): This program compares a protein query against the all six reading frames of a nucleotide sequence database.
Large numbers of query sequences (megablast): When comparing large numbers of input sequences via the command-line BLAST, 'megablast' is much faster than running BLAST multiple times. It concatenates many input sequences together to form a large sequence before searching the BLAST database, then post-analyzes the search results to glean individual alignments and statistical values.

Of these programs, BLASTn and BLASTp are the most commonly used^{[citation needed]} because they use direct comparisons, and do not require translations. However, since protein sequences are better conserved evolutionarily than nucleotide sequences, tBLASTn, tBLASTx, and BLASTx, produce more reliable and accurate results when dealing with coding DNA. They also enable one to be able to directly see the function of the protein sequence, since by translating the sequence of interest before searching often gives you annotated protein hits.

Alternative versions[edit]

A version designed for comparing large genomes or DNA is BLASTZ.

CS-BLAST (Context-Specific BLAST) is an extended version of BLAST for searching protein sequences that finds twice as many remotely related sequences as BLAST at the same speed and error rate. In CS-BLAST, the mutation probabilities between amino acids depend not only on the single amino acid, as in BLAST, but also on its local sequence context. Washington University produced an alternative version of NCBI BLAST, called WU-BLAST. The rights have since been acquired to Advanced Biocomputing, LLC.

In 2009, NCBI has released a new set of BLAST executables, the C++ based BLAST+, and has released C versions until 2.2.26.^[15] Starting with version 2.2.27 (April 2013), only BLAST+ executables are available. Among the changes is the replacement of the blastall executable with separate executables for the different BLAST programs, and changes in option handling. The formatdb utility (C based) has been replaced by makeblastdb (C++ based) and databases formatted by either one should be compatible for identical blast releases. The algorithms remain similar, however, the number of hits found and their order can vary significantly between the older and the newer version. BLAST+ since

Accelerated versions[edit]

TimeLogic offers an FPGA-accelerated implementation of the BLAST algorithm called Tera-BLAST that is hundreds of times faster.

Other formerly supported versions include:

FPGA-accelerated
- Prior to their acquisition by Qiagen, CLC bio collaborated with SciEngines GmbH on an FPGA accelerator they claimed will give 188x acceleration of BLAST.
- The Mitrion-C Open Bio Project was an effort to port BLAST to run on Mitrion FPGAs.
GPU-accelerated
- GPU-Blast^[16] is an accelerated version of NCBI BLASTP for CUDA which is 3x~4x faster than NCBI Blast.
- CUDA-BLASTP^[17] is a version of BLASTP that is GPU-accelerated and is claimed to run up to 10x faster than NCBI BLAST.
- G-BLASTN^[18] is an accelerated version of NCBI blastn and megablast, whose speedup varies from 4x to 14x (compared to the same runs with 4 CPU threads). Its current limitation is that the database must fit into the GPU memory.
CPU-accelerated
- MPIBlast is a parallel implementation of NCBI BLAST using Message Passing Interface. By efficiently utilizing distributed computational resources through database fragmentation, query segmentation, intelligent scheduling, and parallel I/O, mpiBLAST improves NCBI BLAST performance by several orders of magnitude while scaling to hundreds of processors.
- CaBLAST^[19] makes search on large databases orders of magnitude faster by exploiting redundancy in data.
- Paracel BLAST was a commercial parallel implementation of NCBI BLAST, supporting hundreds of processors.
- QuickBLAST (kblastp) from NCBI is an implementation accelerated by prefiltering based on Jaccard index estimates with hashed pentameric fragments. The filtering slightly reduces sensitivity, but increases performance by an order of magnitude.^[20] NCBI only makes the search available on their non-redundant (nr) protein collection, and does not offer downloads.

Alternatives to BLAST[edit]

The predecessor to BLAST, FASTA, can also be used for protein and DNA similarity searching. FASTA provides a similar set of programs for comparing proteins to protein and DNA databases, DNA to DNA and protein databases, and includes additional programs for working with unordered short peptides and DNA sequences. In addition, the FASTA package provides SSEARCH, a vectorized implementation of the rigorous Smith-Waterman algorithm. FASTA is slower than BLAST, but provides a much wider range of scoring matrices, making it easier to tailor a search to a specific evolutionary distance.

An extremely fast but considerably less sensitive alternative to BLAST is BLAT (Blast Like Alignment Tool). While BLAST does a linear search, BLAT relies on k-mer indexing the database, and can thus often find seeds faster.^[21] Another software alternative similar to BLAT is PatternHunter.

Advances in sequencing technology in the late 2000s has made searching for very similar nucleotide matches an important problem. New alignment programs tailored for this use typically use BWT-indexing of the target database (typically a genome). Input sequences can then be mapped very quickly, and output is typically in the form of a BAM file. Example alignment programs are BWA, SOAP, and Bowtie.

For protein identification, searching for known domains (for instance from Pfam) by matching with Hidden Markov Models is a popular alternative, such as HMMER.

An alternative to BLAST for comparing two banks of sequences is PLAST. PLAST provides a high-performance general purpose bank to bank sequence similarity search tool relying on the PLAST^[22] and ORIS^[23] algorithms. Results of PLAST are very similar to BLAST, but PLAST is significantly faster and capable of comparing large sets of sequences with a small memory (i.e. RAM) footprint.

For applications in metagenomics, where the task is to compare billions of short DNA reads against tens of millions of protein references, DIAMOND^[24] runs at up to 20,000 times as fast as BLASTX, while maintaining a high level of sensitivity.

The open-source software MMseqs is an alternative to BLAST/PSI-BLAST, which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed.^[25]

Comparing BLAST and the Smith-Waterman Process[edit]

While both Smith-Waterman and BLAST are used to find homologous sequences by searching and comparing a query sequence with those in the databases, they do have their differences.

Due to the fact that BLAST is based on a heuristic algorithm, the results received through BLAST, in terms of the hits found, may not be the best possible results, as it will not provide you with all the hits within the database. BLAST misses hard to find matches.

A better alternative in order to find the best possible results would be to use the Smith-Waterman algorithm. This method varies from the BLAST method in two areas, accuracy and speed. The Smith-Waterman option provides better accuracy, in that it finds matches that BLAST cannot, because it does not miss any information. Therefore, it is necessary for remote homology. However, when compared to BLAST, it is more time consuming, not to mention that it requires large amounts of computer usage and space. However, technologies to speed up the Smith-Waterman process have been found to improve the time necessary to perform a search dramatically. These technologies include FPGA chips and SIMD technology.

In order to receive better results from BLAST, the settings can be changed from their default settings. However, there is no given or set way of changing these settings in order to receive the best results for a given sequence. The settings available for change are E-Value, gap costs, filters, word size, and substitution matrix. Note, that the algorithm used for BLAST was developed from the algorithm used for Smith-Waterman. BLAST employs an alignment which finds 'local alignments between sequences by finding short matches and from these initial matches (local) alignments are created'.^[26]

BLAST output visualization[edit]

To help users interpreting BLAST results, different software is available. According to installation and use, analysis features and technology, here are some available tools:^[27]

NCBI BLAST service
general BLAST output interpreters, GUI-based: JAMBLAST, Blast Viewer, BLASTGrabber
integrated BLAST environments: PLAN, BlastStation-Free
BLAST output parsers: MuSeqBox, Zerg, BioParser, BLAST-Explorer
specialized BLAST-related tools: MEGAN, BLAST2GENE, BOV, Circoletto

Uses of BLAST[edit]

BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.

Identifying species: With the use of BLAST, you can possibly correctly identify a species or find homologous species. This can be useful, for example, when you are working with a DNA sequence from an unknown species.

Locating domains: When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest.

Establishing phylogeny: Using the results received through BLAST you can create a phylogenetic tree using the BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-built computational phylogenetic methods, so should only be relied upon for 'first pass' phylogenetic analyses.

DNA mapping: When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s). NCBI has a 'Magic-BLAST' tool built around BLAST for this purpose.^[28]

Comparison: When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.

References[edit]