Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
bwa(1)			     Bioinformatics tools			bwa(1)

NAME
       bwa - Burrows-Wheeler Alignment Tool

SYNOPSIS
       bwa index ref.fa

       bwa mem ref.fa reads.fq > aln-se.sam

       bwa mem ref.fa read1.fq read2.fq	> aln-pe.sam

       bwa aln ref.fa short_read.fq > aln_sa.sai

       bwa samse ref.fa	aln_sa.sai short_read.fq > aln-se.sam

       bwa sampe ref.fa	aln_sa1.sai aln_sa2.sai	read1.fq read2.fq > aln-pe.sam

       bwa bwasw ref.fa	long_read.fq > aln.sam

DESCRIPTION
       BWA is a	software package for mapping low-divergent sequences against a
       large  reference	genome,	such as	the human genome. It consists of three
       algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first	 algorithm  is
       designed	 for  Illumina	sequence reads up to 100bp, while the rest two
       for longer sequences ranged from	70bp to	1Mbp. BWA-MEM and BWA-SW share
       similar features	such as	long-read support  and	split  alignment,  but
       BWA-MEM,	which is the latest, is	generally recommended for high-quality
       queries	as  it	is  faster and more accurate.  BWA-MEM also has	better
       performance than	BWA-backtrack for 70-100bp Illumina reads.

       For all the algorithms, BWA first needs to construct the	 FM-index  for
       the  reference genome (the index	command). Alignment algorithms are in-
       voked with different sub-commands: aln/samse/sampe  for	BWA-backtrack,
       bwasw for BWA-SW	and mem	for the	BWA-MEM	algorithm.

COMMANDS AND OPTIONS
       index  bwa index	[-p prefix] [-a	algoType] db.fa

	      Index database sequences in the FASTA format.

	      OPTIONS:

	      -p STR	Prefix of the output database [same as db filename]

	      -a STR	Algorithm  for	constructing BWT index.	BWA implements
			three algorithms for BWT construction: is,  bwtsw  and
			rb2.  The first	algorithm is a little faster for small
			database  but requires large RAM and does not work for
			databases with total length longer than	2GB. The  sec-
			ond  algorithm is adapted from the BWT-SW source code.
			It in theory works with	 database  with	 trillions  of
			bases.	When  this option is not specified, the	appro-
			priate algorithm will be chosen	automatically.

       mem    bwa mem [-aCHjMpP] [-t nThreads] [-k minSeedLen] [-w  bandWidth]
	      [-d  zDropoff]  [-r seedSplitRatio] [-c maxOcc] [-D chainShadow]
	      [-m maxMateSW] [-W minSeedMatch] [-A matchScore] [-B  mmPenalty]
	      [-O  gapOpenPen]	[-E gapExtPen] [-L clipPen] [-U	unpairPen] [-x
	      readType]	[-R RGline] [-H	HDlines] [-v  verboseLevel]  db.prefix
	      reads.fq [mates.fq]

	      Align  70bp-1Mbp	query  sequences  with	the BWA-MEM algorithm.
	      Briefly, the algorithm works by seeding alignments with  maximal
	      exact  matches  (MEMs) and then extending	seeds with the affine-
	      gap Smith-Waterman algorithm (SW).

	      If mates.fq file is absent and option -p is not set,  this  com-
	      mand regards input reads are single-end. If mates.fq is present,
	      this command assumes the i-th read in reads.fq and the i-th read
	      in  mates.fq  constitute a read pair. If -p is used, the command
	      assumes the 2i-th	and the	(2i+1)-th read in reads.fq  constitute
	      a	read pair (such	input file is said to be interleaved). In this
	      case,  mates.fq is ignored. In the paired-end mode, the mem com-
	      mand will	infer the read orientation and the insert size distri-
	      bution from a batch of reads.

	      The BWA-MEM algorithm performs local alignment. It  may  produce
	      multiple	primary	 alignments  for different part	of a query se-
	      quence. This is a	crucial	feature	for long  sequences.  However,
	      some tools may not work with split alignments.  One may consider
	      to use option -M to flag shorter split hits as secondary.

	      ALGORITHM	OPTIONS:

	      -t INT	Number of threads [1]

	      -k INT	Minimum	 seed length. Matches shorter than INT will be
			missed.	The alignment speed is usually insensitive  to
			this  value  unless it significantly deviates from 20.
			[19]

	      -w INT	Band width. Essentially, gaps longer than INT will not
			be found. Note that the	maximum	gap length is also af-
			fected by the scoring matrix and the hit  length,  not
			solely determined by this option. [100]

	      -d INT	Off-diagonal  X-dropoff	 (Z-dropoff).  Stop  extension
			when the difference between the	best and  the  current
			extension  score  is  above |i-j|*A+INT, where i and j
			are the	current	positions of the query and  reference,
			respectively,  and  A is the matching score. Z-dropoff
			is similar to BLAST's X-dropoff	except that it doesn't
			penalize gaps in one of	the sequences  in  the	align-
			ment. Z-dropoff	not only avoids	unnecessary extension,
			but  also  reduces  poor alignments inside a long good
			alignment. [100]

	      -r FLOAT	Trigger	 re-seeding  for  a  MEM  longer   than	  min-
			SeedLen*FLOAT.	 This is a key heuristic parameter for
			tuning the  performance.  Larger  value	 yields	 fewer
			seeds, which leads to faster alignment speed but lower
			accuracy. [1.5]

	      -c INT	Discard	 a  MEM	 if it has more	than INT occurrence in
			the genome. This is an insensitive parameter. [500]

	      -D FLOAT	Drop chains shorter than FLOAT fraction	of the longest
			overlapping chain [0.5]

	      -m INT	Perform	at most	INT rounds of mate-SW [50]

	      -W INT	Drop a chain if	 the  number  of  bases	 in  seeds  is
			smaller	 than  INT.  This option is primarily used for
			longer contigs/reads. When positive, it	 also  affects
			seed filtering.	[0]

	      -P	In  the	 paired-end mode, perform SW to	rescue missing
			hits only but do not try  to  find  hits  that	fit  a
			proper pair.

	      SCORING OPTIONS:

	      -A INT	Matching score.	[1]

	      -B INT	Mismatch  penalty. The sequence	error rate is approxi-
			mately:	{.75 * exp[-log(4) * B/A]}. [4]

	      -O INT[,INT]
			Gap open penalty. If two numbers  are  specified,  the
			first  is  the	penalty	 of opening a deletion and the
			second for opening an insertion. [6]

	      -E INT[,INT]
			Gap extension penalty. If two numbers  are  specified,
			the  first  is the penalty of extending	a deletion and
			second for extending an	insertion. A gap of  length  k
			costs  O  + k*E	(i.e.  -O is for opening a zero-length
			gap). [1]

	      -L INT[,INT]
			Clipping penalty. When performing SW  extension,  BWA-
			MEM  keeps track of the	best score reaching the	end of
			query. If this score is	larger than the	best SW	 score
			minus  the  clipping penalty, clipping will not	be ap-
			plied. Note that in this case, the SAM AS tag  reports
			the best SW score; clipping penalty is not deduced. If
			two  numbers  are  provided,  the  first is for	5'-end
			clipping and second for	3'-end clipping. [5]

	      -U INT	Penalty	for an unpaired	read pair. BWA-MEM  scores  an
			unpaired  read	pair  as scoreRead1+scoreRead2-INT and
			scores	a  paired   as	 scoreRead1+scoreRead2-insert-
			Penalty.  It  compares	these  two scores to determine
			whether	we should force	pairing. A larger value	 leads
			to more	aggressive read	pair. [17]

	      -x STR	Read type. Changes multiple parameters unless overrid-
			den [null]

			pacbio:	  -k17	-W40  -r10 -A1 -B1 -O1 -E1 -L0 (PacBio
				  reads	to ref)

			ont2d:	  -k14 -W20 -r10 -A1 -B1 -O1 -E1  -L0  (Oxford
				  Nanopore 2D-reads to ref)

			intractg: -B9 -O16 -L5 (intra-species contigs to ref)

	      INPUT/OUTPUT OPTIONS:

	      -p	Smart  pairing.	 If  two  adjacent reads have the same
			name, they are considered to form a  read  pair.  This
			way, paired-end	and single-end reads can be mixed in a
			single FASTA/Q stream.

	      -R STR	Complete  read	group header line. '\t'	can be used in
			STR and	will be	converted to a TAB in the output  SAM.
			The  read  group  ID will be attached to every read in
			the  output.  An  example  is	'@RG\tID:foo\tSM:bar'.
			[null]

	      -H ARG	If  ARG	 starts	 with @, it is interpreted as a	string
			and gets inserted into the output SAM  header;	other-
			wise,  ARG  is	interpreted  as	 a file	with all lines
			starting with @	in the	file  inserted	into  the  SAM
			header.	[null]

	      -o FILE	Write  the output SAM file to FILE.  For compatibility
			with other BWA commands, this option may also be given
			as -f FILE.  [standard output]

	      -q
			 Don't reduce the mapping quality of  split  alignment
			of lower alignment score.

	      -5	For  split alignment, mark the segment with the	small-
			est coordinate as the primary.	It  automatically  ap-
			plies option -q	as well. This option may help some Hi-
			C pipelines. By	default, BWA-MEM marks highest scoring
			segment	as primary.

	      -K  INT	Process	 INT  input  bases in each batch regardless of
			the number of threads in use [10000000*nThreads].   By
			default,  the batch size is proportional to the	number
			of threads in use.  Because the	inferred  insert  size
			distribution slightly depends on the batch size, using
			different number of threads may	produce	different out-
			put.  Specifying this option helps reproducibility.

	      -T INT	Don't  output  alignment  with	score  lower than INT.
			This option affects output and occasionally  SAM  flag
			2. [30]

	      -j	Treat  ALT  contigs  as	 part  of the primary assembly
			(i.e. ignore the db.prefix.alt file).

	      -h INT[,INT2]
			If a query has not  more  than	INT  hits  with	 score
			higher	than  80%  of the best hit, output them	all in
			the XA tag.  If	INT2 is	specified, BWA-MEM outputs  up
			to INT2	hits if	the list contains a hit	to an ALT con-
			tig. [5,200]

	      -a	Output all found alignments for	single-end or unpaired
			paired-end  reads. These alignments will be flagged as
			secondary alignments.

	      -C	Append FASTA/Q comment to SAM output. This option  can
			be  used  to transfer read meta	information (e.g. bar-
			code) to the SAM output. Note that the FASTA/Q comment
			(the string after a space in  the  header  line)  must
			conform	the SAM	spec (e.g. BC:Z:CGTAC).	Malformed com-
			ments lead to incorrect	SAM output.

	      -Y	Use  soft  clipping  CIGAR operation for supplementary
			alignments. By default,	BWA-MEM	uses soft clipping for
			the primary alignment and hard clipping	for supplemen-
			tary alignments.

	      -M	Mark shorter split hits	as secondary

	      -v INT	Control	the verbosity level of the output. This	option
			has not	been fully supported throughout	BWA.  Ideally,
			a  value  0  for disabling all the output to stderr; 1
			for outputting errors only; 2 for warnings and errors;
			3 for all normal messages; 4 or	higher for  debugging.
			When this option takes value 4,	the output is not SAM.
			[3]

	      -I FLOAT[,FLOAT[,INT[,INT]]]
			Specify	 the mean, standard deviation (10% of the mean
			if absent), max	(4 sigma from the mean if absent)  and
			min  (4	 sigma if absent) of the insert	size distribu-
			tion. Only applicable to the FR	 orientation.  By  de-
			fault,	BWA-MEM	infers these numbers and the pair ori-
			entations given	enough reads. [inferred]

       aln    bwa aln [-n maxDiff] [-o maxGapO]	[-e maxGapE] [-d nDelTail] [-i
	      nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN]  [-M
	      misMsc]  [-O  gapOsc]  [-E  gapEsc]  [-q trimQual] <in.db.fasta>
	      <in.query.fq> > <out.sai>

	      Find the SA coordinates of the input reads. Maximum  maxSeedDiff
	      differences  are	allowed	 in  the first seedLen subsequence and
	      maximum maxDiff differences are allowed in the whole sequence.

	      OPTIONS:

	      -n NUM	Maximum	edit distance if the  value  is	 INT,  or  the
			fraction  of  missing alignments given 2% uniform base
			error rate if FLOAT. In	the latter case,  the  maximum
			edit  distance	is  automatically chosen for different
			read lengths. [0.04]

	      -o INT	Maximum	number of gap opens [1]

	      -e INT	Maximum	number of gap extensions, -1 for  k-difference
			mode (disallowing long gaps) [-1]

	      -d INT	Disallow  a  long  deletion  within INT	bp towards the
			3'-end [16]

	      -i INT	Disallow an indel within INT bp	towards	the ends [5]

	      -l INT	Take the first INT subsequence	as  seed.  If  INT  is
			larger	than  the query	sequence, seeding will be dis-
			abled. For long	reads, this option is typically	ranged
			from 25	to 35 for `-k 2'. [inf]

	      -k INT	Maximum	edit distance in the seed [2]

	      -t INT	Number of threads (multi-threading mode) [1]

	      -M INT	Mismatch penalty. BWA will not search  for  suboptimal
			hits with a score lower	than (bestScore-misMsc). [3]

	      -O INT	Gap open penalty [11]

	      -E INT	Gap extension penalty [4]

	      -R INT	Proceed	 with  suboptimal  alignments  if there	are no
			more than INT equally best hits. This option only  af-
			fects  paired-end  mapping.  Increasing	this threshold
			helps to improve the pairing accuracy at the  cost  of
			speed, especially for short reads (~32bp).

	      -c	Reverse	query but not complement it, which is required
			for  alignment	in  the	 color	space. (Disabled since
			0.6.x)

	      -N	Disable	iterative search. All hits with	no  more  than
			maxDiff	 differences  will be found. This mode is much
			slower than the	default.

	      -q INT	Parameter for read trimming. BWA trims a read down  to
			argmax_x{\sum_{i=x+1}^l(INT-q_i)}  if  q_l<INT where l
			is the original	read length. [0]

	      -I	The input is in	the Illumina 1.3+ read format (quality
			equals ASCII-64).

	      -B INT	Length of barcode starting from	the 5'-end.  When  INT
			is  positive, the barcode of each read will be trimmed
			before mapping and will	be written at the BC SAM  tag.
			For  paired-end	 reads,	the barcode from both ends are
			concatenated. [0]

	      -b	Specify	the input read sequence	file is	the  BAM  for-
			mat.  For  paired-end data, two	ends in	a pair must be
			grouped	together and options -1	or -2 are usually  ap-
			plied  to  specify which end should be mapped. Typical
			command	lines for mapping pair-end  data  in  the  BAM
			format are:

			    bwa	aln ref.fa -b1 reads.bam > 1.sai
			    bwa	aln ref.fa -b2 reads.bam > 2.sai
			    bwa	sampe ref.fa 1.sai 2.sai reads.bam reads.bam >
			aln.sam

	      -0	When  -b  is  specified,  only use single-end reads in
			mapping.

	      -1	When -b	is specified, only use the  first  read	 in  a
			read  pair  in	mapping	(skip single-end reads and the
			second reads).

	      -2	When -b	is specified, only use the second  read	 in  a
			read pair in mapping.

       samse  bwa samse	[-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>

	      Generate	alignments  in	the SAM	format given single-end	reads.
	      Repetitive hits will be randomly chosen.

	      OPTIONS:

	      -n INT	Maximum	number of alignments to	output in the  XA  tag
			for reads paired properly. If a	read has more than INT
			hits, the XA tag will not be written. [3]

	      -r STR	Specify	   the	 read	group	in   a	 format	  like
			`@RG\tID:foo\tSM:bar'. [null]

       sampe  bwa sampe	[-a maxInsSize]	[-o maxOcc] [-n	maxHitPaired] [-N max-
	      HitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
	      >	<out.sam>

	      Generate alignments in the SAM format  given  paired-end	reads.
	      Repetitive read pairs will be placed randomly.

	      OPTIONS:

	      -a INT  Maximum insert size for a	read pair to be	considered be-
		      ing  mapped  properly.  Since 0.4.5, this	option is only
		      used when	there are not enough good alignment  to	 infer
		      the distribution of insert sizes.	[500]

	      -o INT  Maximum  occurrences  of a read for pairing. A read with
		      more occurrneces will be treated as a  single-end	 read.
		      Reducing this parameter helps faster pairing. [100000]

	      -P      Load  the	entire FM-index	into memory to reduce disk op-
		      erations (base-space reads only).	With this  option,  at
		      least 1.25N bytes	of memory are required,	where N	is the
		      length of	the genome.

	      -n INT  Maximum number of	alignments to output in	the XA tag for
		      reads paired properly. If	a read has more	than INT hits,
		      the XA tag will not be written. [3]

	      -N INT  Maximum number of	alignments to output in	the XA tag for
		      disconcordant  read  pairs  (excluding singletons). If a
		      read has more than INT hits, the	XA  tag	 will  not  be
		      written. [10]

	      -r STR  Specify	 the	read	group	in   a	 format	  like
		      `@RG\tID:foo\tSM:bar'. [null]

       bwasw  bwa  bwasw  [-a  matchScore]  [-b	 mmPen]	 [-q  gapOpenPen]  [-r
	      gapExtPen]  [-t nThreads]	[-w bandWidth] [-T thres] [-s hspIntv]
	      [-z zBest] [-N nHspRev]  [-c  thresCoef]	<in.db.fasta>  <in.fq>
	      [mate.fq]

	      Align  query  sequences  in  the	in.fq  file.  When  mate.fq is
	      present, perform paired-end alignment. The paired-end mode  only
	      works  for reads Illumina	short-insert libraries.	In the paired-
	      end mode,	BWA-SW may still output	split alignments but they  are
	      all  marked  as not properly paired; the mate positions will not
	      be written if the	mate has multiple local	hits.

	      OPTIONS:

	      -a INT	Score of a match [1]

	      -b INT	Mismatch penalty [3]

	      -q INT	Gap open penalty [5]

	      -r INT	Gap extension penalty. The penalty  for	 a  contiguous
			gap of size k is q+k*r.	[2]

	      -t INT	Number of threads in the multi-threading mode [1]

	      -w INT	Band width in the banded alignment [33]

	      -T INT	Minimum	score threshold	divided	by a [37]

	      -c FLOAT	Coefficient  for  threshold  adjustment	 according  to
			query length. Given an l-long query, the threshold for
			a hit to be retained is	a*max{T,c*log(l)}. [5.5]

	      -z INT	Z-best heuristics. Higher -z increases accuracy	at the
			cost of	speed. [1]

	      -s INT	Maximum	SA interval size for initiating	a seed.	Higher
			-s increases accuracy at the cost of speed. [3]

	      -N INT	Minimum	 number	 of  seeds  supporting	the  resultant
			alignment to skip reverse alignment. [5]

SAM ALIGNMENT FORMAT
       The  output  of	the  `aln'  command is binary and designed for BWA use
       only. BWA outputs the final  alignment  in  the	SAM  (Sequence	Align-
       ment/Map) format. Each line consists of:

     +-----+-------+----------------------------------------------------------+
     | Col | Field |			   Description			      |
     +-----+-------+----------------------------------------------------------+
     |	1  | QNAME | Query (pair) NAME					      |
     |	2  | FLAG  | bitwise FLAG					      |
     |	3  | RNAME | Reference sequence	NAME				      |
     |	4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
     |	5  | MAPQ  | MAPping Quality (Phred-scaled)			      |
     |	6  | CIAGR | extended CIGAR string				      |
     |	7  | MRNM  | Mate Reference sequence NaMe (`=' if same as RNAME)      |
     |	8  | MPOS  | 1-based Mate POSistion				      |
     |	9  | ISIZE | Inferred insert SIZE				      |
     | 10  | SEQ   | query SEQuence on the same	strand as the reference	      |
     | 11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
     | 12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |
     +-----+-------+----------------------------------------------------------+

       Each bit	in the FLAG field is defined as:

	      +-----+--------+---------------------------------------+
	      |	Chr |  Flag  |		    Description		     |
	      +-----+--------+---------------------------------------+
	      |	 p  | 0x0001 | the read	is paired in sequencing	     |
	      |	 P  | 0x0002 | the read	is mapped in a proper pair   |
	      |	 u  | 0x0004 | the query sequence itself is unmapped |
	      |	 U  | 0x0008 | the mate	is unmapped		     |
	      |	 r  | 0x0010 | strand of the query (1 for reverse)   |
	      |	 R  | 0x0020 | strand of the mate		     |
	      |	 1  | 0x0040 | the read	is the first read in a pair  |
	      |	 2  | 0x0080 | the read	is the second read in a	pair |
	      |	 s  | 0x0100 | the alignment is	not primary	     |
	      |	 f  | 0x0200 | QC failure			     |
	      |	 d  | 0x0400 | optical or PCR duplicate		     |
	      |	 S  | 0x0800 | supplementary alignment		     |
	      +-----+--------+---------------------------------------+

       The Please check	<http://samtools.sourceforge.net> for the format spec-
       ification and the tools for post-processing the alignment.

       BWA generates the following optional fields. Tags starting with `X' are
       specific	to BWA.

	     +-----+--------------------------------------------------+
	     | Tag |			 Meaning		      |
	     +-----+--------------------------------------------------+
	     | NM  | Edit distance				      |
	     | MD  | Mismatching positions/bases		      |
	     | AS  | Alignment score				      |
	     | BC  | Barcode sequence				      |
	     | SA  | Supplementary alignments			      |
	     +-----+--------------------------------------------------+
	     | X0  | Number of best hits			      |
	     | X1  | Number of suboptimal hits found by	BWA	      |
	     | XN  | Number of ambiguous bases in the referenece      |
	     | XM  | Number of mismatches in the alignment	      |
	     | XO  | Number of gap opens			      |
	     | XG  | Number of gap extensions			      |
	     | XT  | Type: Unique/Repeat/N/Mate-sw		      |
	     | XA  | Alternative hits; format: /(chr,pos,CIGAR,NM;)*/ |
	     +-----+--------------------------------------------------+
	     | XS  | Suboptimal	alignment score			      |
	     | XF  | Support from forward/reverse alignment	      |
	     | XE  | Number of supporting seeds			      |
	     +-----+--------------------------------------------------+

       Note  that XO and XG are	generated by BWT search	while the CIGAR	string
       by Smith-Waterman alignment. These two tags may	be  inconsistent  with
       the CIGAR string. This is not a bug.

NOTES ON SHORT-READ ALIGNMENT
   Alignment Accuracy
       When  seeding is	disabled, BWA guarantees to find an alignment contain-
       ing maximum maxDiff differences including maxGapO gap  opens  which  do
       not  occur  within nIndelEnd bp towards either end of the query.	Longer
       gaps may	be found if maxGapE is positive, but it	is not	guaranteed  to
       find  all  hits.	When seeding is	enabled, BWA further requires that the
       first seedLen subsequence contains no  more  than  maxSeedDiff  differ-
       ences.

       When gapped alignment is	disabled, BWA is expected to generate the same
       alignment  as Eland version 1, the Illumina alignment program. However,
       as BWA change `N' in the	database sequence to random nucleotides,  hits
       to  these  random sequences will	also be	counted. As a consequence, BWA
       may mark	a unique hit as	a repeat, if the random	sequences happen to be
       identical to the	sequences which	should be unqiue in the	database.

       By default, if the best hit is not  highly  repetitive  (controlled  by
       -R), BWA	also finds all hits contains one more mismatch;	otherwise, BWA
       finds  all  equally  best  hits only. Base quality is NOT considered in
       evaluating hits.	In the paired-end mode,	BWA pairs all hits  it	found.
       It further performs Smith-Waterman alignment for	unmapped reads to res-
       cue  reads  with	a high erro rate, and for high-quality anomalous pairs
       to fix potential	alignment errors.

   Estimating Insert Size Distribution
       BWA estimates the insert	size distribution per 256*1024 read pairs.  It
       first  collects	pairs of reads with both ends mapped with a single-end
       quality 20 or higher and	then calculates	median (Q2), lower and	higher
       quartile	(Q1 and	Q3). It	estimates the mean and the variance of the in-
       sert  size distribution from pairs whose	insert sizes are within	inter-
       val [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair  con-
       sidered	to  be properly	paired (SAM flag 0x2) is calculated by solving
       equation	Phi((x-mu)/sigma)=x/L*p0, where	mu is the mean,	sigma  is  the
       standard	 error of the insert size distribution,	L is the length	of the
       genome, p0 is prior of anomalous	pair and Phi() is the standard cumula-
       tive distribution function. For mapping Illumina	short-insert reads  to
       the  human  genome, x is	about 6-7 sigma	away from the mean. Quartiles,
       mean, variance and x will be printed to the standard error output.

   Memory Requirement
       With bwtsw algorithm, 5GB memory	is required for	indexing the  complete
       human  genome  sequences.  For short reads, the aln command uses	~3.2GB
       memory and the sampe command uses ~5.4GB.

   Speed
       Indexing	the human genome sequences takes 3 hours with bwtsw algorithm.
       Indexing	smaller	genomes	with IS	algorithms  is	faster,	 but  requires
       more memory.

       The  speed  of alignment	is largely determined by the error rate	of the
       query sequences (r). Firstly, BWA runs much  faster  for	 near  perfect
       hits  than for hits with	many differences, and it stops searching for a
       hit with	l+2 differences	if a l-difference hit is found.	This means BWA
       will be very slow if r is high because in this case BWA	has  to	 visit
       hits  with  many	 differences  and looking for these hits is expensive.
       Secondly, the alignment algorithm behind	makes the speed	 sensitive  to
       [k log(N)/m], where k is	the maximum allowed differences, N the size of
       database	and m the length of a query. In	practice, we choose k w.r.t. r
       and therefore r is the leading factor. I	would not recommend to use BWA
       on data with r>0.02.

       Pairing	is  slower  for	 shorter reads.	This is	mainly because shorter
       reads have more spurious	hits and converting SA coordinates to  chromo-
       somal coordinates are very costly.

CHANGES	IN BWA-0.6
       Since  version  0.6,  BWA has been able to work with a reference	genome
       longer than 4GB.	 This feature makes it possible	to integrate the  for-
       ward  and  reverse complemented genome in one FM-index, which speeds up
       both BWA-short and BWA-SW. As a tradeoff, BWA uses more memory  because
       it has to keep all positions and	ranks in 64-bit	integers, twice	larger
       than 32-bit integers used in the	previous versions.

       The latest BWA-SW also works for	paired-end reads longer	than 100bp. In
       comparison  to  BWA-short,  BWA-SW tends	to be more accurate for	highly
       unique reads and	more robust to relative	 long  INDELs  and  structural
       variants.   Nonetheless,	 BWA-short usually has higher power to distin-
       guish the optimal hit from many suboptimal hits.	The choice of the map-
       ping algorithm may depend on the	application.

SEE ALSO
       BWA   website   <http://bio-bwa.sourceforge.net>,   Samtools    website
       <http://samtools.sourceforge.net>

AUTHOR
       Heng  Li	 at  the Sanger	Institute wrote	the key	source codes and inte-
       grated	the   following	  codes	  for	 BWT	construction:	 bwtsw
       <http://i.cs.hku.hk/~ckwong3/bwtsw/>,  implemented by Chi-Kwong Wong at
       the	 University	  of	   Hong	      Kong	 and	    IS
       <http://yuta.256.googlepages.com/sais>  originally  proposed by Nong Ge
       <http://www.cs.sysu.edu.cn/nong/> at the	Sun Yat-Sen University and im-
       plemented by Yuta Mori.

LICENSE	AND CITATION
       The full	BWA package is distributed under GPLv3 as it uses source codes
       from BWT-SW which is covered by GPL. Sorting, hash table,  BWT  and  IS
       libraries are distributed under the MIT license.

       If  you	use the	BWA-backtrack algorithm, please	cite the following pa-
       per:

       Li H. and Durbin	R. (2009) Fast and accurate short read alignment  with
       Burrows-Wheeler	 transform.   Bioinformatics,  25,  1754-1760.	[PMID:
       19451168]

       If you use the BWA-SW algorithm,	please cite:

       Li H. and Durbin	R. (2010) Fast and accurate long-read  alignment  with
       Burrows-Wheeler	 transform.   Bioinformatics,	26,   589-595.	[PMID:
       20080505]

       If you use BWA-MEM or the fastmap component of BWA, please cite:

       Li H. (2013) Aligning sequence reads, clone sequences and assembly con-
       tigs with BWA-MEM. arXiv:1303.3997v1 [q-bio.GN].

       It is likely that the BWA-MEM manuscript	will not appear	in a  peer-re-
       viewed journal.

HISTORY
       BWA  is	largely	influenced by BWT-SW. It uses source codes from	BWT-SW
       and mimics its binary file formats; BWA-SW resembles BWT-SW in  several
       ways.  The  initial  idea  about	BWT-based alignment also came from the
       group who developed BWT-SW. At the same time, BWA is  different	enough
       from  BWT-SW. The short-read alignment algorithm	bears no similarity to
       Smith-Waterman algorithm	any more. While	BWA-SW learns from BWT-SW,  it
       introduces  heuristics that can hardly be applied to the	original algo-
       rithm. In all, BWA does not guarantee to	find all local	hits  as  what
       BWT-SW  is  designed  to	 do, but it is much faster than	BWT-SW on both
       short and long query sequences.

       I started to write the first piece of codes on 24 May 2008 and got  the
       initial	stable	version	on 02 June 2008. During	this period, I was ac-
       quainted	that Professor Tak-Wah Lam, the	first author of	BWT-SW	paper,
       was collaborating with Beijing Genomics Institute on SOAP2, the succes-
       sor  to	SOAP  (Short Oligonucleotide Analysis Package).	SOAP2 has come
       out in November 2008. According to the SourceForge download  page,  the
       third  BWT-based	 short read aligner, bowtie, was first released	in Au-
       gust 2008. At the time of writing this manual, at least three more BWT-
       based short-read	aligners are being implemented.

       The BWA-SW algorithm is a new component of BWA. It was conceived	in No-
       vember 2008 and implemented ten months later.

       The BWA-MEM algorithm is	based on an  algorithm	finding	 super-maximal
       exact  matches (SMEMs), which was first published with the fermi	assem-
       bler paper in 2012. I first implemented the basic SMEM algorithm	in the
       fastmap command for an experiment and then extended the basic algorithm
       and added the extension part in Feburary	2013 to	make BWA-MEM  a	 fully
       featured	mapper.

bwa-0.7.19-r1273		 22 March 2025				bwa(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=bwa&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>

home | help