Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
samtools-consensus(1)	     Bioinformatics tools	 samtools-consensus(1)

NAME
       samtools	consensus - produces a consensus FASTA/FASTQ/PILEUP

SYNOPSIS
       samtools	 consensus  [-saAMq] [-r region] [-f format] [-l line-len] [-d
       min-depth] [-C cutoff] [-c call-fract] [-H het-fract] in.bam

DESCRIPTION
       Generate	consensus from a SAM, BAM or CRAM file based on	 the  contents
       of  the	alignment  records.  The consensus is written either as	FASTA,
       FASTQ, or a pileup oriented format.  This is selected using the -f FOR-
       MAT option.

       The default output for FASTA and	FASTQ formats  include	one  base  per
       non-gap consensus.  Hence insertions with respect to the	aligned	refer-
       ence  will  be  included	 and deletions removed.	 This behaviour	can be
       controlled with the --show-ins and --show-del options.  This  could  be
       used  to	 compute  a new	reference from sequences assemblies to realign
       against.

       The pileup-style	format strictly	adheres	to one row per consensus loca-
       tion, differing from the	one row	per reference based used  in  the  re-
       lated  "samtools	 mpileup" command.  This means the base	quality	values
       for inserted columns are	reported.  The base quality value of gaps (ei-
       ther within an insertion	or otherwise) are determined as	the average of
       the surrounding non-gap bases.  The columns  shown  are	the  reference
       name,  position,	 nth base at that position (zero if not	an insertion),
       consensus call, consensus confidence,  sequences	 and  quality  values.
       Note  even  when	 a reference is	supplied, the consensus	base is	always
       report (if non-zero depth) in the 5th column.

       Two consensus calling algorithms	are offered.  The default  computes  a
       heterozygous  consensus	in  a Bayesian manner, derived from the	"Gap5"
       consensus algorithm.  Quality values are	also tweaked to	take into  ac-
       count  other nearby low quality values.	This can also be disabled, us-
       ing the --no-adj-qual option.

       This method also	utilises the mapping qualities,	unless the --no-use-MQ
       option is used.	Mapping	qualities are also auto-scaled	to  take  into
       account	the  local reference variation by processing the MD:Z tag, un-
       less --no-adj-MQ	is used.  Mapping qualities can	be  capped  between  a
       minimum	(--low-MQ)  and	maximum	(--high-MQ), although the defaults are
       liberal and trust the data to be	true.  Finally an overall scale	on the
       resulting mapping quality can be	supplied  (--scale-MQ,	defaulting  to
       1.0).   This has	the effect of favouring	more calls with	a higher false
       positive	rate (values greater than 1.0) or  being  more	cautious  with
       higher  false negative rates and	lower false positive (values less than
       1.0).

       The second method is a simple frequency counting	algorithm, summing ei-
       ther +1 for each	base type or +qual if the --use-qual option is	speci-
       fied.  This is enabled with the --mode simple option.

       The  summed  share of a specific	base type is then compared against the
       total possible and if this is above the --call-fract fraction parameter
       then the	most likely base type is called, or "N"	otherwise  (or	absent
       if  it  is  a gap).  The	--ambig	option permits generation of ambiguity
       codes instead of	"N", provided the minimum fraction of the second  most
       common base type	to the most common is above the	--het-fract fraction.

OPTIONS
       General options that apply to both algorithms:

       -r REG, --region	REG
		 Limit the query to region REG.	 This requires an index.

       -f FMT, --format	FMT
		 Produce  format  FMT,	with  "fastq", "fasta" and "pileup" as
		 permitted options.

       -l N, --line-len	N
		 Sets the maximum line length of line-wrapped fasta and	 fastq
		 formats to N.

       -o FILE,	--output FILE
		 Output	consensus to FILE instead of stdout.

       -m STR, --mode STR
		 Select	 the  consensus	 algorithm.   Valid modes are "simple"
		 frequency counting and	the "bayesian"	(Gap5)	methods,  with
		 Bayesian  being  the default.	(Note case does	not matter, so
		 "Bayesian" is accepted	too.)  There are a variety of bayesian
		 methods.  Straight "bayesian" is the best  set	 suitable  for
		 the  other parameters selected.  The choice of	internal para-
		 meters	may change depending on	the "--P-indel"	 score.	  This
		 method	 distinguishes	between	 substitution  and indel error
		 rates.	 The old Samtools consensus in version	1.16  did  not
		 distinguish  types  of	 errors,  but  for  compatibility  the
		 "bayesian_116"	mode may be selected to	replicate this.

       -a	 Outputs all bases, from start to end of reference, even  when
		 the  aligned  data does not extend to the ends.  This is most
		 useful	for construction of a full length reference sequence.

       -a -a, -aa
		 Output	absolutely all positions, including references with no
		 data aligned against them.

       --rf, --incl-flags STR|INT
		 Only include reads with at least one FLAG bit set.   Defaults
		 to zero, which	filters	no reads.

       --ff, --excl-flags STR|INT
		 Exclude reads with any	FLAG bit set.  Defaults	to "UNMAP,SEC-
		 ONDARY,QCFAIL,DUP".

       --min-MQ	INT
		 Filters out reads with	a mapping quality below	INT.  This de-
		 faults	to zero.

       --min-BQ	INT
		 Filters  out  bases  with a base quality below	INT.  This de-
		 faults	to zero.

       --show-del yes/no
		 Whether to show deletions as "*" (yes)	or to  omit  from  the
		 output	(no).  Defaults	to no.

       --show-ins yes/no
		 Whether  to  show  insertions	in the consensus.  Defaults to
		 yes.

       --mark-ins
		 Insertions, when shown, are normally recorded in the  consen-
		 sus  with  plain 7-bit	ASCII (ACGT, or	acgt if	heterozygous).
		 However this makes it impossible to identify the mapping  be-
		 tween	consensus coordinates and the original reference coor-
		 dinates.  If fasta output is selected then the	option adds an
		 underscore before every inserted base,	plus  a	 corresponding
		 character in the quality for fastq format.  When used in con-
		 junction with -a --show-del yes, this permits an easy deriva-
		 tion of the consensus to reference coordinate mapping.

       -A, --ambig
		 Enables IUPAC ambiguity codes in the consensus	output.	 With-
		 out this the output will be limited to	A, C, G, T, N and *.

       -d D, --min-depth D
		 The  minimum  depth  required to make a call.	Defaults to 1.
		 Failing this depth check will produce consensus "N",  or  ab-
		 sent if it is an insertion.  Note this	check is performed af-
		 ter filtering by flags	and mapping/base quality.

       -T ref.fa, --reference ref.fa
		 For  base positions with zero coverage, use the supplied ref-
		 erence	instead	of "N".	 Note this does	 not  replace  minimum
		 depth	or  minimum  quality  filters as the base is known but
		 considiered low quality so the	ambiguity is retained.

       --ref-qual INT
		 When --reference is given this	specifies the quality value to
		 use for reference-derived bases.  This	defaults to zero.

       The following options apply only	to the simple consensus	mode:

       -q, --use-qual
		 For the simple	consensus algorithm, this enables use of  base
		 quality  values.   Instead  of	 summing 1 per base called, it
		 sums the base quality instead.	 These sums are	also  used  in
		 the  --call-fract  and	 --het-fract  parameters too.  Quality
		 values	are always used	for the	"Gap5"	consensus  method  and
		 this  option  has  no effect.	Note currently	quality	values
		 only affect SNPs and not inserted sequences, which still  get
		 scores	with a fixed +1	per base type occurrence.

       -H H, --het-fract H
		 For  consensus	columns	containing multiple base types,	if the
		 second	most frequent type is at least H fraction of the  most
		 common	type then a heterozygous base type will	be reported in
		 the  consensus.  Otherwise the	most common base is used, pro-
		 vided it meets	the --call-fract  parameter  (otherwise	 "N").
		 The  fractions	computed may be	modified by the	use of quality
		 values	if the -q option is enabled.  Note although IUPAC  has
		 ambiguity  codes for A,C,G,T vs any other A,C,G,T it does not
		 have codes for	A,C,G,T	vs gap	(such  as  in  a  heterozygous
		 deletion).   Given  the  lack	of  any	 official code,	we use
		 lower-case letter to symbolise	a half-present base type.

       -c C, --call-fract C
		 Only used for the simple  consensus  algorithm.   Require  at
		 least	C fraction of bases agreeing with the most likely con-
		 sensus	call to	emit that base type.  This defaults  to	 0.75.
		 Failing this check will output	"N".

       -@ NTHREADS
		 Specify the number of additional threads to use for computing
		 the consensus.	 Note if no index is present threads will only
		 be  used  for	parallel decompression meaning asking for more
		 than 2	threads	is unlikely to speed up	processing.   With  an
		 index the consensus is	computed for multiple regions simulta-
		 neously, offering near	linear speed ups.

       -Z BASE_COUNT
		 When  using  multiple	threads	 this  specifies the number of
		 bases per threading job.   The	 default  is  500,000  bp  for
		 fasta/fastq  output  and  100,000  for	pileup output.	Larger
		 blocks	may yield improved threading performance at a cost  of
		 more memory.

       The following options apply only	to Bayesian consensus mode enabled
       (default	on).

       -C C, --cutoff C
		 Only used for the Gap5	consensus mode,	which produces a Phred
		 style	score for the final consensus quality.	If this	is be-
		 low C then the	consensus is called as "N".

       --use-MQ, --no-use-MQ
		 Enable	or disable the use of mapping qualities.  Defaults  to
		 on.

       --adj-MQ, --no-adj-MQ
		 If mapping qualities are used,	this controls whether they are
		 scaled	 by  the  local	number of mismatches to	the reference.
		 The reference is unknown by this tool,	so this	 data  is  ob-
		 tained	 from  the  MD:Z  auxiliary  tag  (or  ignored	if not
		 present).  Defaults to	on.

       --NM-halo INT
		 Specifies the distance	either side of	the  base  call	 being
		 considered for	computing the number of	local mismatches.

       --low-MQ	MIN, --high-MQ MAX
		 Specifies a minimum and maximum value of the mapping quality.
		 These	are not	filters	and instead simply put upper and lower
		 caps on the values.  The defaults are 0 and 60.

       --scale-MQ FLOAT
		 This is a general  multiplicative   mapping  quality  scaling
		 factor.  The effect is	to globally raise or lower the quality
		 values	 used  in  the	consensus algorithm.  Defaults to 1.0,
		 which leaves the values unchanged.

       --P-het FLOAT
		 Controls the likelihood of any	position being a  heterozygous
		 site.	 This  is used in the priors for the Bayesian calcula-
		 tions,	and has	little difference on deep data.	  Defaults  to
		 1e-3.	 Smaller  numbers  makes  the algorithm	more likely to
		 call a	pure base type.	 Note the algorithm will  always  com-
		 pute  the  probability	 of  the base being homozygous vs het-
		 erozygous, irrespective of whether the	output is reported  as
		 ambiguous  (it	will be	"N" if deemed to be heterozygous with-
		 out --ambig mode enabled).

       --P-indel FLOAT
		 Controls the likelihood of small indels.  This	is used	in the
		 priors	for the	Bayesian calculations, and has little  differ-
		 ence on deep data.  Defaults to 2e-4.

       --het-scale FLOAT
		 This  is a multiplicative correction applied per base quality
		 before	adding to the heterozygous  hypotheses.	  Reducing  it
		 means fewer heterozygous calls	are made.  This	oftens leads a
		 significant  reduction	 in false positive het calls, for some
		 increase in false negatives (mislabelling  real  heterozygous
		 sites	as  homozygous).   It  is usually beneficial to	reduce
		 this on instruments where a significant proportion  of	 bases
		 may  be  aligned  in  the  wrong column due to	insertions and
		 deletions leading to alignment	errors and reference bias.  It
		 can be	considered as a	het sensitivity	tuning parameter.  De-
		 faults	to 1.0 (nop).

       -p, --homopoly-fix
		 Some technologies that	call runs of the same  base  type  to-
		 gether	 always	put the	lowest quality calls at	one end.  This
		 can cause problems when reverse complementing	and  comparing
		 alignments  with  indels.  This option	averages the qualities
		 at both ends to avoid orientation  biases.   Recommended  for
		 old 454 or PacBio HiFi	data sets.

       --homopoly-score	FLOAT
		 The -p	option also reduces confidence values within homopoly-
		 mers due to an	additional likelihood of sequence specific er-
		 rors.	 The quality values are	multiplied by FLOAT.  This de-
		 faults	to 0.5,	but is not used	if -p was not specified.   Ad-
		 justing this score also automatically enables -p.

       -t, --qual-calibration FILE
		 Loads	a  quality calibration table from FILE.	 The format of
		 this is a series of lines with	 the  following	 fields,  each
		 starting with the literal text	"QUAL":

		     QUAL value	substitution undercall overcall

		 Lines	starting  with	a  "#"	are ignored.  Each line	maps a
		 recorded quality value	to the Phred equivalent	score for sub-
		 stitution, undercall and overcall errors.  Quality values are
		 expected to be	sorted in increasing numerical order, but  may
		 skip values.  This allows the consensus algorithm to know the
		 most  likely cause of an error, and whether the instrument is
		 more likely to	have indel errors (more	common	in  some  long
		 read  technologies)  or  substitution	errors (more common in
		 clocked short-read instruments).

		 Some pre-defined calibration tables are built in.  These  are
		 specified  with  a  fake filename starting with a colon.  See
		 the -X	option for more	details.

		 Note due to the additional heuristics applied by the  consen-
		 sus  algorithm, these recalibration tables are	not a true re-
		 flection of the instrument error rates	 and  are  a  work  in
		 progress.

       -X, --config STR
		 Specifies  predefined	sets of	configuration parameters.  Ac-
		 ceptable values for STR are defined  below,  along  with  the
		 list of parameters they are equivalent	to.

		 hiseq	   --qual-calibration :hiseq

		 hifi	   --qual-calibration  :hifi --homopoly-fix 0.3	--low-
			   MQ 5	--scale-MQ 1.5 --het-scale 0.37

		 r10.4_sup --qual-calibration  :r10.4_sup  --homopoly-fix  0.3
			   --low-MQ 5 --scale-MQ 1.5 --het-scale 0.37

		 r10.4_dup --qual-calibration  :r10.4_dup  --homopoly-fix  0.3
			   --low-MQ 5 --scale-MQ 1.5 --het-scale 0.37

		 ultima	   --qual-calibration	:ultima	  --homopoly-fix   0.3
			   --low-MQ 10 --scale-MQ 2 --het-scale	0.37

EXAMPLES
       -      Create a modified	FASTA reference	that has a 1:1 coordinate cor-
	      respondence with the original reference used in alignment.

		samtools consensus -a --show-ins no --show-del yes in.bam -o ref.fa

       -      Create a FASTQ file for the contigs with aligned data, including
	      insertions.

		samtools consensus -f fastq in.bam -o cons.fq

AUTHOR
       Written by James	Bonfield from the Sanger Institute.

SEE ALSO
       samtools(1), samtools-mpileup(1),

       Samtools	website: <http://www.htslib.org/>

samtools-1.22			  30 May 2025		 samtools-consensus(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=samtools-consensus&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help