Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
samtools-consensus(1)	     Bioinformatics tools	 samtools-consensus(1)

NAME
       samtools	consensus - produces a consensus FASTA/FASTQ/PILEUP

SYNOPSIS
       samtools	 consensus  [-saAMq] [-r region] [-f format] [-l line-len] [-d
       min-depth] [-C cutoff] [-c call-fract] [-H het-fract] in.bam

DESCRIPTION
       Generate	consensus from a SAM, BAM or CRAM file based on	 the  contents
       of  the	alignment  records.  The consensus is written either as	FASTA,
       FASTQ, or a pileup oriented format.  This is selected using the -f FOR-
       MAT option.

       The default output for FASTA and	FASTQ formats  include	one  base  per
       non-gap consensus.  Hence insertions with respect to the	aligned	refer-
       ence  will  be  included	 and deletions removed.	 This behaviour	can be
       controlled with the --show-ins and --show-del options.  This  could  be
       used  to	 compute  a new	reference from sequences assemblies to realign
       against.

       The pileup-style	format strictly	adheres	to one row per consensus loca-
       tion, differing from the	one row	per reference based used  in  the  re-
       lated  "samtools	 mpileup" command.  This means the base	quality	values
       for inserted columns are	reported.  The base quality value of gaps (ei-
       ther within an insertion	or otherwise) are determined as	the average of
       the surrounding non-gap bases.  The columns  shown  are	the  reference
       name,  position,	 nth base at that position (zero if not	an insertion),
       consensus call, consensus confidence, sequences and quality values.

       Two consensus calling algorithms	are offered.  The default  computes  a
       heterozygous  consensus	in  a Bayesian manner, derived from the	"Gap5"
       consensus algorithm.  Quality values are	also tweaked to	take into  ac-
       count  other nearby low quality values.	This can also be disabled, us-
       ing the --no-adj-qual option.

       This method also	utilises the mapping qualities,	unless the --no-use-MQ
       option is used.	Mapping	qualities are also auto-scaled	to  take  into
       account	the  local reference variation by processing the MD:Z tag, un-
       less --no-adj-MQ	is used.  Mapping qualities can	be  capped  between  a
       minimum	(--low-MQ)  and	maximum	(--high-MQ), although the defaults are
       liberal and trust the data to be	true.  Finally an overall scale	on the
       resulting mapping quality can be	supplied  (--scale-MQ,	defaulting  to
       1.0).   This has	the effect of favouring	more calls with	a higher false
       positive	rate (values greater than 1.0) or  being  more	cautious  with
       higher  false negative rates and	lower false positive (values less than
       1.0).

       The second method is a simple frequency counting	algorithm, summing ei-
       ther +1 for each	base type or +qual if the --use-qual option is	speci-
       fied.  This is enabled with the --mode simple option.

       The  summed  share of a specific	base type is then compared against the
       total possible and if this is above the --call-fract fraction parameter
       then the	most likely base type is called, or "N"	otherwise  (or	absent
       if  it  is  a gap).  The	--ambig	option permits generation of ambiguity
       codes instead of	"N", provided the minimum fraction of the second  most
       common base type	to the most common is above the	--het-fract fraction.

OPTIONS
       General options that apply to both algorithms:

       -r REG, --region	REG
		 Limit the query to region REG.	 This requires an index.

       -f FMT, --format	FMT
		 Produce  format  FMT,	with  "fastq", "fasta" and "pileup" as
		 permitted options.

       -l N, --line-len	N
		 Sets the maximum line length of line-wrapped fasta and	 fastq
		 formats to N.

       -o FILE,	--output FILE
		 Output	consensus to FILE instead of stdout.

       -m STR, --mode STR
		 Select	 the  consensus	 algorithm.   Valid modes are "simple"
		 frequency counting and	the "bayesian"	(Gap5)	methods,  with
		 Bayesian  being  the default.	(Note case does	not matter, so
		 "Bayesian" is accepted	too.)  There are a variety of bayesian
		 methods.  Straight "bayesian" is the best  set	 suitable  for
		 the  other parameters selected.  The choice of	internal para-
		 meters	may change depending on	the "--P-indel"	 score.	  This
		 method	 distinguishes	between	 substitution  and indel error
		 rates.	 The old Samtools consensus in version	1.16  did  not
		 distinguish  types  of	 errors,  but  for  compatibility  the
		 "bayesian_116"	mode may be selected to	replicate this.

       -a	 Outputs all bases, from start to end of reference, even  when
		 the  aligned  data does not extend to the ends.  This is most
		 useful	for construction of a full length reference sequence.

       -a -a, -aa
		 Output	absolutely all positions, including references with no
		 data aligned against them.

       --rf, --incl-flags STR|INT
		 Only include reads with at least one FLAG bit set.   Defaults
		 to zero, which	filters	no reads.

       --ff, --excl-flags STR|INT
		 Exclude reads with any	FLAG bit set.  Defaults	to "UNMAP,SEC-
		 ONDARY,QCFAIL,DUP".

       --min-MQ	INT
		 Filters out reads with	a mapping quality below	INT.  This de-
		 faults	to zero.

       --min-BQ	INT
		 Filters  out  bases  with a base quality below	INT.  This de-
		 faults	to zero.

       --show-del yes/no
		 Whether to show deletions as "*" (yes)	or to  omit  from  the
		 output	(no).  Defaults	to no.

       --show-ins yes/no
		 Whether  to  show  insertions	in the consensus.  Defaults to
		 yes.

       --mark-ins
		 Insertions, when shown, are normally recorded in the  consen-
		 sus  with  plain 7-bit	ASCII (ACGT, or	acgt if	heterozygous).
		 However this makes it impossible to identify the mapping  be-
		 tween	consensus coordinates and the original reference coor-
		 dinates.  If fasta output is selected then the	option adds an
		 underscore before every inserted base,	plus  a	 corresponding
		 character in the quality for fastq format.  When used in con-
		 junction with -a --show-del yes, this permits an easy deriva-
		 tion of the consensus to reference coordinate mapping.

       -A, --ambig
		 Enables IUPAC ambiguity codes in the consensus	output.	 With-
		 out this the output will be limited to	A, C, G, T, N and *.

       -d D, --min-depth D
		 The  minimum  depth  required to make a call.	Defaults to 1.
		 Failing this depth check will produce consensus "N",  or  ab-
		 sent if it is an insertion.  Note this	check is performed af-
		 ter filtering by flags	and mapping/base quality.

       The following options apply only	to the simple consensus	mode:

       -q, --use-qual
		 For  the simple consensus algorithm, this enables use of base
		 quality values.  Instead of summing 1	per  base  called,  it
		 sums  the  base quality instead.  These sums are also used in
		 the --call-fract and  --het-fract  parameters	too.   Quality
		 values	 are  always  used for the "Gap5" consensus method and
		 this option has no affect.  Note  currently   quality	values
		 only  affect SNPs and not inserted sequences, which still get
		 scores	with a fixed +1	per base type occurrence.

       -H H, --het-fract H
		 For consensus columns containing multiple base	types, if  the
		 second	 most frequent type is at least	H fraction of the most
		 common	type then a heterozygous base type will	be reported in
		 the consensus.	 Otherwise the most common base	is used,  pro-
		 vided	it  meets  the --call-fract parameter (otherwise "N").
		 The fractions computed	may be modified	by the use of  quality
		 values	 if the	-q option is enabled.  Note although IUPAC has
		 ambiguity codes for A,C,G,T vs	any other A,C,G,T it does  not
		 have  codes  for  A,C,G,T  vs	gap (such as in	a heterozygous
		 deletion).  Given the lack  of	 any  official	code,  we  use
		 lower-case letter to symbolise	a half-present base type.

       -c C, --call-fract C
		 Only  used  for  the  simple consensus	algorithm.  Require at
		 least C fraction of bases agreeing with the most likely  con-
		 sensus	 call  to emit that base type.	This defaults to 0.75.
		 Failing this check will output	"N".

       The following options apply only	to Bayesian consensus mode enabled
       (default	on).

       -C C, --cutoff C
		 Only used for the Gap5	consensus mode,	which produces a Phred
		 style score for the final consensus quality.  If this is  be-
		 low C then the	consensus is called as "N".

       --use-MQ, --no-use-MQ
		 Enable	 or disable the	use of mapping qualities.  Defaults to
		 on.

       --adj-MQ, --no-adj-MQ
		 If mapping qualities are used,	this controls whether they are
		 scaled	by the local number of mismatches  to  the  reference.
		 The  reference	 is  unknown by	this tool, so this data	is ob-
		 tained	from  the  MD:Z	 auxiliary  tag	 (or  ignored  if  not
		 present).  Defaults to	on.

       --NM-halo INT
		 Specifies  the	 distance  either  side	of the base call being
		 considered for	computing the number of	local mismatches.

       --low-MQ	MIN, --high-MQ MAX
		 Specifies a minimum and maximum value of the mapping quality.
		 These are not filters and instead simply put upper and	 lower
		 caps on the values.  The defaults are 0 and 60.

       --scale-MQ FLOAT
		 This  is  a  general  multiplicative  mapping quality scaling
		 factor.  The effect is	to globally raise or lower the quality
		 values	used in	the consensus  algorithm.   Defaults  to  1.0,
		 which leaves the values unchanged.

       --P-het FLOAT
		 Controls  the likelihood of any position being	a heterozygous
		 site.	This is	used in	the priors for the  Bayesian  calcula-
		 tions,	 and  has little difference on deep data.  Defaults to
		 1e-3.	Smaller	numbers	makes the  algorithm  more  likely  to
		 call  a  pure base type.  Note	the algorithm will always com-
		 pute the probability of the base  being  homozygous  vs  het-
		 erozygous,  irrespective of whether the output	is reported as
		 ambiguous (it will be "N" if deemed to	be heterozygous	 with-
		 out --ambig mode enabled).

       --P-indel FLOAT
		 Controls the likelihood of small indels.  This	is used	in the
		 priors	 for the Bayesian calculations,	and has	little differ-
		 ence on deep data.  Defaults to 2e-4.

       --het-scale FLOAT
		 This is a multiplicative correction applied per base  quality
		 before	 adding	 to  the heterozygous hypotheses.  Reducing it
		 means fewer heterozygous calls	are made.  This	oftens leads a
		 significant reduction in false	positive het calls,  for  some
		 increase  in  false negatives (mislabelling real heterozygous
		 sites as homozygous).	It is  usually	beneficial  to	reduce
		 this  on  instruments where a significant proportion of bases
		 may be	aligned	in the wrong  column  due  to  insertions  and
		 deletions leading to alignment	errors and reference bias.  It
		 can be	considered as a	het sensitivity	tuning parameter.  De-
		 faults	to 1.0 (nop).

       -p, --homopoly-fix
		 Some  technologies  that  call	runs of	the same base type to-
		 gether	always put the lowest quality calls at one end.	  This
		 can  cause  problems when reverse complementing and comparing
		 alignments with indels.  This option averages	the  qualities
		 at  both  ends	 to avoid orientation biases.  Recommended for
		 old 454 or PacBio HiFi	data sets.

       --homopoly-score	FLOAT
		 The -p	option also reduces confidence values within homopoly-
		 mers due to an	additional likelihood of sequence specific er-
		 rors.	The quality values are multiplied by FLOAT.  This  de-
		 faults	 to 0.5, but is	not used if -p was not specified.  Ad-
		 justing this score also automatically enables -p.

       -t, --qual-calibration FILE
		 Loads a quality calibration table from	FILE.  The  format  of
		 this  is  a  series  of lines with the	following fields, each
		 starting with the literal text	"QUAL":

		     QUAL value	substitution undercall overcall

		 Lines starting	with a "#" are	ignored.   Each	 line  maps  a
		 recorded quality value	to the Phred equivalent	score for sub-
		 stitution, undercall and overcall errors.  Quality values are
		 expected  to be sorted	in increasing numerical	order, but may
		 skip values.  This allows the consensus algorithm to know the
		 most likely cause of an error,	and whether the	instrument  is
		 more  likely  to  have	indel errors (more common in some long
		 read technologies) or substitution  errors  (more  common  in
		 clocked short-read instruments).

		 Some  pre-defined calibration tables are built	in.  These are
		 specified with	a fake filename	starting with  a  colon.   See
		 the -X	option for more	details.

		 Note  due to the additional heuristics	applied	by the consen-
		 sus algorithm,	these recalibration tables are not a true  re-
		 flection  of  the  instrument	error  rates and are a work in
		 progress.

       -X, --config STR
		 Specifies predefined sets of configuration  parameters.   Ac-
		 ceptable  values  for	STR  are defined below,	along with the
		 list of parameters they are equivalent	to.

		 hiseq	   --qual-calibration :hiseq

		 hifi	   --qual-calibration :hifi --homopoly-fix 0.3	--low-
			   MQ 5	--scale-MQ 1.5 --het-scale 0.37

		 r10.4_sup --qual-calibration  :r10.4_sup  --homopoly-fix  0.3
			   --low-MQ 5 --scale-MQ 1.5 --het-scale 0.37

		 r10.4_dup --qual-calibration  :r10.4_dup  --homopoly-fix  0.3
			   --low-MQ 5 --scale-MQ 1.5 --het-scale 0.37

		 ultima	   --qual-calibration	:ultima	  --homopoly-fix   0.3
			   --low-MQ 10 --scale-MQ 2 --het-scale	0.37

EXAMPLES
       -      Create a modified	FASTA reference	that has a 1:1 coordinate cor-
	      respondence with the original reference used in alignment.

		samtools consensus -a --show-ins no in.bam -o ref.fa

       -      Create a FASTQ file for the contigs with aligned data, including
	      insertions.

		samtools consensus -f fastq in.bam -o cons.fq

AUTHOR
       Written by James	Bonfield from the Sanger Institute.

SEE ALSO
       samtools(1), samtools-mpileup(1),

       Samtools	website: <http://www.htslib.org/>

samtools-1.21		       12 September 2024	 samtools-consensus(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=samtools-consensus&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>

home | help