FreeBSD Manual Pages

home | help
samtools-mpileup(1)	     Bioinformatics tools	   samtools-mpileup(1)

NAME
       samtools	mpileup	- produces "pileup" textual format from	an alignment

SYNOPSIS
       samtools	 mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q
       minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

DESCRIPTION
       Generate	text pileup output for one or multiple BAM files.  Each	 input
       file produces a separate	group of pileup	columns	in the output.

       Note that there are two orthogonal ways to specify locations in the in-
       put file; via -r	region and -l file.  The former	uses (and requires) an
       index  to  do  random  access while the latter streams through the file
       contents	filtering out the specified regions, requiring no index.   The
       two  may	be used	in conjunction.	 For example a BED file	containing lo-
       cations of genes	in chromosome 20 could be specified  using  -r	20  -l
       chr20.bed,  meaning  that  the  index is	used to	find chromosome	20 and
       then it is filtered for the regions listed in the bed file.

       Unmapped	reads are not considered and are always	discarded.  By default
       secondary alignments, QC	failures and duplicate reads will be  omitted,
       along with low quality bases and	some reads in high depth regions.  See
       the --ff, -Q and	-d options for changing	this.

   Pileup Format
       Pileup  format  consists	 of TAB-separated lines, with each line	repre-
       senting the pileup of reads at a	single genomic position.

       Several columns contain numeric quality values  encoded	as  individual
       ASCII  characters.  Each	character can range from "!" to	"~" and	is de-
       coded by	taking its ASCII value and subtracting 33; e.g.,  "A"  encodes
       the numeric value 32.

       The first three columns give the	position and reference:

       O Chromosome name.

       O 1-based position on the chromosome.

       O Reference  base  at  this  position (this will	be "N" on all lines if
	 -f/--fasta-ref	has not	been used).

       The remaining columns show the pileup data, and are repeated  for  each
       input BAM file specified:

       O Number	of reads covering this position.

       O Read bases.  This encodes information on matches, mismatches, indels,
	 strand, mapping quality, and starts and ends of reads.

	 For each read covering	the position, this column contains:

	 • If  this is the first position covered by the read, a "^" character
	   followed by the alignment's mapping quality	encoded	 as  an	 ASCII
	   character.

	 • A single character indicating the read base and the strand to which
	   the read has	been mapped:
	   Forward   Reverse			Meaning
	   ---------------------------------------------------------------
	    . dot    , comma   Base matches the	reference base
	    ACGTN     acgtn    Base is a mismatch to the reference base
	      >		<      Reference skip (due to CIGAR "N")
	      *	       */#     Deletion	of the reference base (CIGAR "D")

	   Deleted bases are shown as "*" on both strands unless --reverse-del
	   is used, in which case they are shown as "#"	on the reverse strand.

	 • If  there  is  an  insertion	 after	this  read base, text matching
	   "\+[0-9]+[ACGTNacgtn*#]+": a	"+" character followed by  an  integer
	   giving  the length of the insertion and then	the inserted sequence.
	   Pads	are shown as "*" unless	--reverse-del is used, in  which  case
	   pads	on the reverse strand will be shown as "#".

	 • If  there  is  a  deletion  after  this  read  base,	 text matching
	   "-[0-9]+[ACGTNacgtn]+": a "-" character  followed  by  the  deleted
	   reference  bases  represented  similarly.  (Subsequent pileup lines
	   will	contain	"*" for	this read indicating the deleted bases.)

	 • If this is the last position	covered	by the read, a "$" character.

       O Base qualities, encoded as ASCII characters.

       O Alignment mapping qualities, encoded as  ASCII	 characters.   (Column
	 only present when -s/--output-MQ is used.)

       O Comma-separated  1-based positions within the alignments, in the ori-
	 entation shown	in the input file.  E.g., 5 indicates that it  is  the
	 fifth	base  of the corresponding read	that is	mapped to this genomic
	 position.  (Column only present when -O/--output-BP is	used.)

       O Additional comma-separated read field columns,	as selected via	--out-
	 put-extra.  The fields	selected appear	in the same order as  in  SAM:
	 QNAME,	 FLAG, RNAME, POS, MAPQ	(displayed numerically), RNEXT,	PNEXT,
	 followed by RLEN for unclipped	read length.

       O Comma-separated 1-based positions within the alignments, in 5'	to  3'
	 orientation.  E.g., 5 indicates that it is the	fifth base of the cor-
	 responding  read  as  produced	 by the	sequencing instrument, that is
	 mapped	to this	genomic	position. (Column only present when  --output-
	 BP-5 is used.)

       O Additional  read  tag	field columns, as selected via --output-extra.
	 These columns are formatted as	determined by --output-sep and	--out-
	 put-empty  (comma-separated by	default), and appear in	the same order
	 as the	tags are given in --output-extra.

	 Any output column that	would be empty,	such as	a  tag	which  is  not
	 present  or  the filtered sequence depth is zero, is reported as "*".
	 This ensures a	consistent number of columns across all	reported posi-
	 tions.

OPTIONS
       -6, --illumina1.3+
		 Assume	the quality is in the Illumina 1.3+ encoding.

       -A, --count-orphans
		 Do not	skip anomalous read pairs in variant  calling.	 Anom-
		 alous read pairs are those marked in the FLAG field as	paired
		 in sequencing but without the properly-paired flag set.

       -b, --bam-list FILE
		 List of input BAM files, one file per line [null]

       -B, --no-BAQ
		 Disable  base	alignment  quality (BAQ) computation.  See BAQ
		 below.

       -C, --adjust-MQ INT
		 Coefficient for downgrading mapping quality  for  reads  con-
		 taining  excessive  mismatches.   Mismatches are counted as a
		 proportion of the number of aligned bases ("M",  "X"  or  "="
		 CIGAR operations), along with their quality, to derive	an up-
		 per-bound of the mapping quality.  Original mapping qualities
		 lower than this are left intact, while	higher ones are	capped
		 at the	new adjusted score.

		 The exact formula is complex and likely tuned to specific in-
		 struments  and	 specific  alignment  tools, so	this option is
		 disabled by default  (indicated  as  having  a	 zero  value).
		 Variables  in	the formulae and their meaning are defined be-
		 low.

		 Variable   Meaning / formula
		 -----------------------------------------------------------
		 M	    The	number of matching CIGAR  bases	 (operation
			    "M", "X" or	"=").
		 X	    The	number of substitutions	with quality >=	13.
		 SubQ	    The	 summed	 quality  of substitution bases	in-
			    cluded in X, capped	at a maximum of	quality	 33
			    per	mismatching base.
		 ClipQ	    The	 summed	 quality  of  soft-clipped or hard-
			    clipped bases. This	has no minimum	or  maximum
			    quality  threshold	per base.  For hard-clipped
			    bases the per-base quality is taken	as 13.

		 T	    SubQ - 10 *	log10(M^X / X!)	+ ClipQ/5
		 Cap	    MAX(0, INT * sqrt((INT - T)	/ INT))

		 Some notes on the impact of this.

		 O As the number of mismatches increases, the mapping  quality
		   cap reduces,	eventually resulting in	discarded alignments.

		 O High	 quality  mismatches  reduces  the cap faster than low
		   quality mismatches.

		 O The starting	INT value also acts as a hard cap  on  mapping
		   quality, even when zero mismatches are observed.

		 O Indels have no impact on the	mapping	quality.

		 The  intent  of  this	option is to work around aligners that
		 compute a mapping quality using  a  local  alignment  without
		 having	 any regard to the degree of clipping required or con-
		 sideration of potential contamination or large	 scale	inser-
		 tions	with  respect  to  the	reference.  A record may align
		 uniquely and have no close second match, but  having  a  high
		 number	 of  mismatches	 may still imply that the reference is
		 not the correct site.

		 However we do not recommend use of this parameter unless  you
		 fully understand the impact of	it and have determined that it
		 is appropriate	for your sequencing technology.

       -d, --max-depth INT
		 At  a position, read maximally	INT reads per input file. Set-
		 ting this limit reduces the amount of memory and time	needed
		 to process regions with very high coverage.  Passing zero for
		 this  option  sets  it	 to the	highest	possible value,	effec-
		 tively	removing the depth limit. [8000]

		 Note that up to release 1.8, samtools would enforce a minimum
		 value for this	option.	 This no longer	happens	and the	 limit
		 is set	exactly	as specified.

       -E, --redo-BAQ
		 Recalculate BAQ on the	fly, ignore existing BQ	tags.  See BAQ
		 below.

       -f, --fasta-ref FILE
		 The  faidx-indexed  reference	file  in the FASTA format. The
		 file can be optionally	compressed by bgzip.  [null]

		 Supplying a reference file will enable	base alignment quality
		 calculation for all reads aligned to a	reference in the file.
		 See BAQ below.

       -G, --exclude-RG	FILE
		 Exclude reads from read groups	listed in FILE (one @RG-ID per
		 line)

       -l, --positions FILE
		 BED or	position list file containing a	 list  of  regions  or
		 sites	where pileup or	BCF should be generated. Position list
		 files contain two columns (chromosome and position) and start
		 counting from 1.  BED files contain at	least 3	columns	(chro-
		 mosome, start and end position) and are 0-based half-open.
		 While it is possible to mix both position-list	and BED	 coor-
		 dinates in the	same file, this	is strongly ill	advised	due to
		 the differing coordinate systems. [null]

       -q, --min-MQ INT
		 Minimum mapping quality for an	alignment to be	used [0]

       -Q, --min-BQ INT
		 Minimum base quality for a base to be considered. [13]

		 Note  base-quality  0	is  used  as a filtering mechanism for
		 overlap removal which marks bases as having quality zero  and
		 lets the base quality filter remove them.  Hence using	--min-
		 BQ  0	will  make the overlapping bases reappear, albeit with
		 quality zero.

       -r, --region STR
		 Only generate pileup in region. Requires the BAM files	to  be
		 indexed.   If	used in	conjunction with -l then considers the
		 intersection of the two requests.  STR	[all sites]

       -R, --ignore-RG
		 Ignore	RG tags. Treat all reads in one	BAM as one sample.

       --rf, --incl-flags STR|INT
		 Required flags: only include reads with any of	the mask  bits
		 set  [null].	Note this is implemented as a filter-out rule,
		 rejecting reads that have none	of the mask bits  set.	 Hence
		 this does not override	the --excl-flags option.

       --ff, --excl-flags STR|INT
		 Filter	flags: skip reads with any of the mask bits set.  This
		 defaults  to SECONDARY,QCFAIL,DUP.  The option	is not accumu-
		 lative, so specifying e.g. --ff QCFAIL	will  reenable	output
		 of  secondary	and  duplicate alignments.  Note this does not
		 override the --incl-flags option.

       -x, --ignore-overlaps-removal, --disable-overlap-removal
		 Overlap detection and removal is enabled  by  default.	  This
		 option	turns it off.

		 When enabled, where the ends of a read-pair overlap the over-
		 lapping  region will have one base selected and the duplicate
		 base nullified	by setting its phred score to zero.   It  will
		 then be discarded by the --min-BQ option unless this is zero.

		 The  quality  values  of  the retained	base within an overlap
		 will be the summation of the two bases	if they	agree, or  0.8
		 times	the higher of the two bases if they disagree, with the
		 base nucleotide also being the	higher confident call.

       -X	 Include customized index file as a part of arguments. See EX-
		 AMPLES	section	for sample of usage.

       Output Options:

       -o, --output FILE
		 Write pileup output to	FILE, rather than the default of stan-
		 dard output.

       -O, --output-BP
		 Output	base positions on reads	in orientation listed  in  the
		 SAM file (left	to right).

       --output-BP-5
		 Output	 base  positions  on  reads in their original 5' to 3'
		 orientation.

       -s, --output-MQ
		 Output	mapping	qualities encoded as ASCII characters.

       --output-QNAME
		 Output	an extra column	containing comma-separated read	names.
		 Equivalent to --output-extra QNAME.

       --output-extra STR
		 Output	extra columns  containing  comma-separated  values  of
		 read  fields  or  read	tags. The names	of the selected	fields
		 have to be provided as	they are described in the SAM Specifi-
		 cation	(pag. 6) and will be output by the mpileup command  in
		 the  same  order  as  in  the	document  (i.e.	  QNAME, FLAG,
		 RNAME,...)  The names are case	sensitive. Currently, only the
		 following fields are supported:

		 QNAME,	FLAG, RNAME, POS, MAPQ,	RNEXT, PNEXT, RLEN

		 Anything that is not on this list is treated as  a  potential
		 tag,  although	 only  two character tags are accepted.	In the
		 mpileup output, tag columns are displayed in the  order  they
		 were provided by the user in the command line.	 Field and tag
		 names	have to	be provided in a comma-separated string	to the
		 mpileup command.  Tags	with type B (byte array) type are  not
		 supported.   An  absent  or unsupported tag will be listed as
		 "*".  E.g.

		 samtools mpileup --output-extra FLAG,QNAME,RG,NM in.bam

		 will display four extra columns in the	 mpileup  output,  the
		 first being a list of comma-separated read names, followed by
		 a  list of flag values, a list	of RG tag values and a list of
		 NM tag	values.	Field values are always	displayed  before  tag
		 values.

       --output-sep CHAR
		 Specify  a different separator	character for tag value	lists,
		 when those values might contain one or	more commas (,), which
		 is the	default	list  separator.   This	 option	 only  affects
		 columns  for  two-letter  tags	 like NM; standard fields like
		 FLAG or QNAME will always be separated	by commas.

       --output-empty CHAR
		 Specify a different 'no value'	character for tag list entries
		 corresponding to reads	that don't have	a tag  requested  with
		 the --output-extra option. The	default	is *.

		 This  option only applies to rows that	have at	least one read
		 in the	pileup,	and  only  to  columns	for  two-letter	 tags.
		 Columns for empty rows	will always be printed as *.

       -M, --output-mods
		 Adds base modification	markup into the	sequence column.  This
		 uses the Mm and Ml auxiliary tags (or their uppercase equiva-
		 lents).  Any base in the sequence output may be followed by a
		 series	 of strand code	quality	strings	enclosed within	square
		 brackets where	strand is "+" or "-", code is a	single charac-
		 ter (such as "m" or "h") or a ChEBI numeric  in  parentheses,
		 and  quality is an optional numeric quality value.  For exam-
		 ple a "C" base	with possible 5mC and 5hmC  base  modification
		 may be	reported as "C[+m179+h40]".

		 Quality  values  are  from 0 to 255 inclusive,	representing a
		 linear	scale of probability 0.0 to  1.0  in  1/256ths	incre-
		 ments.	  If  quality  values are absent (no Ml	tag) these are
		 omitted, giving an example string of "C[+m+h]".

		 Note the base modifications may be identified on the  reverse
		 strand,  either  due to the native ability for	this detection
		 by the	sequencing instrument or by the	sequence  subsequently
		 being	reverse	 complemented.	 This can lead to modification
		 codes,	such as	"m" meaning 5mC, being shown for their comple-
		 mentary bases,	such as	"G[-m50]".

		 When --output-mods is selected	base modifications can	appear
		 on  any  base in the sequence output, including during	inser-
		 tions.	 This may make parsing the  string  more  complex,  so
		 also see the --no-output-ins-mods and --no-output-ins options
		 to simplify this process.

       --no-output-ins
		 Do  not  output  the  inserted	 bases in the sequence column.
		 Usually this is reported as "+length sequence", but with this
		 option	it becomes simply "+length".  For example an insertion
		 of AGT	in a  pileup  column  changes  from  "CCC+3AGTGCC"  to
		 "CCC+3GCC".

		 Specifying  this option twice also removes the	"+length" por-
		 tion, changing	the example above to "CCCGCC".

		 The purpose of	this change is to simplify parsing using basic
		 regular  expressions,	which  traditionally  cannot   perform
		 counting operations.  It is particularly beneficial when used
		 in  conjunction  with	--output-mods as the syntax of the in-
		 serted	sequence is adjusted to	also report possible base mod-
		 ifications, but see also --no-output-ins-mods as an  alterna-
		 tive.

       --no-output-ins-mods
		 Outputs the inserted bases in the sequence, but excluding any
		 base  modifications.  This only affects output	when --output-
		 mods is also used.

       --no-output-del
		 Do not	output deleted reference bases in the sequence column.
		 Normally this is reported as  "+length	 sequence",  but  with
		 this  option  it  becomes  simply  "+length".	For example an
		 deletion of 3 unknown bases (due to no	reference being	speci-
		 fied)	would  normally	 be  seen  in	a   column   as	  e.g.
		 "CCC-3NNNGCC",	 but  will be reported as "CCC-3GCC" with this
		 option.

		 Specifying this option	twice also removes the "-length"  por-
		 tion, changing	the example above to "CCCGCC".

		 The purpose of	this change is to simplify parsing using basic
		 regular   expressions,	 which	traditionally  cannot  perform
		 counting operations.  See also	--no-output-ins.

       --no-output-ends
		 Removes the "^" (with mapping quality)	and  "$"  markup  from
		 the sequence column.

       --reverse-del
		 Mark  the  deletions on the reverse strand with the character
		 #, instead of the usual *.

       -a	 Output	all positions, including those with zero depth.

       -a -a, -aa
		 Output	absolutely all positions, including  unused  reference
		 sequences.   Note  that  when	used in	conjunction with a BED
		 file the -a option may	sometimes operate as if	-aa was	speci-
		 fied if the reference sequence	has coverage  outside  of  the
		 region	specified in the BED file.

       BAQ (Base Alignment Quality)

       BAQ  is	the  Phred-scaled probability of a read	base being misaligned.
       It greatly helps	to reduce false	SNPs caused by misalignments.  BAQ  is
       calculated  using the probabilistic realignment method described	in the
       paper "Improving	SNP discovery by base  alignment  quality",  Heng  Li,
       Bioinformatics,	Volume	27, Issue 8 <https://doi.org/10.1093/bioinfor-
       matics/btr076>

       BAQ is applied to modify	quality	values before the -Q filtering happens
       and before the choice of	which sequence to retain when  removing	 over-
       laps.

       BAQ is turned on	when a reference file is supplied using	the -f option.
       To disable it, use the -B option.

       It  is  possible	 to  store precalculated BAQ values in a SAM BQ:Z tag.
       Samtools	mpileup	will use the precalculated values if  it  finds	 them.
       The  -E	option	can be used to make it ignore the contents of the BQ:Z
       tag and force it	to recalculate the BAQ scores by making	a  new	align-
       ment.

EXAMPLES
       Using	range:	 With	implicit   index   files   in1.bam.<ext>   and
       in2.sam.gz.<ext>,

	 samtools mpileup in1.bam in2.sam.gz -r	chr10:100000-200000

       With explicit index files,

	 samtools mpileup in1.bam in2.sam.gz idx/in1.csi idx/in2.csi -X	-r chr10:100000-200000

       With fofn being a file of input file names, and	implicit  index	 files
       present with inputs,

	 samtools mpileup -b fofn -r chr10:100000-200000

       Using  flags:  To  get reads with flags READ2 or	REVERSE	and not	having
       any of SECONDARY,QCFAIL,DUP,

	 samtools mpileup --rf READ2,REVERSE in.sam

       or

	 samtools mpileup --rf 144 in.sam

       To get reads with flag SECONDARY,

	 samtools mpileup --rf SECONDARY --ff QCFAIL,DUP in.sam

       Using all possible alignmentes: To show all possible alignments,	either
       of below	two equivalent commands	may be used,

	 samtools mpileup --count-orphans --no-BAQ --max-depth 0 --fasta-ref ref_file.fa \
	 --min-BQ 0 --excl-flags 0 --disable-overlap-removal in.sam

	 samtools mpileup -A -B	-d 0 -f	ref_file.fa -Q 0 --ff 0	-x in.sam

AUTHOR
       Written by Heng Li from the Sanger Institute.

SEE ALSO
       samtools(1), samtools-depth(1), samtools-sort(1), bcftools(1)

       Samtools	website: <http://www.htslib.org/>

samtools-1.21		       12 September 2024	   samtools-mpileup(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=samtools-mpileup&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages