Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
samtools-stats(1)	     Bioinformatics tools	     samtools-stats(1)

NAME
       samtools	stats -	produces comprehensive statistics from alignment file

SYNOPSIS
       samtools	stats [options]	in.sam|in.bam|in.cram [region...]

DESCRIPTION
       samtools	stats collects statistics from BAM files and outputs in	a text
       format.	The output can be visualized graphically using plot-bamstats.

       A summary of output sections is listed below, followed by more detailed
       descriptions.

       CHK    Checksum
       SN     Summary numbers
       FFQ    First fragment qualities
       LFQ    Last fragment qualities
       GCF    GC content of first fragments
       GCL    GC content of last fragments
       GCC    ACGT content per cycle
       GCT    ACGT content per cycle, read oriented
       FBC    ACGT content per cycle for first fragments only
       FTC    ACGT raw counters	for first fragments
       LBC    ACGT content per cycle for last fragments	only
       LTC    ACGT raw counters	for last fragments
       BCC    ACGT content per cycle for BC barcode
       CRC    ACGT content per cycle for CR barcode
       OXC    ACGT content per cycle for OX barcode
       RXC    ACGT content per cycle for RX barcode
       MPC    Mismatch distribution per	cycle
       QTQ    Quality distribution for BC barcode
       CYQ    Quality distribution for CR barcode
       BZQ    Quality distribution for OX barcode
       QXQ    Quality distribution for RX barcode
       IS     Insert sizes
       RL     Read lengths
       FRL    Read lengths for first fragments only
       LRL    Read lengths for last fragments only
       MAPQ   Mapping qualities
       ID     Indel size distribution
       IC     Indels per cycle
       COV    Coverage (depth) distribution
       GCD    GC-depth

       The  "cycle" terminology	used here originates from the Illumina instru-
       ments, but it is	interpreted more generally as the Nth base reported in
       the original read orientation (starting from 1).

       Not all sections	will be	reported as some depend	on the data being  co-
       ordinate	 sorted	 while	others	are only present when specific barcode
       tags are	in use.

       Some of the statistics are collected for	"first"	or  "last"  fragments.
       Records	are  put  into	these categories using the PAIRED (0x1), READ1
       (0x40) and READ2	(0x80) flag bits, as follows:

          Unpaired reads (i.e.	PAIRED is not set) are all "first"  fragments.
	   For these records, the READ1	and READ2 flags	are ignored.

          Reads  where	 PAIRED	 and  READ1  are set, and READ2	is not set are
	   "first" fragments.

          Reads where PAIRED and READ2	are set, and  READ1  is	 not  set  are
	   "last" fragments.

          Reads  where	 PAIRED	is set and either both READ1 and READ2 are set
	   or neither is set are not counted in	either category.

       Information on the meaning of the flags is given	in the SAM  specifica-
       tion document <https://samtools.github.io/hts-specs/SAMv1.pdf>.

       The  CHK	row contains distinct CRC32 checksums of read names, sequences
       and quality values.  The	checksums are computed	per  alignment	record
       and  summed, meaning the	checksum does not change if the	input file has
       the sort-order changed.

       The SN section contains a series	of counts, percentages,	and  averages,
       in a similar style to samtools flagstat,	but more comprehensive.

	      raw total	sequences - total number of reads in a file, excluding
	      supplementary and	secondary reads.  Same number reported by sam-
	      tools view -c -F 0x900.

	      filtered	sequences - number of discarded	reads when using -f or
	      -F option.

	      sequences	- number of processed reads.

	      is sorted	- flag	indicating  whether  the  file	is  coordinate
	      sorted (1) or not	(0).

	      1st  fragments  -	number of first	fragment reads (flags 0x01 not
	      set; or flags 0x01 and 0x40 set, 0x80 not	set).

	      last fragments - number of last fragment reads (flags  0x01  and
	      0x80 set,	0x40 not set).

	      reads  mapped  -	number	of  reads,  paired or single, that are
	      mapped (flag 0x4 or 0x8 not set).

	      reads mapped and paired -	number of mapped  paired  reads	 (flag
	      0x1 is set and flags 0x4 and 0x8 are not set).

	      reads unmapped - number of unmapped reads	(flag 0x4 is set).

	      reads  properly paired - number of mapped	paired reads with flag
	      0x2 set.

	      paired - number of paired	reads, mapped or  unmapped,  that  are
	      neither  secondary  nor supplementary (flag 0x1 is set and flags
	      0x100 (256) and 0x800 (2048) are not set).

	      reads duplicated - number	of duplicate reads (flag 0x400	(1024)
	      is set).

	      reads MQ0	- number of mapped reads with mapping quality 0.

	      reads QC failed -	number of reads	that failed the	quality	checks
	      (flag 0x200 (512)	is set).

	      non-primary  alignments  - number	of secondary reads (flag 0x100
	      (256) set).

	      supplementary alignments - number	of supplementary  reads	 (flag
	      0x800 (2048) set).

	      total  length  -	number	of processed bases from	reads that are
	      neither secondary	nor supplementary (flags 0x100 (256) and 0x800
	      (2048) are not set).

	      total first fragment length - number of processed	bases that be-
	      long to first fragments.

	      total last fragment length - number of processed bases that  be-
	      long to last fragments.

	      bases  mapped  -	number of processed bases that belong to reads
	      mapped.

	      bases mapped (cigar) - number of mapped bases  filtered  by  the
	      CIGAR  string  corresponding  to	the  read they belong to. Only
	      alignment	matches(M), inserts(I),	sequence  matches(=)  and  se-
	      quence mismatches(X) are counted.

	      bases  trimmed  -	number of bases	trimmed	by bwa,	that belong to
	      non secondary and	non supplementary reads. Enabled by -q option.

	      bases duplicated - number	of bases that belong to	 reads	dupli-
	      cated.

	      mismatches  -  number of mismatched bases, as reported by	the NM
	      tag associated with a read, if present.

	      error rate - ratio between mismatches and	bases mapped (cigar).

	      average length - ratio between total length and sequences.

	      average first fragment length - ratio between total first	 frag-
	      ment length and 1st fragments.

	      average last fragment length - ratio between total last fragment
	      length and last fragments.

	      maximum  length  -  length  of  the longest read (includes hard-
	      clipped bases).

	      maximum first fragment length -  length  of  the	longest	 first
	      fragment read (includes hard-clipped bases).

	      maximum  last fragment length - length of	the longest last frag-
	      ment read	(includes hard-clipped bases).

	      average quality -	ratio between the sum of  base	qualities  and
	      total length.

	      insert  size  average - the average absolute template length for
	      paired and mapped	reads.

	      insert size standard deviation - standard	deviation for the  av-
	      erage template length distribution.

	      inward  oriented	pairs  - number	of paired reads	with flag 0x40
	      (64) set and flag	0x10 (16) not set or with flag 0x80 (128)  set
	      and flag 0x10 (16) set.

	      outward  oriented	 pairs - number	of paired reads	with flag 0x40
	      (64) set and flag	0x10 (16) set or with flag 0x80	(128) set  and
	      flag 0x10	(16) not set.

	      pairs with other orientation - number of paired reads that don't
	      fall in any of the above two categories.

	      pairs  on	different chromosomes -	number of pairs	where one read
	      is on one	chromosome and the pair	read is	on a different chromo-
	      some.

	      percentage of properly paired reads - percentage of reads	 prop-
	      erly paired out of sequences.

	      bases  inside the	target - number	of bases inside	the target re-
	      gion(s) (when a target file is specified with -t option).

	      percentage of target genome with coverage	> VAL -	percentage  of
	      target bases with	a coverage larger than VAL. By default,	VAL is
	      0,  but  a  custom value can be supplied by the user with	-g op-
	      tion.

       The FFQ and LFQ sections	report the quality distribution	per first/last
       fragment	and per	cycle number.  They have one row per  cycle  (reported
       as the first column after the FFQ/LFQ key) with remaining columns being
       the observed integer counts per quality value, starting at quality 0 in
       the  left-most  row  and	 ending	at the largest observed	quality.  Thus
       each row	forms its own quality  distribution  and  any  cycle  specific
       quality artefacts can be	observed.

       GCF  and	 GCL  report  the total	GC content of each fragment, separated
       into first and last fragments.  The columns show	the GC percentile (be-
       tween 0 and 100)	and an integer count of	fragments at that percentile.

       GCC, FBC	and LBC	report the nucleotide content per  cycle  either  com-
       bined  (GCC)  or	 split into first (FBC)	and last (LBC) fragments.  The
       columns are cycle number	(integer), and percentage counts for A,	C,  G,
       T,  N  and  other  (typically  containing  ambiguity  codes) normalised
       against the total counts	of A, C, G and T only (excluding N and other).

       GCT offers a similar report to GCC, but whereas GCC counts  nucleotides
       as  they	appear in the SAM output (in reference orientation), GCT takes
       into account whether a nucleotide belongs  to  a	 reverse  complemented
       read  and  counts it in the original read orientation.  If there	are no
       reverse complemented reads in a file, the GCC and GCT reports  will  be
       identical.

       FTC  and	LTC report the total numbers of	nucleotides for	first and last
       fragments, respectively.	The columns are	the raw	counters for A,	C,  G,
       T and N bases.

       MPC  reports  the number	of mismatches per cycle	and per	quality	value.
       The MPC statistics are only included when a reference is	specified  via
       the  -r	option.	 There is one row per cycle number.  Each row includes
       the cycle number, the number of N bases (not counted  in	 the  per-qual
       columns),  followed  by	one column per quality value (starting at zero
       and incrementing	by one each time) listing the  number  of  non-N  mis-
       matches	with  that quality.  A mismatch	is defined as an ACGT sequence
       base mismatching	an ACGT	reference base.	 Ambiguity codes  are  ignored
       (except	for  sequence N	as mentioned above, which is counted even when
       the reference is	also N).

       BCC, CRC, OXC and RXC are the barcode equivalent	of  GCC,  showing  nu-
       cleotide	 content  for the barcode tags BC, CR, OX and RX respectively.
       Their quality values distributions are in the QTQ,  CYQ,	 BZQ  and  QXQ
       sections, corresponding to the BC/QT, CR/CY, OX/BZ and RX/QX SAM	format
       sequence/quality	 tags.	 These	quality	value distributions follow the
       same format used	in the FFQ and LFQ sections. All these	section	 names
       are  followed  by  a number (1 or 2), indicating	that the stats figures
       below them correspond to	the first or second barcode (in	 the  case  of
       dual  indexing).	 Thus,	these sections will appear as BCC1, CRC1, OXC1
       and RXC1, accompanied by	their quality correspondents QTQ1, CYQ1,  BZQ1
       and  QXQ1. If a separator is present in the barcode sequence (usually a
       hyphen),	indicating dual	indexing, then sections	 ending	 in  "2"  will
       also  be	reported to show the second tag	statistics (e.g. both BCC1 and
       BCC2 are	present).

       IS reports insert size distributions with one row per size, reported in
       the first column, with subsequent columns for the  frequency  of	 total
       pairs,  inward  oriented	pairs, outward orient pairs and	other orienta-
       tion pairs.  The	-i option specifies the	maximum	insert size reported.

       RL reports the distribution for all read	lengths, with one row per  ob-
       served  length (up to the maximum specified by the -l option).  Columns
       are read	length and frequency.  FRL and LRL contains the	same  informa-
       tion separated into first and last fragments.

       MAPQ  reports  the mapping qualities for	the mapped reads, ignoring the
       duplicates, supplementary, secondary and	failing	quality	reads.

       ID reports the distribution of indel sizes, with	one row	 per  observed
       size.  The  columns  are	size, frequency	of insertions at that size and
       frequency of deletions at that size.

       IC reports the frequency	of indels occurring per	cycle, broken down  by
       both  insertion	/  deletion and	by first / last	read.  Note for	multi-
       base indels this	only counts the	first base location.  Columns are  cy-
       cle,  number  of	insertions in first fragments, number of insertions in
       last fragments, number of deletions in first fragments, and  number  of
       deletions in last fragments.

       COV reports a distribution of the alignment depth per covered reference
       site.   For  example  an	 average depth of 50 would ideally result in a
       normal distribution centred on 50, but the presence of repeats or copy-
       number variation	may reveal multiple peaks at approximate multiples  of
       50.   The  first	 column	 is an inclusive coverage range	in the form of
       [min-max].  The next columns are	a repeat of the	maximum	portion	of the
       depth range (now	as a single integer)  and  the	frequency  that	 depth
       range  was observed.  The minimum, maximum and range step size are con-
       trolled by the -c option.  Depths above and below the minimum and maxi-
       mum are reported	with ranges [<min] and [max<].

       GCD reports the GC content of the reference data	 aligned  against  per
       alignment  record,  with	one row	per observed GC	percentage reported as
       the first column	and sorted on this column.  The	second column is a to-
       tal sequence percentile,	as a running  total  (ending  at  100%).   The
       first  and  second columns may be used to produce a simple distribution
       of GC content.  Subsequent columns list the  coverage  depth  at	 10th,
       25th,  50th, 75th and 90th GC percentiles for this specific GC percent-
       age, revealing any GC bias in  mapping.	 These	columns	 are  averaged
       depths, so are floating point with no maximum value.

OPTIONS
       -c, --coverage MIN,MAX,STEP
	       Set  coverage  distribution  to	the specified range (MIN, MAX,
	       STEP all	given as integers) [1,1000,1]

       -d, --remove-dups
	       Exclude from statistics reads marked as duplicates

       -f, --required-flag STR|INT
	       Required	flag, 0	for unset. See also `samtools flags` [0]

       -F, --filtering-flag STR|INT
	       Filtering flag, 0 for unset. See	also `samtools flags` [0]

       --GC-depth FLOAT
	       the size	of GC-depth bins (decreasing bin size increases	memory
	       requirement) [2e4]

       -h, --help
	       This help message

       -i, --insert-size INT
	       Maximum insert size [8000]

       -I, --id	STR
	       Include only listed read	group or sample	name []

       -l, --read-length INT
	       Include in the statistics only reads with the given read	length
	       [-1]

       -m, --most-inserts FLOAT
	       Report only the main part of inserts [0.99]

       -P, --split-prefix STR
	       A path or string	prefix to prepend  to  filenames  output  when
	       creating	 categorised statistics	files with -S/--split.	[input
	       filename]

       -q, --trim-quality INT
	       The BWA trimming	parameter [0]

       -r, --ref-seq FILE
	       Reference sequence (required for	GC-depth  and  mismatches-per-
	       cycle calculation).  []

       -S, --split TAG
	       In addition to the complete statistics, also output categorised
	       statistics  based on the	tagged field TAG (e.g.,	use --split RG
	       to split	into read groups).

	       Categorised  statistics	are  written  to  files	 named	 <pre-
	       fix>_<value>.bamstat,  where prefix is as given by --split-pre-
	       fix (or the input filename by default) and value	has  been  en-
	       countered  as the specified tagged field's value	in one or more
	       alignment records.

       -t, --target-regions FILE
	       Do stats	in these regions only. Tab-delimited file chr,from,to,
	       1-based,	inclusive.  []

       -x, --sparse
	       Suppress	outputting IS rows where there are no insertions.

       -p, --remove-overlaps
	       Remove overlaps of paired-end  reads  from  coverage  and  base
	       count computations.

       -g, --cov-threshold INT
	       Only  bases  with coverage above	this value will	be included in
	       the target percentage computation [0]

       -X      If this option is set, it will  allows  user  to	 specify  cus-
	       tomized index file location(s) if the data folder does not con-
	       tain  any  index	file.  Example usage: samtools stats [options]
	       -X /data_folder/data.bam	/index_folder/data.bai chrM:1-10

       -@, --threads INT
	       Number of input/output compression threads to use  in  addition
	       to main thread [0].

AUTHOR
       Written	by  Petr  Danacek with major modifications by Nicholas Clarke,
       Martin Pollard, Josh Randall, and Valeriu Ohan, all from	the Sanger In-
       stitute.

SEE ALSO
       samtools(1), samtools-flagstat(1), samtools-idxstats(1)

       Samtools	website: <http://www.htslib.org/>

samtools-1.21		       12 September 2024	     samtools-stats(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=samtools-stats&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>

home | help