Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
samtools-checksum(1)	     Bioinformatics tools	  samtools-checksum(1)

NAME
       samtools	checksum - produces checksums of SAM / BAM / CRAM content

SYNOPSIS
       samtools	checksum [options] in.sam|in.bam|in.cram|in.fastq [ ...	]
       samtools	checksum -m [options] in.checksum [ ...	]

DESCRIPTION
       With  no	options, this produces an order	agnostic checksum of sequence,
       quality,	read-name and barcode related aux data in a SAM, BAM, CRAM  or
       FASTQ  file.  The CRC32 checksum	is used, combined together in a	multi-
       plicative prime field of	size (2<<31)-1.

       The purpose of this mode	is to validate that no data has	been  lost  in
       data  processing	 through  the  various steps of	alignment, sorting and
       processing.  Only primary alignments are	 recorded  and	the  checksums
       computed	 are order agnostic so the same	checksums are produced in name
       collated	or position sorted output files.

       One set of checksums is produced	per read-group as well as  a  combined
       file,  plus  a  set for records that have no read-group assigned.  This
       allows for validation of	merging	multiple runs and splitting  pools  by
       their read-group.  The checksums	are also reported for QC-pass only and
       QC-fail	only  (indicated by the	QCFAIL BAM flag), so checksums of data
       identified and removed as contamination can also	be tracked.

       All of the above	are compatible with  Biobambam2's  bamseqchksum	 tool,
       which  was  the	inspiration  for this samtools command.	 The -B	option
       further enhances	compatibility by using the  same  output  format,  al-
       though  it limits the functionality to the order	agnostic checksums and
       fewer types validated.

       The -m or --merge option	can be	used  to  merge	 previously  generated
       checksums.   The	 input	filenames  are checksum	outputs	from this tool
       (via shell redirection or the -o) option.  The intended use of this  is
       to  validate no data is lost or corruption during file merging of read-
       group specific files, by	algorithmically	computing the expected	check-
       sum output.

       Additionally checksum can track other columns including BAM flags, map-
       ping  information  (MAPQ	and CIGAR), pair information (RNEXT, PNEXT and
       TLEN), as well as a wider list of tags.

       With the	-O option the checksums	become record  order  specific.	  Com-
       bined together with the -a option this can be used to validate SAM, BAM
       and  CRAM  format  conversions.	 The  CRCs per record are XORed	with a
       record counter for the Nth record per read group.  See the detailed de-
       scription below for single -O vs	double and  the	 implications  on  re-
       ordering	between	read-groups.

       When performing such validation,	it is also useful to enable data sani-
       tisation	first, as CRAM can fix up certain types	of inconsistencies in-
       cluding	common	issues	such  as  MAPQ and CIGAR strings for unaligned
       data.

OUTPUT
       The output format consists of a machine readable	table of checksums and
       human readable text starting with a "#" character.

       For compatibility with bamseqchksum the data is CRCed in	 specific  or-
       ders  before  combining	together  to form a checksum column.  The last
       column reported is then the combination of all checksums	in  that  row,
       permitting easy comparison by looking at	a single value.

       The columns reported are	as follows.

	   Group     The  read	group  name.   There  is always	an "all" group
		     which represents all records.  This is  followed  by  one
		     checksum set per read-group found in the file.

	   QC	     This is either "all" or "pass".  "Pass" refers to records
		     that do not have the QCFAIL BAM flag specified.

	   flag+seq  The checksum of SAM FLAG +	SEQ fields

	   +name     The checksum of SAM QNAME + FLAG +	SEQ fields

	   +qual     The checksum of SAM FLAG +	SEQ + QUAL fields

	   +aux	     The  checksum  of	SAM  FLAG  +  SEQ + selected auxiliary
		     fields

	   +chr/pos  The checksum of SAM FLAG +	SEQ + RNAME (chromosome) + PO-
		     Sition fields

	   +mate     The checksum of SAM FLAG +	SEQ + RNEXT +  PNEXT  +	 ISIZE
		     fields.

	   combined  The  combined  checksum of	all columns prior to this col-
		     umn.  The first row will be for all  alignments,  so  the
		     combined checksum on the first row	may be used as a whole
		     file combined checksum.

       An example output can be	seen below.

	 # Checksum for	file: NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.bam
	 # Aux tags:	      BC,FI,QT,RT,TC
	 # BAM flags:	      PAIRED,READ1,READ2

	 # Group    QC	      count  flag+seq  +name	 +qual	   +aux	     combined
	 all	    all	   42890086  71169bbb  633fd9f7	 2a2e693f  71169bbb  09d03ed4
	 SRR010946  all	     262249  2957df86  3b6dcbc9	 66be71f7  2957df86  58e89c25
	 SRR002165  all	      97846  47ff17e0  6ff8fc7b	 58366bf5  47ff17e0  796eecb0
	 [...cut...]

OPTIONS
       -@ COUNT	 Uses  COUNT  compute threads in decoding the file.  Typically
		 this does not gain much speed beyond 2	or 3.  The default  is
		 to use	a single thread.

       -B, --bamseqchksum
		 Produces  a  report compatible	with biobambam2's bamseqchksum
		 default output. Note this is only  expected  to  work	if  no
		 other	format	options	 have  been enabled.  Specifically the
		 header	line is	not updated to reflect additional  columns  if
		 requested.

		 Bamseqchksum  has  more  output  modes	 and  many alternative
		 checksums.  We	only support the default CRC32 method.

       -F FLAG,	--exclude-flags	FLAG
		 Specifies which alignment FLAGs to filter out.	 This defaults
		 to secondary and supplementary	alignments  (0x900)  as	 these
		 can be	duplicates of the primary alignment.  This ensures the
		 same  number  of  records  are	 checksummed  in unaligned and
		 aligned files.

       -f FLAG,	--require-flags	FLAG
		 A list	of FLAGs that are required.  Defaults to zero.	An ex-
		 ample use of this may be to checksum QCFAIL only.

       -b FLAG,	--flag-mask FLAG
		 The BAM FLAG is masked	first before  checksumming.   The  un-
		 aligned  flags	 will  contain data about the sequencing run -
		 whether it is paired in sequencing and	if so whether this  is
		 READ1	or  READ2.  These flags	will not change	post-alignment
		 and so	everything except these	three are  masked  out.	  FLAG
		 defaults to PAIRED,READ1,READ2	(0xc1).

       -c, --no-rev-comp
		 By  default the sequence and quality strings are reverse com-
		 plemented before checksumming,	so unaligned data does not af-
		 fect the checksums.  This option disables this	and  checksums
		 as-is.

       -t STR, --tags STR
		 Specifies  a  comma-separated	list  of aux tags to checksum.
		 These are concatenated	together in their canonical BAM	encod-
		 ing in	the order listed in STR, prior to computing the	check-
		 sums.

		 If STR	begins with "*"	then all tags are used.	 This can then
		 be followed by	a comma	separated list	of  tags  to  exclude.
		 For  example "*,MD,NM"	is all tags except MD and NM.  In this
		 mode, the tags	are combined in	alphanumeric order.

		 The default value is "BC,FI,QT,RT,TC".

       -O, --in-order

		 By default the	CRCs are combined in  a	 multiplicative	 field
		 that  is  order agnostic, as multiplication is	an associative
		 operation.  This option XORs the CRC with the a number	 indi-
		 cating	 the  Nth record number	for this grouping prior	to the
		 multiply step,	making the final multiplicative	 checksum  de-
		 pendent on the	order of the input data.

		 For  the  "all" row the count is taken	from the Nth record in
		 the read-group	associated with	this record (or	 the  "-"  row
		 for  read-group-less  data).  This ensures that the checksums
		 can be	subsequently merged together algorithmically using the
		 -m option, but	it does	mean there is no validation of	record
		 swaps	between	read-groups.  Note however due to the way ties
		 are resolved, when running  samtools  merge  out.bam  rg1.bam
		 rg2.bam  we  may get different	orderings if we	merged the two
		 files in the opposite order.  This can	happen when two	 read-
		 groups	have alignments	at the same position with the same BAM
		 flags.	  Hence	 if we wish to check a samtools	split followed
		 by samtools merge round trip  works  then  this  counter  per
		 readgroup is a	benefit.

		 However,  if  absolute	ordering needs to be validated regard-
		 less of read-groups, specifying the -O	option twice will com-
		 pute the "all"	row by combining the CRC with the  Nth	record
		 in  the  file	rather	than  the Nth record in	its readgroup.
		 This output can no longer can merged using checksum -m.

       -P, --check-pos
		 Adds a	column to the output with combined chromosome and  po-
		 sition	 checksums.   This also	incorporates the flag/sequence
		 CRC.

       -C, --check-cigar
		 Adds a	column to the output with combined mapping quality and
		 CIGAR checksums.  This	also  incorporates  the	 flag/sequence
		 CRC.

       -M, --check-mate
		 Adds  a  column  to  the output with combined mate reference,
		 mate position and template length checksums.  This  also  in-
		 corporates the	flag/sequence CRC.

       -b FLAGS, --sanitize FLAGS
		 Perform data sanitization prior to checksumming.  This	is off
		 by default.  See samtools view	for the	FLAG terms accepted.

       -N COUNT, --count COUNT
		 Limits	 the  checksumming to the first	COUNT records from the
		 file.

       -a, --all Checksum all data.  This is equivalent	to -PCMOc -b 0xfff -f0
		 -F0 -z	all,cigarx -t *,cF,MD,NM.   It is useful for  validat-
		 ing round-trips between file formats, such as BAM to CRAM.

       -T, --tabs
		 Use tabs for separating columns instead of aligned spaces.

       -q, --show-qc
		 Also  show  QC	 pass and fail rows per	read-group.  These are
		 based on the QCFAIL BAM flag.

       -o FILE,	--output FILE
		 Output	checksum report	to FILE	instead	of stdout.

       -m FILE,	--merge	FILE...
		 Merge checksum	outputs	produced by the	-o option.   This  can
		 be  used  to  simulate	 or  validate  the effect of computing
		 checksum on the output	of a samtools merge command.

		 The columns to	report are read	from the "# Group" line.   The
		 rows  to  report  are still governed by the -q, -v and	-T op-
		 tions so this can also	be used	for reformatting of  a	single
		 file.

		 Note the "all"	row merging cannot be done when	the two	levels
		 of order-specific checksums (-OO) has been used.

       -v, --verbose
		 Increase  verbosity.	At  level  1 or	higher this also shows
		 rows that have	zero count values, which can aid machine pars-
		 ing.

EXAMPLES
       o To check that an aligned and position sorted file contains  the  same
	 data as the pre-alignment FASTQ:

	   samtools checksum -q	pos-aln.bam
	   samtools import -u -1 rg1.fastq.gz -2 rg2.fastq.gz |	samtools checksum -q

	 The output for	this consists of some human readable comments starting
	 with "#" and a	series of checksum lines per read-group	and QC status.

	   # Checksum for file:	SRR554369.P_aeruginosa.cram
	   # Aux tags:		BC,FI,QT,RT,TC
	   # BAM flags:		PAIRED,READ1,READ2

	   # Group    QC	count  flag+seq	 +name	   +qual     +aux      combined
	   all	      all     3315742  4a812bf2	 22d15cfe  507f0f57  4a812bf2  035e2f5b
	   all	      pass    3315742  4a812bf2	 22d15cfe  507f0f57  4a812bf2  035e2f5b

	 Note  as  no barcode tags exist, the "+aux" column is the same	as the
	 "flag+seq" column it is based upon.

       o To check round-tripping from BAM to CRAM and back again we  can  con-
	 vert  the  BAM	 to  CRAM  and then run	the checksum on	the CRAM file.
	 This does not need explicitly converting back to BAM as  htslib  will
	 decode	the CRAM and convert it	back to	the same in-memory representa-
	 tion that is utilised in BAM.

	   samtools checksum -a	9827_2#49.1m.bam
	   [...cut...]
	   samtools view -@8 -C	-T $HREF 9827_2#49.1m.bam | samtools checksum -a
	   # Checksum for file:	-
	   # Aux tags:		*,cF,MD,NM
	   # BAM flags:		PAIRED,PROPER_PAIR,UNMAP,MUNMAP,REVERSE,MREVERSE,READ1,READ2,SECONDARY,QCFAIL,DUP,SUPPLEMENTARY

	   # Group    QC	count  flag+seq	 +name	   +qual     +aux      +chr/pos	 +cigar	   +mate     combined
	   all	      all	99890  066a0706	 0805371d  5506e19f  6b0eec58  60e2347c	 09a2c3ba  347a3214  66c5e2de
	   1#49	      all	99890  066a0706	 0805371d  5506e19f  6b0eec58  60e2347c	 09a2c3ba  347a3214  66c5e2de

       o To validate that splitting a file by regroup retains all the data, we
	 can  compute  checksums  on the split BAMs and	merge the checksum re-
	 ports together	to compare against the original	unsplit	 file.	 (Note
	 in the	example	below diff will	report the filename changing, which is
	 expected.)

	   samtools split -u /tmp/split/noRG.bam -f '/tmp/split/%!.%.' in.cram
	   samtools checksum -a	in.cram	-o in.chksum
	   s=$(for i in	/tmp/split/*.bam;do echo "<(samtools checksum -a $i)";done)
	   eval	samtools checksum -m $s	-o split.chksum
	   diff	in.chksum split.chksum

AUTHOR
       Written by James	Bonfield from the Sanger Institute.
       Inspired	 by bamseqchksum, written by David Jackson of Sanger Institute
       and amended by German Tischler.

SEE ALSO
       samtools(1), samtools-view(1),

       Samtools	website: <http://www.htslib.org/>

samtools-1.22			  30 May 2025		  samtools-checksum(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=samtools-checksum&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help