Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
annot-tsv(1)		     Bioinformatics tools		  annot-tsv(1)

NAME
       annot-tsv  -  transfer  annotations from	one TSV	(tab-separated values)
       file into another

SYNOPSIS
       annot-tsv [OPTIONS]

DESCRIPTION
       The program finds overlaps in two sets of genomic regions (for  example
       two CNV call sets) and annotates	regions	of the target file (-t,	--tar-
       get-file)  with information from	overlapping regions of the source file
       (-s, --source-file).

       It can transfer one or multiple columns (-f, --transfer)	and the	trans-
       fer can be conditioned on requiring matching  values  in	 one  or  more
       columns (-m, --match).  In addition to column transfer (-f) and special
       annotations (-a,	--annotate), the program can operate in	a simple grep-
       like  mode  and print matching lines (when neither -f nor -a are	given)
       or drop matching	lines (-x, --drop-overlaps).

       All indexes and coordinates are 1-based and inclusive.

OPTIONS
   Common Options
       -c, --core SRC:TGT
	   List	of names of the	core columns,  in  the	order  of  chromosome,
	   start  and end positions, irrespective of the header	name and order
	   in which they  appear  in  source  or  target  files	 (for  example
	   "chr,beg,end:CHROM,START,END").   If	both files use the same	header
	   names, the TGT names	can be omitted	(for  example  "chr,beg,end").
	   If  SRC or TGT file has no header, 1-based indexes can be given in-
	   stead (for example "chr,beg,end:3,1,2").  Note that regions are not
	   required, the program can work with a list of positions (for	 exam-
	   ple "chr,beg,end:CHROM,POS,POS").

       -f, --transfer SRC:TGT
	   Comma-separated list	of columns to transfer.	If the SRC column does
	   not	exist,	interpret  it  as  the default value to	fill in	when a
	   match is found or a dot (".") when a	match is not found. If the TGT
	   column does not exist, a new	column is created. If the  TGT	column
	   already  exists,  its  values  will	be overwritten when overlap is
	   found and left as is	otherwise.

       -m, --match SRC:TGT
	   The columns required	to be identical

       -o, --output FILE
	   Output file name, by	default	the result is printed on standard out-
	   put

       -s, --source-file FILE
	   Source file with annotations	to transfer

       -t, --target-file FILE
	   Target file to be extend with annotations from -s, --source-file

   Other options
       --allow-dups
	   Add the same	annotations multiple times if  multiple	 overlaps  are
	   found

       --help
	   This	help message

       --max-annots INT
	   Add at most INT annotations per column to save time when many over-
	   laps	are found with a single	region

       --version
	   Print version string	and exit

       -a, --annotate LIST
	   Add one or more special annotation and its target name separated by
	   ':'.	If no target name is given, the	special	annotation's name will
	   be used in output header.

	   cnt
	       number of overlapping regions

	   frac
	       fraction	of the target region with an overlap

	   nbp
	       number of source	base pairs in the overlap

       -d, --delim SRC:TGT
	   Column delimiter in the source and the target file. For example, if
	   both	 files	are  comma-delimited, run with "--delim	,:," or	simply
	   "--delim ,".	If the source file is comma-delimited and  the	target
	   file	is tab-delimited, run with "-d $',:\t'".

       -h, --headers SRC:TGT
	   Line	 number	 of  the  header row with column names.	By default the
	   first line is interpreted as	header if it starts with  the  comment
	   character ("#"), otherwise expects numeric indices. However,	if the
	   first  line	does  not start	with "#" but still contains the	column
	   names, use "--headers 1:1". To ignore existing header (skip comment
	   lines) and use numeric indices, use "--headers 0:0" which is	equiv-
	   alent to "--ignore-headers".	When negative value is	given,	it  is
	   interpreted	as  the	 number	 of  lines from	the end	of the comment
	   block. Specifically,	"--headers -1" takes the column	names from the
	   last	line of	the comment block (e.g., the "#CHROM" line in the  VCF
	   format).

       -H, --ignore-headers
	   Ignore  the	headers	completely and use numeric indexes even	when a
	   header exists

       -I, --no-hdr-idx
	   Suppress index numbers in the printed header. If given twice,  drop
	   the entire header.

       -O, --overlap FLOAT,[FLOAT]
	   Minimum  overlap as a fraction of region length in SRC and TGT, re-
	   spectively (with two	numbers), or in	at least one of	 the  overlap-
	   ping	 regions  (with	 a single number). If also -r, --reciprocal is
	   given, require at least FLOAT overlap with respect to both regions.
	   Two identical numbers are equivalent	to running with	-r, --recipro-
	   cal

       -r, --reciprocal
	   Require the -O, --overlap with respect to both overlapping regions

       -x, --drop-overlaps
	   Drop	overlapping regions (cannot be combined	with -f, --transfer)

EXAMPLE
       Both SRC	and TGT	input files must be tab-delimited files	with or	 with-
       out a header, their columns can be named	differently, can appear	in ar-
       bitrary order. For example consider the source file

       #chr   beg   end	  sample   type	  qual
       chr1   100   200	  smpl1	   DEL	  10
       chr1   300   400	  smpl2	   DUP	  30

       and the target file

       150   200   chr1	  smpl1
       150   200   chr1	  smpl2
       350   400   chr1	  smpl1
       350   400   chr1	  smpl2

       In  the first example we	transfer type and quality but only for regions
       with matching sample. Notice that the header is present in SRC but  not
       in TGT, therefore we use	column indexes for the latter

       annot-tsv -s src.txt.gz -t tgt.txt.gz -c	chr,beg,end:3,1,2 -m sample:4 -f type,qual
       150   200   chr1	  smpl1	  DEL	10
       150   200   chr1	  smpl2	  .	.
       350   400   chr1	  smpl1	  .	.
       350   400   chr1	  smpl2	  DUP	30

       The next	example	demonstrates the special annotations nbp and cnt, with
       target  name  as	 pair,count.   In  this	case we	use a target file with
       headers so that column names will be copied to the output:

       #from	 to   chrom	sample
       150  200	 chr1 smpl1
       150  200	 chr1 smpl2
       350  400	 chr1 smpl1
       350  400	 chr1 smpl2

       annot-tsv -s src.txt.gz -t tgt_hdr.txt.gz -c chr,beg,end:chrom,from,to -m sample	-f type,qual -a	nbp,cnt:pair,count
       #[1]from	 [2]to	   [3]chrom  [4]sample [5]type	 [6]qual   [7]pair   [8]count
       150  200	 chr1 smpl1	DEL  10	  51   1
       150  200	 chr1 smpl2	.    .	  0    0
       350  400	 chr1 smpl1	.    .	  0    0
       350  400	 chr1 smpl2	DUP  30	  51   1

       One of the SRC or TGT file can be streamed from stdin

       cat src.txt | annot-tsv -t tgt.txt -c chr,beg,end:3,2,1 -m sample:4 -f type,qual	-o output.txt
       cat tgt.txt | annot-tsv -s src.txt -c chr,beg,end:3,2,1 -m sample:4 -f type,qual	-o output.txt

       The program can be used in a grep-like mode to print only matching  re-
       gions of	the target file	without	modifying the records

       annot-tsv -s src.txt -t tgt.txt -c chr,beg,end:3,2,1 -m sample:4
       150   200   chr1	  smpl1
       350   400   chr1	  smpl2

AUTHORS
       The program was written by Petr Danecek and was originally published on
       github as annot-regs

COPYING
       The MIT/Expat License, see the LICENSE document for details.
       Copyright (c) Genome Research Ltd.

htslib-1.21		       12 September 2024		  annot-tsv(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=annot-tsv&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>

home | help