Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
ref-cache(1)		     Bioinformatics tools		  ref-cache(1)

NAME
       ref-cache - CRAM	reference caching proxy

SYNOPSIS
       ref-cache [-bLUv] [-l LOG_DIR] [-u URL] -d CACHE_DIR -p PORT

DESCRIPTION
       ref-cache  is a caching proxy for reference sequences, for use when en-
       coding and decoding CRAM	format sequence	alignment files.

       CRAM can	use reference based  compression  where	 individual  bases  in
       aligned	records	are compared against a known reference sequence, stor-
       ing only	the bases that differ.	This gives better compression, but re-
       quires the reference sequence to	be supplied from an  external  source.
       One way to get these sequences is by querying a server implementing the
       GA4GH  refget  standard <https://ga4gh.github.io/refget/>, however this
       can lead	to excessive network traffic and server	load if, as  is	 often
       the case, the same reference is needed more than	once.  ref-cache makes
       reference handling easier by keeping copies of downloaded files,	allow-
       ing them	to be reused when they are needed again.

       As  it  has been	specifically designed to serve reference sequences for
       CRAM encoders and decoders, ref-cache  behaves  rather  differently  to
       general-purpose caching web proxies:

        It only makes requests	to a single upstream server.

        Sequences  are	requested using	MD5 checksum identifiers, as stored in
	 the M5	tag in CRAM @SQ	header lines.

        The requested sequence	is returned as a single	 string	 in  uppercase
	 ASCII text with no line terminators or	other formatting characters.

        Downloaded  sequences are checked to ensure they have the correct MD5
	 checksum before being stored in the cache.

        Cached	sequences are never removed.

        Cached	sequences are stored in	a way that allows them to be  accessed
	 directly  on the filesystem.  This can	allow the web server to	be by-
	 passed	in some	set-ups	(for example where the cache is	 on  a	shared
	 drive),  allowing  already-downloaded files to	be accessed more effi-
	 ciently.

        When started, ref-cache will test to see if it's already  running  on
	 the  specified	port, and exits	silently if it finds that it is.  This
	 enables a simple way of ensuring the  server  is  up,	by  trying  to
	 restart it every few minutes.

QUICK-START GUIDE
       Create  directories  for	 the  cache  and (optionally) log files.  Then
       start up	the server in the background, listening	on port	8080 and  with
       the EBI's CRAM reference	server as the upstream source.

	 mkdir cached_refs
	 mkdir logs
	 ref-cache -b -d cached_refs -l	logs -p	8080 -u	https://www.ebi.ac.uk/ena/cram/md5/

       To make SAMtools	and HTSlib use the server, set its URL in the REF_PATH
       environment variable (note that colons should be	doubled	up in the URL,
       and you should substitute the hostname of your actual server).

	 REF_PATH='http:://myserver.example.com::8080/%s'
	 export	REF_PATH

       If   the	 cache	directory  can	be  made  visible  to  SAMtools/HTSlib
       processes, it can also be added directly	to REF_PATH by putting it  be-
       fore  the  web server URL.  It is necessary to use the full path	to the
       directory, followed by "/%2s/%2s/%s" for	the file location due  to  the
       way they	are stored inside the cache.

	 REF_PATH='/path/to/cache/%2s/%2s/%s:http:://myserver.example.com::8080/%s'
	 export	REF_PATH

       This  is	 useful	as accessing the files directly	is more	efficient than
       using http.  Files are downloaded to a temporary	name and then  renamed
       after  validation  so processes directly	using the cache	will never try
       to use a	partly downloaded file.	 By putting the	URL at	the  end,  the
       web  server will	pick up	any requests for references not	already	in the
       cache, download them, provide them to the requester, and	store them  in
       the cache.

OPTIONS
       -b	 Run  in  the background as a System V-style daemon.  This op-
		 tion must not be used with -s.

       -d <dir>	 Directory where cached	files will be stored

       -h	 Show help

       -l <dir>	 Directory for log files.  If not set and running in the fore-
		 ground, logs will be sent to stdout

       -L	 Don't log

       -m all|default|localhost|<network-list>
		 Reply to connections from the listed network(s).  This	option
		 can be	given more than	once, with the final allow list	 being
		 the  union of all listed networks along with localhost	(which
		 is always enabled).  See CLIENT ADDRESS CHECKING below.

       -n <1-4>	 Number	of server processes to run

       -p <port> Port number to	listen on

       -r <num>	 Number	of request log files to	keep

       -R <num>	 Maximum size of a request log file (MiB)

       -s	 Run as	a systemd-style	socket service.	 As the	 service  man-
		 ager handles socket allocation, the -p	option is ignored when
		 running in this mode.	This option must not be	used with -b.

       -u <url>	 URL  of  the upstream server.	If not set or overridden using
		 -U, the  EBI's	 server	 (https://www.ebi.ac.uk/ena/cram/md5/)
		 will be used.

       -U	 Do  not  attempt  to get files	from an	upstream server.  Only
		 files already in the local cache will be served.

       -v	 Turn on debugging output

CLIENT ADDRESS CHECKING
       ref-cache is designed to	serve references to local networks.  To	ensure
       that it only responds to	the desired clients, it	has an allow  list  of
       address	ranges	that  it  will talk to.	 If a connection attempt comes
       from an IP address not in the allowed set, it will  be  closed  immedi-
       ately.	(N.B.: Rejected	clients	will see a connection open and immedi-
       ately close, as it's necessary for connections to  be  opened  for  the
       server to discover the peer address.  If	you want to drop or reject un-
       wanted requests without opening them, you will need to use your operat-
       ing system's firewall.)

       The  address  ranges  can be set	using the -m option, which may be used
       more than once.	Networks can be	specified either as a  comma-separated
       list of CIDR-format blocks (e.g.	 192.0.2.0/24, 2001:db8::/32) or using
       one of the following synonyms:

	   all	  Any address (not recommended)

	   default
		  10.0.0.0/8,	172.16.0.0/12,	 192.168.0.0/16	 (the  private
		  ranges listed	in RFC 1918); fc00::/7 (the local IPv6 Unicast
		  address range	in RFC 4193); and fe80::/10  (IPv6  link-local
		  addresses)

	   localhost
		  127.0.0.0/8 and ::1/128 (loop-back addresses)

       If  no -m option	is given, the "default"	list will be used, as most or-
       ganisations will	be using one or	more of	these internally.   This  will
       be  overridden  if any -m option	appears, in which case -m default will
       need to be specified explicitly if you also want	to reply to  addresses
       in the IPv4 and IPv6 private ranges.  For example:

		  ref-cache -m 192.0.2.0/24 -m default ...

       ref-cache will always listen to the loop-back address, even if this was
       not  specified.	 Using	-m  localhost will limit it to only respond to
       loop-back requests.

AUTHOR
       Written by Rob Davies from the Wellcome Sanger Institute

SEE ALSO
       samtools(1)

       Samtools	website: <http://www.htslib.org/>

       CRAM specification: <https://samtools.github.io/hts-specs/CRAMv3.pdf>

       Refget website: <https://ga4gh.github.io/refget/>

htslib-1.22			  30 May 2025			  ref-cache(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=ref-cache&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help