Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
cmcalibrate(1)			Infernal Manual			cmcalibrate(1)

NAME
       cmcalibrate - fit exponential tails for covariance model	E-value	deter-
       mination

SYNOPSIS
       cmcalibrate [options] cmfile

DESCRIPTION
       cmcalibrate determines exponential tail parameters for E-value determi-
       nation  by  generating random sequences,	searching them with the	CM and
       collecting the scores of	the resulting hits. A  histogram  of  the  bit
       scores of the hits is fit to an exponential tail, and the parameters of
       the fitted tail are saved to the	CM file. The exponential tail  parame-
       ters  are  then	used  to estimate the statistical significance of hits
       found in	cmsearch and cmscan.

       A CM file must be calibrated with cmcalibrate before it can be used  in
       cmsearch	 or  cmscan,  with  a single exception:	it is not necessary to
       calibrate CM files that include only models with	zero basepairs	before
       running cmsearch.

       cmcalibrate  is	very  slow.  It	takes a	couple of hours	to calibrate a
       single average sized CM on a single CPU.	 cmcalibrate will run in  par-
       allel  on  all  available  cores	if Infernal was	built on a system that
       supports	POSIX threading	(see the  Installation	section	 of  the  user
       guide for more information). Using <n> cores will result	in roughly <n>
       -fold acceleration versus a single CPU.	MPI  (Message  Passing	Inter-
       face)  can be also be used for parallelization with the --mpi option if
       Infernal	was built with MPI enabled, but	using more than	161 processors
       is  not	recommended  because  increasing past 161 won't	accelerate the
       calibration.  See the Installation seciton of the user guide  for  more
       information.

       The --forecast option can be used to estimate how long the program will
       take to run for a given cmfile on the current machine.  To predict  the
       running	time on	_n_ processors with MPI, additionally use the --nfore-
       cast _n_	option.

       The random sequences searched in	cmcalibrate are	generated  by  an  HMM
       that  was  trained  on real genomic sequences with various GC contents.
       The goal	is to have the GC distributions	in  the	 random	 sequences  be
       similar to those	in actual genomic sequences.

       Four  rounds  of	searches and subsequent	exponential tail fits are per-
       formed, one each	for the	four different CM algorithms that can be  used
       in  cmsearch and	cmscan:	glocal CYK, glocal Inside, local CYK and local
       Inside.

       The E-values parameters determined by cmcalibrate are only used by  the
       cmsearch	 and  cmscan programs.	If you are not going to	use these pro-
       grams then do not waste time calibrating	your models.

OPTIONS
       -h     Help; print a brief reminder of command line usage and available
	      options.

       -L _x_ Set  the	total  length  of  random  sequences  to search	to _x_
	      megabases	(Mb). By default, _x_ is 1.6 Mb. Increasing  _x_  will
	      make  the	 exponential  tail fits	more precise and E-values more
	      accurate,	but will take longer (doubling _x_ will	roughly	double
	      the running time).  Decreasing _x_ is not	recommended as it will
	      make the fits less precise and the E-values less accurate.

OPTIONS	FOR PREDICTING REQUIRED	TIME AND MEMORY
       --forecast
	      Predict the running time of the calibration of cmfile (with pro-
	      vided  options) on the current machine and exit. The calibration
	      is not performed.	 The predictions should	 be  considered	 rough
	      estimates.  If  multithreading is	enabled	(see Installation sec-
	      tion of user guide), the timing will take	into account the  num-
	      ber of available cores.

       --nforecast _n_
	      With  --forecast,	 specify  that _n_ processors will be used for
	      the calibration.	This might be useful for predicting  the  run-
	      ning time	of an MPI run with _n_ processors.

       --memreq
	      Predict  the  amount  of	required memory	for calibrating	cmfile
	      (with provided options) on the current  machine  and  exit.  The
	      calibration is not performed.

OPTIONS	CONTROLLING EXPONENTIAL	TAIL FITS
       --gtailn	_x_
	      fit the exponential tail for glocal Inside and glocal CYK	to the
	      _n_ highest scores in the	histogram tail,	where _n_ is _x_ times
	      the number of Mb searched. The default value of _x_ is 250.  The
	      value 250	was chosen because it works well empirically  relative
	      to other values.

       --ltailn	_x_
	      fit  the	exponential tail for local Inside and local CYK	to the
	      _n_ highest scores in the	histogram tail,	where _n_ is _x_ times
	      the number of Mb searched. The default value of _x_ is 750.  The
	      value 750	was chosen because it works well empirically  relative
	      to other values.

       --tailp _x_
	      Ignore  the  --gtailn  and --ltailn prefixed options and fit the
	      _x_ fraction tail	of the histogram to an exponential  tail,  for
	      all search modes.

OPTIONAL OUTPUT	FILES
       --hfile _f_
	      Save the histograms fit to file _f_.  The	format of this file is
	      two space	delimited columns per line. The	first column is	the x-
	      axis  values of bit scores of each bin. The second column	is the
	      y-axis values of number of hits per bin. Each series  is	delim-
	      ited  by	a line with a single character "&". The	file will con-
	      tain one series for each of the four exponential	tail  fits  in
	      the  following  order: glocal CYK, glocal	Inside,	local CYK, and
	      local Inside.

       --sfile _f_
	      Save survival plot information to	file _f_.  The format of  this
	      file  is	two space delimited columns per	line. The first	column
	      is the x-axis values of bit scores of each bin. The second  col-
	      umn is the y-axis	values of fraction of hits that	meet or	exceed
	      the score	for each bin. Each series is delimited by a line  with
	      a	 single	 character "&".	 The file will contain three series of
	      data for each of the four	CM search modes	in the	following  or-
	      der:  glocal  CYK,  glocal  Inside, local	CYK, and local Inside.
	      The first	series is the empirical	survival plot  from  the  his-
	      togram  of hits to the random sequence. The second series	is the
	      exponential tail fit to the empirical  distribution.  The	 third
	      series  is the exponential tail fit if lambda were fixed and set
	      as the natural log of 2 (0.691314718).

       --qqfile	_f_
	      Save quantile-quantile plot information to file _f_.  The	format
	      of  this file is two space delimited columns per line. The first
	      column is	the x-axis values, and the second column is the	y-axis
	      values.  The distance of the points from the identity line (y=x)
	      is a measure of how good the exponential tail fit	is, the	closer
	      the  points  are	to  the	 identity line,	the better the fit is.
	      Each series is delimited by a line with a	single character  "&".
	      The  file	 will contain one series of empirical data for each of
	      the four exponential tail	fits in	the  following	order:	glocal
	      CYK, glocal Inside, local	CYK and	local Inside.

       --ffile _f_
	      Save  space  delimited  statistics of different exponential tail
	      fits to file _f_.	 The file will contain the lambda and mu  val-
	      ues  for	exponential  tails fit to histogram tails of different
	      sizes. The fields	in the file are	labelled informatively.

       --xfile _f_
	      Save a list of the scores	in each	fit  histogram	tail  to  file
	      _f_.   Each  line	of this	file will have a different score indi-
	      cating one hit existed in	the tail with that score.  Each	series
	      is  delimited  by	 a  line with a	single character "&". The file
	      will contain one series for each of the  four  exponential  tail
	      fits  in	the  following order: glocal CYK, glocal Inside, local
	      CYK, and local Inside.

OTHER OPTIONS
       --seed _n_
	      Seed the random number generator with _n_, an integer >= 0.   If
	      _n_ is nonzero, stochastic simulations will be reproducible; the
	      same command will	give the same results.	If _n_ is 0, the  ran-
	      dom number generator is seeded arbitrarily, and stochastic simu-
	      lations will vary	from run to run	of the same command.  The  de-
	      fault seed is 181.

       --beta _x_
	      By  default  query-dependent banding (QDB) is used to accelerate
	      the CM search algorithms with a beta tail	 loss  probability  of
	      1E-15.   This  beta value	can be changed to _x_ with --beta _x_.
	      The beta parameter is the	amount of  probability	mass  excluded
	      during  band  calculation,  higher  values  of beta give greater
	      speedups but sacrifice more accuracy than	lower values. The  de-
	      fault  value  used  is  1E-15.  (For more	information on QDB see
	      Nawrocki and Eddy, PLoS Computational Biology 3(3): e56.)

       --nonbanded
	      Turn off QDB during E-value calibration.	This  will  slow  down
	      calibration.

       --nonull3
	      Turn  off	 the null3 post	hoc additional null model. This	is not
	      recommended unless you plan on using the same option to cmsearch
	      and/or cmscan.

       --random
	      Use  the	background null	model of the CM	to generate the	random
	      sequences, instead of the	more realistic HMM. Unless the CM  was
	      built  using  the	 --null	option to cmbuild, the background null
	      model will be 25%	each A,	C, G and U.

       --gc _f_
	      Generate the random sequences using the nucleotide  distribution
	      from the sequence	file _f_.

       --cpu _n_
	      Specify  that _n_	parallel CPU workers be	used. If _n_ is	set as
	      "0", then	the program will be run	in serial mode,	without	 using
	      threads.	 You  can also control this number by setting an envi-
	      ronment variable,	 INFERNAL_NCPU.	  This	option	will  only  be
	      available	 if the	machine	on which Infernal was built is capable
	      of using POSIX threading (see the	Installation  section  of  the
	      user guide for more information).

       --mpi  Run  as an MPI parallel program. This option will	only be	avail-
	      able if Infernal has been	configured and built with  the	"--en-
	      able-mpi"	 flag  (see the	Installation section of	the user guide
	      for more information).

SEE ALSO
       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For  complete documentation, see	the user guide that came with your In-
       fernal distribution (Userguide.pdf); or see the Infernal	web page ().

COPYRIGHT
       Copyright (C) 2019 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For additional information on copyright and  licensing,	see  the  file
       called  COPYRIGHT  in your Infernal source distribution,	or see the In-
       fernal web page ().

AUTHOR
       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147	USA
       http://eddylab.org

Infernal 1.1.3			   Nov 2019			cmcalibrate(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY | OPTIONS CONTROLLING EXPONENTIAL TAIL FITS | OPTIONAL OUTPUT FILES | OTHER OPTIONS | SEE ALSO | COPYRIGHT | AUTHOR

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmcalibrate&sektion=1&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help