FreeBSD Manual Pages

home | help
cmcalibrate(1)			Infernal Manual			cmcalibrate(1)

NAME
       cmcalibrate - fit exponential tails for covariance model	E-value	deter-
       mination

SYNOPSIS
       cmcalibrate [options] cmfile

DESCRIPTION
       cmcalibrate determines exponential tail parameters for E-value determi-
       nation  by  generating random sequences,	searching them with the	CM and
       collecting the scores of	the resulting hits. A  histogram  of  the  bit
       scores of the hits is fit to an exponential tail, and the parameters of
       the  fitted tail	are saved to the CM file. The exponential tail parame-
       ters are	then used to estimate the  statistical	significance  of  hits
       found in	cmsearch and cmscan.

       A  CM file must be calibrated with cmcalibrate before it	can be used in
       cmsearch	or cmscan, with	a single exception: it	is  not	 necessary  to
       calibrate  CM files that	include	only models with zero basepairs	before
       running cmsearch.

       cmcalibrate is very slow. It takes a couple of  hours  to  calibrate  a
       single  average sized CM	on a single CPU.  cmcalibrate will run in par-
       allel on	four cores if Infernal was built on  a	system	that  supports
       POSIX  threading	 (see  the  Installation section of the	user guide for
       more information) and that system has at	least 4	cores. Using <n> cores
       will result in roughly <n> -fold	acceleration versus a single CPU.  You
       can specify the number of cores be <n> to use with the  --cpu  <n>  op-
       tion.  MPI  (Message  Passing Interface)	can be also be used for	paral-
       lelization with the --mpi option	if Infernal was	 built	with  MPI  en-
       abled,  but  using  more	than 161 processors is not recommended because
       increasing past 161 won't accelerate the	calibration.  See the  Instal-
       lation section of the user guide	for more information.

       The --forecast option can be used to estimate how long the program will
       take  to	run for	a given	cmfile on the current machine.	To predict the
       running time on <n> processors with MPI,	additionally use the  --nfore-
       cast <n>	option.

       Some  large models require a lot	of memory to calibrate.	You can	deter-
       mine how	much memory is required	with the --memreq  option.  For	 these
       models, you may be limited by the available RAM on your system. Another
       strategy	for parallelization that can be	useful when a lot of memory is
       required	 per core is to	split the calibration into <n> separate	compu-
       tations or partitions, each of which can	be performed  separately,  po-
       tentially in parallel if	you have access	to a computer cluster. The re-
       sults  from  each computation can then be merged	together for the final
       calibration. To do this,	first run cmcalibrate with the --split,	--ptot
       <n> and --cfile <f> options, which will save the	<n> separate partition
       commands	into the file <f> .  After all of these	commands have been ex-
       ecuted, you can then combine the	results	and create a calibrated	 model
       file  by	calling	again with the --merge and --ptot <n> options. See the
       "Parallelizing calibration of large models  by  splitting  into	parti-
       tions" subsection of the	tutorial in the	user's guide for more informa-
       tion.

       The  random  sequences  searched	in cmcalibrate are generated by	an HMM
       that was	trained	on real	genomic	sequences with	various	 GC  contents.
       The  goal  is  to  have the GC distributions in the random sequences be
       similar to those	in actual genomic sequences.

       Four rounds of searches and subsequent exponential tail fits  are  per-
       formed,	one each for the four different	CM algorithms that can be used
       in cmsearch and cmscan: glocal CYK, glocal Inside, local	CYK and	 local
       Inside.

       The  E-values parameters	determined by cmcalibrate are only used	by the
       cmsearch	and cmscan programs.  If you are not going to use  these  pro-
       grams then do not waste time calibrating	your models.

OPTIONS
       -h     Help; print a brief reminder of command line usage and available
	      options.

       -L <x> Set  the	total  length  of  random  sequences  to search	to <x>
	      megabases	(Mb). By default, <x> is 1.6 Mb. Increasing  <x>  will
	      make  the	 exponential  tail fits	more precise and E-values more
	      accurate,	but will take longer (doubling <x> will	roughly	double
	      the running time).  Decreasing <x> is not	recommended as it will
	      make the fits less precise and the E-values less accurate.

OPTIONS	FOR PREDICTING REQUIRED	TIME AND MEMORY
       --forecast
	      Predict the running time of the calibration of cmfile (with pro-
	      vided options) on	the current machine and	exit. The  calibration
	      is  not  performed.   The	predictions should be considered rough
	      estimates. If multithreading is enabled (see  Installation  sec-
	      tion  of user guide), the	timing will take into account the num-
	      ber of available cores.

       --nforecast <n>
	      With --forecast, specify that <n>	processors will	 be  used  for
	      the  calibration.	  This might be	useful for predicting the run-
	      ning time	of an MPI run with <n> processors.

       --memreq
	      Predict the amount of required  memory  for  calibrating	cmfile
	      (with  provided  options)	 on  the current machine and exit. The
	      calibration is not performed.

OPTIONS	CONTROLLING EXPONENTIAL	TAIL FITS
       --gtailn	<x>
	      fit the exponential tail for glocal Inside and glocal CYK	to the
	      <n> highest scores in the	histogram tail,	where <n> is <x> times
	      the number of Mb searched. The default value of <x> is 250.  The
	      value 250	was chosen because it works well empirically  relative
	      to other values.

       --ltailn	<x>
	      fit  the	exponential tail for local Inside and local CYK	to the
	      <n> highest scores in the	histogram tail,	where <n> is <x> times
	      the number of Mb searched. The default value of <x> is 750.  The
	      value 750	was chosen because it works well empirically  relative
	      to other values.

       --tailp <x>
	      Ignore  the  --gtailn  and --ltailn prefixed options and fit the
	      <x> fraction tail	of the histogram to an exponential  tail,  for
	      all search modes.

OPTIONAL OUTPUT	FILES
       --hfile <f>
	      Save the histograms fit to file <f>.  The	format of this file is
	      two space	delimited columns per line. The	first column is	the x-
	      axis  values of bit scores of each bin. The second column	is the
	      y-axis values of number of hits per bin. Each series  is	delim-
	      ited  by	a line with a single character "&". The	file will con-
	      tain one series for each of the four exponential	tail  fits  in
	      the  following  order: glocal CYK, glocal	Inside,	local CYK, and
	      local Inside.

       --sfile <f>
	      Save survival plot information to	file <f>.  The format of  this
	      file  is	two space delimited columns per	line. The first	column
	      is the x-axis values of bit scores of each bin. The second  col-
	      umn is the y-axis	values of fraction of hits that	meet or	exceed
	      the  score for each bin. Each series is delimited	by a line with
	      a	single character "&".  The file	will contain three  series  of
	      data  for	 each of the four CM search modes in the following or-
	      der: glocal CYK, glocal Inside, local  CYK,  and	local  Inside.
	      The  first  series  is the empirical survival plot from the his-
	      togram of	hits to	the random sequence. The second	series is  the
	      exponential  tail	 fit  to the empirical distribution. The third
	      series is	the exponential	tail fit if lambda were	fixed and  set
	      as the natural log of 2 (0.691314718).

       --qqfile	<f>
	      Save quantile-quantile plot information to file <f>.  The	format
	      of  this file is two space delimited columns per line. The first
	      column is	the x-axis values, and the second column is the	y-axis
	      values. The distance of the points from the identity line	 (y=x)
	      is a measure of how good the exponential tail fit	is, the	closer
	      the  points  are	to  the	 identity line,	the better the fit is.
	      Each series is delimited by a line with a	single character  "&".
	      The  file	 will contain one series of empirical data for each of
	      the four exponential tail	fits in	the  following	order:	glocal
	      CYK, glocal Inside, local	CYK and	local Inside.

       --ffile <f>
	      Save  space  delimited  statistics of different exponential tail
	      fits to file <f>.	 The file will contain the lambda and mu  val-
	      ues  for	exponential  tails fit to histogram tails of different
	      sizes. The fields	in the file are	labelled informatively.

       --xfile <f>
	      Save a list of the scores	in each	fit  histogram	tail  to  file
	      <f>.   Each  line	of this	file will have a different score indi-
	      cating one hit existed in	the tail with that score.  Each	series
	      is delimited by a	line with a single  character  "&".  The  file
	      will  contain  one  series for each of the four exponential tail
	      fits in the following order: glocal CYK,	glocal	Inside,	 local
	      CYK, and local Inside.

OPTIONS	CONTROLLING SPLIT, PARTITION AND MERGE MODES:
       --split
	      Prepare  a  partitioned  calibration.  This option only works in
	      combination with the --ptot <n> and  --cfile  <f>	 options,  and
	      will  prepare  a calibration split into <n> separate partitions.
	      The commands to run all of the partitions	will be	 in  the  file
	      <f> .

       --cfile <f>
	      With --split, save the commands for all partitions to file <f> .

       --proot <s>
	      With  --split,  specify  that  the per-partition scores files be
	      named <s>.<n> where <n> is the partition index.  By default they
	      will be named <s>.calib.<n> where	<s> is the name	of the CM file
	      to be calibrated (including path).

       --part <n>
	      specify that this	is partition <n> out of	<n2> from --ptot <n2>.
	      Must be used in combination with --ptot and --pfile .

       --ptot <n>
	      With --split, --part or --merge, specify that there are <n>  to-
	      tal partitions.

       --pfile <f>
	      With --part , specify that scores	for this partition be saved to
	      file <f>

       --merge
	      Merge  scores  from  multiple previously executed	partitions and
	      calibrate	CMs. If	you used the option --proot <s>	 with  cmcali-
	      brate  when you ran it with --split to setup the partitions, use
	      --proot <s> again	with --merge.  The  full  cmcalibrate  --merge
	      command to use will have been output to standard output when the
	      initial cmcalibrate --split command was executed.

OTHER OPTIONS
       --seed <n>
	      Seed  the	random number generator	with <n>, an integer >=	0.  If
	      <n> is nonzero, stochastic simulations will be reproducible; the
	      same command will	give the same results.	If <n> is 0, the  ran-
	      dom number generator is seeded arbitrarily, and stochastic simu-
	      lations  will vary from run to run of the	same command.  The de-
	      fault seed is 181.

       --beta <x>
	      By default query-dependent banding (QDB) is used	to  accelerate
	      the  CM  search  algorithms with a beta tail loss	probability of
	      1E-15.  This beta	value can be changed to	<x> with  --beta  <x>.
	      The  beta	 parameter  is the amount of probability mass excluded
	      during band calculation, higher  values  of  beta	 give  greater
	      speedups	but sacrifice more accuracy than lower values. The de-
	      fault value used is 1E-15. (For  more  information  on  QDB  see
	      Nawrocki and Eddy, PLoS Computational Biology 3(3): e56.)

       --nonbanded
	      Turn  off	 QDB  during  E-value calibration. This	will slow down
	      calibration.

       --nonull3
	      Turn off the null3 post hoc additional null model. This  is  not
	      recommended unless you plan on using the same option to cmsearch
	      and/or cmscan.

       --random
	      Use  the	background null	model of the CM	to generate the	random
	      sequences, instead of the	more realistic HMM. Unless the CM  was
	      built  using  the	 --null	option to cmbuild, the background null
	      model will be 25%	each A,	C, G and U.

       --gc <f>
	      Generate the random sequences using the nucleotide  distribution
	      from the sequence	file <f>.

       --cpu <n>
	      Set  the number of parallel worker threads to <n>.  On multicore
	      machines,	the default is 4.  You can also	control	this number by
	      setting an environment variable, INFERNAL_NCPU.  There is	also a
	      master thread, so	the actual number  of  threads	that  Infernal
	      spawns  is  <n>+1.  This option is not available if Infernal was
	      compiled with POSIX threads support turned off.

       --mpi  Run as an	MPI parallel program. This option will only be	avail-
	      able  if	Infernal has been configured and built with the	"--en-
	      able-mpi"	flag (see the Installation section of the  user	 guide
	      for more information).

SEE ALSO
       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For  complete documentation, see	the user guide that came with your In-
       fernal distribution (Userguide.pdf);  or	 see  the  Infernal  web  page
       (http://eddylab.org/infernal/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	Infernal source	distribution, or see  the  In-
       fernal web page (http://eddylab.org/infernal/).

AUTHOR
       http://eddylab.org

Infernal 1.1.5			   Sep 2023			cmcalibrate(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmcalibrate&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages