FreeBSD Manual Pages

home | help
cmsearch(1)			Infernal Manual			   cmsearch(1)

NAME
       cmsearch	- search covariance model(s) against a sequence	database

SYNOPSIS
       cmsearch	[options] <cmfile> <seqdb>

DESCRIPTION
       cmsearch	 is used to search one or more covariance models (CMs) against
       a sequence database.  For each CM in <cmfile>, use  that	 query	CM  to
       search  the  target database of sequences in <seqdb>, and output	ranked
       lists of	the sequences with the most significant	matches	to the CM.  To
       build CMs from multiple alignments, see cmbuild.

       The query <cmfile> must have been calibrated for	E-values with  cmcali-
       brate.	As  a special exception, any models in <cmfile>	that have zero
       basepairs need not be calibrated. For these models, profile HMM	search
       algorithms will be used instead of CM ones, as discussed	further	below.

       The  query  <cmfile>  may  be '-' (a dash character), in	which case the
       query CM	input will be read from	a <stdin> pipe instead of from a file.
       The <seqdb> may not be '-' because the current implementation needs  to
       be able to rewind the database, which is	not possible with stdin	input.

       The output format is designed to	be human-readable, but is often	so vo-
       luminous	 that reading it is impractical, and parsing it	is a pain. The
       --tblout	option saves output in a simple	tabular	format that is concise
       and easier to parse.  The -o option allows redirecting the main output,
       including throwing it away in /dev/null.

       cmsearch	reexamines the 5' and 3' termini  of  target  sequences	 using
       specialized  algorithms	for detection of truncated hits, in which part
       of the 5' and/or	3' end of the actual full length  homologous  sequence
       is  missing  in	the  target sequence file. These types of hits will be
       most common in sequence	files  consisting  of  unassembled  sequencing
       reads.  By  default,  any  5'  truncated	hit is required	to include the
       first residue of	the target sequence it derives from  in	 <seqdb>,  and
       any  3'	truncated  hit is required to include the final	residue	of the
       target sequence it derives from.	Any 5' and 3' truncated	hit  must  in-
       clude  the  first  and  final residue of	the target sequence it derives
       from. The --anytrunc option will	relax the requirements for hit	inclu-
       sion of sequence	endpoints, and truncated hits are allowed to start and
       stop  at	 any  positions	 of target sequences. Importantly though, with
       --anytrunc, hit E-values	will be	less accurate because  model  calibra-
       tion  does  not	consider  the possibility of truncated hits, so	use it
       with caution.  The --notrunc option can be used to turn	off  truncated
       hit  detection.	 --notrunc  will  reduce the running time of cmsearch,
       most significantly for target <seqdb> files that	include	many short se-
       quences.

       Truncated hit detection is automatically	turned	off  when  the	--max,
       --nohmm,	 --qdb,	 or  --nonbanded options are used because it relies on
       the use of an accelerated HMM banded alignment strategy that is	turned
       off by any of those options.

OPTIONS
       -h     Help;  print  a  brief  reminder	of  command line usage and all
	      available	options.

       -g     Turn on the glocal alignment algorithm, global with  respect  to
	      the  query  model	and local with respect to the target database.
	      By default, the local alignment algorithm	is used	which is local
	      with respect to both the target sequence and the model. In local
	      mode, the	alignment to span two or more subsequences  if	neces-
	      sary  (e.g.  if the structures of	the query model	and target se-
	      quence are only partially	shared), allowing certain large	inser-
	      tions and	deletions in the structure to be penalized differently
	      than normal indels. Local	 mode  performs	 better	 on  empirical
	      benchmarks and is	significantly more sensitive for remote	homol-
	      ogy  detection.  Empirically,  glocal searches return many fewer
	      hits than	local searches,	so glocal may be desired for some  ap-
	      plications.  With	 -g, all models	must be	calibrated, even those
	      with zero	basepairs.

       -Z <x> Calculate	E-values as if the search space	size was <x> megabases
	      (Mb). Without the	use of this option, the	search space  size  is
	      defined  as  the total number of nucleotides in <seqdb> times 2,
	      because both strands of each target sequence will	be searched.

       --devhelp
	      Print help, as with -h , but also	include	 expert	 options  that
	      are  not	displayed  with	-h .  These expert options are not ex-
	      pected to	be relevant for	the vast majority of users and so  are
	      not described in the manual page.	 The only resources for	under-
	      standing	what  they actually do are the brief one-line descrip-
	      tions output when	--devhelp is enabled, and the source code.

OPTIONS	FOR CONTROLLING	OUTPUT
       -o <f> Direct the main human-readable output to a file <f>  instead  of
	      the default stdout.

       -A <f> Save  a multiple alignment of all	significant hits (those	satis-
	      fying inclusion thresholds) to the file <f>.

       --tblout	<f>
	      Save a simple tabular  (space-delimited)	file  summarizing  the
	      hits  found, with	one data line per hit. The format of this file
	      is described in section 6	of the Infernal	user guide.

       --acc  Use accessions instead of	names in the main output, where	avail-
	      able for profiles	and/or sequences.

       --noali
	      Omit the alignment  section  from	 the  main  output.  This  can
	      greatly reduce the output	volume.

       --notextw
	      Unlimit  the length of each line in the main output. The default
	      is a limit of 120	characters per line, which helps in displaying
	      the output cleanly on terminals and in editors, but can truncate
	      target profile description lines.

       --textw <n>
	      Set the main output's line length	limit to  <n>  characters  per
	      line. The	default	is 120.

       --verbose
	      Include extra search pipeline statistics in the main output, in-
	      cluding  filter  survival	statistics for truncated hit detection
	      and number of envelopes discarded	due to matrix size overflows.

       --nomiss
	      With -A, in the output alignment,	do not mark truncated hits  as
	      fragmentary  by  using  missing  data  '~' characters before the
	      first residue (for 5' truncated hits)  and/or  after  the	 final
	      residue  (for 3' truncated hits).	By default, missing data char-
	      acters ('~') will	be used	in this	way to mark truncated hits  as
	      fragments.  The  cmbuild	program	can use	these fragment defini-
	      tions in input alignment files  if  the  --fraggiven  option  is
	      used.

OPTIONS	CONTROLLING REPORTING THRESHOLDS
       Reporting  thresholds  control  which hits are reported in output files
       (the main output	and --tblout) Hits are ranked by statistical  signifi-
       cance  (E-value).   By  default,	all hits with an E-value <= 10 are re-
       ported.	The following options allow you	to change the default  E-value
       reporting thresholds, or	to use bit score thresholds instead.

       -E <x> In  the  per-target  output,  report target sequences with an E-
	      value of <= <x>.	The default is 10.0, meaning that on  average,
	      about  10	false positives	will be	reported per query, so you can
	      see the top of the noise and decide for yourself if it's	really
	      noise.

       -T <x> Instead  of thresholding per-CM output on	E-value, report	target
	      sequences	with a bit score of >= <x>.

OPTIONS	FOR INCLUSION THRESHOLDS
       Inclusion thresholds are	stricter than reporting	thresholds.  Inclusion
       thresholds control which	hits are considered to be reliable  enough  to
       be  included  in	an output alignment or in a possible subsequent	search
       round, or marked	as significant ("!") as	opposed	to questionable	 ("?")
       in hit output.

       --incE <x>
	      Use  an  E-value	of <= <x> as the hit inclusion threshold.  The
	      default is 0.01, meaning that on average,	about 1	false positive
	      would be expected	in every 100 searches with different query se-
	      quences.

       --incT <x>
	      Instead of using E-values	for setting the	 inclusion  threshold,
	      instead  use  a bit score	of >= <x> as the hit inclusion thresh-
	      old.  By default this option is unset.

OPTIONS	FOR MODEL-SPECIFIC SCORE THRESHOLDING
       Curated CM databases may	define specific	bit score thresholds for  each
       CM,  superseding	 any  thresholding  based  on statistical significance
       alone.

       To use these options, the profile must contain the appropriate (GA, TC,
       and/or NC) optional score threshold annotation; this is	picked	up  by
       cmbuild from Stockholm format alignment files. Each thresholding	option
       has  a score of <x> bits, and acts as if	-T <x> --incT <x> has been ap-
       plied specifically using	each model's curated thresholds.

       --cut_ga
	      Use the GA (gathering) bit scores	in the model to	 set  hit  re-
	      porting  and  inclusion  thresholds. GA thresholds are generally
	      considered to be the reliable curated thresholds defining	family
	      membership; for example, in Rfam,	these thresholds  define  what
	      gets  included  in  Rfam	Full alignments	based on searches with
	      Rfam Seed	models.

       --cut_nc
	      Use the NC (noise	cutoff)	bit score thresholds in	the  model  to
	      set  hit	reporting  and inclusion thresholds. NC	thresholds are
	      generally	considered to be  the  score  of  the  highest-scoring
	      known false positive.

       --cut_tc
	      Use the TC (trusted cutoff) bit score thresholds in the model to
	      set  hit	reporting  and inclusion thresholds. TC	thresholds are
	      generally	considered to be the score of the lowest-scoring known
	      true positive that is above all known false positives.

OPTIONS	CONTROLLING THE	ACCELERATION PIPELINE
       Infernal	1.1 searches are accelerated in	a six-stage  filter  pipeline.
       The  first  five	 stages	use a profile HMM to define envelopes that are
       passed to the stage six CM CYK filter. Any envelopes that  survive  all
       filters	are  assigned  final scores using the the CM Inside algorithm.
       (See the	user guide for more information.)

       The profile HMM filter is built by the cmbuild program and is stored in
       <cmfile>.

       Each successive filter is slower	than the previous one, but better than
       it at disciminating between subsequences	that may contain  high-scoring
       CM  hits	 and  those that do not. The first three HMM filter stages are
       the same	as those used in HMMER3.  Stage	1 (F1) is the  local  HMM  SSV
       filter  modified	 for  long  sequences.	Stage  2 (F2) is the local HMM
       Viterbi filter. Stage 3 (F3) is the local HMM Forward filter.  Each  of
       the first three stages uses the profile HMM in local mode, which	allows
       a target	subsequence to align to	any region of the HMM. Stage 4 (F4) is
       a  glocal  HMM  filter, which requires a	target subsequence to align to
       the full-length profile HMM. Stage 5 (F5) is the	 glocal	 HMM  envelope
       definition filter, which	uses HMMER3's domain identification heursitics
       to define envelope boundaries. After each stage from 2 to 5 a bias fil-
       ter  step (F2b, F3b, F4b, and F5b) is used to remove sequences that ap-
       pear to have passed the filter due to biased composition	alone. Any en-
       velopes that survive stages F1 through F5b are then passed with the lo-
       cal CM CYK filter. The CYK filter uses constraints (bands) derived from
       an HMM alignment	of the envelope	to reduce the number of	required  cal-
       culations  and  save time.  Any envelopes that pass CYK are scored with
       the local CM Inside algorithm, again using HMM bands for	acceleration.

       The default filter thresholds that define the  minimum  score  required
       for  a  subsequence to survive each stage are defined based on the size
       of the database in <seqdb> (or the size <x> in megabases	(Mb) specified
       by the -Z <x> or	--FZ <x> options).  For	larger databases, the  filters
       are  more strict	leading	to more	acceleration but potentially a greater
       loss of sensitivity. The	rationale is that for larger  databases,  hits
       must  have  higher  scores  to  achieve	statistical  significance,  so
       stricter	filtering that removes lower scoring insignificant hits	is ac-
       ceptable.

       The P-value thresholds for all possible search space sizes and all fil-
       ter stages are listed next. (A P-value threshold	 of  0.01  means  that
       roughly	1%  of	the  highest scoring nonhomologous subsequence are ex-
       pected to pass the filter.) Z is	defined	as the number  of  nucleotides
       in  the complete	target sequence	file times 2 because both strands will
       be searched with	each model.

       If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off;	F3,  F3b,  F4,
       F4b and F5 are 0.02; F6 is 0.0001.

       If  Z  is  between  2 Mb	and 20 Mb: F1 is 0.35; F2 and F2b are off; F3,
       F3b, F4,	F4b and	F5 are 0.005; F6 is 0.0001.

       If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2	and F2b	are 0.15;  F3,
       F3b, F4,	F4b and	F5 are 0.003; F6 is 0.0001.

       If  Z  is between 200 Mb	and 2 Gb: F1 is	0.15; F2 and F2b are 0.15; F3,
       F3b, F4,	F4b, F5, and F5b are 0.0008; and F6 is 0.0001.

       If Z is between 2 Gb and	20 Gb: F1 is 0.15; F2 and F2b  are  0.15;  F3,
       F3b, F4,	F4b, F5, and F5b are 0.0002; and F6 is 0.0001.

       If  Z is	more than 20 Gb: F1 is 0.06; F2	and F2b	are 0.02; F3, F3b, F4,
       F4b, F5,	and F5b	are 0.0002; and	F6 is 0.0001.

       These thresholds	were chosen based on performance on an internal	bench-
       mark testing many different possible settings.

       There are five options for controlling  the  general  filtering	level.
       These  options are, in order from least strict (slowest but most	sensi-
       tive) to	most strict (fastest but  least	 sensitive):  --max,  --nohmm,
       --mid,  --default,  (this  is  the default setting), --rfam.  and --hm-
       monly.  With --default the filter thresholds will be database-size  de-
       pendent.	 See the explanation of	each of	these individual options below
       for more	information.

       Additionally, an	expert user can	precisely control  each	 filter	 stage
       score  threshold	 with the --F1,	--F1b, --F2, --F2b, --F3, --F3b, --F4,
       --F4b, --F5, --F5b, and --F6 options. As	well as	turn each stage	on  or
       off with	the --noF1, --doF1b, --noF2, --noF2b, --noF3, --noF3b, --noF4,
       --noF4b,	 --noF5,  and  --noF6.	 options.  These options are only dis-
       played if the --devhelp option is used to keep the number of  displayed
       options	with  -h  reasonable, and because they are only	expected to be
       useful to a small minority of users.

       As a special case, for any models in <cmfile>  which  have  zero	 base-
       pairs,  profile	HMM searches are run instead of	CM searches. HMM algo-
       rithms are more efficient than CM algorithms, and the benefit of	CM al-
       gorithms	is lost	for models with	no  secondary  structure  (zero	 base-
       pairs).	These  profile HMM searches will run significantly faster than
       the CM searches.	You can	force HMM-only searches	with the --hmmonly op-
       tion. For more information on HMM-only searches see the description  of
       the --hmmonly option below, and the user	guide.

       --max  Turn  off	 all filters, and run non-banded Inside	on every full-
	      length target sequence. This increases sensitivity somewhat,  at
	      an extremely large cost in speed.

       --nohmm
	      Turn off all HMM filter stages (F1 through F5b). The CYK filter,
	      using QDBs, will be run on every full-length target sequence and
	      will  enforce  a	P-value	 threshold of 0.0001. Each subsequence
	      that survives CYK	will be	passed to Inside, which	will also  use
	      QDBs (but	a looser set). This increases sensitivity somewhat, at
	      a	very large cost	in speed.

       --mid  Turn off the HMM SSV and Viterbi filter stages (F1 through F2b).
	      Set  remaining HMM filter	thresholds (F3 through F5b) to 0.02 by
	      default, but changeable to <x> with --Fmid  <x>  sequence.  This
	      may increase sensitivity,	at a significant cost in speed.

       --default
	      Use  the	default	 filtering  strategy. This option is on	by de-
	      fault. The filter	thresholds are determined based	on  the	 data-
	      base size.

       --rfam Use  a  strict  filtering	 strategy  devised for large databases
	      (more than 20 Gb). This will accelerate the search at  a	poten-
	      tial cost	to sensitivity.	It will	have no	effect if the database
	      is larger	than 20	Gb.

       --hmmonly
	      Only use the filter profile HMM for searches, do not use the CM.
	      Only  filter stages F1 through F3	will be	executed, using	strict
	      P-value thresholds (0.02 for F1, 0.001 for F2  and  0.00001  for
	      F3).   Additionally  a bias composition filter is	used after the
	      F1 stage (with P=0.02 survival threshold).  Any  hit  that  sur-
	      vives  all  stages and has an HMM	E-value	or bit score above the
	      reporting	threshold will be output.  The	user  can  change  the
	      HMM-only	filter	thresholds  and	options	with --hmmF1, --hmmF2,
	      --hmmF3, --hmmnobias, --hmmnonull2, and --hmmmax.	  By  default,
	      searches	for  any model with zero basepairs will	be run in HMM-
	      only mode. This can be turned off, forcing CM searches for these
	      models with the --nohmmonly option.  These options are only dis-
	      played if	the --devhelp option is	used.

       --FZ <x>
	      Set filter thresholds as the defaults used if the	database  were
	      <x>  megabases (Mb). If used with	<x> greater than 20000 (20 Gb)
	      this option has the same effect as --rfam.

       --Fmid <x>
	      With the --mid option set	the HMM	filter thresholds (F3  through
	      F5b) to <x>.  By default,	<x> is 0.02.

OTHER OPTIONS
       --notrunc
	      Turn off truncated hit detection.

       --anytrunc
	      Allow  truncated hits to begin and end at	any position in	a tar-
	      get sequence. By default,	5' truncated  hits  must  include  the
	      first  residue  of  their	 target	sequence and 3'	truncated hits
	      must include the final residue of	their  target  sequence.  With
	      this  option  you	may observe fewer full length hits that	extend
	      to the beginning and end of the query CM.	As of  version	1.1.5,
	      truncated	 hits that end at sequence terminii with a lower score
	      penalty than  internally	truncated  hits	 are  also  considered
	      (these were not considered in 1.1x versions prior	to 1.1.5).  To
	      reproduce	 the  behavior	of  this  option  from v1.1.4, use the
	      --inttrunc option	instead.

       --nonull3
	      Turn off the null3 CM score corrections for biased  composition.
	      This correction is not used during the HMM filter	stages.

       --mxsize	<x>
	      Set the maximum allowable	CM DP matrix size to <x> megabytes. By
	      default  this  size  is 128 Mb.  This should be large enough for
	      the vast majority	of searches, especially	with  smaller  models.
	      If  cmsearch  encounters	an envelope in the CYK or Inside stage
	      that requires a larger matrix, the envelope will	be  discounted
	      from  consideration.  This behavior is like an additional	filter
	      that prevents expensive (slow) CM	DP calculations, but at	a  po-
	      tential cost to sensitivity.  Note that if cmsearch is being run
	      in  <n> multiple threads on a multicore machine then each	thread
	      may have an allocated matrix of up to size <x> Mb	at  any	 given
	      time.

       --smxsize <x>
	      Set  the	maximum	 allowable  CM	search	DP  matrix size	to <x>
	      megabytes. By default this size is 128 Mb.  This option is  only
	      relevant if the CM will not use HMM banded matrices, i.e.	if the
	      --max,  --nohmm, --qdb, --fqdb, --nonbanded, or --fnonbanded op-
	      tions are	also used. Note	that if	cmsearch is being run  in  <n>
	      multiple	threads	 on  a	multicore machine then each thread may
	      have an allocated	matrix of up to	size <x> Mb at any given time.

       --cyk  Use the CYK algorithm, not Inside, to determine the final	 score
	      of all hits.

       --acyk Use   the	  CYK	algorithm  to  align  hits.  By	 default,  the
	      Durbin/Holmes optimal accuracy algorithm is  used,  which	 finds
	      the  alignment  that  maximizes  the  expected  accuracy	of all
	      aligned residues.

       --wcx <x>
	      For each CM, set the W parameter,	the expected maximum length of
	      a	hit, to	<x> times the consensus	length of the  model.  By  de-
	      fault,  the  W parameter is read from the	CM file	and was	calcu-
	      lated based on the transition probabilities of the model by  cm-
	      build.  You can find out what the	default	W is for a model using
	      cmstat.	This  option should be used with caution as it impacts
	      the filtering pipeline at	several	different stages in nonobvious
	      ways. It is only recommended for expert users searching for hits
	      that are much longer than	any of the homologs used to build  the
	      model  in	 cmbuild,  e.g.	ones with large	introns	or other large
	      insertions.  This	option cannot be used in combination with  the
	      --nohmm,	--fqdb	or  --qdb  options because in those cases W is
	      limited by query-dependent bands.

       --toponly
	      Only search the top (Watson) strand of target sequences in  <se-
	      qdb>.   By  default,  both strands are searched. This will halve
	      the database size	(Z).

       --bottomonly
	      Only search the bottom (Crick) strand  of	 target	 sequences  in
	      <seqdb>.	 By  default,  both  strands  are searched.  This will
	      halve the	database size (Z).

       --tformat <s>
	      Assert that the target sequence database file is in format  <s>.
	      Accepted	formats	include	fasta, embl, genbank, ddbj, stockholm,
	      pfam, a2m, afa, clustal, and phylip The default is to autodetect
	      the format of the	file.

       --cpu <n>
	      Set the number of	parallel worker	threads	to <n>.	 On  multicore
	      machines,	the default is 4.  You can also	control	this number by
	      setting an environment variable, INFERNAL_NCPU.  There is	also a
	      master  thread,  so  the	actual number of threads that Infernal
	      spawns is	<n>+1.	This option is not available if	 Infernal  was
	      compiled with POSIX threads support turned off.

       --stall
	      For  debugging the MPI master/worker version: pause after	start,
	      to enable	the developer to attach	debuggers to the running  mas-
	      ter  and worker(s) processes. Send SIGCONT signal	to release the
	      pause.  (Under gdb: (gdb)	signal SIGCONT)	(Only available	if op-
	      tional MPI support was enabled at	compile-time.)

       --mpi  Run in MPI master/worker mode, using mpirun.  To use --mpi,  the
	      sequence	file  must  have  first	 been 'indexed'	using the esl-
	      sfetch  program,	which  is  included  with  Infernal,  in   the
	      easel/miniapps/  subdirectory.   (Only available if optional MPI
	      support was enabled at compile-time.)

SEE ALSO
       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For complete documentation, see the user	guide that came	with your  In-
       fernal  distribution  (Userguide.pdf);  or  see	the  Infernal web page
       (http://eddylab.org/infernal/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For additional information on copyright and  licensing,	see  the  file
       called  COPYRIGHT  in your Infernal source distribution,	or see the In-
       fernal web page (http://eddylab.org/infernal/).

AUTHOR
       http://eddylab.org

Infernal 1.1.5			   Sep 2023			   cmsearch(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmsearch&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages