FreeBSD Manual Pages

home | help
cmscan(1)			Infernal Manual			     cmscan(1)

NAME
       cmscan -	search sequence(s) against a covariance	model database

SYNOPSIS
       cmscan [options]	<cmdb> <seqfile>

DESCRIPTION
       cmscan  is  used	 to search sequences against collections of covariance
       models.	For each sequence in <seqfile>,	use  that  query  sequence  to
       search the target database of CMs in <cmdb>, and	output ranked lists of
       the CMs with the	most significant matches to the	sequence.

       The  <seqfile>  may  contain more than one query	sequence. It can be in
       FASTA format, or	several	other common sequence file  formats  (genbank,
       embl,  and  among  others),  or	in  alignment file formats (stockholm,
       aligned fasta, and others). See the --qformat  option  for  a  complete
       list.

       The <cmdb> needs	to be press'ed using cmpress before it can be searched
       with  cmscan.  This creates four	binary files, suffixed .i1{fimp}.  Ad-
       ditionally, <cmdb> must have been calibrated for	E-values with  cmcali-
       brate before being press'ed with	cmpress.

       The  query  <seqfile>  may be '-' (a dash character), in	which case the
       query sequences are read	from a <stdin> pipe instead of	from  a	 file.
       The  <cmdb>  cannot  be read from a <stdin> stream, because it needs to
       have those four auxiliary binary	files generated	by cmpress.

       The output format is designed to	be human-readable, but is often	so vo-
       luminous	that reading it	is impractical,	and parsing it is a pain.  The
       --tblout	option saves output in a simple	tabular	format that is concise
       and easier to parse. The	--fmt 2	option modifies	the format of the tab-
       ular  output  by	adding several fields, including markup	of overlapping
       hits, as	described in section 6 of the Infernal user guide.  The	-o op-
       tion allows redirecting the main	output,	including throwing it away  in
       /dev/null.

       cmscan  reexamines the 5' and 3'	termini	of target sequences using spe-
       cialized	algorithms for detection of truncated hits, in which  part  of
       the  5'	and/or 3' end of the actual full length	homologous sequence is
       missing in the target sequence file. These types	of hits	will  be  most
       common in sequence files	consisting of unassembled sequencing reads. By
       default,	 any 5'	truncated hit is required to include the first residue
       of the target sequence it derives from in <seqfile>, and	any  3'	 trun-
       cated  hit  is  required	to include the final residue of	the target se-
       quence it derives from. Any 5' and 3' truncated hit  must  include  the
       first  and  final  residue  of the target sequence it derives from. The
       --anytrunc option will relax the	requirements for hit inclusion of  se-
       quence  endpoints,  and truncated hits are allowed to start and stop at
       any  positions  of  target   sequences.	  Importantly	though,	  with
       --anytrunc,  hit	 E-values will be less accurate	because	model calibra-
       tion does not consider the possibility of truncated  hits,  so  use  it
       with  caution.	The --notrunc option can be used to turn off truncated
       hit detection.  --notrunc will reduce the running time of cmscan,  most
       significantly  for  target  <seqfile> files that	include	many short se-
       quences.	 Truncated hit detection is automatically turned off when  the
       --max,  --nohmm,	 --qdb,	or --nonbanded options are used	because	it re-
       lies on the use of an accelerated HMM banded alignment strategy that is
       turned off by any of those options.

OPTIONS
       -h     Help; print a brief reminder  of	command	 line  usage  and  all
	      available	options.

       -g     Turn  on	the glocal alignment algorithm,	global with respect to
	      the query	model and local	with respect to	the  target  database.
	      By default, the local alignment algorithm	is used	which is local
	      with respect to both the target sequence and the model. In local
	      mode,  the  alignment to span two	or more	subsequences if	neces-
	      sary (e.g. if the	structures of the query	model and  target  se-
	      quence are only partially	shared), allowing certain large	inser-
	      tions and	deletions in the structure to be penalized differently
	      than  normal  indels.  Local  mode  performs better on empirical
	      benchmarks and is	significantly more sensitive for remote	homol-
	      ogy detection. Empirically, glocal searches  return  many	 fewer
	      hits  than local searches, so glocal may be desired for some ap-
	      plications.

       -Z <x> Calculate	E-values as if the search space	size was <x> megabases
	      (Mb). Without the	use of this  option,  the  search  space  size
	      changes  for each	query sequence,	it is defined as the length of
	      the current query	sequence times 2 (because both strands of  the
	      sequence will be searched) times the number of CMs in <cmdb>.

       --devhelp
	      Print  help,  as	with -h	, but also include expert options that
	      are not displayed	with -h	.  These expert	options	 are  not  ex-
	      pected  to be relevant for the vast majority of users and	so are
	      not described in the manual page.	 The only resources for	under-
	      standing what they actually do are the brief  one-line  descrip-
	      tions output when	--devhelp is enabled, and the source code.

OPTIONS	FOR CONTROLLING	OUTPUT
       -o <f> Direct  the  main	human-readable output to a file	<f> instead of
	      the default stdout.

       --tblout	<f>
	      Save a simple tabular  (space-delimited)	file  summarizing  the
	      hits found, with one data	line per hit.  The format of this file
	      is described in section 6	of the Infernal	user guide.

       --fmt <n>
	      specify  the  format  of	the tabular output file	specified with
	      --tblout <f> be in format	<n>.  Possible values for <n> are 1 or
	      2. By default <n>	is 1 when  --tblout  is	 used  without	--fmt.
	      With  --fmt  2  nine  additional fields are added	to the tabular
	      output file, most	of which pertain to the	annotation of overlap-
	      ping hits.  See section 6	the Infernal user guide	for a descrip-
	      tion of both formats.

       --acc  Use accessions instead of	names in the main output, where	avail-
	      able for profiles	and/or sequences.

       --noali
	      Omit the alignment  section  from	 the  main  output.  This  can
	      greatly reduce the output	volume.

       --notextw
	      Unlimit  the length of each line in the main output. The default
	      is a limit of 120	characters per line, which helps in displaying
	      the output cleanly on terminals and in editors, but can truncate
	      target profile description lines.

       --textw <n>
	      Set the main output's line length	limit to  <n>  characters  per
	      line. The	default	is 120.

       --verbose
	      Include extra search pipeline statistics in the main output, in-
	      cluding  filter  survival	statistics for truncated hit detection
	      and number of envelopes discarded	due to matrix size overflows.

OPTIONS	CONTROLLING REPORTING THRESHOLDS
       Reporting thresholds control which hits are reported  in	 output	 files
       (the  main output and --tblout) Hits are	ranked by statistical signifi-
       cance (E-value).	 By default, all hits with an E-value <=  10  are  re-
       ported.	 The following options allow you to change the default E-value
       reporting thresholds, or	to use bit score thresholds instead.

       -E <x> In the per-target	output,	report target  sequences  with	an  E-
	      value  of	<= <x>.	 The default is	10.0, meaning that on average,
	      about 10 false positives will be reported	per query, so you  can
	      see  the top of the noise	and decide for yourself	if it's	really
	      noise.

       -T <x> Instead of thresholding per-CM output on E-value,	report	target
	      sequences	with a bit score of >= <x>.

OPTIONS	FOR INCLUSION THRESHOLDS
       Inclusion thresholds are	stricter than reporting	thresholds.  Inclusion
       thresholds  control  which hits are considered to be reliable enough to
       be included in a	possible subsequent search round, or marked as signif-
       icant ("!") as opposed to questionable ("?") in hit output.

       --incE <x>
	      Use an E-value of	<= <x> as the hit  inclusion  threshold.   The
	      default is 0.01, meaning that on average,	about 1	false positive
	      would be expected	in every 100 searches with different query se-
	      quences.

       --incT <x>
	      Instead  of  using E-values for setting the inclusion threshold,
	      instead use a bit	score of >= <x>	as the hit  inclusion  thresh-
	      old.  By default this option is unset.

OPTIONS	FOR MODEL-SPECIFIC SCORE THRESHOLDING
       Curated	CM databases may define	specific bit score thresholds for each
       CM, superseding any  thresholding  based	 on  statistical  significance
       alone.

       To use these options, the profile must contain the appropriate (GA, TC,
       and/or  NC)  optional  score threshold annotation; this is picked up by
       cmbuild from Stockholm format alignment files. Each thresholding	option
       has a score of <x> bits,	and acts as if -T <x> --incT <x> has been  ap-
       plied specifically using	each model's curated thresholds.

       --cut_ga
	      Use  the	GA  (gathering)	bit scores in the model	to set hit re-
	      porting and inclusion thresholds.	GA  thresholds	are  generally
	      considered to be the reliable curated thresholds defining	family
	      membership;  for	example, in Rfam, these	thresholds define what
	      gets included in Rfam Full alignments  based  on	searches  with
	      Rfam Seed	models.

       --cut_nc
	      Use  the	NC (noise cutoff) bit score thresholds in the model to
	      set hit reporting	and inclusion thresholds.  NC  thresholds  are
	      generally	 considered  to	 be  the  score	of the highest-scoring
	      known false positive.

       --cut_tc
	      Use the TC (trusted cutoff) bit score thresholds in the model to
	      set hit reporting	and inclusion thresholds.  TC  thresholds  are
	      generally	considered to be the score of the lowest-scoring known
	      true positive that is above all known false positives.

OPTIONS	CONTROLLING THE	ACCELERATION PIPELINE
       Infernal	 searches  are accelerated in a	six-stage filter pipeline. The
       first five stages use a profile HMM to define envelopes that are	passed
       to the stage six	CM CYK filter. Any envelopes that survive all  filters
       are assigned final scores using the the CM Inside algorithm.

       The profile HMM filter is built by the cmbuild program and is stored in
       <cmfile>.

       Each successive filter is slower	than the previous one, but better than
       it  at disciminating between subsequences that may contain high-scoring
       CM hits and those that do not. The first	three HMM  filter  stages  are
       the  same  as  those used in HMMER3.  Stage 1 (F1) is the local HMM SSV
       filter modified for long	sequences. Stage  2  (F2)  is  the  local  HMM
       Viterbi	filter.	 Stage 3 (F3) is the local HMM Forward filter. Each of
       the first three stages uses the profile HMM in local mode, which	allows
       a target	subsequence to align to	any region of the HMM. Stage 4 (F4) is
       a glocal	HMM filter, which requires a target subsequence	 to  align  to
       the  full-length	 profile  HMM. Stage 5 (F5) is the glocal HMM envelope
       definition filter, which	uses HMMER3's domain identification heursitics
       to define envelope boundaries. After each stage from 2 to 5 a bias fil-
       ter step	(F2b, F3b, F4b,	and F5b) is used to remove sequences that  ap-
       pear to have passed the filter due to biased composition	alone. Any en-
       velopes that survive stages F1 through F5b are then passed with the lo-
       cal CM CYK filter. The CYK filter uses constraints (bands) derived from
       an  HMM alignment of the	envelope to reduce the number of required cal-
       culations and save time.	 Any envelopes that pass CYK are  scored  with
       the local CM Inside algorithm, again using HMM bands for	acceleration.

       The  default  filter  thresholds	that define the	minimum	score required
       for a subsequence to survive each stage are defined based on  the  size
       of  the search space (Z), which is defined as the length	of the current
       query sequence times 2 (because both strands will  be  searched)	 times
       the  number  of	profiles  in <cmdb>.  However, if either the -Z	<x> or
       --FZ <x>	options	are used then the search space will be	considered  to
       be <x> for purposes of defining the filter thresholds.

       For  larger  databases, the filters are more strict leading to more ac-
       celeration but potentially a greater loss of sensitivity. The rationale
       is that for larger databases, hits must have higher scores  to  achieve
       statistical  significance,  so  stricter	 filtering  that removes lower
       scoring insignificant hits is acceptable.

       The P-value thresholds for all possible search space sizes and all fil-
       ter stages are listed next. (A P-value threshold	 of  0.01  means  that
       roughly	1%  of	the  highest scoring nonhomologous subsequence are ex-
       pected to pass the filter.) Z is	defined	as the number  of  nucleotides
       in  the complete	target sequence	file times 2 because both strands will
       be searched with	each model.

       If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off;	F3,  F3b,  F4,
       F4b and F5 are 0.02; F6 is 0.0001.

       If  Z  is  between  2 Mb	and 20 Mb: F1 is 0.35; F2 and F2b are off; F3,
       F3b, F4,	F4b and	F5 are 0.005; F6 is 0.0001.

       If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2	and F2b	are 0.15;  F3,
       F3b, F4,	F4b and	F5 are 0.003; F6 is 0.0001.

       If  Z  is between 200 Mb	and 2 Gb: F1 is	0.15; F2 and F2b are 0.15; F3,
       F3b, F4,	F4b, F5, and F5b are 0.0008; and F6 is 0.0001.

       If Z is between 2 Gb and	20 Gb: F1 is 0.15; F2 and F2b  are  0.15;  F3,
       F3b, F4,	F4b, F5, and F5b are 0.0002; and F6 is 0.0001.

       If  Z is	more than 20 Gb: F1 is 0.06; F2	and F2b	are 0.02; F3, F3b, F4,
       F4b, F5,	and F5b	are 0.0002; and	F6 is 0.0001.

       These thresholds	were chosen based on performance on an internal	bench-
       mark testing many different possible settings.

       There are five options for controlling  the  general  filtering	level.
       These  options are, in order from least strict (slowest but most	sensi-
       tive) to	most strict (fastest but  least	 sensitive):  --max,  --nohmm,
       --mid, --default, (this is the default setting) --rfam.	and --hmmonly.
       With  --default	the filter thresholds will be database-size dependent.
       See the explanation of each of these individual options below for  more
       information.

       Additionally,  an  expert  user can precisely control each filter stage
       score threshold with the	--F1, --F1b, --F2, --F2b, --F3,	 --F3b,	 --F4,
       --F4b,  --F5, --F5b, and	--F6 options. As well as turn each stage on or
       off with	the --noF1, --doF1b, --noF2, --noF2b, --noF3, --noF3b, --noF4,
       --noF4b,	--noF5,	and --noF6.  options.  These  options  are  only  dis-
       played  if the --devhelp	option is used to keep the number of displayed
       options with -h reasonable, and because they are	only  expected	to  be
       useful to a small minority of users.

       As  a  special  case,  for any models in	<cmfile> which have zero base-
       pairs, profile HMM searches are run instead of CM searches.  HMM	 algo-
       rithms are more efficient than CM algorithms, and the benefit of	CM al-
       gorithms	 is  lost  for	models with no secondary structure (zero base-
       pairs). These profile HMM searches will run significantly  faster  than
       the CM searches.	You can	force HMM-only searches	with the --hmmonly op-
       tion. For more information on HMM-only searches see the user guide.

       --max  Turn  off	 all filters, and run non-banded Inside	on every full-
	      length target sequence. This increases sensitivity somewhat,  at
	      an extremely large cost in speed.

       --nohmm
	      Turn off all HMM filter stages (F1 through F5b). The CYK filter,
	      using QDBs, will be run on every full-length target sequence and
	      will  enforce  a	P-value	 threshold of 0.0001. Each subsequence
	      that survives CYK	will be	passed to Inside, which	will also  use
	      QDBs (but	a looser set). This increases sensitivity somewhat, at
	      a	very large cost	in speed.

       --mid  Turn off the HMM SSV and Viterbi filter stages (F1 through F2b).
	      Set  remaining HMM filter	thresholds (F3 through F5b) to 0.02 by
	      default, but changeable to <x> with --Fmid  <x>  sequence.  This
	      may increase sensitivity,	at a significant cost in speed.

       --default
	      Use  the	default	 filtering  strategy. This option is on	by de-
	      fault. The filter	thresholds are determined based	on  the	 data-
	      base size.

       --rfam Use  a  strict  filtering	 strategy  devised for large databases
	      (more than 20 Gb). This will accelerate the search at  a	poten-
	      tial cost	to sensitivity.

       --hmmonly
	      Only use the filter profile HMM for searches, do not use the CM.
	      Only  filter stages F1 through F3	will be	executed, using	strict
	      P-value thresholds (0.02 for F1, 0.001 for F2  and  0.00001  for
	      F3).   Additionally  a bias composition filter is	used after the
	      F1 stage (with P=0.02 survival threshold).  Any  hit  that  sur-
	      vives  all  stages and has an HMM	E-value	or bit score above the
	      reporting	threshold will be output.  The	user  can  change  the
	      HMM-only	filter	thresholds  and	options	with --hmmF1, --hmmF2,
	      --hmmF3, --hmmnobias, --hmmnonull2, and --hmmmax.	  By  default,
	      searches	for  any model with zero basepairs will	be run in HMM-
	      only mode. This can be turned off, forcing CM searches for these
	      models with the --nohmmonly option.

       --FZ <x>
	      Set filter thresholds as the defaults used if the	database  were
	      <x>  megabases (Mb). If used with	<x> greater than 20000 (20 Gb)
	      this option has the same effect as --rfam.

       --Fmid <x>
	      With the --mid option set	the HMM	filter thresholds (F3  through
	      F5b) to <x>.  By default,	<x> is 0.02.

OTHER OPTIONS
       --notrunc
	      Turn off truncated hit detection.

       --anytrunc
	      Allow  truncated hits to begin and end at	any position in	a tar-
	      get sequence. By default,	5' truncated  hits  must  include  the
	      first  residue  of  their	 target	sequence and 3'	truncated hits
	      must include the final residue of	their  target  sequence.  With
	      this  option  you	may observe fewer full length hits that	extend
	      to the beginning and end of the query CM.	As of  version	1.1.5,
	      truncated	 hits that end at sequence terminii with a lower score
	      penalty than  internally	truncated  hits	 are  also  considered
	      (these were not considered in 1.1x versions prior	to 1.1.5).  To
	      reproduce	 the  behavior	of  this  option  from v1.1.4, use the
	      --inttrunc option	instead.

       --nonull3
	      Turn off the null3 CM score corrections for biased  composition.
	      This correction is not used during the HMM filter	stages.

       --mxsize	<x>
	      Set the maximum allowable	CM DP matrix size to <x> megabytes. By
	      default  this  size  is 128 Mb.  This should be large enough for
	      the vast majority	of searches, especially	with  smaller  models.
	      If cmscan	encounters an envelope in the CYK or Inside stage that
	      requires	a  larger matrix, the envelope will be discounted from
	      consideration. This behavior is like an additional  filter  that
	      prevents expensive (slow)	CM DP calculations, but	at a potential
	      cost  to	sensitivity.   Note that if cmscan is being run	in <n>
	      multiple threads on a multicore machine  then  each  thread  may
	      have an allocated	matrix of up to	size <x> Mb at any given time.

       --smxsize <x>
	      Set  the	maximum	 allowable  CM	search	DP  matrix size	to <x>
	      megabytes. By default this size is 128 Mb.  This option is  only
	      relevant if the CM will not use HMM banded matrices, i.e.	if the
	      --max,  --nohmm, --qdb, --fqdb, --nonbanded, or --fnonbanded op-
	      tions are	also used. Note	that if	cmsearch is being run  in  <n>
	      multiple	threads	 on  a	multicore machine then each thread may
	      have an allocated	matrix of up to	size <x> Mb at any given time.

       --cyk  Use the CYK algorithm, not Inside, to determine the final	 score
	      of all hits.

       --acyk Use   the	  CYK	algorithm  to  align  hits.  By	 default,  the
	      Durbin/Holmes optimal accuracy algorithm is  used,  which	 finds
	      the  alignment  that  maximizes  the  expected  accuracy	of all
	      aligned residues.

       --wcx <x>
	      For each CM, set the W parameter,	the expected maximum length of
	      a	hit, to	<x> times the consensus	length of the  model.  By  de-
	      fault,  the  W parameter is read from the	CM file	and was	calcu-
	      lated based on the transition probabilities of the model by  cm-
	      build.  You can find out what the	default	W is for a model using
	      cmstat.	This  option should be used with caution as it impacts
	      the filtering pipeline at	several	different stages in nonobvious
	      ways. It is only recommended for expert users searching for hits
	      that are much longer than	any of the homologs used to build  the
	      model  in	 cmbuild,  e.g.	ones with large	introns	or other large
	      insertions.  It cannot be	used in	combination with the  --nohmm,
	      --fqdb  or  --qdb	options	because	in those cases W is limited by
	      query-dependent bands.

       --toponly
	      Only search the top (Watson) strand of target sequences in <seq-
	      file>.  By default, both strands are searched. This  will	 halve
	      the search space size (Z).

       --bottomonly
	      Only  search  the	 bottom	 (Crick) strand	of target sequences in
	      <seqfile>.  By default, both strands  are	 searched.  This  will
	      halve the	search space size (Z).

       --qformat <s>
	      Assert  that  the	query sequence database	file is	in format <s>.
	      Accepted formats include fasta, embl, genbank, ddbj,  stockholm,
	      pfam, a2m, afa, clustal, and phylip The default is to autodetect
	      the format of the	file.

       --glist <f>
	      Configure	 a  subset of models from <cmfile> in glocal alignment
	      mode, instead of local mode, namely the models  listed  in  file
	      <f>.   Configure	all  other models (those not listed in <f>) in
	      local mode.  This	option is incompatible with -g.	 File <f> must
	      list valid names of models from <cmfile>,	each separated by  any
	      whitespace character (e.g. a newline character).

       --clanin	<f>
	      Read  clan  information on the models in <cmfile>	from file <f>.
	      Not all models in	<cmfile> need to be a member of	a clan.	  This
	      option must be used in combination with --fmt 2 and --tblout be-
	      cause  clan annotation is	only output in format 2	of the tabular
	      output file.  See	section	9 of the Infernal user guide for spec-
	      ifications on the	format of the clan input file <f>.

       --oclan
	      Only mark	overlaps between models	in the same clan.  This	option
	      must be used in combination with --fmt 2 , --tblout and --clanin
	      because clan annotation is only output in	format 2 of the	 tabu-
	      lar  output  file,  and clan information can only	be input using
	      the --clanin option.

       --oskip
	      Omit any hit h from the tabular output file that	satisfies  the
	      following:  another hit h2 overlaps with h and the E-value of h2
	      is lower than that of h, and h2 is itself	 not  omitted.	Hit  h
	      will  not	 appear	 in  the tabular output	file, although it will
	      still exist in the standard output.  This	option must be used in
	      combination with --fmt 2 --tblout	because	overlap	annotation  is
	      only  output  in format 2	of the tabular output file.  When used
	      in combination with --oclan only hits h that satisfy the follow-
	      ing are omitted: another hit h2 overlaps with h, the E-value  of
	      h2 is lower than that of h, and both h and h2 are	hits to	models
	      that are in the same clan.

       --cpu <n>
	      Set  the number of parallel worker threads to <n>.  On multicore
	      machines,	the default is 4.  You can also	control	this number by
	      setting an environment variable, INFERNAL_NCPU.  There is	also a
	      master thread, so	the actual number  of  threads	that  Infernal
	      spawns  is  <n>+1.  This option is not available if Infernal was
	      compiled with POSIX threads support turned off.

       --stall
	      For debugging the	MPI master/worker version: pause after	start,
	      to  enable the developer to attach debuggers to the running mas-
	      ter and worker(s)	processes. Send	SIGCONT	signal to release  the
	      pause.  (Under gdb: (gdb)	signal SIGCONT)	(Only available	if op-
	      tional MPI support was enabled at	compile-time.)

       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if
	      optional MPI support was enabled at compile-time.)

SEE ALSO
       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For  complete documentation, see	the user guide that came with your In-
       fernal distribution (Userguide.pdf);  or	 see  the  Infernal  web  page
       (http://eddylab.org/infernal/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	Infernal source	distribution, or see  the  In-
       fernal web page (http://eddylab.org/infernal/).

AUTHOR
       http://eddylab.org

Infernal 1.1.5			   Sep 2023			     cmscan(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmscan&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages