Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
cmbuild(1)			Infernal Manual			    cmbuild(1)

NAME
       cmbuild - construct covariance model(s) from structurally annotated RNA
       multiple	sequence alignment(s)

SYNOPSIS
       cmbuild [options] <cmfile_out> <msafile>

DESCRIPTION
       For  each  multiple  sequence alignment in <msafile> build a covariance
       model and save it to a new file <cmfile_out>.

       The alignment file must be in Stockholm or SELEX	format,	and must  con-
       tain  consensus	secondary structure annotation.	 cmbuild uses the con-
       sensus structure	to determine the architecture of the CM.

       <msafile> may be	'-' (dash), which means	reading	this input from	 stdin
       rather  than  a	file.  To use '-', you must also specify the alignment
       file format with	--informat <s>,	as in --informat stockholm (because of
       a current limitation in our implementation, MSA file formats cannot  be
       autodetected in a nonrewindable input stream.)

       <cmfile_out>  may  not  be '-' (stdout),	because	sending	the CM file to
       stdout would conflict with the other text output	of the program.

       In addition to writing CM(s) to <cmfile_out>, cmbuild  also  outputs  a
       single line for each model created to stdout. Each line has the follow-
       ing  fields:  "aln":  the  index	of the alignment used to build the CM;
       "idx": the index	of the CM in the <cmfile_out>; "name": the name	of the
       CM; "nseq": the number of sequences in the alignment used to build  the
       CM;  "eff_nseq":	 the  effective	 number	of sequences used to build the
       model; "alen": the length of  the  alignment  used  to  build  the  CM;
       "clen":	the  number of columns from the	alignment defined as consensus
       (match) columns;	"bps": the number of basepairs in the CM; "bifs":  the
       number of bifurcations in the CM; "rel entropy: CM": the	total relative
       entropy	of  the	model divided by the number of consensus columns; "rel
       entropy:	HMM": the total	relative entropy of the	 model	ignoring  sec-
       ondary structure	divided	by the number of consensus columns.  "descrip-
       tion": description of the model/alignment.

OPTIONS
       -h     Help; print a brief reminder of command line usage and available
	      options.

       -n <s> Name  the	 new  CM  <s>.	 The default is	to use the name	of the
	      alignment	(if one	is present  in	the  <msafile>),  or,  failing
	      that,  the  name	of  the	<msafile>.  If <msafile> contains more
	      than one alignment, -n doesn't work, and	every  alignment  must
	      have  a name annotated in	the <msafile> (as in Stockholm #=GF ID
	      annotation).

       -F     Allow <cmfile_out> to be overwritten. Without  this  option,  if
	      <cmfile_out> already exists, cmbuild exits with an error.

       -o <f> Direct the summary output	to file	<f>, rather than to stdout.

       -O <f> After  each model	is constructed,	resave annotated source	align-
	      ments to a file <f> in Stockholm format.	Sequences are  annoted
	      with  what  relative sequence weights were assigned.  The	align-
	      ments are	also annotated with a reference	annotation line	 indi-
	      cating  which  columns were assigned as consensus. If the	source
	      alignment	had reference annotation ("#=GC	RF") it	 will  be  re-
	      placed  with  the	 consensus  residue of the model for consensus
	      columns and '.' for insert columns, unless the --hand option was
	      used for specifying consensus positions, in which	case  it  will
	      be  unchanged.  Any sequences defined as fragments will be anno-
	      tated as well, using only	~ characters before the	first  residue
	      and  after  the final residue, unless the	--fraggiven option was
	      used.

	      --devhelp	Print help, as with -h , but also include  expert  op-
	      tions that are not displayed with	-h .  These expert options are
	      not  expected  to	be relevant for	the vast majority of users and
	      so are not described in the manual page.	The only resources for
	      understanding what they actually do are the brief	 one-line  de-
	      scriptions  output  when	--devhelp  is  enabled,	and the	source
	      code.

OPTIONS	CONTROLLING MODEL CONSTRUCTION
       --fast Define consensus columns automatically  as  those	 that  have  a
	      fraction	>=  symfrac of residues	as opposed to gaps. (See below
	      for the --symfrac	option.) This is the default.

       --hand Use reference coordinate annotation (#=GC	RF line, in Stockholm)
	      to determine which columns are consensus,	and which are inserts.
	      Any non-gap character indicates a	consensus column.  (For	 exam-
	      ple,  mark  consensus  columns with "x", and insert columns with
	      ".".) This option	was called --rf	in previous versions of	Infer-
	      nal (0.1 through 1.0.2).

       --symfrac <x>
	      Define the residue fraction threshold necessary to define	a con-
	      sensus column when not using --hand.  The	default	 is  0.5.  The
	      symbol  fraction in each column is calculated after taking rela-
	      tive sequence weighting into account.  Setting this to 0.0 means
	      that every alignment column will be assigned as consensus, which
	      may be useful in some cases. Setting it to 1.0 means  that  only
	      columns that include 0 gaps will be assigned as consensus.  This
	      option  replaces	the  --gapthresh <y> option from previous ver-
	      sions of Infernal	(0.1 through 1.0.2), with <x> equal to (1.0  -
	      <y>).   For  example  to reproduce behavior for a	command	of cm-
	      build --gapthresh	 0.8 in	a previous version, use	cmbuild	--sym-
	      frac  0.2	with this version.

       --fragthresh <x>
	      We only want to count terminal gaps as deletions if the  aligned
	      sequence	is  known  to  be full-length, not if it is a fragment
	      (for instance, because only part of it was sequenced).  A	 frag-
	      ment is defined as any aligned sequence for which	the fractional
	      span,  defined  as  its  aligned	length	from its first to last
	      residue divided by the total  alignment  length,	is  less  than
	      "0.8"  (by  default).  Note that this differs from the way HMMER
	      defines fragments	(as of v3.3.2).	 Setting --fragthresh  0  will
	      define  no  sequence as a	fragment; you might want to do this if
	      you know you have	a carefully curated alignment  of  full-length
	      sequences	or want	to mimic the behavior of older versions	of In-
	      fernal  (v1.1 to v1.1.4).	Setting	--fragthresh 1 will define all
	      sequences	as fragments.  The --fragnrfpos	and --fraggiven	 offer
	      alternative ways to define fragments.

       --fragnrfpos <n>
	      Define  a	sequence as a fragment if it has more than <n> gaps in
	      terminal consensus positions at the 5' or	3' ends.  This	option
	      can  only	 be used in combination	with the --hand	option,	and if
	      it is used, the --fragthresh option is ignored.

       --fraggiven
	      Do not infer  which  sequences  are  fragments  based  on	 their
	      lengths  but do use fragment information in the input alignment,
	      if there is any.	For a sequence in the input  alignment	to  be
	      considered  a  fragment,	all positions before (5' of) the first
	      nucleotide and all positions after (3' of) the final  nucleotide
	      must  be	~ symbols. Importantly,	~ symbols are not allowed any-
	      where else in the	alignment.

       --noss Ignore the secondary structure annotation, if any, in  <msafile>
	      and  build  a CM with zero basepairs. This model will be similar
	      to a profile HMM and the cmsearch	and cmscan programs  will  use
	      HMM algorithms which are faster than CM ones for this model. Ad-
	      ditionally,  a  zero  basepair model need	not be calibrated with
	      cmcalibrate prior	to running cmsearch with it. The --noss	option
	      must be used if there is no secondary  structure	annotation  in
	      <msafile>.

       --rsearch <f>
	      Parameterize emission scores a la	RSEARCH, using the RIBOSUM ma-
	      trix  in	file  <f>.   With --rsearch enabled, all alignments in
	      <msafile>	must contain exactly one sequence or the --call	option
	      must also	be enabled. All	positions in  each  sequence  will  be
	      considered  consensus  "columns".	 Actually, the emission	scores
	      for these	models will not	be identical to	RIBOSUM	scores due  of
	      differences  in  the  modelling  strategy	 between  Infernal and
	      RSEARCH, but they	will be	as similar as possible.	  RIBOSUM  ma-
	      trix  files are included with Infernal in	the "matrices/"	subdi-
	      rectory of the top-level "infernal-xxx" directory.  RIBOSUM  ma-
	      trices  are substitution score matrices trained specifically for
	      structural RNAs with separate single stranded residue  and  base
	      pair  substitution  scores. For more information see the RSEARCH
	      publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003).

       --consrf
	      With --hand use the model's consensus  sequence  for  the	 model
	      reference	annotation instead of the RF annotation	from the input
	      alignment.

OTHER MODEL CONSTRUCTION OPTIONS
       --null <f>
	      Read  a  null model from <f>.  The null model defines the	proba-
	      bility of	each RNA nucleotide in background  sequence,  the  de-
	      fault  is	 to  use 0.25 for each nucleotide.  The	format of null
	      files is specified in the	user guide.

       --prior <f>
	      Read a Dirichlet prior from <f>, replacing the  default  mixture
	      Dirichlet.   The	format of prior	files is specified in the user
	      guide.

       Use --devhelp to	see additional,	 otherwise  undocumented,  model  con-
       struction options.

OPTIONS	CONTROLLING RELATIVE WEIGHTS
       cmbuild	uses  an  ad  hoc  sequence  weighting algorithm to downweight
       closely related sequences and upweight distantly	related	ones. This has
       the effect of making models less	biased by uneven  phylogenetic	repre-
       sentation.  For	example,  two identical	sequences would	typically each
       receive half the	weight that one	sequence would.	 These options control
       which algorithm gets used.

       --wpb  Use  the	Henikoff  position-based  sequence  weighting	scheme
	      [Henikoff	 and  Henikoff,	J. Mol.	Biol. 243:574, 1994].  This is
	      the default.

       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm	 [Ger-
	      stein et al, J. Mol. Biol. 235:1067, 1994].

       --wnone
	      Turn  sequence  weighting	 off; e.g. explicitly set all sequence
	      weights to 1.0.

       --wgiven
	      Use sequence weights as given in annotation in the input	align-
	      ment  file.  If  no weights were given, assume they are all 1.0.
	      The default is to	determine new sequence	weights	 by  the  Ger-
	      stein/Sonnhammer/Chothia	 algorithm,   ignoring	any  annotated
	      weights.

       --wblosum
	      Use the BLOSUM filtering algorithm to weight the sequences,  in-
	      stead  of	the default GSC	weighting.  Cluster the	sequences at a
	      given percentage identity	(see --wid); assign each cluster a to-
	      tal weight of 1.0, distributed equally amongst  the  members  of
	      that cluster.

       --wid <x>
	      Controls	the behavior of	the --wblosum weighting	option by set-
	      ting the percent identity	for clustering the alignment to	<x>.

OPTIONS	CONTROLLING EFFECTIVE SEQUENCE NUMBER
       After relative weights are determined, they are normalized to sum to  a
       total  effective	sequence number, eff_nseq.  This number	may be the ac-
       tual number of sequences	in the alignment,  but	it  is	almost	always
       smaller	than  that.  The default entropy weighting method (--eent) re-
       duces the effective sequence number to reduce the  information  content
       (relative entropy, or average expected score on true homologs) per con-
       sensus position.	The target relative entropy is controlled by a two-pa-
       rameter	function, where	the two	parameters are settable	with --ere and
       --esigma.

       --eent Use the entropy weighting	strategy to  determine	the  effective
	      sequence	number	that  gives a target mean match	state relative
	      entropy. This option is the default, and can be turned off  with
	      --enone.	 The  default target mean match	state relative entropy
	      is 0.59 bits for models with at least 1 basepair and  0.38  bits
	      for  models  with	zero basepairs,	but can	be changed with	--ere.
	      The default of 0.59 or 0.38 bits is automatically	changed	if the
	      total relative entropy of	the model (summed match	state relative
	      entropy) is less than a cutoff, which is controlled by the --es-
	      igma option. If you really want to play with that	 option,  con-
	      sult the source code.  Additionally, the effective sequence num-
	      ber  cannot be larger than the number of sequences in the	align-
	      ment, although this can be overridden to set the maximum	possi-
	      ble effective sequence number with the --emaxseq option.

       --enone
	      Turn  off	the entropy weighting strategy.	The effective sequence
	      number is	just the number	of sequences in	the alignment.

       --ere <x>
	      Set the target mean match	state relative entropy as <x>.	By de-
	      fault the	target relative	entropy	per  match  position  is  0.59
	      bits  for	 models	 with  at least	1 basepair and 0.38 for	models
	      with zero	basepairs.

       --eminseq <x>
	      Define the minimum allowed effective sequence number as <x>.

       --emaxseq <x>
	      Define the maximum allowed effective  sequence  number  as  <x>.
	      This  number  can	 be larger than	the number of sequences	in the
	      alignment.

       --ehmmre	<x>
	      Set the target HMM mean match state  relative  entropy  as  <x>.
	      Entropy for basepairing match states is calculated using margin-
	      alized basepair emission probabilities.

       --eset <x>
	      Set the effective	sequence number	for entropy weighting as <x>.

OPTIONS	CONTROLLING FILTER P7 HMM CONSTRUCTION
       For  each  CM that cmbuild constructs, an accompanying filter p7	HMM is
       built from the input alignment as well. These  options  control	filter
       HMM construction:

       --p7ere <x>
	      Set  the target mean match state relative	entropy	for the	filter
	      p7 HMM as	<x>.  By default the target relative entropy per match
	      position is 0.38 bits.

       --p7ml Use a maximum likelihood p7 HMM built from the CM	as the	filter
	      HMM.  This  HMM  will be as similar as possible to the CM	(while
	      necessarily ignorant of secondary	structure).

       Use --devhelp to	see additional,	 otherwise  undocumented,  filter  HMM
       construction options.

OPTIONS	CONTROLLING FILTER P7 HMM CALIBRATION
       After  building each filter HMM,	cmbuild	determines appropriate E-value
       parameters to use during	filtering in cmsearch and cmscan by sampling a
       set of sequences	and searching them with	each HMM filter	 configuration
       and algorithm.

       --EmN  <n> Set the number of sampled sequences for local	MSV filter HMM
       calibration to <n>.  200	by default.

       --EvN <n> Set the number	of sampled sequences for local Viterbi	filter
       HMM calibration to <n>.	200 by default.

       --ElfN <n> Set the number of sampled sequences for local	Forward	filter
       HMM calibration to <n>.	200 by default.

       --EgfN  <n> Set the number of sampled sequences for glocal Forward fil-
       ter HMM calibration to <n>.  200	by default.

       Use --devhelp to	see additional,	 otherwise  undocumented,  filter  HMM
       calibration options.

OPTIONS	FOR REFINING THE INPUT ALIGNMENT
       --refine	<f>
	      Attempt to refine	the alignment before building the CM using ex-
	      pectation-maximization  (EM).  A CM is first built from the ini-
	      tial alignment as	usual. Then, the sequences  in	the  alignment
	      are  realigned optimally (with the HMM banded CYK	algorithm, op-
	      timal means optimal given	the bands) to the CM, and a new	CM  is
	      built  from  the resulting alignment. The	sequences are then re-
	      aligned to the new CM, and a new CM is built  from  that	align-
	      ment. This is continued until convergence, specifically when the
	      alignments  for  two successive iterations are not significantly
	      different	(the summed bit	scores of all  the  sequences  in  the
	      alignment	 changes  less	than  1% between two successive	itera-
	      tions). The final	alignment (the alignment used to build the  CM
	      that gets	written	to <cmfile_out>) is written to <f>.

       -l     With  --refine, turn on the local	alignment algorithm, which al-
	      lows the alignment to span two or	more subsequences if necessary
	      (e.g. if the structures of the query model and  target  sequence
	      are  only	 partially  shared), allowing certain large insertions
	      and deletions in the structure to	be penalized differently  than
	      normal indels.  The default is to	globally align the query model
	      to the target sequences.

       --gibbs
	      Modifies	the behavior of	--refine so Gibbs sampling is used in-
	      stead of EM. The difference is that during the  alignment	 stage
	      the  alignment  is not necessarily optimal, instead an alignment
	      (parsetree) for each sequences is	 sampled  from	the  posterior
	      distribution  of	alignments  as	determined by the Inside algo-
	      rithm. Due to this sampling step --gibbs	is  non-deterministic,
	      so  different  runs  with	the same alignment may yield different
	      results. This is not true	when  --refine	is  used  without  the
	      --gibbs  option,	in  which case the final alignment and CM will
	      always be	the same. When --gibbs is enabled, the --seed  <n> op-
	      tion can be used	to  seed  the  random  number  generator  pre-
	      dictably,	 making	 the  results  reproducible.   The goal	of the
	      --gibbs option is	to help	expert RNA alignment  curators	refine
	      structural  alignments  by  allowing them	to observe alternative
	      high scoring alignments.

       --seed <n>
	      Seed the random number generator with  <n>,  an  integer	>=  0.
	      This  option  can	 only be used in combination with --gibbs.  If
	      <n> is nonzero, stochastic sampling of alignments	will be	repro-
	      ducible; the same	command	will give the same results.  If	<n> is
	      0, the random number generator is	seeded arbitrarily,  and  sto-
	      chastic  samplings may vary from run to run of the same command.
	      The default seed is 0.

       --cyk  With --refine, align with	the CYK	algorithm. By default the  op-
	      timal  accuracy  algorithm is used. There	is more	information on
	      this in the cmalign manual page.

       --notrunc
	      With --refine, turn off the the truncated	 alignment  algorithm.
	      There is more information	on this	in the cmalign manual page.

       --miss With  --refine,  in  the	final  alignment and each intermediate
	      alignment, consider all sequences	with terminal  gaps  as	 frag-
	      ments for	purposes of building models from those alignments. You
	      may want to do this if you have many sequences that are not full
	      length, e.g. fragmentary because only part of it was sequenced.

       Use  --devhelp to see additional, otherwise undocumented, alignment re-
       finement	options	as well	as other output	file options and  options  for
       building	multiple models	for a single alignment.

SEE ALSO
       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For  complete documentation, see	the user guide that came with your In-
       fernal distribution (Userguide.pdf);  or	 see  the  Infernal  web  page
       (http://eddylab.org/infernal/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	Infernal source	distribution, or see  the  In-
       fernal web page (http://eddylab.org/infernal/).

AUTHOR
       http://eddylab.org

Infernal 1.1.5			   Sep 2023			    cmbuild(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmbuild&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help