Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
cmbuild(1)			Infernal Manual			    cmbuild(1)

NAME
       cmbuild - construct covariance model(s) from structurally annotated RNA
       multiple	sequence alignment(s)

SYNOPSIS
       cmbuild [options] _cmfile_out_ _msafile_

DESCRIPTION
       For each	multiple sequence alignment in _msafile_  build	 a  covariance
       model and save it to a new file _cmfile_out_.

       The  alignment file must	be in Stockholm	or SELEX format, and must con-
       tain consensus secondary	structure annotation.  cmbuild uses  the  con-
       sensus structure	to determine the architecture of the CM.

       _msafile_  may be '-' (dash), which means reading this input from stdin
       rather than a file.  To use '-',	you must also  specify	the  alignment
       file format with	--informat _s_,	as in --informat stockholm (because of
       a current limitation in our implementation, MSA file formats cannot  be
       autodetected in a nonrewindable input stream.)

       _cmfile_out_  may  not  be '-' (stdout),	because	sending	the CM file to
       stdout would conflict with the other text output	of the program.

       In addition to writing CM(s) to _cmfile_out_, cmbuild  also  outputs  a
       single line for each model created to stdout. Each line has the follow-
       ing fields: "aln": the index of the alignment used  to  build  the  CM;
       "idx": the index	of the CM in the _cmfile_out_; "name": the name	of the
       CM; "nseq": the number of sequences in the alignment used to build  the
       CM;  "eff_nseq":	 the  effective	 number	of sequences used to build the
       model; "alen": the length of  the  alignment  used  to  build  the  CM;
       "clen":	the  number of columns from the	alignment defined as consensus
       (match) columns;	"bps": the number of basepairs in the CM; "bifs":  the
       number of bifurcations in the CM; "rel entropy: CM": the	total relative
       entropy of the model divided by the number of consensus	columns;  "rel
       entropy:	 HMM":	the  total relative entropy of the model ignoring sec-
       ondary structure	divided	by the number of consensus columns.  "descrip-
       tion": description of the model/alignment.

OPTIONS
       -h     Help; print a brief reminder of command line usage and available
	      options.

       -n _s_ Name the new CM _s_.  The	default	is to  use  the	 name  of  the
	      alignment	 (if  one  is  present	in the _msafile_), or, failing
	      that, the	name of	the _msafile_.	 If  _msafile_	contains  more
	      than  one	 alignment,  -n	doesn't	work, and every	alignment must
	      have a name annotated in the _msafile_ (as in Stockholm #=GF  ID
	      annotation).

       -F     Allow  _cmfile_out_  to  be overwritten. Without this option, if
	      _cmfile_out_ already exists, cmbuild exits with an error.

       -o _f_ Direct the summary output	to file	_f_, rather than to stdout.

       -O _f_ After each model is constructed, resave annotated	source	align-
	      ments  to	a file _f_ in Stockholm	format.	 Sequences are annoted
	      with what	relative sequence weights were assigned.   The	align-
	      ments  are also annotated	with a reference annotation line indi-
	      cating which columns were	assigned as consensus. If  the	source
	      alignment	 had  reference	 annotation ("#=GC RF")	it will	be re-
	      placed with the consensus	residue	of  the	 model	for  consensus
	      columns and '.' for insert columns, unless the --hand option was
	      used for specifying consensus positions, in which	case  it  will
	      be unchanged.

	      --devhelp	 Print	help, as with -h , but also include expert op-
	      tions that are not displayed with	-h .  These expert options are
	      not  expected  to	be relevant for	the vast majority of users and
	      so are not described in the manual page.	The only resources for
	      understanding  what  they	actually do are	the brief one-line de-
	      scriptions output	when --devhelp	is  enabled,  and  the	source
	      code.

OPTIONS	CONTROLLING MODEL CONSTRUCTION
       These  options  control	how consensus columns are defined in an	align-
       ment.

       --fast Define consensus columns automatically  as  those	 that  have  a
	      fraction	>=  symfrac of residues	as opposed to gaps. (See below
	      for the --symfrac	option.) This is the default.

       --hand Use reference coordinate annotation (#=GC	RF line, in Stockholm)
	      to determine which columns are consensus,	and which are inserts.
	      Any non-gap character indicates a	consensus column.  (For	 exam-
	      ple,  mark  consensus  columns with "x", and insert columns with
	      ".".) This option	was called --rf	in previous versions of	Infer-
	      nal (0.1 through 1.0.2).

       --symfrac _x_
	      Define the residue fraction threshold necessary to define	a con-
	      sensus column when not using --hand.  The	default	 is  0.5.  The
	      symbol  fraction in each column is calculated after taking rela-
	      tive sequence weighting into account.  Setting this to 0.0 means
	      that every alignment column will be assigned as consensus, which
	      may be useful in some cases. Setting it to 1.0 means  that  only
	      columns that include 0 gaps will be assigned as consensus.  This
	      option replaces the --gapthresh _y_ option  from	previous  ver-
	      sions  of	Infernal (0.1 through 1.0.2), with _x_ equal to	(1.0 -
	      _y_).  For example to reproduce behavior for a  command  of  cm-
	      build --gapthresh	 0.8 in	a previous version, use	cmbuild	--sym-
	      frac  0.2	with this version.

       --noss Ignore the secondary structure annotation, if any, in  _msafile_
	      and  build  a CM with zero basepairs. This model will be similar
	      to a profile HMM and the cmsearch	and cmscan programs  will  use
	      HMM algorithms which are faster than CM ones for this model. Ad-
	      ditionally, a zero basepair model	need not  be  calibrated  with
	      cmcalibrate prior	to running cmsearch with it. The --noss	option
	      must be used if there is no secondary  structure	annotation  in
	      <msafile>.

       --rsearch _f_
	      Parameterize emission scores a la	RSEARCH, using the RIBOSUM ma-
	      trix in file _f_.	 With --rsearch	 enabled,  all	alignments  in
	      _msafile_	must contain exactly one sequence or the --call	option
	      must also	be enabled. All	positions in  each  sequence  will  be
	      considered  consensus  "columns".	 Actually, the emission	scores
	      for these	models will not	be identical to	RIBOSUM	scores due  of
	      differences  in  the  modelling  strategy	 between  Infernal and
	      RSEARCH, but they	will be	as similar as possible.	  RIBOSUM  ma-
	      trix  files are included with Infernal in	the "matrices/"	subdi-
	      rectory of the top-level "infernal-xxx" directory.  RIBOSUM  ma-
	      trices  are substitution score matrices trained specifically for
	      structural RNAs with separate single stranded residue  and  base
	      pair  substitution  scores. For more information see the RSEARCH
	      publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003).

OTHER MODEL CONSTRUCTION OPTIONS
       --null _f_
	      Read a null model	from _f_.  The null model defines  the	proba-
	      bility  of  each	RNA nucleotide in background sequence, the de-
	      fault is to use 0.25 for each nucleotide.	 The  format  of  null
	      files is specified in the	user guide.

       --prior _f_
	      Read  a  Dirichlet prior from _f_, replacing the default mixture
	      Dirichlet.  The format of	prior files is specified in  the  user
	      guide.

       Use  --devhelp  to  see	additional, otherwise undocumented, model con-
       struction options.

OPTIONS	CONTROLLING RELATIVE WEIGHTS
       cmbuild uses an ad  hoc	sequence  weighting  algorithm	to  downweight
       closely related sequences and upweight distantly	related	ones. This has
       the effect of making models less	biased by uneven  phylogenetic	repre-
       sentation.  For	example,  two identical	sequences would	typically each
       receive half the	weight that one	sequence would.	 These options control
       which algorithm gets used.

       --wpb  Use   the	 Henikoff  position-based  sequence  weighting	scheme
	      [Henikoff	and Henikoff, J. Mol. Biol. 243:574, 1994].   This  is
	      the default.

       --wgsc Use  the	Gerstein/Sonnhammer/Chothia  weighting algorithm [Ger-
	      stein et al, J. Mol. Biol. 235:1067, 1994].

       --wnone
	      Turn sequence weighting off; e.g.	explicitly  set	 all  sequence
	      weights to 1.0.

       --wgiven
	      Use  sequence weights as given in	annotation in the input	align-
	      ment file. If no weights were given, assume they	are  all  1.0.
	      The  default  is	to  determine new sequence weights by the Ger-
	      stein/Sonnhammer/Chothia	algorithm,  ignoring   any   annotated
	      weights.

       --wblosum
	      Use  the BLOSUM filtering	algorithm to weight the	sequences, in-
	      stead of the default GSC weighting.  Cluster the sequences at  a
	      given percentage identity	(see --wid); assign each cluster a to-
	      tal weight of 1.0, distributed equally amongst  the  members  of
	      that cluster.

       --wid _x_
	      Controls	the behavior of	the --wblosum weighting	option by set-
	      ting the percent identity	for clustering the alignment to	_x_.

OPTIONS	CONTROLLING EFFECTIVE SEQUENCE NUMBER
       After relative weights are determined, they are normalized to sum to  a
       total  effective	sequence number, eff_nseq.  This number	may be the ac-
       tual number of sequences	in the alignment,  but	it  is	almost	always
       smaller	than  that.  The default entropy weighting method (--eent) re-
       duces the effective sequence number to reduce the  information  content
       (relative entropy, or average expected score on true homologs) per con-
       sensus position.	The target relative entropy is controlled by a two-pa-
       rameter	function, where	the two	parameters are settable	with --ere and
       --esigma.

       --eent Use the entropy weighting	strategy to  determine	the  effective
	      sequence	number	that  gives a target mean match	state relative
	      entropy. This option is the default, and can be turned off  with
	      --enone.	 The  default target mean match	state relative entropy
	      is 0.59 bits for models with at least 1 basepair and  0.38  bits
	      for  models  with	zero basepairs,	but can	be changed with	--ere.
	      The default of 0.59 or 0.38 bits is automatically	changed	if the
	      total relative entropy of	the model (summed match	state relative
	      entropy) is less than a cutoff, which is controlled by the --es-
	      igma  option.  If	you really want	to play	with that option, con-
	      sult the source code.  Additionally, the effective sequence num-
	      ber  cannot be larger than the number of sequences in the	align-
	      ment, although this can be overridden to set the maximum	possi-
	      ble effective sequence number with the --emaxseq option.

       --enone
	      Turn  off	the entropy weighting strategy.	The effective sequence
	      number is	just the number	of sequences in	the alignment.

       --ere _x_
	      Set the target mean match	state relative entropy as _x_.	By de-
	      fault  the  target  relative  entropy per	match position is 0.59
	      bits for models with at least 1 basepair	and  0.38  for	models
	      with zero	basepairs.

       --eminseq _x_
	      Define the minimum allowed effective sequence number as _x_.

       --emaxseq _x_
	      Define  the  maximum  allowed  effective sequence	number as _x_.
	      This number can be larger	than the number	of  sequences  in  the
	      alignment.

       --ehmmre	_x_
	      Set  the	target	HMM  mean match	state relative entropy as _x_.
	      Entropy  for  basepairing	 match	states	is  calculated	 using
	      marginalized basepair emission probabilities.

       --eset _x_
	      Set the effective	sequence number	for entropy weighting as _x_.

OPTIONS	CONTROLLING FILTER P7 HMM CONSTRUCTION
       For  each  CM that cmbuild constructs, an accompanying filter p7	HMM is
       built from the input alignment as well. These  options  control	filter
       HMM construction:

       --p7ere _x_
	      Set  the target mean match state relative	entropy	for the	filter
	      p7 HMM as	_x_.  By default the target relative entropy per match
	      position is 0.38 bits.

       --p7ml Use  a maximum likelihood	p7 HMM built from the CM as the	filter
	      HMM. This	HMM will be as similar as possible to  the  CM	(while
	      necessarily ignorant of secondary	structure).

       Use  --devhelp  to  see	additional, otherwise undocumented, filter HMM
       construction options.

OPTIONS	CONTROLLING FILTER P7 HMM CALIBRATION
       After building each filter HMM, cmbuild determines appropriate  E-value
       parameters to use during	filtering in cmsearch and cmscan by sampling a
       set of sequences	and searching them with	each HMM filter	 configuration
       and algorithm.

       --EmN  _n_ Set the number of sampled sequences for local	MSV filter HMM
       calibration to _n_.  200	by default.

       --EvN _n_ Set the number	of sampled sequences for local Viterbi	filter
       HMM calibration to _n_.	200 by default.

       --ElfN _n_ Set the number of sampled sequences for local	Forward	filter
       HMM calibration to _n_.	200 by default.

       --EgfN _n_ Set the number of sampled sequences for glocal Forward  fil-
       ter HMM calibration to _n_.  200	by default.

       Use  --devhelp  to  see	additional, otherwise undocumented, filter HMM
       calibration options.

OPTIONS	FOR REFINING THE INPUT ALIGNMENT
       --refine	_f_
	      Attempt to refine	the alignment before building the CM using ex-
	      pectation-maximization  (EM).  A CM is first built from the ini-
	      tial alignment as	usual. Then, the sequences  in	the  alignment
	      are  realigned optimally (with the HMM banded CYK	algorithm, op-
	      timal means optimal given	the bands) to the CM, and a new	CM  is
	      built  from  the resulting alignment. The	sequences are then re-
	      aligned to the new CM, and a new CM is built  from  that	align-
	      ment. This is continued until convergence, specifically when the
	      alignments for two successive iterations are  not	 significantly
	      different	 (the  summed  bit  scores of all the sequences	in the
	      alignment	changes	less than 1%  between  two  successive	itera-
	      tions).  The final alignment (the	alignment used to build	the CM
	      that gets	written	to _cmfile_out_) is written to _f_.

       -l     With --refine, turn on the local alignment algorithm, which  al-
	      lows the alignment to span two or	more subsequences if necessary
	      (e.g. if the structures of the query model and  target  sequence
	      are  only	 partially  shared), allowing certain large insertions
	      and deletions in the structure to	be penalized differently  than
	      normal indels.  The default is to	globally align the query model
	      to the target sequences.

       --gibbs
	      Modifies the behavior of --refine	so Gibbs sampling is used  in-
	      stead  of	 EM. The difference is that during the alignment stage
	      the alignment is not necessarily optimal,	instead	 an  alignment
	      (parsetree)  for	each  sequences	 is sampled from the posterior
	      distribution of alignments as determined	by  the	 Inside	 algo-
	      rithm.  Due  to this sampling step --gibbs is non-deterministic,
	      so different runs	with the same alignment	 may  yield  different
	      results.	This  is  not  true  when --refine is used without the
	      --gibbs option, in which case the	final alignment	 and  CM  will
	      always be	the same. When --gibbs is enabled, the --seed  <n> op-
	      tion can be used	to  seed  the  random  number  generator  pre-
	      dictably,	 making	 the  results  reproducible.   The goal	of the
	      --gibbs option is	to help	expert RNA alignment  curators	refine
	      structural  alignments  by  allowing them	to observe alternative
	      high scoring alignments.

       --seed _n_
	      Seed the random number generator with  _n_,  an  integer	>=  0.
	      This  option  can	 only be used in combination with --gibbs.  If
	      _n_ is nonzero, stochastic sampling of alignments	will be	repro-
	      ducible; the same	command	will give the same results.  If	_n_ is
	      0, the random number generator is	seeded arbitrarily,  and  sto-
	      chastic  samplings may vary from run to run of the same command.
	      The default seed is 0.

       --cyk  With --refine, align with	the CYK	algorithm. By default the  op-
	      timal  accuracy  algorithm is used. There	is more	information on
	      this in the cmalign manual page.

       --notrunc
	      With --refine, turn off the the truncated	 alignment  algorithm.
	      There is more information	on this	in the cmalign manual page.

       Use  --devhelp to see additional, otherwise undocumented, alignment re-
       finement	options	as well	as other output	file options and  options  for
       building	multiple models	for a single alignment.

SEE ALSO
       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For complete documentation, see the user	guide that came	with your  In-
       fernal distribution (Userguide.pdf); or see the Infernal	web page ().

COPYRIGHT
       Copyright (C) 2019 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	Infernal source	distribution, or see  the  In-
       fernal web page ().

AUTHOR
       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147	USA
       http://eddylab.org

Infernal 1.1.3			   Nov 2019			    cmbuild(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | OPTIONS CONTROLLING MODEL CONSTRUCTION | OTHER MODEL CONSTRUCTION OPTIONS | OPTIONS CONTROLLING RELATIVE WEIGHTS | OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER | OPTIONS CONTROLLING FILTER P7 HMM CONSTRUCTION | OPTIONS CONTROLLING FILTER P7 HMM CALIBRATION | OPTIONS FOR REFINING THE INPUT ALIGNMENT | SEE ALSO | COPYRIGHT | AUTHOR

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmbuild&sektion=1&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help