FreeBSD Manual Pages

home | help
cmalign(1)			Infernal Manual			    cmalign(1)

NAME
       cmalign - align sequences to a covariance model

SYNOPSIS
       cmalign
	      [options]	<cmfile> <seqfile>

DESCRIPTION
       cmalign	aligns	the RNA	sequences in <seqfile> to the covariance model
       (CM) in <cmfile>.  The new alignment is output to stdout	 in  Stockholm
       format, but can be redirected to	a file <f> with	the -o <f> option.

       Either  <cmfile>	 or  <seqfile> (but not	both) may be '-' (dash), which
       means reading this input	from stdin rather than a file.

       The sequence file <seqfile> must	be in FASTA or Genbank format.

       cmalign uses an HMM banding technique to	accelerate  alignment  by  de-
       fault  as  described below for the --hbanded option. HMM	banding	can be
       turned off with the --nonbanded option.

       By default, cmalign computes the	alignment with maximum expected	 accu-
       racy  that  is consistent with constraints (bands) derived from an HMM,
       using a banded version of the Durbin/Holmes optimal accuracy algorithm.
       This behavior can be changed with the --cyk or --sample options.

       cmalign takes special care  to  correctly  align	 truncated  sequences,
       where  some  nucleotides	from the beginning (5')	and/or end (3')	of the
       actual full length biological sequence are not present in the input se-
       quence (see DL Kolbe and	SR Eddy, Bioinformatics, 25:1236-1243,	2009).
       This  behavior  is on by	default, but can be turned off with --notrunc.
       In previous versions of cmalign the --sub option	was required to	appro-
       priately	handle truncated sequences. The	--sub option is	 still	avail-
       able in this version, but the new default method	for handling truncated
       sequences should	be as good or superior to the sub method in nearly all
       cases.

       The  --mapali  <s> option allows	inclusion of the fixed training	align-
       ment used to build the CM from file <s> within the output alignment  of
       cmalign.

       It  is  possible	to merge two or	more alignments	created	by the same CM
       using the Easel miniapp esl-alimerge (included in  the  easel/miniapps/
       subdirectory  of	 Infernal).  Previous versions of cmalign included op-
       tions to	merge alignments but they were deprecated upon development  of
       esl-alimerge, which is significantly more memory	efficient.

       By default, cmalign will	output the alignment to	stdout.	 The alignment
       can  be	redirected  to an output file <f> with the -o <f> option. With
       -o, information on each aligned sequence,  including  score  and	 model
       alignment boundaries will be printed to stdout (more on this below).

       The  output  alignment will be in Stockholm format by default. This can
       be changed to Pfam, aligned FASTA (AFA),	A2M, Clustal, or Phylip	format
       using the --outformat <s> option, where <s> is the name of the  desired
       format.	As a special case, if the output alignment is large (more than
       10,000  sequences  or  more than	10,000,000 total nucleotides) than the
       output format will be Pfam format, with each sequence  appearing	 on  a
       single  line,  for  reasons of memory efficiency. For alignments	larger
       than this, using	--ileaved will force interleaved Stockholm format, but
       the user	should be aware	 that  this  may  require  a  lot  of  memory.
       --ileaved  will	only  work  for	 alignments up to 100,000 sequences or
       100,000,000 total nucleotides.

       If the output alignment format is Stockholm or Pfam, the	output	align-
       ment  will be annotated with posterior probabilities which estimate the
       confidence level	of each	aligned	nucleotide.  This  annotation  appears
       as  lines  beginning  with "#=GR	<seq name> PP",	one per	sequence, each
       immediately below the  corresponding  aligned  sequence	"<seq  name>".
       Characters  in PP lines have 12 possible	values:	"0-9", "*", or ".". If
       ".", the	position corresponds to	a gap in the sequence. A value of  "0"
       indicates  a  posterior	probability of between 0.0 and 0.05, "1" indi-
       cates between 0.05 and 0.15, "2"	indicates between 0.15 and 0.25	and so
       on up to	"9" which indicates between 0.85 and 0.95. A value of "*"  in-
       dicates	a posterior probability	of between 0.95	and 1.0. Higher	poste-
       rior probabilities correspond to	greater	confidence  that  the  aligned
       nucleotide  belongs  where  it  appears	in the alignment.  With	--non-
       banded, the calculation of the posterior	 probabilities	considers  all
       possible	 alignments  of	 the target sequence to	the CM.	Without	--non-
       banded (i.e. in default mode), the calculation considers	only  possible
       alignments  within  the HMM bands. Further, the posterior probabilities
       are conditional on the truncation mode of the alignment.	 For  example,
       if  the sequence	alignment is truncated 5', a PP	value of "9" indicates
       between 0.85 and	0.95 of	all 5' truncated alignments include the	 given
       nucleotide  at  the  given  position.   The posterior annotation	can be
       turned off with the --noprob option. If --small is  enabled,  posterior
       annotation must also be turned off using	--noprob.

       The  tabular  output that is printed to stdout if the -o	option is used
       includes	one line per sequence and twelve fields	per line:  "idx":  the
       index of	the sequence in	the input file,	"seq name": the	sequence name;
       "length":  the length of	the sequence; "cm from"	and "cm	to": the model
       start and end positions of the alignment; "trunc": "no" if the sequence
       is not truncated, "5'" if the beginning of the sequence	truncated  5',
       "3'"  if	 the end of the	sequence is truncated, and "5'&3'" if both the
       beginning and the end are truncated; "bit sc": the  bit	score  of  the
       alignment,  "avg	 pp"  the average posterior probability	of all aligned
       nucleotides in the alignment; "band calc", "alignment" and "total": the
       time in seconds required	 for  calculating  HMM	bands,	computing  the
       alignment,  and complete	processing of the sequence, respectively; "mem
       (Mb)": the size in Mb of	all dynamic programming	matrices required  for
       aligning	the sequence.  This tabular data can be	saved to file <f> with
       the --sfile <f> option.

OPTIONS
       -h     Help; print a brief reminder of command line usage and available
	      options.

       -o <f> Save  the	 alignment in Stockholm	format to a file <f>.  The de-
	      fault is to write	it to standard output.

       -g     Configure	the model for global alignment of the query  model  to
	      the  target  sequences.  By default, the model is	configured for
	      local alignment. Local alignments	can contain  large  insertions
	      and  deletions called "local ends" in the	structure to be	penal-
	      ized differently than normal indels. These are annotated as  "~"
	      columns  in  the	RF line	of the output alignment. The -g	option
	      can be used to disallow these local ends.	 The -g	option is  re-
	      quired if	the --sub option is also used.

OPTIONS	FOR CONTROLLING	THE ALIGNMENT ALGORITHM
       --optacc
	      Align  sequences	using the Durbin/Holmes	optimal	accuracy algo-
	      rithm. This is the default.  The optimal accuracy	alignment will
	      be constrained by	HMM bands for acceleration unless  the	--non-
	      banded option is enabled.	 The optimal accuracy algorithm	deter-
	      mines  the  alignment that maximizes the posterior probabilities
	      of the aligned nucleotides  within  it.	The  posterior	proba-
	      bilites  are  determined using (possibly HMM banded) variants of
	      the Inside and Outside algorithms.

       --cyk  Do not use the Durbin/Holmes optimal accuracy alignment to align
	      the sequences, instead use the CYK  algorithm  which  determines
	      the  optimally scoring (maximum likelihood) alignment of the se-
	      quence to	the model, given the HMM bands (unless --nonbanded  is
	      also enabled).

       --sample
	      Sample  an  alignment  from the posterior	distribution of	align-
	      ments.  The posterior distribution is determined	using  an  HMM
	      banded (unless --nonbanded) variant of the Inside	algorithm.

       --seed <n>
	      Seed  the	 random	 number	 generator  with <n>, an integer >= 0.
	      This option can only be used in combination with	--sample.   If
	      <n> is nonzero, stochastic sampling of alignments	will be	repro-
	      ducible; the same	command	will give the same results.  If	<n> is
	      0,  the  random number generator is seeded arbitrarily, and sto-
	      chastic samplings	may vary from run to run of the	same  command.
	      The default seed is 181.

       --notrunc
	      Turn  off	 truncated alignment algorithms.  All sequences	in the
	      input file will be assumed to be full length,  unless  --sub  is
	      also  used, in which case	the program can	still handle truncated
	      sequences	but will use an	alternative strategy for their	align-
	      ment.

       --sub  Turn  on the sub model construction and alignment	procedure. For
	      each sequence, an	HMM is first used to predict the  model	 start
	      and  end consensus columns, and a	new sub	CM is constructed that
	      only models consensus columns from start to end. The sequence is
	      then aligned to this sub CM.  Sub	alignment is an	 older	method
	      than  the	 default  one for aligning sequences that are possibly
	      truncated. By default, cmalign uses  special  DP	algorithms  to
	      handle  truncated	 sequences  which should be more accurate than
	      the sub method in	most cases.  --sub is still included as	an op-
	      tion mainly for testing against this default truncated  sequence
	      handling.	  This	"sub CM" procedure is not the same as the "sub
	      CMs" described by	Weinberg and Ruzzo.

OPTIONS	FOR CONTROLLING	SPEED AND MEMORY REQUIREMENTS
       --hbanded
	      This option is turned on by  default.  Accelerate	 alignment  by
	      pruning  away regions of the CM DP matrix	that are deemed	negli-
	      gible by an HMM.	First, each sequence is	scored with a CM  plan
	      9	HMM derived from the CM	using the Forward and Backward HMM al-
	      gorithms	to  calculate  posterior  probabilities	 that each nu-
	      cleotide aligns to each state of the HMM.	These posterior	proba-
	      bilities are used	to derive constraints (bands) on the CM	DP ma-
	      trix. Finally, the target	sequence is aligned to	the  CM	 using
	      the  banded  DP matrix, during which cells outside the bands are
	      ignored. Usually most of the full	DP  matrix  lies  outside  the
	      bands  (often  more  than	95%), making this technique faster be-
	      cause fewer DP calculations are required,	and more memory	 effi-
	      cient because only cells within the bands	need be	allocated.

	      Importantly, HMM banding sacrifices the guarantee	of determining
	      the  optimally  accurarte	 or  optimal  alignment, which will be
	      missed if	it lies	outside	the bands. The tau  parameter  is  the
	      amount of	probability mass considered negligible during HMM band
	      calculation; lower values	of tau yield greater speedups but also
	      a	 greater  chance of missing the	optimal	alignment. The default
	      tau is 1E-7, determined empirically as a good  tradeoff  between
	      sensitivity and speed, though this value can be changed with the
	      --tau  <x> option. The level of acceleration increases with both
	      the  length  and primary sequence	conservation level of the fam-
	      ily. For example,	with the default tau of	1E-7, tRNA models (low
	      primary sequence	conservation  with  length  of	about  75  nu-
	      cleotides)  show	about 10X acceleration,	and SSU	bacterial rRNA
	      models (high primary sequence conservation with length of	 about
	      1500  nucleotides)  show	about 700X.  HMM banding can be	turned
	      off with the --nonbanded option.

       --tau <x>
	      Set the tail loss	probability used during	HMM  band  calculation
	      to  <x>.	 This is the amount of probability mass	within the HMM
	      posterior	probabilities that is considered negligible.  The  de-
	      fault  value  is 1E-7.  In general, higher values	will result in
	      greater acceleration, but	increase the chance of missing the op-
	      timal alignment due to the HMM bands.

       --mxsize	<x>
	      Set the maximum allowable	total DP matrix	size to	<x> megabytes.
	      By default this size is 1024 Mb.	This should  be	 large	enough
	      for  the	vast majority of alignments, however if	it is not cma-
	      lign will	attempt	to iteratively tighten the HMM bands  it  uses
	      to  constrain the	alignment by raising the tau parameter and re-
	      calculating the bands until the total matrix size	 needed	 falls
	      below  <x> megabytes or the maximum allowable tau	value (0.05 by
	      default, but changeable with --maxtau) is	reached. At each iter-
	      ation of band tightening,	tau is multiplied by a 2.0.  The  band
	      tightening  strategy  can	 be turned off with the	--fixedtau op-
	      tion.  If	the maximum tau	is reached  and	 the  required	matrix
	      size  still  exceeds <x> or if HMM banding is not	being used and
	      the required matrix size exceeds <x> then	cmalign	will exit pre-
	      maturely and report an error message that	 the  matrix  exceeded
	      its  maximum  allowable  size. In	this case, the --mxsize	can be
	      used to raise the	size limit or the maximum tau  can  be	raised
	      with  --maxtau.	The  limit  will commonly be exceeded when the
	      --nonbanded option is used without the --small option,  but  can
	      still  occur  when --nonbanded is	not used. Note that if cmalign
	      is being run in <n> multiple threads on a	multicore machine then
	      each thread may have an allocated	matrix of up to	size <x> Mb at
	      any given	time.

       --fixedtau
	      Turn off the HMM band tightening strategy	described in  the  ex-
	      planation	of the --mxsize	option above.

       --maxtau	<x>
	      Set  the	maximum	 allowed value for tau during band tightening,
	      described	in the explanation of --mxsize above, to <x>.  By  de-
	      fault this value is 0.05.

       --nonbanded
	      Turns  off  HMM banding. The returned alignment is guaranteed to
	      be the globally optimally	accurate one (by default) or the glob-
	      ally optimally scoring one (if --cyk is enabled).	  The  --small
	      option  is  recommended in combination with this option, because
	      standard alignment without HMM banding requires a	lot of	memory
	      (see --small ).

       --small
	      Use  the divide and conquer CYK alignment	algorithm described in
	      SR Eddy, BMC Bioinformatics 3:18,	2002. The  --nonbanded	option
	      must be used in combination with this options.  Also, it is rec-
	      ommended	whenever --nonbanded is	used that --small is also used
	      because standard CM alignment without HMM	banding	requires a lot
	      of memory, especially for	large RNAs.  --small allows CM	align-
	      ment  within  practical  memory  limits, reducing	the memory re-
	      quired for alignment LSU rRNA, the largest known RNAs, from  150
	      Gb  to less than 300 Mb.	This option can	only be	used in	combi-
	      nation with --noprob, --nonbanded, --notrunc, and	--cyk.

OPTIONAL OUTPUT	FILES
       --sfile <f>
	      Dump per-sequence	alignment score	and timig information to  file
	      <f>.   The format	of this	file is	described above	(it's the same
	      data in the same format as the tabular stdout output when	the -o
	      option is	used).

       --tfile <f>
	      Dump tabular sequence tracebacks for each	individual sequence to
	      a	file <f>.  Primarily useful for	debugging.

       --ifile <f>
	      Dump per-sequence	insert information to file <f>.	 The format of
	      the file is described by "#"-prefixed comment lines included  at
	      the  top	of the file <f>.  The insert information is valid even
	      when the --matchonly option is used.

       --elfile	<f>
	      Dump per-sequence	EL state (local	 end)  insert  information  to
	      file  <f>.   The format of the file is described by "#"-prefixed
	      comment lines included at	the top	of the file <f>.  The  EL  in-
	      sert  information	 is  valid even	when the --matchonly option is
	      used.

OTHER OPTIONS
       --mapali	<f>
	      Reads the	alignment from file <f>	used to	build the model	aligns
	      it as a single object to the CM; e.g. the	alignment  in  <f>  is
	      held  fixed.  This allows	you to align sequences to a model with
	      cmalign and view them in the context of an existing trusted mul-
	      tiple alignment.	<f> must be the	alignment file that the	CM was
	      built from. The program verifies that the	checksum of  the  file
	      matches that of the file used to construct the CM. A similar op-
	      tion  to	this  one was called --withali in previous versions of
	      cmalign.

       --mapstr
	      Must be used in combination with --mapali	<f>.  Propagate	struc-
	      tural information	for any	pseudoknots that exist in <f>  to  the
	      output  alignment.  A  similar  option  to  this	one was	called
	      --withstr	in previous versions of	cmalign.

       --informat <s>
	      Assert that the input <seqfile> is in format <s>.	  Do  not  run
	      Babelfish	 format	autodection. This increases the	reliability of
	      the program somewhat, because the	Babelfish can  make  mistakes;
	      particularly recommended for unattended, high-throughput runs of
	      Infernal.	  Acceptable  formats  are:  FASTA, GENBANK, and DDBJ.
	      <s> is case-insensitive.

       --outformat <s>
	      Specify the output alignment format as <s>.  Acceptable  formats
	      are: Pfam, AFA, A2M, Clustal, and	Phylip.	 AFA is	aligned	fasta.
	      Only Pfam	and Stockholm alignment	formats	will include consensus
	      structure	 annotation  and  posterior  probability annotation of
	      aligned residues.

       --dnaout
	      Output the alignments as DNA sequence alignments,	instead	of RNA
	      ones.

       --noprob
	      Do not annotate the output alignment with	 posterior  probabili-
	      ties.

       --matchonly
	      Only  include  match columns in the output alignment, do not in-
	      clude any	insertions relative to the consensus model.  This  op-
	      tion  may	be useful when creating	very large alignments that re-
	      quire a lot of memory and	disk space, most of which is necessary
	      only to deal with	insert columns	that  are  gaps	 in  most  se-
	      quences.

       --miss In  the  output alignment, use missing data characters ('~') be-
	      fore the first residue and after the final residue of  each  se-
	      quence  to  indicate  the	 sequence was aligned with a truncated
	      alignment	algorithm. The aligned sequences would	be  considered
	      fragments	if the alignment was used subsequently as input	to cm-
	      build  with the --fraggiven option. This option has no effect if
	      --notrunc	is also	used.

       --ileaved
	      Output the alignment in interleaved Stockholm format of a	 fixed
	      width  that may be more convenient for examination. This was the
	      default output alignment format of previous versions of cmalign.
	      Note that	cmalign	requires more memory when this option is used.
	      For this reason, --ileaved will only work	for alignments	of  up
	      to  100,000  sequences  or  a  total  of 100,000,000 aligned nu-
	      cleotides.

       --flanktoins <x1>
	      Change the transition probabilities from the ROOT_S state	to the
	      ROOT_IL and ROOT_IR states, and from the ROOT_IL to the  ROOT_IR
	      state  to	 <x1>.	 This  option is meant to be helpful only when
	      aligning sequences that include extra sequence at	the 5'	and/or
	      3'  ends.	 Without  using	 this  option cmalign tends to mess up
	      alignments at the	end, especially	for  models  with  zero	 base-
	      pairs.  This  option  should  not	be necessary when aligning se-
	      quences identified by cmsearch or	cmscan because they should not
	      include extra sequence at	the ends.  This	option must be used in
	      combination with the  --flankselfins  <x2>  option.  Recommended
	      values  to  use are 0.1 for <x1> and 0.8 for <x2>	, but the best
	      performing pair of values	may vary for different	models.	  <x1>
	      must  be greater than 0.0	and less than 0.4, and the sum of <x1>
	      and <x2> must be less than 0.95.

       --flankselfins <x2>
	      Change the self-transition probabilities	for  the  ROOT_IL  and
	      ROOT_IR states to	<x2>.  This option must	be used	in combination
	      with  the	 --flanktoins <x1> option. See the explanation of that
	      option above for more information.

       --regress <s>
	      Save an additional copy of the output alignment with  no	author
	      information to file <s>.

       --verbose
	      Output additional	information in the tabular scores output (out-
	      put  to stdout if	-o is used, or to <f> if --sfile <f> is	used).
	      These are	mainly useful for testing and debugging.

       --cpu <n>
	      Set the number of	parallel worker	threads	to <n>.	 On  multicore
	      machines,	the default is 4.  You can also	control	this number by
	      setting an environment variable, INFERNAL_NCPU.  There is	also a
	      master  thread,  so  the	actual number of threads that Infernal
	      spawns is	<n>+1.	This option is not available if	 Infernal  was
	      compiled with POSIX threads support turned off.

       --mpi  Run  as an MPI parallel program. This option will	only be	avail-
	      able if Infernal has been	configured and built with  the	"--en-
	      able-mpi"	 flag  (see the	Installation section of	the user guide
	      for more information).

SEE ALSO
       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For complete documentation, see the user	guide that came	with your  In-
       fernal  distribution  (Userguide.pdf);  or  see	the  Infernal web page
       (http://eddylab.org/infernal/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For additional information on copyright and  licensing,	see  the  file
       called  COPYRIGHT  in your Infernal source distribution,	or see the In-
       fernal web page (http://eddylab.org/infernal/).

AUTHOR
       http://eddylab.org

Infernal 1.1.5			   Sep 2023			    cmalign(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=cmalign&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages