FreeBSD Manual Pages

home | help
hmmbuild(1)			 HMMER Manual			   hmmbuild(1)

NAME
       hmmbuild	- construct profiles from multiple sequence alignments

SYNOPSIS
       hmmbuild	[options] hmmfile msafile

DESCRIPTION
       For each	multiple sequence alignment in msafile build a profile HMM and
       save it to a new	file hmmfile.

       msafile	may  be	 '-' (dash), which means reading this input from stdin
       rather than a file.

       hmmfile may not be '-' (stdout),	because	sending	the HMM	file to	stdout
       would conflict with the other text output of the	program.

OPTIONS
       -h     Help; print a brief reminder  of	command	 line  usage  and  all
	      available	options.

       -n <s> Name the new profile <s>.	 The default is	to use the name	of the
	      alignment	 (if  one is present in	the msafile, or, failing that,
	      the name of the hmmfile.	If  msafile  contains  more  than  one
	      alignment, -n doesn't work, and every alignment must have	a name
	      annotated	in the msafile (as in Stockholm	#=GF ID	annotation).

       -o <f> Direct the summary output	to file	<f>, rather than to stdout.

       -O <f> After each model is constructed, resave annotated, possibly mod-
	      ified  source alignments to a file <f> in	Stockholm format.  The
	      alignments are annotated with a reference	annotation line	 indi-
	      cating  which  columns were assigned as consensus, and sequences
	      are annotated with what relative sequence	weights	were assigned.
	      Some residues of the alignment may have been shifted to accommo-
	      date restrictions	of the Plan7 profile architecture, which  dis-
	      allows transitions between insert	and delete states.

OPTIONS	FOR SPECIFYING THE ALPHABET
       --amino
	      Assert that sequences in msafile are protein, bypassing alphabet
	      autodetection.

       --dna  Assert that sequences in msafile are DNA,	bypassing alphabet au-
	      todetection.

       --rna  Assert that sequences in msafile are RNA,	bypassing alphabet au-
	      todetection.

OPTIONS	CONTROLLING PROFILE CONSTRUCTION
       These  options  control	how consensus columns are defined in an	align-
       ment.

       --fast Define consensus columns as those	that have a fraction  >=  sym-
	      frac  of	residues as opposed to gaps. (See below	for the	--sym-
	      frac option.) This is the	default.

       --hand Define consensus columns in next profile using reference annota-
	      tion to the multiple alignment.  This allows you to  define  any
	      consensus	columns	you like.

       --symfrac <x>
	      Define the residue fraction threshold necessary to define	a con-
	      sensus  column when using	the --fast option. The default is 0.5.
	      The symbol fraction in each column is  calculated	 after	taking
	      relative sequence	weighting into account,	and ignoring gap char-
	      acters  corresponding  to	ends of	sequence fragments (as opposed
	      to internal insertions/deletions).  Setting this	to  0.0	 means
	      that every alignment column will be assigned as consensus, which
	      may  be  useful in some cases. Setting it	to 1.0 means that only
	      columns that include 0 gaps (internal insertions/deletions) will
	      be assigned as consensus.

       --fragthresh <x>
	      We only want to count terminal gaps as deletions if the  aligned
	      sequence	is  known  to  be full-length, not if it is a fragment
	      (for instance, because only part of  it  was  sequenced).	 HMMER
	      uses  a  simple  rule  to	infer fragments: if the	range of a se-
	      quence in	the alignment (the number of alignment columns between
	      the first	and last positions of the sequence) is	less  than  or
	      equal  to	 a fraction <x>	times the alignment length in columns,
	      then the sequence	is handled as a	fragment. The default is  0.5.
	      Setting  --fragthresh  0 will define no (nonempty) sequence as a
	      fragment;	you might want to do this if you  know	you've	got  a
	      carefully	 curated  alignment of full-length sequences.  Setting
	      --fragthresh 1 will define all sequences as fragments; you might
	      want to do this if you know your alignment is entirely  composed
	      of  fragments,  such  as	translated  short reads	in metagenomic
	      shotgun data.

OPTIONS	CONTROLLING RELATIVE WEIGHTS
       HMMER uses an ad	hoc sequence weighting algorithm to downweight closely
       related sequences and upweight distantly	related	ones. This has the ef-
       fect of making models less biased by  uneven  phylogenetic  representa-
       tion. For example, two identical	sequences would	typically each receive
       half  the  weight that one sequence would.  These options control which
       algorithm gets used.

       --wpb  Use  the	Henikoff  position-based  sequence  weighting	scheme
	      [Henikoff	 and  Henikoff,	J. Mol.	Biol. 243:574, 1994].  This is
	      the default.

       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm	 [Ger-
	      stein et al, J. Mol. Biol. 235:1067, 1994].

       --wblosum
	      Use  the	same clustering	scheme that was	used to	weight data in
	      calculating BLOSUM substitution matrices [Henikoff and Henikoff,
	      Proc. Natl. Acad.	Sci 89:10915,  1992].  Sequences  are  single-
	      linkage  clustered  at  an identity threshold (default 0.62; see
	      --wid) and within	each cluster of	 c  sequences,	each  sequence
	      gets relative weight 1/c.

       --wnone
	      No relative weights. All sequences are assigned uniform weight.

       --wid <x>
	      Sets  the	 identity  threshold used by single-linkage clustering
	      when using --wblosum.  Invalid with any other weighting  scheme.
	      Default is 0.62.

OPTIONS	CONTROLLING EFFECTIVE SEQUENCE NUMBER
       After  relative weights are determined, they are	normalized to sum to a
       total effective sequence	number,	eff_nseq.  This	number may be the  ac-
       tual  number  of	 sequences  in	the alignment, but it is almost	always
       smaller than that.  The default entropy weighting method	 (--eent)  re-
       duces  the  effective sequence number to	reduce the information content
       (relative entropy, or average expected score on true homologs) per con-
       sensus position.	The target relative entropy is controlled by a two-pa-
       rameter function, where the two parameters are settable with --ere  and
       --esigma.

       --eent Adjust  effective	sequence number	to achieve a specific relative
	      entropy per position (see	--ere).	 This is the default.

       --eclust
	      Set effective sequence number to the  number  of	single-linkage
	      clusters at a specific identity threshold	(see --eid).  This op-
	      tion  is	not  recommended;  it's	for experiments	evaluating how
	      much better --eent is.

       --enone
	      Turn off effective sequence number determination	and  just  use
	      the  actual number of sequences. One reason you might want to do
	      this is to try to	maximize the relative entropy/position of your
	      model, which may be useful for short models.

       --eset <x>
	      Explicitly set the effective sequence number for all  models  to
	      <x>.

       --ere <x>
	      Set  the	minimum	 relative entropy/position target to <x>.  Re-
	      quires --eent.  Default depends on the  sequence	alphabet.  For
	      protein  sequences, it is	0.59 bits/position; for	nucleotide se-
	      quences, it is 0.45 bits/position.

       --esigma	<x>
	      Sets the minimum relative	entropy	contributed by an entire model
	      alignment, over its whole	length.	This has the effect of	making
	      short  models  have  higher  relative  entropy per position than
	      --ere alone would	give. The default is 45.0 bits.

       --eid <x>
	      Sets the fractional pairwise  identity  cutoff  used  by	single
	      linkage  clustering  with	 the  --eclust	option.	The default is
	      0.62.

OPTIONS	CONTROLLING PRIORS
       By default, weighted counts are converted to mean posterior probability
       parameter estimates using mixture Dirichlet  priors.   Default  mixture
       Dirichlet prior parameters for protein models and for nucleic acid (RNA
       and  DNA) models	are built in. The following options allow you to over-
       ride the	default	priors.

       --pnone
	      Don't use	any priors. Probability	parameters will	simply be  the
	      observed frequencies, after relative sequence weighting.

       --plaplace
	      Use a Laplace +1 prior in	place of the default mixture Dirichlet
	      prior.

OPTIONS	CONTROLLING SINGLE SEQUENCE SCORING
       By  default,  if	a query	is a single sequence from a file in fasta for-
       mat, hmmbuild constructs	a search model from that sequence and a	 stan-
       dard  20x20  substitution  matrix for residue probabilities, along with
       two additional parameters for position-independent gap open and gap ex-
       tend probabilities. These options  allow	 the  default  single-sequence
       scoring	parameters  to be changed, and for single-sequence scoring op-
       tions to	be applied to a	single sequence	coming from an aligned format.

       --singlemx
	      If a single sequence query comes from a multiple sequence	align-
	      ment file, such as in stockholm format, the search model	is  by
	      default  constructed  as is typically done for multiple sequence
	      alignments. This option forces hmmbuild to  use  the  single-se-
	      quence method with substitution score matrix.

       --mx <s>
	      Obtain residue alignment probabilities from the built-in substi-
	      tution  matrix  named <s>.  Several standard matrices are	built-
	      in, and do not need to be	read from files.  The matrix name  <s>
	      can  be  PAM30,  PAM70, PAM120, PAM240, BLOSUM45,	BLOSUM50, BLO-
	      SUM62, BLOSUM80, BLOSUM90, or DNA1.  Only	one of	the  --mx  and
	      --mxfile options may be used.

       --mxfile	<mxfile>
	      Obtain residue alignment probabilities from the substitution ma-
	      trix in file <mxfile>.  The default score	matrix is BLOSUM62 for
	      protein  sequences, and DNA1 for nucleotide sequences (these ma-
	      trices are internal to HMMER and do not need to be available  as
	      a	 file).	  The  format of a substitution	matrix <mxfile>	is the
	      standard format accepted by BLAST,  FASTA,  and  other  sequence
	      analysis software.  See ftp.ncbi.nlm.nih.gov/blast/matrices/ for
	      example  files.  (The  only exception: we	require	matrices to be
	      square, so for DNA, use files like NCBI's	NUC.4.4, not NUC.4.2.)

       --popen <x>
	      Set the gap open probability for a single	sequence  query	 model
	      to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.

       --pextend <x>
	      Set the gap extend probability for a single sequence query model
	      to <x>.  The default is 0.4.  <x>	must be	>= 0 and < 1.0.

OPTIONS	CONTROLLING E-VALUE CALIBRATION
       The  location  parameters  for the expected score distributions for MSV
       filter scores, Viterbi filter scores, and Forward scores	require	 three
       short random sequence simulations.

       --EmL <n>
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter mu	for MSV	filter E-values. Default is 200.

       --EmN <n>
	      Sets the number of sequences in simulation  that	estimates  the
	      location parameter mu for	MSV filter E-values. Default is	200.

       --EvL <n>
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter mu	for Viterbi filter E-values. Default is	200.

       --EvN <n>
	      Sets the number of sequences in simulation  that	estimates  the
	      location	parameter  mu  for Viterbi filter E-values. Default is
	      200.

       --EfL <n>
	      Sets the sequence	length in simulation that estimates the	 loca-
	      tion parameter tau for Forward E-values. Default is 100.

       --EfN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter tau for Forward E-values. Default is 200.

       --Eft <x>
	      Sets the tail mass fraction to fit in the	simulation that	 esti-
	      mates the	location parameter tau for Forward evalues. Default is
	      0.04.

OTHER OPTIONS
       --cpu <n>
	      Set  the number of parallel worker threads to <n>.  On multicore
	      machines,	the default is 2.  You can also	control	this number by
	      setting an environment variable, HMMER_NCPU.  There  is  also  a
	      master thread, so	the actual number of threads that HMMER	spawns
	      is <n>+1.

	      This  option  is	not available if HMMER was compiled with POSIX
	      threads support turned off.

       --informat <s>
	      Assert that input	msafile	is in alignment	format <s>,  bypassing
	      format  autodetection.   Common  choices for <s> include:	stock-
	      holm, a2m, afa, psiblast,	clustal, phylip.   For	more  informa-
	      tion, and	for codes for some less	common formats,	see main docu-
	      mentation.   The string <s> is case-insensitive (a2m or A2M both
	      work).

       --seed <n>
	      Seed the random number generator with <n>, an integer >= 0.   If
	      <n> is nonzero, any stochastic simulations will be reproducible;
	      the  same	 command will give the same results.  If <n> is	0, the
	      random number generator is seeded	 arbitrarily,  and  stochastic
	      simulations  will	vary from run to run of	the same command.  The
	      default seed is 42.

       --w_beta	<x>
	      Window length tail mass.	The upper bound, W, on the  length  at
	      which  nhmmer  expects  to  find an instance of the model	is set
	      such that	the fraction of	all sequences generated	by  the	 model
	      with length >= W is less than <x>.  The default is 1e-7.

       --w_length <n>
	      Override the model instance length upper bound, W, which is oth-
	      erwise  controlled  by  --w_beta.	  It should be larger than the
	      model length. The	value of W is used deep	 in  the  acceleration
	      pipeline,	 and modest changes are	not expected to	impact results
	      (though larger values of W do lead to longer run time).

       --mpi  Run as a parallel	MPI program. Each alignment is assigned	 to  a
	      MPI worker node for construction.	(Therefore, the	maximum	paral-
	      lelization  cannot  exceed the number of alignments in the input
	      msafile.)	 This is useful	when building large profile libraries.
	      This option is only available if optional	MPI capability was en-
	      abled at compile-time.

       --stall
	      For debugging MPI	parallelization: arrest	program	execution  im-
	      mediately	 after start, and wait for a debugger to attach	to the
	      running process and release the arrest.

       --maxinsertlen <n>
	      Restrict insert length parameterization such that	 the  expected
	      insert length at each position of	the model is no	more than <n>.

SEE ALSO
       See  hmmer(1)  for  a master man	page with a list of all	the individual
       man pages for programs in the HMMER package.

       For complete documentation, see the user	guide that came	with your  HM-
       MER distribution	(Userguide.pdf); or see	the HMMER web page (http://hm-
       mer.org/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	HMMER source distribution, or  see  the	 HMMER
       web page	(http://hmmer.org/).

AUTHOR
       http://eddylab.org

HMMER 3.4			   Aug 2023			   hmmbuild(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=hmmbuild&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages