Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
hmmbuild(1)			 HMMER Manual			   hmmbuild(1)

       hmmbuild	- construct profiles from multiple sequence alignments

       hmmbuild	[options] hmmfile msafile

       For each	multiple sequence alignment in msafile build a profile HMM and
       save it to a new	file hmmfile.

       msafile may be '-' (dash), which	means reading this  input  from	 stdin
       rather than a file.

       hmmfile may not be '-' (stdout),	because	sending	the HMM	file to	stdout
       would conflict with the other text output of the	program.

       -h     Help; print a brief reminder  of	command	 line  usage  and  all
	      available	options.

       -n _s_ Name the new profile _s_.	 The default is	to use the name	of the
	      alignment	(if one	is present in the msafile, or,	failing	 that,
	      the  name	 of  the  hmmfile.   If	msafile	contains more than one
	      alignment, -n doesn't work, and every alignment must have	a name
	      annotated	in the msafile (as in Stockholm	#=GF ID	annotation).

       -o _f_ Direct the summary output	to file	_f_, rather than to stdout.

       -O _f_ After each model is constructed, resave annotated, possibly mod-
	      ified source alignments to a file	_f_ in Stockholm format.   The
	      alignments  are annotated	with a reference annotation line indi-
	      cating which columns were	assigned as consensus,	and  sequences
	      are annotated with what relative sequence	weights	were assigned.
	      Some residues of the alignment may have been shifted to accommo-
	      date  restrictions of the	Plan7 profile architecture, which dis-
	      allows transitions between insert	and delete states.

	      Assert that sequences in msafile are protein, bypassing alphabet

       --dna  Assert that sequences in msafile are DNA,	bypassing alphabet au-

       --rna  Assert that sequences in msafile are RNA,	bypassing alphabet au-

       These  options  control	how consensus columns are defined in an	align-

       --fast Define consensus columns as those	that have a fraction  >=  sym-
	      frac  of	residues as opposed to gaps. (See below	for the	--sym-
	      frac option.) This is the	default.

       --hand Define consensus columns in next profile using reference annota-
	      tion  to	the multiple alignment.	 This allows you to define any
	      consensus	columns	you like.

       --symfrac _x_
	      Define the residue fraction threshold necessary to define	a con-
	      sensus  column when using	the --fast option. The default is 0.5.
	      The symbol fraction in each column is  calculated	 after	taking
	      relative sequence	weighting into account,	and ignoring gap char-
	      acters corresponding to ends of sequence fragments  (as  opposed
	      to  internal  insertions/deletions).   Setting this to 0.0 means
	      that every alignment column will be assigned as consensus, which
	      may  be  useful in some cases. Setting it	to 1.0 means that only
	      columns that include 0 gaps (internal insertions/deletions) will
	      be assigned as consensus.

       --fragthresh _x_
	      We  only want to count terminal gaps as deletions	if the aligned
	      sequence is known	to be full-length, not if  it  is  a  fragment
	      (for  instance,  because	only  part of it was sequenced). HMMER
	      uses a simple rule to infer fragments: if	the  range  of	a  se-
	      quence in	the alignment (the number of alignment columns between
	      the first	and last positions of the sequence) is	less  than  or
	      equal  to	 a fraction _x_	times the alignment length in columns,
	      then the sequence	is handled as a	fragment. The default is  0.5.
	      Setting  --fragthresh  0 will define no (nonempty) sequence as a
	      fragment;	you might want to do this if you  know	you've	got  a
	      carefully	 curated  alignment of full-length sequences.  Setting
	      --fragthresh 1 will define all sequences as fragments; you might
	      want  to do this if you know your	alignment is entirely composed
	      of fragments, such as  translated	 short	reads  in  metagenomic
	      shotgun data.

       HMMER uses an ad	hoc sequence weighting algorithm to downweight closely
       related sequences and upweight distantly	related	ones. This has the ef-
       fect  of	 making	 models	less biased by uneven phylogenetic representa-
       tion. For example, two identical	sequences would	typically each receive
       half  the  weight that one sequence would.  These options control which
       algorithm gets used.

       --wpb  Use  the	Henikoff  position-based  sequence  weighting	scheme
	      [Henikoff	 and  Henikoff,	J. Mol.	Biol. 243:574, 1994].  This is
	      the default.

       --wgsc Use the Gerstein/Sonnhammer/Chothia  weighting  algorithm	 [Ger-
	      stein et al, J. Mol. Biol. 235:1067, 1994].

	      Use  the	same clustering	scheme that was	used to	weight data in
	      calculating BLOSUM subsitution matrices [Henikoff	and  Henikoff,
	      Proc.  Natl.  Acad.  Sci	89:10915, 1992]. Sequences are single-
	      linkage clustered	at an identity threshold  (default  0.62;  see
	      --wid)  and  within  each	 cluster of c sequences, each sequence
	      gets relative weight 1/c.

	      No relative weights. All sequences are assigned uniform weight.

       --wid _x_
	      Sets the identity	threshold used	by  single-linkage  clustering
	      when  using --wblosum.  Invalid with any other weighting scheme.
	      Default is 0.62.

       After relative weights are determined, they are normalized to sum to  a
       total  effective	sequence number, eff_nseq.  This number	may be the ac-
       tual number of sequences	in the alignment,  but	it  is	almost	always
       smaller	than  that.  The default entropy weighting method (--eent) re-
       duces the effective sequence number to reduce the  information  content
       (relative entropy, or average expected score on true homologs) per con-
       sensus position.	The target relative entropy is controlled by a two-pa-
       rameter	function, where	the two	parameters are settable	with --ere and

       --eent Adjust effective sequence	number to achieve a specific  relative
	      entropy per position (see	--ere).	 This is the default.

	      Set  effective  sequence	number to the number of	single-linkage
	      clusters at a specific identity threshold	(see --eid).  This op-
	      tion  is	not  recommended;  it's	for experiments	evaluating how
	      much better --eent is.

	      Turn off effective sequence number determination	and  just  use
	      the  actual number of sequences. One reason you might want to do
	      this is to try to	maximize the relative entropy/position of your
	      model, which may be useful for short models.

       --eset _x_
	      Explicitly  set  the effective sequence number for all models to

       --ere _x_
	      Set the minimum relative entropy/position	target	to  _x_.   Re-
	      quires  --eent.	Default	 depends on the	sequence alphabet. For
	      protein sequences, it is 0.59 bits/position; for nucleotide  se-
	      quences, it is 0.45 bits/position.

       --esigma	_x_
	      Sets the minimum relative	entropy	contributed by an entire model
	      alignment, over its whole	length.	This has the effect of	making
	      short  models  have  higher  relative  entropy per position than
	      --ere alone would	give. The default is 45.0 bits.

       --eid _x_
	      Sets the fractional pairwise  identity  cutoff  used  by	single
	      linkage  clustering  with	 the  --eclust	option.	The default is

       By default, weighted counts are converted to mean posterior probability
       parameter  estimates  using  mixture Dirichlet priors.  Default mixture
       Dirichlet prior parameters for protein models and for nucleic acid (RNA
       and  DNA) models	are built in. The following options allow you to over-
       ride the	default	priors.

	      Don't use	any priors. Probability	parameters will	simply be  the
	      observed frequencies, after relative sequence weighting.

	      Use a Laplace +1 prior in	place of the default mixture Dirichlet

       By default, if a	query is a single sequence from	a file in  fasta  for-
       mat,  hmmbuild constructs a search model	from that sequence and a stan-
       dard 20x20 substitution matrix for residue  probabilities,  along  with
       two additional parameters for position-independent gap open and gap ex-
       tend probabilities. These options  allow	 the  default  single-sequence
       scoring	parameters  to be changed, and for single-sequence scoring op-
       tions to	be applied to a	single sequence	coming from an aligned format.

	      If a single sequence query comes from a multiple sequence	align-
	      ment  file,  such	as in stockholm	format,	the search model is by
	      default constructed as is	typically done for  multiple  sequence
	      alignments.  This	 option	 forces	hmmbuild to use	the single-se-
	      quence method with substitution score matrix.

       --mx _s_
	      Obtain residue alignment probabilities from the built-in substi-
	      tution  matrix  named _s_.  Several standard matrices are	built-
	      in, and do not need to be	read from files.  The matrix name  _s_
	      can  be  PAM30,  PAM70, PAM120, PAM240, BLOSUM45,	BLOSUM50, BLO-
	      SUM62, BLOSUM80, BLOSUM90, or DNA1.  Only	one of	the  --mx  and
	      --mxfile options may be used.

       --mxfile	_mxfile_
	      Obtain residue alignment probabilities from the substitution ma-
	      trix in file _mxfile_.  The default score	matrix is BLOSUM62 for
	      protein  sequences, and DNA1 for nucleotide sequences (these ma-
	      trices are internal to HMMER and do not need to be available  as
	      a	 file).	  The  format of a substitution	matrix _mxfile_	is the
	      standard format accepted by BLAST,  FASTA,  and  other  sequence
	      analysis software.  See for
	      example files. (The only exception: we require  matrices	to  be
	      square, so for DNA, use files like NCBI's	NUC.4.4, not NUC.4.2.)

       --popen _x_
	      Set  the	gap open probability for a single sequence query model
	      to _x_.  The default is 0.02.  _x_ must be >= 0 and < 0.5.

       --pextend _x_
	      Set the gap extend probability for a single sequence query model
	      to _x_.  The default is 0.4.  _x_	must be	>= 0 and < 1.0.

       The  location  parameters  for the expected score distributions for MSV
       filter scores, Viterbi filter scores, and Forward scores	require	 three
       short random sequence simulations.

       --EmL _n_
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter mu	for MSV	filter E-values. Default is 200.

       --EmN _n_
	      Sets the number of sequences in simulation  that	estimates  the
	      location parameter mu for	MSV filter E-values. Default is	200.

       --EvL _n_
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter mu	for Viterbi filter E-values. Default is	200.

       --EvN _n_
	      Sets the number of sequences in simulation  that	estimates  the
	      location	parameter  mu  for Viterbi filter E-values. Default is

       --EfL _n_
	      Sets the sequence	length in simulation that estimates the	 loca-
	      tion parameter tau for Forward E-values. Default is 100.

       --EfN _n_
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter tau for Forward E-values. Default is 200.

       --Eft _x_
	      Sets the tail mass fraction to fit in the	simulation that	 esti-
	      mates the	location parameter tau for Forward evalues. Default is

       --cpu _n_
	      Set the number of	parallel worker	threads	to _n_.	 On  multicore
	      machines,	the default is 2.  You can also	control	this number by
	      setting an environment variable, HMMER_NCPU.  There  is  also  a
	      master thread, so	the actual number of threads that HMMER	spawns
	      is _n_+1.

	      This option is not available if HMMER was	 compiled  with	 POSIX
	      threads support turned off.

       --informat _s_
	      Assert  that input msafile is in alignment format	_s_, bypassing
	      format autodetection.  Common choices for	 _s_  include:	stock-
	      holm,  a2m,  afa,	 psiblast, clustal, phylip.  For more informa-
	      tion, and	for codes for some less	common formats,	see main docu-
	      mentation.   The string _s_ is case-insensitive (a2m or A2M both

       --seed _n_
	      Seed the random number generator with _n_, an integer >= 0.   If
	      _n_ is nonzero, any stochastic simulations will be reproducible;
	      the same command will give the same results.  If _n_ is  0,  the
	      random  number  generator	 is seeded arbitrarily,	and stochastic
	      simulations will vary from run to	run of the same	command.   The
	      default seed is 42.

       --w_beta	_x_
	      Window  length  tail mass.  The upper bound, W, on the length at
	      which nhmmer expects to find an instance of  the	model  is  set
	      such  that  the fraction of all sequences	generated by the model
	      with length _= W is less than _x_.  The default is 1e-7.

       --w_length _n_
	      Override the model instance length upper bound, W, which is oth-
	      erwise  controlled  by  --w_beta.	  It should be larger than the
	      model length. The	value of W is used deep	 in  the  acceleration
	      pipeline,	 and modest changes are	not expected to	impact results
	      (though larger values of W do lead to longer run time).

       --mpi  Run as a parallel	MPI program. Each alignment is assigned	 to  a
	      MPI worker node for construction.	(Therefore, the	maximum	paral-
	      lelization cannot	exceed the number of alignments	in  the	 input
	      msafile.)	 This is useful	when building large profile libraries.
	      This option is only available if optional	MPI capability was en-
	      abled at compile-time.

	      For  debugging MPI parallelization: arrest program execution im-
	      mediately	after start, and wait for a debugger to	attach to  the
	      running process and release the arrest.

       --maxinsertlen _n_
	      Restrict	insert	length parameterization	such that the expected
	      insert length at each position of	the model is no	more than _n_.

       See hmmer(1) for	a master man page with a list of  all  the  individual
       man pages for programs in the HMMER package.

       For  complete documentation, see	the user guide that came with your HM-
       MER distribution	(Userguide.pdf); or see	the HMMER web page (http://hm-

       Copyright (C) 2019 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	HMMER source distribution, or  see  the	 HMMER
       web page	(


HMMER 3.3			   Nov 2019			   hmmbuild(1)


Want to link to this manual page? Use this URL:

home | help