FreeBSD Manual Pages

home | help
nhmmer(1)			 HMMER Manual			     nhmmer(1)

NAME
       nhmmer -	search DNA queries against a DNA sequence database

SYNOPSIS
       nhmmer [options]	queryfile seqdb

DESCRIPTION
       nhmmer  is  used	to search one or more nucleotide queries against a nu-
       cleotide	sequence database.  For	each  query  in	 queryfile,  use  that
       query to	search the target database of sequences	in seqdb, and output a
       ranked list of the hits with the	most significant matches to the	query.
       A  query	may be either a	profile	model built using hmmbuild, a sequence
       alignment, or a single sequence.	Sequence based queries	can  be	 in  a
       number  of  formats (see	--qformat), and	can typically be autodetected.
       Note that only Stockholm	format supports	queries	made up	of  more  than
       one sequence alignment.

       Either the query	queryfile or the target	seqdb may be '-' (a dash char-
       acter),	in  which case the query file or target	database input will be
       read from a <stdin> pipe	instead	of from	a file.	Only one input	source
       can  come  through  <stdin>,  not both.	If the queryfile contains more
       than one	query, then seqdb cannot come from  stdin,  because  we	 can't
       rewind the streaming target database to search it with another profile.

       If  the query is	sequence-based (unaligned or aligned), a new file con-
       taining the HMM(s) built	from the input(s) in queryfile may  optionally
       be produced, with the filename set using	the --hmmout flag.

       The output format is designed to	be human-readable, but is often	so vo-
       luminous	 that reading it is impractical, and parsing it	is a pain. The
       --tblout	option saves output in a simple	tabular	format that is concise
       and easier to parse.  The -o option allows redirecting the main output,
       including throwing it away in /dev/null.

OPTIONS
       -h     Help; print a brief reminder  of	command	 line  usage  and  all
	      available	options.

OPTIONS	FOR CONTROLLING	OUTPUT
       -o <f> Direct  the  main	human-readable output to a file	<f> instead of
	      the default stdout.

       -A <f> Save a multiple alignment	of all significant hits	(those	satis-
	      fying "inclusion thresholds") to the file	<f>.

       --tblout	<f>
	      Save  a  simple  tabular	(space-delimited) file summarizing the
	      per-target output, with one data line per	homologous target  se-
	      quence found.

       --dfamtblout <f>
	      Save  a  tabular	(space-delimited) file summarizing the per-hit
	      output, similar to --tblout but more succinct.

       --aliscoresout <f>
	      Save to file a list of per-position scores for each  hit.	  This
	      is  useful,  for	example,  in identifying regions of high score
	      density for use in resolving  overlapping	 hits  from  different
	      models.

       --hmmout	<f>
	      If  queryfile  is	 sequence-based, write the internally-computed
	      HMM(s) to	file <f>.

       --acc  Use accessions instead of	names in the main output, where	avail-
	      able for profiles	and/or sequences.

       --noali
	      Omit the alignment  section  from	 the  main  output.  This  can
	      greatly reduce the output	volume.

       --notextw
	      Unlimit  the length of each line in the main output. The default
	      is a limit of 120	characters per line, which helps in displaying
	      the output cleanly on terminals and in editors, but can truncate
	      target profile description lines.

       --textw <n>
	      Set the main output's line length	limit to  <n>  characters  per
	      line. The	default	is 120.

OPTIONS	CONTROLLING SINGLE SEQUENCE SCORING
       By  default,  if	a query	is a single sequence from a file in fasta for-
       mat, nhmmer uses	a search model constructed from	that  sequence	and  a
       standard	 20x20	substitution  matrix  for residue probabilities, along
       with two	additional parameters for position-independent	gap  open  and
       gap  extend  probabilities.  These options allow	the default single-se-
       quence scoring parameters to be changed,	and for	single-sequence	 scor-
       ing  options  to	be applied to a	single sequence	coming from an aligned
       format.

       --singlemx
	      If a single sequence query comes from a multiple sequence	align-
	      ment file, such as in Stockholm format, the search model	is  by
	      default  constructed  as is typically done for multiple sequence
	      alignments. This option forces nhmmer to use the single-sequence
	      method with substitution score matrix.

       --mxfile<mxfile
	      Obtain residue alignment probabilities from the substitution ma-
	      trix in file mxfile.  The	default	score matrix is	DNA1 (this ma-
	      trix is internal to HMMER	and does not have to be	available as a
	      file).  The format of a substitution matrix mxfile is the	 stan-
	      dard  format accepted by BLAST, FASTA, and other sequence	analy-
	      sis software.  See ftp.ncbi.nlm.nih.gov/blast/matrices/ for  ex-
	      ample  files.  (The  only	 exception:  we	require	matrices to be
	      square, so for DNA, use files like NCBI's	NUC.4.4, not NUC.4.2.)

       --popen <x>
	      Set the gap open probability for a single	sequence  query	 model
	      to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.

       --pextend <x>
	      Set the gap extend probability for a single sequence query model
	      to <x>.  The default is 0.4.  <x>	must be	>= 0 and < 1.0.

OPTIONS	CONTROLLING REPORTING THRESHOLDS
       Reporting  thresholds  control  which hits are reported in output files
       (the main output, --tblout, and --dfamtblout).  Hits are	ranked by sta-
       tistical	significance (E-value).

       -E <x> Report target sequences with an E-value of <= <x>.  The  default
	      is  10.0,	meaning	that on	average, about 10 false	positives will
	      be reported per query, so	you can	see the	top of the  noise  and
	      decide for yourself if it's really noise.

       -T <x> Instead of thresholding output on	E-value, instead report	target
	      sequences	with a bit score of >= <x>.

OPTIONS	FOR INCLUSION THRESHOLDS
       Inclusion thresholds are	stricter than reporting	thresholds.  Inclusion
       thresholds  control  which hits are considered to be reliable enough to
       be included in an output	alignment or a	subsequent  search  round,  or
       marked  as  significant	("!") as opposed to questionable ("?")	in hit
       output.

       --incE <x>
	      Use an E-value of	<= <x> as the inclusion	 threshold.   The  de-
	      fault  is	 0.01, meaning that on average,	about 1	false positive
	      would be expected	in every 100 searches with different query se-
	      quences.

       --incT <x>
	      Instead of using E-values	for setting the	 inclusion  threshold,
	      use  a  bit  score of >= <x> as the inclusion threshold.	By de-
	      fault this option	is unset.

OPTIONS	FOR MODEL-SPECIFIC SCORE THRESHOLDING
       Curated profile databases may define specific bit score thresholds  for
       each profile, superseding any thresholding based	on statistical signif-
       icance alone.

       To use these options, the profile must contain the appropriate (GA, TC,
       and/or  NC)  optional  score threshold annotation; this is picked up by
       hmmbuild	from Stockholm format alignment	files. For a nucleotide	model,
       each thresholding option	has a single per-hit threshold <x>  This  acts
       as  if  -T  <x>	--incT	<x>  has  been applied specifically using each
       model's curated thresholds.

       --cut_ga
	      Use the GA (gathering) bit score threshold in the	model  to  set
	      per-hit  reporting  and  inclusion thresholds. GA	thresholds are
	      generally	considered  to	be  the	 reliable  curated  thresholds
	      defining	family membership; for example,	in Dfam, these thresh-
	      olds are applied when annotating a genome	with a model of	a fam-
	      ily known	to be found in that organism. They may allow for mini-
	      mal expected false discovery rate.

       --cut_nc
	      Use the NC (noise	cutoff)	bit score threshold in	the  model  to
	      set  per-hit  reporting  and inclusion thresholds. NC thresholds
	      are less stringent than GA; in the context  of  Pfam,  they  are
	      generally	 used  to store	the score of the highest-scoring known
	      false positive.

       --cut_tc
	      Use the TC (trusted cutoff) bit score threshold in the model  to
	      set  per-hit  reporting  and inclusion thresholds. TC thresholds
	      are more stringent than GA, and are generally considered	to  be
	      the  score  of  the  lowest-scoring  known true positive that is
	      above all	known false positives; for  example,  in  Dfam,	 these
	      thresholds  are applied when annotating a	genome with a model of
	      a	family not known to be found in	that organism.

OPTIONS	CONTROLLING THE	ACCELERATION PIPELINE
       HMMER3 searches are accelerated in a three-step	filter	pipeline:  the
       scanning-SSV  filter,  the  Viterbi filter, and the Forward filter. The
       first filter is the fastest and most approximate; the last is the  full
       Forward scoring algorithm. There	is also	a bias filter step between SSV
       and  Viterbi.  Targets  that  pass  all	the  steps in the acceleration
       pipeline	are then subjected to postprocessing --	domain	identification
       and scoring using the Forward/Backward algorithm.

       Changing	 filter	 thresholds only removes or includes targets from con-
       sideration; changing filter thresholds does not alter  bit  scores,  E-
       values,	or  alignments,	all of which are determined solely in postpro-
       cessing.

       --max  Turn off (nearly)	all filters, including the  bias  filter,  and
	      run  full	 Forward/Backward postprocessing on most of the	target
	      sequence.	 In contrast to	phmmer and hmmsearch, where this  flag
	      really  does  turn  off  the filters entirely, the --max flag in
	      nhmmer sets the scanning-SSV filter threshold to 0.4,  not  1.0.
	      Use of this flag increases sensitivity somewhat, at a large cost
	      in speed.

       --F1 <x>
	      Set  the P-value threshold for the SSV filter step.  The default
	      is 0.02, meaning that roughly 2% of the highest  scoring	nonho-
	      mologous targets are expected to pass the	filter.

       --F2 <x>
	      Set  the P-value threshold for the Viterbi filter	step.  The de-
	      fault is 0.001.

       --F3 <x>
	      Set the P-value threshold	for the	Forward	filter step.  The  de-
	      fault is 1e-5.

       --nobias
	      Turn  off	 the bias filter. This increases sensitivity somewhat,
	      but can come at a	high cost in speed, especially	if  the	 query
	      has  biased  residue  composition	(such as a repetitive sequence
	      region, or if it is a membrane protein with large	regions	of hy-
	      drophobicity). Without the bias filter, too many	sequences  may
	      pass  the	filter with biased queries, leading to slower than ex-
	      pected  performance  as  the  computationally   intensive	  For-
	      ward/Backward algorithms shoulder	an abnormally heavy load.

OPTIONS	FOR SPECIFYING THE ALPHABET
       --dna  Assert that sequences in msafile are DNA,	bypassing alphabet au-
	      todetection.

       --rna  Assert that sequences in msafile are RNA,	bypassing alphabet au-
	      todetection.

OPTIONS	CONTROLLING SEED SEARCH	HEURISTIC
       When searching with nhmmer, one may optionally precompute a binary ver-
       sion  of	 the  target  database,	using makehmmerdb, then	search against
       that database. Using default settings, this yields  a  roughly  10-fold
       acceleration  with  small  loss	of sensitivity on benchmarks.  This is
       achieved	using a	heuristic method that  searches	 for  seeds  (ungapped
       alignments) around which	full processing	is done. This is essentially a
       replacement to the SSV stage. (This method has been extensively tested,
       but  should  still be treated as	somewhat experimental.)	 The following
       options only impact nhmmer if the value of --tformat is hmmerdb.

       Changing	parameters for this seed-finding step will impact  both	 speed
       and sensitivity - typically faster search leads to lower	sensitivity.

       --seed_max_depth	<n>
	      The  seed	 step requires that a seed reach a specified bit score
	      in length	no longer than <n>.  By	default,  this	value  is  15.
	      Longer  seeds  allow  a  greater chance of meeting the bit score
	      threshold, leading to diminished filtering (greater sensitivity,
	      slower run time).

       --seed_sc_thresh	<x>
	      The seed must reach score	<x> (in	bits).	The  default  is  15.0
	      bits. A higher threshold increases filtering stringency, leading
	      to faster	run times and lower sensitivity.

       --seed_sc_density <x>
	      Either all prefixes or all suffixes of a seed must have bit den-
	      sity  (bits  per aligned position) of at least <x>.  The default
	      is 0.8 bits/position. An increase	 in  the  density  requirement
	      leads  to	 increased filtering stringency, thus faster run times
	      and lower	sensitivity.

       --seed_drop_max_len <n>
	      A	seed may not have a run	of length <n> in which the score drops
	      by --seed_drop_lim or more. Basically, this prunes seeds that go
	      through long slightly-negative seed extensions. The  default  is
	      4.   Increasing the limit	causes (slightly) diminished filtering
	      efficiency, thus slower run times	and higher sensitivity.	(minor
	      tuning option)

       --seed_drop_lim <x>
	      In a seed, there may be no run of	length --seed_drop_max_len  in
	      which  the  score	 drops by --seed_drop_lim.  The	default	is 0.3
	      bits. Larger numbers mean	less filtering.	 (minor	tuning option)

       --seed_req_pos <n>
	      A	seed must contain a  run  of  at  least	 <n>  positive-scoring
	      matches.	The default is 5. Larger values	mean increased filter-
	      ing.  (minor tuning option)

       --seed_ssv_length <n>
	      After finding a short seed, an ungapped alignment	is extended in
	      both directions in an attempt to meet the	--F1 score  threshold.
	      The  window  through  which  this	 ungapped alignment extends is
	      length <n>.  The default is 70.  Decreasing this value  slightly
	      reduces run time,	at a small risk	of reduced sensitivity.	(minor
	      tuning option)

OTHER OPTIONS
       --qformat <s>
	      Assert  that  input  queryfile  is a sequence file (unaligned or
	      aligned),	in format <s>, bypassing format	autodetection.	Common
	      choices for <s> include: fasta, embl, genbank.   Alignment  for-
	      mats  also  work,	and will serve as the basis for	automatic cre-
	      ation of a profile HMM used for searching;  common  choices  in-
	      clude: stockholm,	a2m, afa, psiblast, clustal, phylip.  For more
	      information,  and	 for  codes  for some less common formats, see
	      main documentation.

       --qsingle_seqs
	      Force queryfile to be read as individual sequences, even	if  it
	      is  in  an  msa  format. For example, if the input is in aligned
	      stockholm	format,	the --qsingle_seqs
	       flag will cause each sequence in	that alignment to be used as a
	      separate query sequence.

       --tformat <s>
	      Assert that target sequence database seqdb is in format <s>, by-
	      passing format autodetection.  Common choices for	 <s>  include:
	      fasta,  embl,  genbank,  fmindex.	  Alignment formats also work;
	      common choices include: stockholm, a2m, afa, psiblast,  clustal,
	      phylip.	For more information, and for codes for	some less com-
	      mon formats, see main documentation.  The	string <s> is case-in-
	      sensitive	(fasta or FASTA	both work).  The format	fmindex	 indi-
	      cates  that  the	database  file is a binary file	produced using
	      makehmmerdb.

       --nonull2
	      Turn off the null2 score corrections for biased composition.

       -Z <x> For the purposes of per-hit E-value  calculations,  Assert  that
	      the  total  size	of  the	 target	 database  is  <x> million nu-
	      cleotides, rather	than the actual	number of targets seen.

       --seed <n>
	      Set the random number seed to <n>.  Some steps in	postprocessing
	      require Monte Carlo simulation.  The default is to use  a	 fixed
	      seed  (42),  so that results are exactly reproducible. Any other
	      positive integer will give different (but	also reproducible) re-
	      sults. A choice of 0 uses	a randomly chosen seed.

       --w_beta	<x>
	      Window length tail mass.	The upper bound, W, on the  length  at
	      which  nhmmer  expects  to  find an instance of the model	is set
	      such that	the fraction of	all sequences generated	by  the	 model
	      with  length  >= W is less than <x>.  The	default	is 1e-7.  This
	      flag may be used to override the value of	W established for  the
	      model by hmmbuild, or when the query is sequence-based.

       --w_length <n>
	      Override the model instance length upper bound, W, which is oth-
	      erwise  controlled  by  --w_beta.	  It should be larger than the
	      model length. The	value of W is used deep	 in  the  acceleration
	      pipeline,	 and modest changes are	not expected to	impact results
	      (though larger values of W do lead to longer  run	 time).	  This
	      flag  may	be used	to override the	value of W established for the
	      model by hmmbuild, or when the query is sequence-based.

       --watson
	      Only search the top strand. By default both the  query  sequence
	      and its reverse-complement are searched.

       --crick
	      Only  search  the	bottom (reverse-complement) strand. By default
	      both the query sequence and its reverse-complement are searched.

       --cpu <n>
	      Set the number of	parallel worker	threads	to <n>.	 On  multicore
	      machines,	the default is 2.  You can also	control	this number by
	      setting  an  environment	variable, HMMER_NCPU.  There is	also a
	      master thread, so	the actual number of threads that HMMER	spawns
	      is <n>+1.

	      This option is not available if HMMER was	 compiled  with	 POSIX
	      threads support turned off.

       --stall
	      For  debugging the MPI master/worker version: pause after	start,
	      to enable	the developer to attach	debuggers to the running  mas-
	      ter  and worker(s) processes. Send SIGCONT signal	to release the
	      pause.  (Under gdb: (gdb)	signal SIGCONT)	(Only available	if op-
	      tional MPI support was enabled at	compile-time.)

       --mpi  Run under	MPI control with master/worker parallelization	(using
	      mpirun,  for example, or equivalent). Only available if optional
	      MPI support was enabled at compile-time.

SEE ALSO
       See hmmer(1) for	a master man page with a list of  all  the  individual
       man pages for programs in the HMMER package.

       For  complete documentation, see	the user guide that came with your HM-
       MER distribution	(Userguide.pdf); or see	the HMMER web page (http://hm-
       mer.org/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For additional information on copyright and  licensing,	see  the  file
       called  COPYRIGHT  in  your HMMER source	distribution, or see the HMMER
       web page	(http://hmmer.org/).

AUTHOR
       http://eddylab.org

HMMER 3.4			   Aug 2023			     nhmmer(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=nhmmer&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages