FreeBSD Manual Pages

home | help
jackhmmer(1)			 HMMER Manual			  jackhmmer(1)

NAME
       jackhmmer - iteratively search sequence(s) against a sequence database

SYNOPSIS
       jackhmmer [options] seqfile seqdb

DESCRIPTION
       jackhmmer  iteratively  searches	each query sequence in seqfile against
       the target sequence(s) in seqdb.	 The first iteration is	identical to a
       phmmer search.  For the next iteration, a  multiple  alignment  of  the
       query  together	with all target	sequences satisfying inclusion thresh-
       olds is assembled, a profile is constructed from	this alignment	(iden-
       tical  to  using	 hmmbuild on the alignment), and profile search	of the
       seqdb is	done (identical	to an hmmsearch	with the profile).

       The query seqfile may be	'-' (a dash  character),  in  which  case  the
       query sequences are read	from a stdin pipe instead of from a file.

       The  seqdb needs	to be a	'normal' sequence file.	It cannot be read from
       a stdin stream, because jackhmmer needs to do multiple passes over  the
       database.  It  cannot be	a compressed (gzipped) file either, because we
       treat gzipped files essentially as stdin	streams, calling  an  external
       decompression program.

       The output format is designed to	be human-readable, but is often	so vo-
       luminous	 that reading it is impractical, and parsing it	is a pain. The
       --tblout	and --domtblout	options	save output in simple tabular  formats
       that are	concise	and easier to parse.  The -o option allows redirecting
       the main	output,	including throwing it away in /dev/null.

OPTIONS
       -h     Help;  print  a  brief  reminder	of  command line usage and all
	      available	options.

       -N <n> Set the maximum number of	iterations to <n>.  The	default	is  5.
	      If N=1, the result is equivalent to a phmmer search.

OPTIONS	CONTROLLING OUTPUT
       By  default,  output for	each iteration appears on stdout in a somewhat
       human readable, somewhat	parseable format. These	 options  allow	 redi-
       recting	that output or saving additional kinds of output to files, in-
       cluding checkpoint files	for each iteration.

       -o <f> Direct the human-readable	output to a file <f>.

       -A <f> After the	final iteration, save an annotated multiple  alignment
	      of  all hits satisfying inclusion	thresholds (also including the
	      original query) to <f> in	Stockholm format.

       --tblout	<f>
	      After the	final iteration, save a	tabular	 summary  of  top  se-
	      quence hits to <f> in a readily parseable, columnar, whitespace-
	      delimited	format.

       --domtblout <f>
	      After  the final iteration, save a tabular summary of top	domain
	      hits to <f> in a readily parseable, columnar,  whitespace-delim-
	      ited format.

       --chkhmm	prefix
	      At the start of each iteration, checkpoint the query HMM,	saving
	      it  to a file named prefix-n.hmm where n is the iteration	number
	      (from 1..N).

       --chkali	prefix
	      At the end of each iteration, checkpoint an alignment of all do-
	      mains satisfying inclusion thresholds (e.g. what will become the
	      query HMM	for the	next iteration), saving	it  to	a  file	 named
	      prefix-n.sto  in Stockholm format, where n is the	iteration num-
	      ber (from	1..N).

       --acc  Use accessions instead of	names in the main output, where	avail-
	      able for profiles	and/or sequences.

       --noali
	      Omit the alignment  section  from	 the  main  output.  This  can
	      greatly reduce the output	volume.

       --notextw
	      Unlimit  the length of each line in the main output. The default
	      is a limit of 120	characters per line, which helps in displaying
	      the output cleanly on terminals and in editors, but can truncate
	      target profile description lines.

       --textw <n>
	      Set the main output's line length	limit to  <n>  characters  per
	      line. The	default	is 120.

OPTIONS	CONTROLLING SINGLE SEQUENCE SCORING (FIRST ITERATION)
       By  default, the	first iteration	uses a search model constructed	from a
       single query sequence. This model is constructed	using a	standard 20x20
       substitution matrix for residue probabilities, and two additional para-
       meters for position-independent gap open	and gap	extend	probabilities.
       These  options  allow the default single-sequence scoring parameters to
       be changed.

       --popen <x>
	      Set the gap open probability for a single	sequence  query	 model
	      to <x>.  The default is 0.02.  <x> must be >= 0 and < 0.5.

       --pextend <x>
	      Set the gap extend probability for a single sequence query model
	      to <x>.  The default is 0.4.  <x>	must be	>= 0 and < 1.0.

       --mx <s>
	      Obtain residue alignment probabilities from the built-in substi-
	      tution  matrix  named <s>.  Several standard matrices are	built-
	      in, and do not need to be	read from files.  The matrix name  <s>
	      can  be  PAM30,  PAM70, PAM120, PAM240, BLOSUM45,	BLOSUM50, BLO-
	      SUM62, BLOSUM80, or BLOSUM90.  Only one of the --mx and --mxfile
	      options may be used.

       --mxfile	mxfile
	      Obtain residue alignment probabilities from the substitution ma-
	      trix in file mxfile.  The	default	score matrix is	BLOSUM62 (this
	      matrix is	internal to HMMER and does not have to be available as
	      a	file).	The format of a	 substitution  matrix  mxfile  is  the
	      standard	format	accepted  by  BLAST, FASTA, and	other sequence
	      analysis software.  See ftp.ncbi.nlm.nih.gov/blast/matrices/ for
	      example files. (The only exception: we require  matrices	to  be
	      square, so for DNA, use files like NCBI's	NUC.4.4, not NUC.4.2.)

OPTIONS	CONTROLLING REPORTING THRESHOLDS
       Reporting  thresholds  control  which hits are reported in output files
       (the main output, --tblout, and --domtblout).  In each  iteration,  se-
       quence  hits and	domain hits are	ranked by statistical significance (E-
       value) and output is generated in two sections  called  per-target  and
       per-domain  output. In per-target output, by default, all sequence hits
       with an E-value <= 10 are reported. In the per-domain output, for  each
       target  that  has  passed  per-target reporting thresholds, all domains
       satisfying per-domain reporting thresholds are  reported.  By  default,
       these are domains with conditional E-values of <= 10. The following op-
       tions  allow you	to change the default E-value reporting	thresholds, or
       to use bit score	thresholds instead.

       -E <x> Report sequences with E-values <=	<x>  in	 per-sequence  output.
	      The default is 10.0.

       -T <x> Use  a bit score threshold for per-sequence output instead of an
	      E-value threshold	(any setting of	-E  is	ignored).  Report  se-
	      quences  with  a bit score of >= <x>.  By	default	this option is
	      unset.

       -Z <x> Declare the total	size of	the database to	be <x> sequences,  for
	      purposes	of  E-value calculation.  Normally E-values are	calcu-
	      lated relative to	the size of the	database you actually searched
	      (e.g. the	number of sequences in target_seqdb).  In  some	 cases
	      (for  instance,  if  you've  split your target sequence database
	      into multiple files for parallelization of your search), you may
	      know better what the actual size of your search space is.

       --domE <x>
	      Report domains with conditional E-values <=  <x>	in  per-domain
	      output,  in  addition  to	the top-scoring	domain per significant
	      sequence hit. The	default	is 10.0.

       --domT <x>
	      Use a bit	score threshold	for per-domain output instead of an E-
	      value threshold (any setting of --domE is	ignored).  Report  do-
	      mains  with a bit	score of >= <x>	in per-domain output, in addi-
	      tion to the top-scoring domain per significant sequence hit.  By
	      default this option is unset.

       --domZ <x>
	      Declare the number of significant	sequences to be	<x> sequences,
	      for  purposes  of	conditional E-value calculation	for additional
	      domain significance.  Normally conditional E-values  are	calcu-
	      lated  relative  to the number of	sequences passing per-sequence
	      reporting	threshold.

OPTIONS	CONTROLLING INCLUSION THRESHOLDS
       Inclusion thresholds control which hits are included  in	 the  multiple
       alignment  and  profile	constructed for	the next search	iteration.  By
       default,	a sequence must	have a per-sequence E-value of <=  0.001  (see
       -E option) to be	included, and any additional domains in	it besides the
       top-scoring one must have a conditional E-value of <= 0.001 (see	--domE
       option).	 The  difference  between  reporting  thresholds and inclusion
       thresholds is that inclusion thresholds control which hits actually get
       used in the next	iteration (or the final	output multiple	 alignment  if
       the  -A	option is used), whereas reporting thresholds control what you
       see in output. Reporting	thresholds are generally more loose so you can
       see borderline hits in the top of the noise that	might be of interest.

       --incE <x>
	      Include sequences	with E-values <= <x> in	 subsequent  iteration
	      or final alignment output	by -A.	The default is 0.001.

       --incT <x>
	      Use  a bit score threshold for per-sequence inclusion instead of
	      an E-value threshold (any	setting	of --incE is ignored). Include
	      sequences	with a bit score of >= <x>.  By	default	this option is
	      unset.

       --incdomE <x>
	      Include domains with conditional E-values	<= <x>	in  subsequent
	      iteration	 or  final  alignment output by	-A, in addition	to the
	      top-scoring domain per significant sequence hit.	The default is
	      0.001.

       --incdomT <x>
	      Use a bit	score threshold	for per-domain inclusion instead of an
	      E-value threshold	(any setting of	--incdomE is ignored). Include
	      domains with a bit score of >= <x>.  By default this  option  is
	      unset.

OPTIONS	CONTROLLING ACCELERATION HEURISTICS
       HMMER3  searches	 are  accelerated in a three-step filter pipeline: the
       MSV filter, the Viterbi filter, and the Forward filter. The first  fil-
       ter  is	the fastest and	most approximate; the last is the full Forward
       scoring algorithm, slowest but most accurate. There is also a bias fil-
       ter step	between	MSV and	Viterbi. Targets that pass all	the  steps  in
       the  acceleration  pipeline are then subjected to postprocessing	-- do-
       main identification and scoring using the Forward/Backward algorithm.

       Essentially the only free parameters  that  control  HMMER's  heuristic
       filters are the P-value thresholds controlling the expected fraction of
       nonhomologous  sequences	 that  pass  the  filters. Setting the default
       thresholds higher will pass a higher proportion	of  nonhomologous  se-
       quence,	increasing  sensitivity	 at  the expense of speed; conversely,
       setting lower P-value thresholds	will pass a  smaller  proportion,  de-
       creasing	 sensitivity  and increasing speed. Setting a filter's P-value
       threshold to 1.0	means it will passing all sequences,  and  effectively
       disables	the filter.

       Changing	 filter	 thresholds only removes or includes targets from con-
       sideration; changing filter thresholds does not alter  bit  scores,  E-
       values,	or  alignments,	all of which are determined solely in postpro-
       cessing.

       --max  Maximum sensitivity.  Turn off all filters, including  the  bias
	      filter,  and  run	 full Forward/Backward postprocessing on every
	      target. This increases sensitivity slightly, at a	large cost  in
	      speed.

       --F1 <x>
	      First  filter  threshold;	 set the P-value threshold for the MSV
	      filter step.  The	default	is 0.02, meaning that  roughly	2%  of
	      the  highest  scoring nonhomologous targets are expected to pass
	      the filter.

       --F2 <x>
	      Second filter threshold;	set  the  P-value  threshold  for  the
	      Viterbi filter step.  The	default	is 0.001.

       --F3 <x>
	      Third  filter  threshold;	set the	P-value	threshold for the For-
	      ward filter step.	 The default is	1e-5.

       --nobias
	      Turn off the bias	filter.	This increases	sensitivity  somewhat,
	      but  can	come  at a high	cost in	speed, especially if the query
	      has biased residue composition (such as  a  repetitive  sequence
	      region, or if it is a membrane protein with large	regions	of hy-
	      drophobicity).  Without  the bias	filter,	too many sequences may
	      pass the filter with biased queries, leading to slower than  ex-
	      pected   performance   as	 the  computationally  intensive  For-
	      ward/Backward algorithms shoulder	an abnormally heavy load.

OPTIONS	CONTROLLING PROFILE CONSTRUCTION (LATER	ITERATIONS)
       jackhmmer always	includes your original query sequence in the alignment
       result at every iteration, and consensus	positions are  always  defined
       by that query sequence. That is,	a jackhmmer profile is always the same
       length as your original query, at every iteration.  Therefore jackhmmer
       gives you less control over profile construction	than hmmbuild does; it
       does  not  have	the --fast, or --hand, or --symfrac options.  The only
       profile construction option available in	jackhmmer is --fragthresh:

       --fragthresh <x>
	      We only want to count terminal gaps as deletions if the  aligned
	      sequence	is  known  to  be full-length, not if it is a fragment
	      (for instance, because only part of  it  was  sequenced).	 HMMER
	      uses  a simple rule to infer fragments: if the sequence length L
	      is less than or equal to a  fraction  <x>	 times	the  alignment
	      length  in  columns, then	the sequence is	handled	as a fragment.
	      The default is 0.5.  Setting --fragthresh	0 will define no (non-
	      empty) sequence as a fragment; you might want to do this if  you
	      know you've got a	carefully curated alignment of full-length se-
	      quences.	 Setting  --fragthresh	1 will define all sequences as
	      fragments; you might want	to do this if you know your  alignment
	      is  entirely  composed  of  fragments,  such as translated short
	      reads in metagenomic shotgun data.

OPTIONS	CONTROLLING RELATIVE WEIGHTS
       Whenever	a profile is built from	a multiple alignment, HMMER uses an ad
       hoc sequence weighting algorithm	 to  downweight	 closely  related  se-
       quences	and  upweight  distantly  related ones.	This has the effect of
       making models less biased by uneven  phylogenetic  representation.  For
       example,	 two identical sequences would typically each receive half the
       weight that one sequence	would (and this	is why	jackhmmer  isn't  con-
       cerned  about always including your original query sequence in each it-
       eration's alignment, even if it finds it	again in the  database	you're
       searching). These options control which algorithm gets used.

       --wpb  Use   the	 Henikoff  position-based  sequence  weighting	scheme
	      [Henikoff	and Henikoff, J. Mol. Biol. 243:574, 1994].   This  is
	      the default.

       --wgsc Use  the	Gerstein/Sonnhammer/Chothia  weighting algorithm [Ger-
	      stein et al, J. Mol. Biol. 235:1067, 1994].

       --wblosum
	      Use the same clustering scheme that was used to weight  data  in
	      calculating BLOSUM substitution matrices [Henikoff and Henikoff,
	      Proc.  Natl.  Acad.  Sci	89:10915, 1992]. Sequences are single-
	      linkage clustered	at an identity threshold  (default  0.62;  see
	      --wid)  and  within  each	 cluster of c sequences, each sequence
	      gets relative weight 1/c.

       --wnone
	      No relative weights. All sequences are assigned uniform weight.

       --wid <x>
	      Sets the identity	threshold used	by  single-linkage  clustering
	      when  using --wblosum.  Invalid with any other weighting scheme.
	      Default is 0.62.

OPTIONS	CONTROLLING EFFECTIVE SEQUENCE NUMBER
       After relative weights are determined, they are normalized to sum to  a
       total  effective	sequence number, eff_nseq.  This number	may be the ac-
       tual number of sequences	in the alignment,  but	it  is	almost	always
       smaller	than  that.  The default entropy weighting method (--eent) re-
       duces the effective sequence number to reduce the  information  content
       (relative entropy, or average expected score on true homologs) per con-
       sensus position.	The target relative entropy is controlled by a two-pa-
       rameter	function, where	the two	parameters are settable	with --ere and
       --esigma.

       --eent Adjust effective sequence	number to achieve a specific  relative
	      entropy per position (see	--ere).	 This is the default.

       --eclust
	      Set  effective  sequence	number to the number of	single-linkage
	      clusters at a specific identity threshold	(see --eid).  This op-
	      tion is not recommended; it's  for  experiments  evaluating  how
	      much better --eent is.

       --enone
	      Turn  off	 effective  sequence number determination and just use
	      the actual number	of sequences. One reason you might want	to  do
	      this is to try to	maximize the relative entropy/position of your
	      model, which may be useful for short models.

       --eset <x>
	      Explicitly  set  the effective sequence number for all models to
	      <x>.

       --ere <x>
	      Set the minimum relative entropy/position	target	to  <x>.   Re-
	      quires  --eent.	Default	 depends on the	sequence alphabet; for
	      protein sequences, it is 0.59 bits/position.

       --esigma	<x>
	      Sets the minimum relative	entropy	contributed by an entire model
	      alignment, over its whole	length.	This has the effect of	making
	      short  models  have  higher  relative  entropy per position than
	      --ere alone would	give. The default is 45.0 bits.

       --eid <x>
	      Sets the fractional pairwise  identity  cutoff  used  by	single
	      linkage  clustering  with	 the  --eclust	option.	The default is
	      0.62.

OPTIONS	CONTROLLING PRIORS
       In profile construction,	by default, weighted counts are	 converted  to
       mean  posterior probability parameter estimates using mixture Dirichlet
       priors.	Default	mixture	Dirichlet prior	parameters for protein	models
       and  for	 nucleic acid (RNA and DNA) models are built in. The following
       options allow you to override the default priors.

       --pnone
	      Don't use	any priors. Probability	parameters will	simply be  the
	      observed frequencies, after relative sequence weighting.

       --plaplace
	      Use a Laplace +1 prior in	place of the default mixture Dirichlet
	      prior.

OPTIONS	CONTROLLING E-VALUE CALIBRATION
       Estimating the location parameters for the expected score distributions
       for  MSV	 filter	 scores, Viterbi filter	scores,	and Forward scores re-
       quires three short random sequence simulations.

       --EmL <n>
	      Sets the sequence	length in simulation that estimates the	 loca-
	      tion parameter mu	for MSV	filter E-values. Default is 200.

       --EmN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter mu for	MSV filter E-values. Default is	200.

       --EvL <n>
	      Sets the sequence	length in simulation that estimates the	 loca-
	      tion parameter mu	for Viterbi filter E-values. Default is	200.

       --EvN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter mu for	Viterbi	filter	E-values.  Default  is
	      200.

       --EfL <n>
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter tau for Forward E-values. Default is 100.

       --EfN <n>
	      Sets the number of sequences in simulation  that	estimates  the
	      location parameter tau for Forward E-values. Default is 200.

       --Eft <x>
	      Sets  the	tail mass fraction to fit in the simulation that esti-
	      mates the	location parameter tau for Forward evalues. Default is
	      0.04.

OTHER OPTIONS
       --nonull2
	      Turn off the null2 score corrections for biased composition.

       -Z <x> Assert that the total number of targets in your searches is <x>,
	      for the purposes of per-sequence	E-value	 calculations,	rather
	      than the actual number of	targets	seen.

       --domZ <x>
	      Assert that the total number of targets in your searches is <x>,
	      for the purposes of per-domain conditional E-value calculations,
	      rather  than  the	 number	 of  targets that passed the reporting
	      thresholds.

       --seed <n>
	      Seed the random number generator with <n>, an integer >= 0.   If
	      <n>  is >0, any stochastic simulations will be reproducible; the
	      same command will	give the same results.	If <n> is 0, the  ran-
	      dom number generator is seeded arbitrarily, and stochastic simu-
	      lations  will vary from run to run of the	same command.  The de-
	      fault seed is 42.

       --qformat <s>
	      Assert that input	query seqfile is in format <s>,	bypassing for-
	      mat autodetection.  Common choices for <s> include: fasta, embl,
	      genbank.	Alignment formats also work; common  choices  include:
	      stockholm,  a2m,	afa, psiblast, clustal,	phylip.	 jackhmmer al-
	      ways uses	a single sequence query	to start its search,  so  when
	      the  input  seqfile  is an alignment, jackhmmer reads it one un-
	      aligned query sequence at	a time,	not as an alignment.  For more
	      information, and for codes for some  less	 common	 formats,  see
	      main  documentation.   The string	<s> is case-insensitive	(fasta
	      or FASTA both work).

       --tformat <s>
	      Assert that the input target sequence seqdb is  in  format  <s>.
	      See --qformat above for accepted choices for <s>.

       --cpu <n>
	      Set  the number of parallel worker threads to <n>.  On multicore
	      machines,	the default is 2.  You can also	control	this number by
	      setting an environment variable, HMMER_NCPU.  There  is  also  a
	      master thread, so	the actual number of threads that HMMER	spawns
	      is <n>+1.

	      This  option  is	not available if HMMER was compiled with POSIX
	      threads support turned off.

       --stall
	      For debugging the	MPI master/worker version: pause after	start,
	      to  enable the developer to attach debuggers to the running mas-
	      ter and worker(s)	processes. Send	SIGCONT	signal to release  the
	      pause.  (Under gdb: (gdb)	signal SIGCONT)	(Only available	if op-
	      tional MPI support was enabled at	compile-time.)

       --mpi  Run  under MPI control with master/worker	parallelization	(using
	      mpirun, for example, or equivalent). Only	available if  optional
	      MPI support was enabled at compile-time.

SEE ALSO
       See  hmmer(1)  for  a master man	page with a list of all	the individual
       man pages for programs in the HMMER package.

       For complete documentation, see the user	guide that came	with your  HM-
       MER distribution	(Userguide.pdf); or see	the HMMER web page (http://hm-
       mer.org/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	HMMER source distribution, or  see  the	 HMMER
       web page	(http://hmmer.org/).

AUTHOR
       http://eddylab.org

HMMER 3.4			   Aug 2023			  jackhmmer(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=jackhmmer&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages