FreeBSD Manual Pages

home | help
hmmsim(1)			 HMMER Manual			     hmmsim(1)

NAME
       hmmsim -	collect	profile	score distributions on random sequences

SYNOPSIS
       hmmsim [options]	hmmfile

DESCRIPTION
       The  hmmsim  program  generates	random sequences, scores them with the
       model(s)	in hmmfile, and	outputs	various	sorts  of  histograms,	plots,
       and fitted distributions	for the	resulting scores.

       hmmsim  is  not	a  mainstream part of the HMMER	package	and most users
       would have no reason to use it. It is used to develop and test the sta-
       tistical	methods	used to	determine P-values and E-values	in HMMER3. For
       example,	it was used to generate	most of	the results in a 2008 paper on
       H3's local  alignment  statistics  (PLoS	 Comp  Bio  4:e1000069,	 2008;
       http://www.ploscompbiol.org/doi/pcbi.1000069).

       Because it is a research	testbed, you should not	expect it to be	as ro-
       bust  as	other programs in the package. For example, options may	inter-
       act in weird ways; we haven't tested nor	tried to anticipate  all  dif-
       ferent possible combinations.

       The  main  task	is  to fit a maximum likelihood	Gumbel distribution to
       Viterbi scores or an maximum likelihood exponential tail	to  high-scor-
       ing  Forward  scores,  and to test that these fitted distributions obey
       the conjecture that lambda ~ log_2 for both the Viterbi Gumbel and  the
       Forward exponential tail.

       The  output is a	table of numbers, one row for each model. Four differ-
       ent parametric fits to the score	data are tested: (1)  maximum  likeli-
       hood  fits to both location (mu/tau) and	slope (lambda) parameters; (2)
       assuming	lambda=log_2, maximum likelihood fit to	the location parameter
       only; (3) same but assuming an  edge-corrected  lambda,	using  current
       procedures in H3	[Eddy, 2008]; and (4) using both parameters determined
       by  H3's	 current procedures. The standard simple, quick	and dirty sta-
       tistic for goodness-of-fit is 'E@10', the  calculated  E-value  of  the
       10th ranked top hit, which we expect to be about	10.

       In detail, the columns of the output are:

       name   Name of the model.

       tailp  Fraction of the highest scores used to fit the distribution. For
	      Viterbi,	MSV, and Hybrid	scores,	this defaults to 1.0 (a	Gumbel
	      distribution is fitted to	all the	 data).	 For  Forward  scores,
	      this  defaults  to  0.02	(an  exponential tail is fitted	to the
	      highest 2% scores).

       mu/tau Location parameter for the maximum likelihood fit	to the data.

       lambda Slope parameter for the maximum likelihood fit to	the data.

       E@10   The E-value calculated for the 10th ranked high  score  ('E@10')
	      using  the ML mu/tau and lambda. By definition, this expected to
	      be about 10, if E-value estimation were accurate.

       mufix  Location parameter, for a	maximum	likelihood fit	with  a	 known
	      (fixed) slope parameter lambda of	log_2 (0.693).

       E@10fix
	      The E-value calculated for the 10th ranked score using mufix and
	      the expected lambda = log_2 = 0.693.

       mufix2 Location	parameter,  for	a maximum likelihood fit with an edge-
	      effect-corrected lambda.

       E@10fix2
	      The E-value calculated for the 10th ranked  score	 using	mufix2
	      and the edge-effect-corrected lambda.

       pmu    Location parameter as determined by H3's estimation procedures.

       plambda
	      Slope parameter as determined by H3's estimation procedures.

       pE@10  The  E-value  calculated	for  the  10th ranked score using pmu,
	      plambda.

       At the end of this table, one more line is printed, starting with # and
       summarizing the overall CPU time	used by	the simulations.

       Some of the optional output files are in	xmgrace	xy format. xmgrace  is
       powerful	and freely available graph-plotting software.

OPTIONS
       -h     Help;  print  a  brief  reminder	of  command line usage and all
	      available	options.

       -a     Collect expected Viterbi alignment length	statistics  from  each
	      simulated	sequence. This only works with Viterbi scores (the de-
	      fault;  see  --vit).   Two  additional fields are	printed	in the
	      output table for each model: the mean length of  Viterbi	align-
	      ments, and the standard deviation.

       -v     (Verbose). Print the scores too, one score per line.

       -L <n> Set the length of	the randomly sampled (nonhomologous) sequences
	      to <n>.  The default is 100.

       -N <n> Set  the	number	of randomly sampled sequences to <n>.  The de-
	      fault is 1000.

       --mpi  Run under	MPI control with master/worker parallelization	(using
	      mpirun,  for example, or equivalent). Only available if optional
	      MPI support was enabled at compile-time.

	      It is parallelized at the	level of sending one profile at	a time
	      to an MPI	worker process,	so parallelization only	helps  if  you
	      have  more than one profile in the hmmfile, and you want to have
	      at least as many profiles	as MPI worker processes.

OPTIONS	CONTROLLING OUTPUT
       -o <f> Save the main output table to a file <f> rather than sending  it
	      to stdout.

       --afile <f>
	      When  collecting	Viterbi	 alignment statistics (the -a option),
	      for each sampled sequence, output	two fields per line to a  file
	      <f>:  the	 length	 of the	optimal	alignment, and the Viterbi bit
	      score.  Requires that the	-a option is also used.

       --efile <f>
	      Output a rank vs.	E-value	plot in	XMGRACE	xy format to file <f>.
	      The x-axis is the	rank of	this sequence, from highest  score  to
	      lowest;  the y-axis is the E-value calculated for	this sequence.
	      E-values are calculated using H3's default procedures (i.e.  the
	      pmu, plambda parameters in the output table). You	expect a rough
	      match  between rank and E-value if E-values are accurately esti-
	      mated.

       --ffile <f>
	      Output a "filter power" file to <f>: for each model, a line with
	      three fields: model name,	number of  sequences  passing  the  P-
	      value  threshold,	 and fraction of sequences passing the P-value
	      threshold. See --pthresh	for  setting  the  P-value  threshold,
	      which defaults to	0.02 (the default MSV filter threshold in H3).
	      The  P-values  are as determined by H3's default procedures (the
	      pmu,plambda parameters in	the output table).  If	all  is	 well,
	      you  expect  to  see filter power	equal to the predicted P-value
	      setting of the threshold.

       --pfile <f>
	      Output cumulative	survival plots (P(S>x))	to file	<f> in XMGRACE
	      xy format. There are three plots:	(1) the	observed score distri-
	      bution; (2) the maximum likelihood fitted	 distribution;	(3)  a
	      maximum likelihood fit to	the location parameter (mu/tau)	while
		  assuming lambda=log_2.

       --xfile <f>
	      Output  the  bit	scores	as  a binary array of double-precision
	      floats (8	bytes per score) to file <f>.  Programs	 like  Easel's
	      esl-histplot  can	 read  such  binary files. This	is useful when
	      generating extremely large sample	sizes.

OPTIONS	CONTROLLING MODEL CONFIGURATION	(MODE)
       H3 only uses multihit local alignment ( --fs mode), and this  is	 where
       we  believe  the	 statistical  fits.   Unihit  local  alignment	scores
       (Smith/Waterman;	--sw mode)  also  obey	our  statistical  conjectures.
       Glocal  alignment  statistics (either multihit or unihit) are still not
       adequately understood nor adequately fitted.

       --fs   Collect multihit local alignment scores. This  is	 the  default.
	      "fs" comes from HMMER2's historical terminology for multihit lo-
	      cal alignment as 'fragment search	mode'.

       --sw   Collect  unihit  local  alignment	scores.	The H3 J state is dis-
	      abled.  "sw" comes from HMMER2's historical terminology for uni-
	      hit local	alignment as 'Smith/Waterman search mode'.

       --ls   Collect multihit glocal alignment	scores.	In glocal  (global/lo-
	      cal) alignment, the entire model must align, to a	subsequence of
	      the target. The H3 local entry/exit transition probabilities are
	      disabled.	 'ls'  comes  from HMMER2's historical terminology for
	      multihit local alignment as 'local search	mode'.

       --s    Collect unihit glocal alignment scores.  Both the	H3 J state and
	      local entry/exit	transition  probabilities  are	disabled.  's'
	      comes  from  HMMER2's  historical	 terminology for unihit	glocal
	      alignment.

OPTIONS	CONTROLLING SCORING ALGORITHM
       --vit  Collect Viterbi maximum likelihood alignment scores. This	is the
	      default.

       --fwd  Collect Forward log-odds likelihood scores, summed  over	align-
	      ment ensemble.

       --hyb  Collect  'Hybrid'	 scores,  as described in papers by Yu and Hwa
	      (for instance, Bioinformatics 18:864, 2002). These involve  cal-
	      culating a Forward matrix	and taking the maximum cell value. The
	      number  itself  is  statistically	 somewhat unmotivated, but the
	      distribution is expected be a well-behaved extreme value distri-
	      bution (Gumbel).

       --msv  Collect MSV (multiple ungapped segment  Viterbi)	scores,	 using
	      H3's main	acceleration heuristic.

       --fast For  any of the above options, use H3's optimized	production im-
	      plementation (using SIMD vectorization). The default is  to  use
	      the  "generic" implementation (slow and non-vectorized). The op-
	      timized implementations sacrifice	a small	 amount	 of  numerical
	      precision. This can introduce confounding	noise into statistical
	      simulations and fits, so when one	gets super-concerned about ex-
	      act  details,  it's  better  to be able to factor	that source of
	      noise out.

OPTIONS	CONTROLLING FITTED TAIL	MASSES FOR FORWARD
       In some experiments, it was useful to fit Forward scores	to a range  of
       different  tail	masses,	 rather	than just one. These options provide a
       mechanism for fitting an	evenly-spaced range of different tail  masses.
       For each	different tail mass, a line is generated in the	output.

       --tmin <x>
	      Set  the lower bound on the tail mass distribution. (The default
	      is 0.02 for the default single tail mass.)

       --tmax <x>
	      Set the upper bound on the tail mass distribution. (The  default
	      is 0.02 for the default single tail mass.)

       --tpoints <n>
	      Set  the	number	of tail	masses to sample, starting from	--tmin
	      and ending at --tmax.  (The default is 1,	for the	 default  0.02
	      single tail mass.)

       --tlinear
	      Sample  a	 range of tail masses with uniform linear spacing. The
	      default is to use	uniform	logarithmic spacing.

OPTIONS	CONTROLLING H3 PARAMETER ESTIMATION METHODS
       H3 uses three short random sequence simulations to estimating the loca-
       tion parameters for the expected	score distributions  for  MSV  scores,
       Viterbi	scores,	 and Forward scores. These options allow these simula-
       tions to	be modified.

       --EmL <n>
	      Sets the sequence	length in simulation that estimates the	 loca-
	      tion parameter mu	for MSV	E-values. Default is 200.

       --EmN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter mu for	MSV E-values. Default is 200.

       --EvL <n>
	      Sets the sequence	length in simulation that estimates the	 loca-
	      tion parameter mu	for Viterbi E-values. Default is 200.

       --EvN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter mu for	Viterbi	E-values. Default is 200.

       --EfL <n>
	      Sets the sequence	length in simulation that estimates the	 loca-
	      tion parameter tau for Forward E-values. Default is 100.

       --EfN <n>
	      Sets  the	 number	 of sequences in simulation that estimates the
	      location parameter tau for Forward E-values. Default is 200.

       --Eft <x>
	      Sets the tail mass fraction to fit in the	simulation that	 esti-
	      mates the	location parameter tau for Forward evalues. Default is
	      0.04.

DEBUGGING OPTIONS
       --stall
	      For  debugging the MPI master/worker version: pause after	start,
	      to enable	the developer to attach	debuggers to the running  mas-
	      ter  and worker(s) processes. Send SIGCONT signal	to release the
	      pause.  (Under gdb: (gdb)	signal SIGCONT)	(Only available	if op-
	      tional MPI support was enabled at	compile-time.)

       --seed <n>
	      Set the random number seed to <n>.   The	default	 is  0,	 which
	      makes the	random number generator	use an arbitrary seed, so that
	      different	 runs  of hmmsim will almost certainly generate	a dif-
	      ferent statistical sample.  For debugging, it is useful to force
	      reproducible results, by fixing a	random number seed.

EXPERIMENTAL OPTIONS
       These options were used in a small variety of different exploratory ex-
       periments.

       --bgflat
	      Set the background residue distribution to a  uniform  distribu-
	      tion,  both  for	purposes of the	null model used	in calculating
	      scores, and for generating the random sequences. The default  is
	      to use a standard	amino acid background frequency	distribution.

       --bgcomp
	      Set  the background residue distribution to the mean composition
	      of the profile. This was used in exploring some of  the  effects
	      of biased	composition.

       --x-no-lengthmodel
	      Turn the H3 target sequence length model off. Set	the self-tran-
	      sitions  for  N,C,J  and the null	model to 350/351 instead; this
	      emulates HMMER2.	Not a good idea	in general. This was  used  to
	      demonstrate one of the main H2 vs. H3 differences.

       --nu <x>
	      Set  the nu parameter for	the MSV	algorithm -- the expected num-
	      ber of ungapped local alignments per target  sequence.  The  de-
	      fault  is	2.0, corresponding to a	E->J transition	probability of
	      0.5. This	was used to test whether varying  nu  has  significant
	      effect  on result	(it doesn't seem to, within reason).  This op-
	      tion only	works if --msv is selected (it only affects MSV),  and
	      it  will not work	with --fast (because the optimized implementa-
	      tions are	hardwired to assume nu=2.0).

       --pthresh <x>
	      Set the filter P-value threshold to  use	in  generating	filter
	      power  files  with --ffile.  The default is 0.02 (which would be
	      appropriate for testing MSV scores, since	this  is  the  default
	      MSV  filter  threshold in	H3's acceleration pipeline.) Other ap-
	      propriate	 choices  (matching  defaults  in   the	  acceleration
	      pipeline)	would be 0.001 for Viterbi, and	1e-5 for Forward.

SEE ALSO
       See  hmmer(1)  for  a master man	page with a list of all	the individual
       man pages for programs in the HMMER package.

       For complete documentation, see the user	guide that came	with your  HM-
       MER distribution	(Userguide.pdf); or see	the HMMER web page (http://hm-
       mer.org/).

COPYRIGHT
       Copyright (C) 2023 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	HMMER source distribution, or  see  the	 HMMER
       web page	(http://hmmer.org/).

AUTHOR
       http://eddylab.org

HMMER 3.4			   Aug 2023			     hmmsim(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=hmmsim&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages