Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
exonerate(1)		   sequence comparison tool		  exonerate(1)

NAME
       exonerate - a generic tool for sequence comparison

SYNOPSIS
       exonerate [ options ] <query path> <target path>

DESCRIPTION
       exonerate is a general tool for sequence	comparison.

       It  uses	the C4 dynamic programming library.  It	is designed to be both
       general and fast.  It can produce either	gapped or ungapped alignments,
       according to a variety of different alignment models.  The  C4  library
       allows  sequence	 alignment using a reduced space full dynamic program-
       ming implementation, but	also allows automated generation of heuristics
       from the	alignment models, using	bounded	sparse dynamic programming, so
       that these alignments may also be rapidly generated.  Alignments	gener-
       ated using these	heuristics will	represent a  valid  path  through  the
       alignment  model,  yet  (unlike the exhaustive alignments), the results
       are not guaranteed to be	optimal.

CONVENTIONS
       A number	of conventions (and idiosyncracies) are	used within exonerate.
       An understanding	of them	facilitates interpretation of the output.

       Coordinates
	      An in-between coordinate system is used, where the positions are
	      counted between the symbols, rather than on the  symbols.	  This
	      numbering	 scheme	starts from zero.  This	numbering is shown be-
	      low for the sequence "ACGT":

	       A C G T
	      0	1 2 3 4

	      Hence the	 subsequence  "CG"  would  have	 start=1,  end=3,  and
	      length=2.	  This coordinate system is used internally in exoner-
	      ate, and for all the output formats produced with	the  exception
	      of  the  "human  readable"  alignment display and	the GFF	output
	      where convention and standards dictate otherwise.

       Reverse Complements
	      When an alignment	is reported on the reverse complement of a se-
	      quence, the coordinates are simply given on the reverse  comple-
	      ment copy	of the sequence.  Hence	positions on the sequences are
	      never  negative.	 Generally, the	forward	strand is indicated by
	      '+', the reverse strand by '-', and an unknown or	not-applicable
	      strand (as in the	case of	a protein sequence)  is	 indicated  by
	      '.'

       Alignment Scores
	      Currently,  only	the  raw alignment scores are displayed.  This
	      score just is the	sum of transistion scores used in the  dynamic
	      programming.   For  example,  in	the  case  of a	Smith-Waterman
	      alignment, this will be  the  sum	 of  the  substitution	matrix
	      scores and the gap penalties.

GENERAL	OPTIONS
       Most  arguments	have  short  and  long forms.  The long	forms are more
       likely to be stable over	time, and hence	 should	 be  used  in  scripts
       which call exonerate.

       -h | --shorthelp	<boolean>
	      Show help.  This will display a concise summary of the available
	      options, defaults	and values currently set.

       --help <boolean>
	      This  shows  all	the  help  options including the defaults, the
	      value currently set, and the environment variable	which  may  be
	      used  to	set  each  parameter.	There will be an indication of
	      which options are	mandatory.  Mandatory options have no default,
	      and must have a value supplied for exonerate to run.  If	manda-
	      tory  options are	used in	order, their flags may be skipped from
	      the command line (see examples below).  Unlike  this  man	 page,
	      the  information from this option	will always be up to date with
	      the latest version of the	program.

       -v | --version <boolean>
	      Display the version number.   Also  displays  other  information
	      such as the build	date and glib version used.

SEQUENCE INPUT OPTIONS
       Pairwise	 comparisons will be performed between all query sequences and
       all target sequences.  Generally, for the best performance, shorter se-
       quences (eg. ESTs, shotgun reads, proteins) should be used as the query
       sequences, and longer sequences (eg. genomic sequences) should be  used
       as the target sequences.

       -q | --query  <paths>
	      Specify  the query sequences required.  These must be in a FASTA
	      format file.  Single or muiltiple	query sequences	 may  be  sup-
	      plied.   Additionally  multiple  copies of the fasta file	may be
	      supplied following a --query flag, or  by	 using	with  multiple
	      --query flags.

       -t | --target <paths>
	      Specify the target sequences required.  Also, must be in a FASTA
	      format  file.   As  with the query sequences, single or multiple
	      target sequences and files may be	supplied.  The target filename
	      may by replace by	a server name and port number in the  form  of
	      hostname:port when using exonerate-server.  See the man page for
	      exonerate-server	for  more  information on running exonerate in
	      client:server mode.  NEW(v2.4.0):	multiple servers  may  now  be
	      used.   These  will  be  queried in parallel if you have set the
	      --cores option.  NEW(v2.4.0): If an input	file is	 not  a	 FASTA
	      format  file,  it	 is  assumed  to contain a list	of other fasta
	      files, directories or servers (one per line).

       -Q | --querytype	<dna | protein>
	      Specify the query	type to	use.  If this  is  not	supplied,  the
	      query  type  is assumed to be DNA	when the first sequence	in the
	      file contains more than 85% [ACGTN] bases.  Otherwise, it	is as-
	      sumed to be peptide.  This option	forces the query type as  some
	      nucleotide  and  peptide	sequences can fall either side of this
	      threshold.

       -T | --targettype <dna |	protein>
	      Specify the  target  type	 to  use.   The	 same  as  --querytype
	      (above),	except	that it	applies	to the target.	Specifying the
	      sequence type will avoid the overhead  of	 having	 to  read  the
	      first  sequence  in the database twice (which may	be significant
	      with chromosome-sized sequences)

       --querychunkid <id>

       --querychunktotal <total>

       --targetchunkid <id>

       --targetchunktotal <total>
	      These options to facilitate running exonerate on compute	farms,
	      and  avoid  having  to  split  up	 sequence databases into small
	      chunks to	run on different nodes.	 If, for example,  you	wished
	      to  split	 the  target  database into three parts, you would run
	      three exonerate jobs on different	nodes including	the options:

	      --targetchunkid 1	--targetchunktotal 3
	      --targetchunkid 2	--targetchunktotal 3
	      --targetchunkid 3	--targetchunktotal 3
	      NB. The granularity offered by this option only goes down	 to  a
	      single sequence, so when there are more chunks than sequences in
	      the database, some processes will	do nothing.

       -V | --verbose <int>
	      Be  verbose - show information about what	is going on during the
	      analysis.	 The default is	1 (little information),	the higher the
	      number given, the	more information is printed.  To  silence  all
	      the  default output from exonerate, use --verbose	0 --showalign-
	      ment no --showvulgar no

ANALYSIS OPTIONS
       -E | --exhaustive <boolean>
	      Specify whether or not exhaustive	alignment should be used.   By
	      default,	this  is FALSE,	and alignment heuristics will be used.
	      If it is set to TRUE, an exhaustive  alignment  will  be	calcu-
	      lated.   This  requires  quadratic  time,	and will be much, much
	      slower, but will provide the optimal result for the given	model.
       -B | --bigseq <int>
	      Perform alignment	of large (multi-megabase) sequences.  This  is
	      very  memory  efficient and fast when both sequences are chromo-
	      some-sized, but currently	does not currently permit the use of a
	      word neighbourhood (ie. exactly matching seeds only).
       --revcomp <boolean>
	      Include comparison of the	reverse	complement of  the  query  and
	      target  where possible.  By default, this	option is enabled, but
	      when you know the	gene is	definitely on the  forward  strand  of
	      the  query  and  target, this option can halve the time taken to
	      compute alignments.
       --forcescan <none | query | target>
	      Force the	FSM to scan the	query sequence rather than the target.
	      This option is useful, for example, if you have a	 single	 piece
	      of  genomic  sequence and	you with to compare it to the whole of
	      dbEST.  By scanning the database,	rather	than  the  query,  the
	      analysis	will  be completed much	more quickly, as the overheads
	      of multiple query	FSM construction, multiple target reading  and
	      splice  site predictions will be removed.	 By default, exonerate
	      will guess the  optimal  strategy	 based	on  database  sequence
	      sizes.
       --saturatethreshold <number>
	      When  set	 to  zero,  this option	does nothing.  Otherwise, once
	      more than	this number of words (in addition to the expected num-
	      ber of words by chance) have matched a position  on  the	query,
	      the  position  on	 the  query  will  be 'numbed' (ignore further
	      matches) for the current pairwise	comparison.
       --customserver <command>
	      When using exonerate in client:server mode with  a  non-standard
	      server,  this command allows you to send a custom	command	to the
	      server.  This command is sent by the client  (exonerate)	before
	      any  other commands, and is provided as a	way of passing parame-
	      ters or other commands specific to the custom server.   See  the
	      exonerate-server	man page for more information on running exon-
	      erate in client:server mode.
       --cores <number>
	      The number of cores/CPUs/threads that  should  be	 used.	 On  a
	      multi-core  or multi-CPU machine,	increasing this	ammount	allows
	      alignment	 computations  to  run	 in   parallel	 on   separate
	      CPUs/cores.   NB.	  Generally,  it  is better to parallelise the
	      analysis by splitting it up into separate	jobs, but this	option
	      may  prove  useful  for problems such as interactive single-gene
	      queries.

FASTA DATABASE OPTIONS
       --fastasuffix <extension>
	      If any of	the inputs given with --query or --target are directo-
	      ries, then exonerate will	recursively descent these directories,
	      reading all files	ending with this suffix	as fasta format	input.

GAPPED ALIGNMENT OPTIONS
       -m | --model <alignment model>
	      Specify the alignment model to use.  The models  currently  sup-
	      ported are:
	      ungapped
		     The  simplest  type of model, used	by default.  An	appro-
		     priate model with be selected automatically for the  type
		     of	input sequences	provided.
	      ungapped:trans
		     This ungapped model includes translation of all frames of
		     both  the query and target	sequences.  This is similar to
		     an	ungapped tblastx type search.
	      affine:global
		     This performs gapped global  alignment,  similar  to  the
		     Needleman-Wunsch  algorithm,  except  with	 affine	 gaps.
		     Global alignment requires	that  both  the	 sequences  in
		     their entirety are	included in the	alignment.
	      affine:bestfit
		     This  performs  a	best fit or best location alignment of
		     the query onto the	target sequence.  The entire query se-
		     quence will be included in	the alignment,	but  only  the
		     best location for its alignment on	the target sequence.
	      affine:local
		     This  is local alignment with affine gaps,	similar	to the
		     Smith-Waterman-Gotoh algorithm.  A	general-purpose	align-
		     ment algorithm.  As this is local alignment,  any	subse-
		     quence of the query and target sequence may appear	in the
		     alignment.
	      affine:overlap
		     This type of alignment finds the best overlap between the
		     query and target.	The overlap alignment must include the
		     start  of the query or target and the end of the query or
		     the target	sequence, to align sequences which overlap  at
		     the  ends,	 or  in	the mid-section	of a longer sequence..
		     This is the type of alignment frequently used in assembly
		     algorithms.
	      est2genome
		     This model	is similar to the affine:local model,  but  it
		     also  includes intron modelling on	the target sequence to
		     allow alignment of	spliced	to unspliced coding  sequences
		     for  both forward and reversed genes.  This is similar to
		     the alignment models used in programs such	as  EST_GENOME
		     and sim4.
	      ner    NERs are non-equivalenced regions - large regions in both
		     the  query	 and target which are not aligned.  This model
		     can be used for protein alignments	 where	strongly  con-
		     served  helix  regions  will  be aligned, but weakly con-
		     served loop regions are not.  Similarly, this model could
		     be	used to	look for co-linearly conserved regions in com-
		     parison of	genomic	sequences.
	      protein2dna
		     This model	compares a protein sequence to a DNA sequence,
		     incorporating all the appropriate gaps and	frameshifts.
	      protein2dna:bestfit
		     This is a bestfit version of the protein2dna model,  with
		     which  the	 entire	 protein is included in	the alignment.
		     It	is currently  only  available  when  using  exhaustive
		     alignment.
	      protein2genome
		     This  model allows	alignment of a protein sequence	to ge-
		     nomic DNA.	  This is similar to  the  protein2dna	model,
		     with  the	addition  of  modelling	 of introns and	intron
		     phases.  This model is simliar to those used by genewise.
	      protein2genome:bestfit
		     This is a bestfit version of  the	protein2genome	model,
		     with  which  the entire protein is	included in the	align-
		     ment.  It is currently only available when	using  exhaus-
		     tive alignment.
	      coding2coding
		     This model	is similar to the ungapped:trans model,	except
		     that  gaps	and frameshifts	are allowed.  It is similar to
		     a gapped tblastx search.
	      coding2genome
		     This is similar to	the est2genome model, except that  the
		     query  sequence is	translated during comparison, allowing
		     a more sensitive comparison.
	      cdna2genome
		     This combines  properties	of  the	 est2genome  and  cod-
		     ing2genome	 models,  to  allow  modeling of an whole cDNA
		     where a central coding region can be flanked by  non-cod-
		     ing  UTRs.	 When the CDS start and	end is known it	may be
		     specified using the --annotation option  (see  below)  to
		     permit  only  the	correct	coding region to appear	in the
		     alignemnt.
	      genome2genome
		     This model	is similar to the coding2coding	model,	except
		     introns  are  modelled  on	 both sequences.  (not working
		     well yet)

       The short names u, u:t, a:g, a:b, a:l, a:o, e2g,	ner,
	      p2d, p2d:b p2g, p2g:b, c2c, c2g cd2g and g2g can	also  be  used
	      for specifying models.

       -s | --score <threshold>
	      This is the overall score	threshold.  Alignments will not	be re-
	      ported  below  this  threshold.	For  heuristic alignments, the
	      higher this threshold, the less time the analysis	will take.

       --percent <percentage>
	      Report only alignments scoring at	least this percentage  of  the
	      maximal  score  for  each	query.	eg. use	--percent 90 to	report
	      alignments with 90% of the maximal  score	 optainable  for  that
	      query.   This  option  is	useful not only	because	it reduces the
	      spurious matches in the output, but because it generates	query-
	      specific	thresholds  (unlike  --score ) for a set of queries of
	      differing	lengths, and will also speed up	the  search  consider-
	      ably.   NB.   with  this	option,	 it is possible	to have	a cDNA
	      match its	corresponding gene exactly, yet	still score less  than
	      100%,  due  to  the addition of the intron penalty scores, hence
	      this option must be used with caution.

       --showalignment <boolean>
	      Show the alignments in an	human readable form.

       --showsugar <boolean>
	      Display "sugar" output for ungapped alignments.  Sugar is	Simple
	      UnGapped Alignment Report, which	displays  ungapped  alignments
	      one-per-line.   The  sugar  line starts with the string "sugar:"
	      for easy extraction from the output, and is followed by the  the
	      following	9 fields in the	order below:

	      query_id	      Query identifier
	      query_start     Query position at	alignment start
	      query_end	      Query position alignment end
	      query_strand    Strand of	query matched
	      target_id	      |
	      target_start    |	the same 4 fields
	      target_end      |	for the	target sequence
	      target_strand   |
	      score	      The raw alignment	score

       --showcigar <boolean>
	      Show the alignments in "cigar" format.  Cigar is a Compact Idio-
	      syncratic	 Gapped	Alignment Report, which	displays gapped	align-
	      ments one-per-line.  The format starts with the same 9 fields as
	      sugar output (see	above),	and is followed	by a series of <opera-
	      tion, length> pairs where	operation is one of match,  insert  or
	      delete, and the length describes the number of times this	opera-
	      tion is repeated.

       --showvulgar <boolean>
	      Shows the	alignments in "vulgar" format.	Vulgar is Verbose Use-
	      ful  Labelled  Gapped  Alignment Report, This format also	starts
	      with the same 9 fields as	sugar output (see above), and is  fol-
	      lowed  by	 a  series  of	<label,	 query_length,	target_length>
	      triplets.	 The label may be one of the following:

	      M	     Match
	      C	     Codon
	      G	     Gap
	      N	     Non-equivalenced region
	      5	     5'	splice site
	      3	     3'	splice site
	      I	     Intron
	      S	     Split codon
	      F	     Frameshift

       --showquerygff <boolean>
	      Report GFF output	for  features  on  the	query  sequence.   See
	      http://www.sanger.ac.uk/Software/formats/GFF  for	 more informa-
	      tion.

       --showtargetgff <boolean>
	      Report GFF output	for features on	the target sequence.

       --ryo <format>
	      Roll-your-own output format.  This  allows  specification	 of  a
	      printf-esque format line which is	used to	specify	which informa-
	      tion  to	include	in the output, and how it is to	be shown.  The
	      format field may contain the following fields:

	      %[qt][idlsSt]
		     For  either  {query,target},   report   the   {id,defini-
		     tion,length,sequence,Strand,type}	Sequences are reported
		     in	a fasta-format like block (no headers).
	      %[qt]a[bels]
		     For either	{query,target}	region	which  occurs  in  the
		     alignment,	report the {begin,end,length,sequence}
	      %[qt]c[bels]
		     For either	{query,target} region which occurs in the cod-
		     ing   sequence   in   the	 alignment,  report  the  {be-
		     gin,end,length,sequence}
	      %s     The raw score
	      %r     The rank (in results from a bestn search)
	      %m     Model name
	      %e[tism]
		     Equivalenced {total,id,similarity,mismatches} (ie.	%em ==
		     (%et - %ei))
	      %p[isS]
		     Percent {id,similarity,Self} over the  equivalenced  por-
		     tions  of	the  alignment.	 (ie. %pi == 100*(%ei /	%et)).
		     Percent Self is the score over the	equivalenced  portions
		     of	 the  alignment	as a percentage	of the self comparison
		     score of the query	sequence.
	      %g     Gene orientation ('+' = forward, '-' = reverse, '.' = un-
		     known)
	      %S     Sugar block (the 9	 fields	 used  in  sugar  output  (see
		     above)
	      %C     Cigar  block  (the	fields of a cigar line after the sugar
		     portion)
	      %V     Vulgar block (the fields of a vulgar line after the sugar
		     portion)
	      %%     Expands to	a percentage sign (%)
	      \n     Newline
	      \t     Tab
	      \\     Expands to	a backslash (\)
	      \{     Open curly	brace
	      \}     Close curly brace
	      {	     Begin per-transition output section
	      }	     End per-transition	output section
	      %P[qt][sabe]
		     Per-transition output  for	 {query,target}	 {sequence,ad-
		     vance,begin,end}
	      %P[nsl]
		     Per-transition output for {name,score,label}

       This  option  is	 very useful and flexible.  For	example, to report all
       the sections of query sequences which feature in	 alignments  in	 fasta
       format, use:

       --ryo ">%qi %qd\n%qas\n"

       To  output  all	the  symbols and scores	in an alignment, try something
       like:

       --ryo "%V{%Pqs %Pts %Ps\n}"

       -n | --bestn <number>
	      Report the best N	results	for each query.	 (Only results scoring
	      better than the score threshold
	       will be reported).  The option reduces  the  amount  of	output
	      generated, and also allows exonerate to speed up the search.

       -S | --subopt <boolean>
	      This  option allows for the reporting of (Waterman-Eggert	style)
	      suboptimal alignments.  (It is on	by default.)   All  suboptimal
	      (ie. non-intersecting) alignments	will be	reported for each pair
	      of sequences scoring at least the	threshold provided by --score.

	      When  this  option  is  used with	exhaustive alignments, several
	      full quadratic time passes will be required, so the running time
	      will be considerably increased.

       -g | --gappedextension <boolean>
	      Causes a gapped extension	stage to be performed ie. dynamic pro-
	      gramming is applied in arbitrarily shaped	and dynamically	 sized
	      regions  surrounding HSP seeds.  The extension threshold is con-
	      trolled by the --extensionthreshold option.

	      Although sometimes slower	than BSDP, gapped  extension  improves
	      sensitivity with weak, gap-rich alignments such as during	cross-
	      species comparison.

	      NB.  This	 option	is now the default. Set	it to false to reverse
	      to the old BSDP type alignments.	This option may	be slower than
	      BSDP for some large scale	analyses with simple alignment models.

       --refine	<strategy>
	      Force exonerate to refine	alignments generated by	heuristics us-
	      ing dynamic programming over larger regions.   This  takes  more
	      time, but	improves the quality of	the final alignments.

	      The strategies available for refinement are:

	      none   The default - no refinement is used.
	      full   An	 exhaustive  alignment	is calculated from the pair of
		     sequences in their	entirety.
	      region DP	is applied just	to the region of the sequences covered
		     by	the heuristic alignment.

       --refineboundary	<size>
	      Specify an extra boundary	to be included in the  region  subject
	      to alignment during refinement by	region.

VITERBI	ALGORITM OPTIONS
       -D | --dpmemory <Mb>
	      The  exhaustive  alignment traceback routines use	a Hughey-style
	      reduced memory technique.	 This option specifies how much	memory
	      will be used for this.  Generally, the more memory is  permitted
	      here, the	faster the alignments will be produced.

CODE GENERATION	OPTIONS
       -C | --compiled <boolean>
	      This  option allows disabling of generated code for dynamic pro-
	      gramming.	 It is mainly used during  development	of  exonerate.
	      When  set	to FALSE, an "interpreted" version of the dynamic pro-
	      gramming implementation is used, which is	much slower.

HEURISTIC OPTIONS
       --terminalrangeint
       --terminalrangeext
       --joinrangeint
       --joinrangeext
       --spanrangeint
       --spanrangeext
	      These options are	used to	specify	the size of the	 sub-alignment
	      regions  to  which  DP  is  applied around the ends of the HSPs.
	      This can be at the HSP ends (terminal range), between HSPs (join
	      range), or between HSPs which may	be connected by	a large	region
	      such as an  intron  or  non-equivalenced	region	(span  range).
	      These  ranges can	be specified for a number of matches back onto
	      the HSP (internal	range) or out from the HSP (external range).

SEEDED DYNAMIC PROGRAMMING OPTIONS
       -x | --extensionthreshold <score>
	      This is the amount by which the score will be allowed to degrade
	      during SDP.  This	is the equivalent of the hspdropoff penalties,
	      except it	is applied during dynamic programming, not HSP	exten-
	      sion.   Decreasing this parameter	will increase the speed	of the
	      SDP, and increasing it will increase the sensitivity.

       --singlepass  <boolean>
	      By default the suboptimal	SDP alignments are reported by a  sin-
	      glepass  algorithm, but may miss some suboptimal alignments that
	      are close	together.  This	option can be used to force the	use of
	      a	multipass suboptimal alignment algorithm for SDP, resulting in
	      higher quality suboptimal	alignments.

BSDP OPTIONS
       --joinfilter <limit>
	      (experimental)

	      Only allow consider this number of SARs  for  joining  HSPs  to-
	      gether.	The SARs with the highest potential for	appearing in a
	      high-scoring alignment are considered.  This option  useful  for
	      limiting time and	memory usage when searching unmasked data with
	      repetitive  sequences,  but  should not be set too low, as valid
	      matches may be ignored.  Something like --joinfilter 32 seems to
	      work well.

SEQUENCE OPTIONS
       --annotation <path>
	      Specify basic sequence annotation	 information.	This  is  most
	      useful with the cdna2genome model, but will work with other mod-
	      els.  The	annotation file	contains four fields per line:

	      <id> <strand> <cds_start>	<cds_length>

	      Here is a	simple example of such a file for 4 cDNAs:

	      dhh.human.cdna + 308 1191
	      dhh.mouse.cdna + 250 1191
	      csn7a.human.cdna + 178 828
	      csn7a.mouse.cdna + 126 828
	      These  annotation	 lines	will also work when only the first two
	      fields are used.	This can be used when specifying which	strand
	      of a specific sequence should be included	in a comparison.

SYMBOL COMPARISON OPTIONS
       --softmaskquery <boolean>
	      Indicate	that  the  query is softmasked.	 See description below
	      for --softmasktarget
       --softmasktarget	<boolean>
	      Indicate that the	target is softmasked.	In  a  softmasked  se-
	      quence  file,  instead  of  masking regions by Ns	or Xs they are
	      masked by	putting	those regions in lower case (and with unmasked
	      regions in upper case).  This option allows the  masking	to  be
	      ignored  by  some	 parts	of the program,	combining the speed of
	      searching	masked data with  sensitivity  of  searching  unmasked
	      data.  The utility fastasoftmask supplied	which is supplied with
	      exonerate	 can  be  used	for producing softmasked sequence from
	      conventionally masked sequence.
       -d | --dnasubmat	<name>
	      Specify the the substitution matrix to be	used for DNA  compari-
	      son.   This  should  be  a path to a substitution	matrix in same
	      format as	that which is used by blast.
       -p | --proteinsubmat <name>
	      Specify the the substitution matrix to be	used for protein  com-
	      parison.	 (Both	DNA  and protein substitution matrices are re-
	      quired for some types of analysis).   The	 use  of  the  special
	      names,  nucleic,	blosum62,  pam250, edit	or identity will cause
	      built-in substitution matrices to	be used.
ALIGNMENT SEEDING OPTIONS
       -M | --fsmmemory	<Mb>
	      Specify the amount of memory to use for  the  FSM	 in  heuristic
	      analyses.	  exonerate multiplexes	the query to accelerate	large-
	      throughput database queries.  This figure	should always be  less
	      than  the	 physical  memory  on  the machine, but	when searching
	      large databases, generally, the more memory  it  is  allowed  to
	      use, the faster it will go.
       --forcefsm <none	| normal | compact>
	      Force the	use of more compact finite state machines for analyses
	      involving	 big  sequences	and large word neighbourhoods.	By de-
	      fault, exonerate will pick a sensible strategy, so  this	option
	      will rarely need to be set.
       --wordjump <int>
	      The  jump	 between query words used to yield the word neighbour-
	      hood.  If	set to 1, every	word is	used, if set to	2, every other
	      word is used, and	if set to the wordlength, only non-overlapping
	      words will be used.  This	option reduces the memory requirements
	      when using very large query sequences, and makes the search  run
	      faster,  but it also damages search sensitivity when high	values
	      are set.
       --wordambiguity <limit>
	      This option may be used to allow alignment seeds containing  IU-
	      PAC  ambiguity  symbols.	The limit is the maximum number	of am-
	      biguous words allowed at a single	position.  If  this  limit  is
	      reached  then  the  position  is not used	for alignment seeding.
	      Using this option	may slow down a	search.	 For  large  datasets,
	      it  is  recommended  to  use esd2esi --wordambiguity instead, as
	      then the speed overhead is only incurred during indexing,	rather
	      than during the database searching itself.  NB. This option only
	      works for	IUPAC symbols in the  target  sequence.	  Query	 words
	      containing IUPAC symbols are (currently) excluded	from seeding.
AFFINE MODEL OPTIONS
       -o | --gapopen <penalty>
	      This is the gap open penalty.
       -e | --gapextend	<penalty>
	      This is the gap extension	penalty.
       --codongapopen <penalty>
	      This is the codon	gap open penalty.
       --codongapextend	<penalty>
	      This is the codon	gap extension penalty.
NER OPTIONS
       --minner	<boolean>
	      Minimum NER length allowed.
       --maxner	<length>
	      Maximum  NER  length  allowed.   NB.  this  option  only affects
	      heuristic	alignments.
       --neropen <penalty>
	      Penalty for opening a non-equivalenced region.
INTRON MODELLING OPTIONS
       --minintron <length>
	      Minimum intron length  limit.   NB.  this	 option	 only  affects
	      heuristic	 alignments.   This  is	not a hard limit - it only af-
	      fects size of introns which are sought during  heuristic	align-
	      ment.
       --maxintron <length>
	      Maximum intron length limit.  See	notes above for	--minintron
       -i | --intronpenalty <penalty>
	      Penalty for introduction of an intron.
FRAMESHIFT MODELLING OPTIONS
       -f | --frameshift <penalty>
	      The penalty for the inclusion of a frameshift in an alignment.
ALPHABET OPTIONS
       --useaatla <boolean>
	      Use  three-letter	abbreviations for AA names.  ie. when display-
	      ing alignment "Met" is used instead of " M "
TRANSLATION OPTIONS
       --geneticcode <code>
	      Specify an alternative genetic code.  The	default	 code  (1)  is
	      the standard genetic code.  Other	genetic	codes may be specified
	      by in shorthand or longhand form.

	      In  shorthand form, a number between 1 and 23 is used to specify
	      one of 17	built-in genetic code  variants.   These  are  genetic
	      code variants taken from:

	      http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

	      These are:
	      1	     The Standard Code
	      2	     The Vertebrate Mitochondrial Code
	      3	     The Yeast Mitochondrial Code
	      4	     The  Mold,	Protozoan, and Coelenterate Mitochondrial Code
		     and the Mycoplasma/Spiroplasma Code
	      5	     The Invertebrate Mitochondrial Code
	      6	     The Ciliate, Dasycladacean	and Hexamita Nuclear Code
	      9	     The Echinoderm and	Flatworm Mitochondrial Code
	      10     The Euplotid Nuclear Code
	      11     The Bacterial and Plant Plastid Code
	      12     The Alternative Yeast Nuclear Code
	      13     The Ascidian Mitochondrial	Code
	      14     The Alternative Flatworm Mitochondrial Code
	      15     Blepharisma Nuclear Code
	      16     Chlorophycean Mitochondrial Code
	      21     Trematode Mitochondrial Code
	      22     Scenedesmus obliquus mitochondrial	Code
	      23     Thraustochytrium Mitochondrial Code",
	      In longhand form,	a genetic code variant may be provided as a 64
	      byte string in TCAG order, eg. the standard genetic code in this
	      form would be:

	      FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

HSP CREATION OPTIONS
       --hspfilter <threshold>
	      Use aggressive HSP filtering to  speed  up  heuristic  searches.
	      The threshold specifies the number of HSPs centred about a point
	      in  the query which will be stored.  Any lower scoring HSPs will
	      be discarded.  This is an	experimental option  to	 handle	 speed
	      problems	caused	by some	sequences.  A value of about 100 seems
	      to work well.
       --useworddropoff	<boolean>
	      When this	is TRUE, the score threshold for admitting words  into
	      the word neighbourhood is	set to be the initial word score minus
	      the  word	 threshold  (see below).  This strategy	is designed to
	      prevent restricting the word SSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTT-
	      TNNKKSSRRVVVVAAAADDEEGGGG	When this is FALSE, the	word threshold
	      is taken to be an	absolute value.
       --seedrepeat <count>
	      The seedrepeat parameter sets the	number of seeds	which must  be
	      found on the same	diagonal or reading frame before HSP extension
	      will occur.  Increasing the value	for --seedrepeat will speed up
	      searches,	 and is	usually	a better option	than using longer word
	      lengths, particularly when using the exonerate-server where  in-
	      creasing	word  lengths  requires	 recomputing  the  index,  and
	      greater increases	memory requirements.
       -w --dnawordlen <bases>
       -W --proteinwordlen <residues>
       -W --codonnwordlen <bases>
	      The word length used for DNA, protein or codon words.  When per-
	      forming DNA vs protein comparisons, a the	 DNA  wordlength  will
	      always (automatically) be	triple the protein wordlength.
       --dnahspdropoff <score>
       --proteinhspdropoff <score>
       --codonhspdropoff <score>
	      The amount by which an HSP score will be allowed to degrade dur-
	      ing  HSP	extension.   Separate  threshold can be	set for	dna or
	      protein comparisons.
       --dnahspthreshold <score>
       --proteinhspthreshold <score>
       --codonhspthreshold <score>
	      The HSP score thresholds.	 An HSP	must score at least this  much
	      before  it  will	be  reported  or  be  used in preparation of a
	      heuristic	alignment.
       --dnawordlimit  <score>
       --proteinwordlimit  <score>
       --codonwordlimit	 <score>
	      The threshold for	admitting DNA or protein words into  the  word
	      neighbourhood.   The  behaviour of this option is	altered	by the
	      --useworddropoff option (see above).

       --geneseed <threshold>
	      Exclude HSPs from	gapped alignment computation which cannot fea-
	      ture in a	alignment containing at	least one HSP scoring at least
	      this threshold.

	      This option provides considerable	speed up for gapped  alignment
	      computation,  but	 may cause some	very gap-rich alignments to be
	      missed.

	      It is useful when	aligning similar sequences  back  onto	genome
	      quickly, eg. try --geneseed 250
       --geneseedrepeat	<count>
	      The  geneseedrepeat  parameter is	like the seedrepeat parameter,
	      but is only applied when looking for the geneseed	hsps.  Using a
	      larger value for --geneseedrepeat	will speed  up	searches  when
	      the --geneseed parameter is also used.  (experimental, implemen-
	      tation incomplete)
ALIGNMENT OPTIONS
       --alignmentwidth	<width>
	      Width of alignment display.  The default is 80.
       --forwardcoordinates <boolean>
	      By  default, all coordinates are reported	on the forward strand.
	      Setting this option  to  false  reverts  to  the	old  behaviour
	      (pre-0.8.3)  whereby  alignments	on the reverse complement of a
	      sequence are reported using coordinates on the  reverse  comple-
	      ment.
SUB-ALIGNMENT REGION OPTIONS
       --quality <percent>
	      This  option  excludes HSPs from BSDP when their components out-
	      side of the SARs fall below this quality threshold.
SPLICE SITE PREDICTION OPTIONS
       --splice3 <path>
       --splice5 <path>
	      Provide a	file containing	a custom PSSM (position	specific score
	      matrix) for prediction of	the intron splice sites.

	      The file format for splice data is simple: lines beginning  with
	      '#'  are	comments, a line containing just the word 'splice' de-
	      notes the	position of the	splice site, and the other lines  show
	      the  observed  relative  frequencies  of	the bases flanking the
	      splice sites in the chosen organism (in ACGT order).

	      Example 5' splice	data file:

	       # start of example 5' splice data
	       # A C G T
	       28 40  17  14
	       59 14  13  14
		8  5  81   6
	       splice
		0  0 100   0
		0  0   0 100
	       54  2  42   2
	       74  8  11   8
		5  6  85   4
	       16 18  21  45
	       # end of	test 5'	splice data

	      Example 3' splice	data file:

	       # start of example 3' splice data
	       # A C G T
		10  31	14  44
		 8  36	14  43
		 6  34	12  48
		 6  34	 8  52
		 9  37	 9  45
		 9  38	10  44
		 8  44	 9  40
		 9  41	 8  41
		 6  44	 6  45
		 6  40	 6  48
		23  28	26  23
		 2  79	 1  18
	       100   0	 0   0
		 0   0 100   0
	       splice
		28  14	47  11
	       # end of	example	3' splice data

       --forcegtag <boolean>
	      Only allow splice	sites at gt....ag  sites  (or  ct....ac	 sites
	      when  the	 gene is reversed) With	this restriction in place, the
	      splice site prediction scores  are  still	 used  and  allow  tie
	      breaking when there is more than one possible splice site.

STRATEGIES FOR SPEED
       Keep all	data on	local disks.

       Apply  the  highest  acceptable score thresholds	using a	combination of
       --score,	--percent and --bestn.

       Repeat mask and dust the	genomic	(target)  sequence.   (Softmask	 these
       sequences and use --softmasktarget).

       Increase	the --fsmmemory	option to allow	more query multiplexing.

       Increase	the value for --seedrepeat

       When  using  an	alignment  model containing introns, set --geneseed as
       high as possible.

       If you are compiling exonerate yourself,	see the	README	file  supplied
       with the	source code for	details	of compile-time	optimisations.

STRATEGIES FOR SENSITIVITY
       Not documented yet.

       Increase	the word neighbourhood.	 Decrease the HSP threshold.  Increase
       the SAR ranges.	Run exhaustively.

ENVIRONMENT
       Not documented yet.

EXAMPLES
       exonerate cdna.fasta genomic.fasta
	      This  simplest  way in which exonerate may be used.  By default,
	      an ungapped alignment model will be used.

       exonerate   --exhaustive	  y   --model	est2genome   cdna.fasta	   ge-
       nomic.masked.fasta
	      Exhaustively  align  cdnas  to  genomic  sequence.  This will be
	      much, much slower, but more accurate.  This option causes	 exon-
	      erate to behave like EST_GENOME.

       exonerate --exhaustive --model affine:local query.fasta target.fasta
	      If the affine:local model	is used	with exhaustive	alignment, you
	      have the Smith-Waterman algorithm.

       exonerate   --exhaustive	  --model   affine:global  protein.fasta  pro-
       tein.fasta
	      Switch to	a global model,	and you	have Needleman-Wunsch.

       exonerate --wordthreshold 1 --gapped  no	 --showhsp  yes	 protein.fasta
       genome.fasta
	      Generate ungapped	Protein:DNA alignments

       exonerate  --model  coding2coding  --score  1000	 --bigseq  yes	--pro-
       teinhspthreshold	90 chr21.fa chr22.fa
	      Perform quick-and-dirty translated  pairwise  alignment  of  two
	      very large DNA sequences.

       Many similar combinations should	work.  Try them	out.

VERSION
       This documentation accompanies version 2.2.0 of the exonerate package.
AUTHOR
       Guy  St.C. Slater.  <guy@ebi.ac.uk>.  See the AUTHORS file accompanying
       the source code for a list of contributors.
AVAILABILITY
       This source code	for the	exonerate package is available under the terms
       of the GNU general public licence.

       Please see the file COPYING which was distrubuted with this package, or
       http://www.gnu.org/licenses/gpl.txt for details.

       This package has	been developed as part of the ensembl project.	Please
       see http://www.ensembl.org/ for more information.
SEE ALSO
       exonerate-server(1), ipcress(1),	blast(1L).

exonerate			 November 2002			  exonerate(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=exonerate&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>

home | help