FreeBSD Manual Pages

home | help
fasta36/ssea...36/lalign36    1(local)	fasta36/ssea...36/lalign36    1(local)

NAME
       fasta36 - scan a	protein	or DNA sequence	library	for similar sequences

       fastx36	 - compare a DNA sequence to a protein sequence	database, com-
       paring the translated DNA sequence in forward and reverse frames.

       tfastx36	 - compare a protein sequence to a DNA sequence	database, cal-
       culating	similarities with frameshifts to the forward and reverse  ori-
       entations.

       fasty36	 - compare a DNA sequence to a protein sequence	database, com-
       paring the translated DNA sequence in forward and reverse frames.

       tfasty36	 - compare a protein sequence to a DNA sequence	database, cal-
       culating	similarities with frameshifts to the forward and reverse  ori-
       entations.

       fasts36 - compare unordered peptides to a protein sequence database

       fastm36	-  compare ordered peptides (or	short DNA sequences) to	a pro-
       tein (DNA) sequence database

       tfasts36	- compare unordered peptides  to  a  translated	 DNA  sequence
       database

       fastf36 - compare mixed peptides	to a protein sequence database

       tfastf36	- compare mixed	peptides to a translated DNA sequence database

       ssearch36  -  compare  a	protein	or DNA sequence	to a sequence database
       using the Smith-Waterman	algorithm.

       ggsearch36 - compare a protein or DNA sequence to a  sequence  database
       using a global alignment	(Needleman-Wunsch)

       glsearch36  -  compare a	protein	or DNA sequence	to a sequence database
       with alignments that are	global in the query and	local in the  database
       sequence	(global-local).

       lalign36	 - produce multiple non-overlapping alignments for protein and
       DNA sequences using the Huang and Miller	sim algorithm for  the	Water-
       man-Eggert algorithm.

       prss36,	prfx36	-  discontinued;  all the FASTA	programs will estimate
       statistical significance	using 500 shuffled sequence scores if two  se-
       quences are compared.

DESCRIPTION
       Release	3.6  of	 the  FASTA package provides a modular set of sequence
       comparison programs that	can run	on conventional	single processor  com-
       puters  or  in  parallel	on multiprocessor computers. More than a dozen
       programs	    -	  fasta36,     fastx36/tfastx36,     fasty36/tfasty36,
       fasts36/tfasts36, fastm36, fastf36/tfastf36, ssearch36, ggsearch36, and
       glsearch36 - are	currently available.

       All  the	comparison programs share a set	of basic command line options;
       additional options are available	for individual comparison functions.

       Threaded	versions  of  the  FASTA  programs  (built  by	default	 under
       Unix/Linux/MacOX)  run  in parallel on modern Linux and Unix multi-core
       or multi-processor computers.  Accelerated versions of the Smith-Water-
       man algorithm are available for architectures with the  Intel  SSE2  or
       Altivec PowerPC architectures, which can	speed-up Smith-Waterman	calcu-
       lations 10 - 20-fold.

       In  addition to the serial and threaded versions	of the FASTA programs,
       MPI parallel versions  are  available  as  fasta36_mpi,	ssearch36_mpi,
       fastx36_mpi,  etc.  The MPI parallel versions use the same command line
       options as the serial and threaded versions.

Running	the FASTA programs
       By default, the FASTA programs are no longer interactive; they are  run
       from  the  command  line	by specifying the program, query.file, and li-
       brary.file.  Program  options  must  preceed  the  query.file  and  li-
       brary.file arguments:

     fasta36 -option1 -option2 -option3	query.file library.file	> fasta.output

       The  "classic" interactive mode,	which prompts for a query.file and li-
       brary.file, is available	with the -I option.   Typing  a	 program  name
       without	any  arguments (ssearch36) provides a short help message; pro-
       gram_name -help provides	a complete set of program options.

       Program options MUST preceed the	query.file and library.file arguments.

FASTA program options
       The default scoring matrix and gap penalties used by each of  the  pro-
       grams have been selected	for high sensitivity searches with the various
       algorithms.   The default program behavior can be modified by providing
       command line options before the query.file and library.file  arguments.
       Command line options can	also be	used in	interactive mode.

       Command line arguments come in several classes.

       (1)  Commands  that  specify  the comparison type. FASTA, FASTS,	FASTM,
       SSEARCH,	GGSEARCH, and GLSEARCH can compare either protein or  DNA  se-
       quences,	 and  attempt  to recognize the	comparison type	by looking the
       residue composition. -n,	-p specify DNA (nucleotide) or protein compar-
       ison, respectively. -U specifies	RNA comparison.

       (2) Commands that limit the set of sequences compared: -1, -3, -M.

       (3) Commands that modify	the scoring parameters:	-f gap-open  penaltyP,
       -g   gap-extend	 penalty,  -j  inter-codon  frame-shift,  within-codon
       frameshift, -s scoring-matrix, -r match/mismatch	score, -x X:X score.

       (4) Commands that modify	the algorithm (mostly FASTA  and  [T]FASTX/Y):
       -c,  -w,	 -y, -o. The -S	can be used to ignore lower-case (low complex-
       ity) residues during the	initial	score calculation.

       (5) Commands that modify	the output: -A,	-b number, -C width,  -d  num-
       ber, -L,	-m 0-11,B, -w line-width, -W context-width, -o offset1,ofset2

       (6) Commands that affect	statistical estimates: -Z, -k.

Option summary:
       -1     Sort by "init1" score (obsolete)

       -3     ([t]fast[x,y] only) use only forward frame translations

       -a     Displays	the  full  length (included unaligned regions) of both
	      sequences	with fasta36, ssearch36, glsearch36, and fasts36.

       -A (fasta36 only) For DNA:DNA, force Smith-Waterman alignment for
	      output.  Smith-Waterman is the default for FASTA protein	align-
	      ment  and	 [t]fast[x,y], but not for DNA comparisons with	FASTA.
	      For protein:protein, use band-alignment algorithm.

       -b #   number of	best scores/descriptions to show (must be  <  expecta-
	      tion  cutoff  if	-E  is	given).	 By default, this option is no
	      longer used; all scores better than the expectation (E())	cutoff
	      are listed. To guarantee the display of  #  descriptions/scores,
	      use  -b  =#,  i.e.  -b =100 ensures that 100 descriptions/scores
	      will be displayed.  To guarantee at  least  1  description,  but
	      possibly many more (limited by -E	e_cut),	use -b >1.

       -c "E-opt E-join"
	      threshold	for gap	joining	(E-join) and band optimization (E-opt)
	      in  FASTA	and [T]FASTX/Y.	 FASTA36 now uses BLAST-like statisti-
	      cal thresholds for joining and band optimization.	  The  default
	      statistical  thresholds  for  protein and	translated comparisons
	      are E-opt=0.2, E-join=0.5; for DNA,  E-join  =  0.1  and	E-opt=
	      0.02.  The  actual number	of joins and optimizations is reported
	      after the	E-join	and  E-opt  scoring  parameters.   Statistical
	      thresholds  improves search speed	2 - 3X,	and provides much more
	      accurate statistical estimates for matrices other	than BLOSUM50.
	      The "classic" joining/optimization thresholds that were the  de-
	      fault  in	 fasta35 and earlier programs are available using -c O
	      (upper case O), possibly followed	a value	> 1.0 to set the  opt-
	      cut optimization threshold.

       -C #   length of	name abbreviation in alignments, default = 6.  Must be
	      less than	20.

       -d #   number  of  best alignments to show ( must be < expectation (-E)
	      cutoff and <= the	-b description limit).

       -D     turn on debugging	mode.  Enables	checks	on  sequence  alphabet
	      that  cause problems with	tfastx36, tfasty36 (only available af-
	      ter compile time option).	 Also preserves	temp files with	-e ex-
	      pand_script.sh option.

       -e expand_script.sh
	      Run a script to expand the set  of  sequences  displayed/aligned
	      based  on	 the  results  of the initial search.  When the	-e ex-
	      pand_script.sh option is used, after the initial scan  and  sta-
	      tistics calculation, but before the "Best	scores"	are shown, ex-
	      pand_script.sh  with  a single argument, the name	of a file that
	      contains the accession information (the text on  the  fasta  de-
	      scription	 line between >	and the	first space) and the E()-value
	      for the sequence.	 expand_script.sh then uses  this  information
	      to send a	library	of additional sequences	to stdout. These addi-
	      tional  sequences	 are  included in the list of high-scoring se-
	      quences (if their	scores are significant)	and aligned. The addi-
	      tional sequences do not change the statistics or database	size.

       -E e_cut	e_cut_r
	      expectation value	upper limit for	score and  alignment  display.
	      Defaults	are  10.0  for FASTA36 and SSEARCH36 protein searches,
	      5.0 for translated DNA/protein comparisons, and 2.0 for  DNA/DNA
	      searches.	FASTA version 36 now reports additional	alignments be-
	      tween  the query and the library sequence, the second value sets
	      the threshold for	the subsequent alignments.  If not given,  the
	      threshold	 is  e_cut/10.0.   If given and	value >	1.0, e_cut_r =
	      e_cut / value; for value < 1.0, e_cut_r =	value;	If  e_cut_r  <
	      0, then the additional alignment option is disabled.

       -f #   penalty for opening a gap.

       -F #   expectation  value  lower	limit for score	and alignment display.
	      -F 1e-6 prevents library sequences with  E()-values  lower  than
	      1e-6  from being displayed. This allows the use to focus on more
	      distant relationships.

       -g #   penalty for additional residues in a gap

       -h     Show short help message.

       -help  Show long	help message, with all options.

       -H     show histogram (with fasta-36.3.4, the histogram is not shown by
	      default).

       -i     (fasta DNA, [t]fastx[x,y]) compare against only the reverse com-
	      plement of the library sequence.

       -I     interactive mode;	prompt for query filename, library.

       -j # # ([t]fast[x,y] only) penalty for a	frameshift between two codons,
	      ([t]fasty	only) penalty for a frameshift within a	codon.

       -J     (lalign36	only) show identity alignment.

       -k     specify number of	shuffles for statistical parameter  estimation
	      (default=500).

       -l str specify FASTLIBS file

       -L     report  long sequence description	in alignments (up to 200 char-
	      acters).

       -m 0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file" alignment display
	      options.	-m 0, 1, 2, 3 display different	types  of  alignments.
	      -m 4 provides an alignment "map" on the query. -m	5 combines the
	      alignment	 map and a -m 0	alignment.  -m 6 provides an HTML out-
	      put.

       -m 8 seeks to mimic BLAST -m 8 tabular output.  Only query and
	      library sequence names, and  identity,  mismatch,	 starts/stops,
	      E()-values,  and	bit  scores are	displayed.  -m 8C mimics BLAST
	      tabular format with comment lines.  -m 8	formats	 do  not  show
	      alignments.

       -m 9 does not change the	alignment output, but provides
	      alignment	 coordinate  and percent identity information with the
	      best scores report.  -m 9c adds encoded alignment	information to
	      the -m 9;	-m 9C adds encoded alignment information  as  a	 CIGAR
	      formatted	 string.  To  accomodate frameshifts, the CIGAR	format
	      has been supplemented with F (forward) and R (reverse).	-m  9i
	      provides	only percent identity and alignment length information
	      with the best scores.  With current versions of the  FASTA  pro-
	      grams,  independent  -m options can be combined; e.g. -m 1 -m 9c
	      -m 6.

       -m 11 provides lav format output	from lalign36.	It does	not
	      currently	affect other alignment	algorithms.   The  lav2ps  and
	      lav2svg  programs	 can  be  used to convert lav format output to
	      postscript/SVG alignment "dot-plots".

       -m B provides BLAST-like	alignments.  Alignments	are labeled as
	      "Query" and "Sbjct", with	coordinates on the same	 line  as  the
	      sequences, and BLAST-like	symbols	for matches and	mismatches. -m
	      BB extends BLAST similarity to all the output, providing an out-
	      put that closely mimics BLAST output.

       -m "F# out.file"	allows one search to write different alignment
	      formats  to  different  files.   The 'F' indicates separate file
	      output; the '#' is the output format (1-6,8,9,10,11,B,BB,	multi-
	      ple compatible formats  can  be  combined	 separated  by	commas
	      -',').

       -M #-# molecular	 weight	(residue) cutoffs.  -M "101-200" examines only
	      library sequences	that are 101-200 residues long.

       -n     force query to nucleotide	sequence

       -N #   break long library sequences into	blocks of # residues.	Useful
	      for  bacterial  genomes, which have only one sequence entry.  -N
	      2000 works well for well for bacterial genomes. (This option was
	      required when FASTA only	provided  one  alignment  between  the
	      query  and library sequence.  It is not as useful, now that mul-
	      tiple alignments are available.)

       -o "#,#"
	      offsets query, library sequence for numbering alignments

       -O file
	      send output to file.

       -p     force query to protein alphabet.

       -P pssm_file
	      (ssearch36,  ggsearch36,	glsearch36  only).   Provide  blastpgp
	      checkpoint file as the PSSM for searching. Two PSSM file formats
	      are  available,  which  must  be	provided  with	the  filename.
	      'pssm_file 0' uses a binary format  that	is  machine  specific;
	      'pssm_file 1' uses the "blastpgp -u 1 -C pssm_file" ASN.1	binary
	      format (preferred).

       -q/-Q  quiet option; do not prompt for input (on	by default)

       -r "+n/-m"
	      (DNA  only) values for match/mismatch for	DNA comparisons. +n is
	      used for the maximum positive value and -m is used for the maxi-
	      mum negative value. Values between max and  min,	are  rescaled,
	      but residue pairs	having the value -1 continue to	be -1.

       -R file
	      save all scores to statistics file (previously -r	file)

       -s name
	      specify  substitution  matrix.   BLOSUM50	 is  used  by default;
	      PAM250, PAM120, and BLOSUM62 can	be  specified  by  setting  -s
	      P120,  P250, or BL62.  Additional	scoring	matrices include: BLO-
	      SUM80 (BL80), and	MDM10, MDM20, MDM40 (Jones, Taylor, and	Thorn-
	      ton, 1992	CABIOS 8:275-282; specified as -s MD10,	 -s  MD20,  -s
	      MD40),  OPTIMA5  (-s  OPT5,  Kann	and Goldstein, (2002) Proteins
	      48:367-376), and VTML160 (-s VT160, Mueller and  Vingron	(2002)
	      J. Comp. Biol. 19:8-13).	Each scoring matrix has	associated de-
	      fault gap	penalties.  The	BLOSUM62 scoring matrix	and -11/-1 gap
	      penalties	can be specified with -s BP62.

	      Alternatively, a BLASTP format scoring matrix file can be	speci-
	      fied, e.g. -s matrix.filename.  DNA scoring matrices can also be
	      specified	with the "-r" option.

	      With  fasta36.3,	variable  scoring matrices can be specified by
	      preceeding the scoring matrix abbreviation  with	'?',  e.g.  -s
	      '?BP62'.	Variable  scoring matrices allow the FASTA programs to
	      choose an	alternative scoring  matrix  with  higher  information
	      content  (bit  score/position) when short	queries	are used.  For
	      example, a 90 nucleotide FASTX  query  can  produce  only	 a  30
	      amino-acid  alignment,  so a scoring matrix with 1.33 bits/posi-
	      tion is required to produce a 40 bit score. The  FASTA  programs
	      include  BLOSUM50	 (0.49	bits/pos) and BLOSUM62 (0.58 bits/pos)
	      but can range to MD10 (3.44 bits/position). The variable scoring
	      matrix option searches down the list of scoring matrices to find
	      one with information content high	enough to  produce  a  40  bit
	      alignment	score.

       -S     treat  lower  case  letters in the query or database as low com-
	      plexity regions that are equivalent to 'X'  during  the  initial
	      database	scan, but are treated as normal	residues for the final
	      alignment	display.  Statistical estimates	are based on the 'X'ed
	      out sequence used	during the initial search.  Protein  databases
	      (and query sequences) can	be generated in	the appropriate	format
	      using    John    Wooton's	  "pseg"   program,   available	  from
	      ftp://ftp.ncbi.nih.gov/pub/seg/pseg.  Once you have compiled the
	      "pseg" program, use the command:

	      pseg database.fasta -z 1 -q  > database.lc_seg

       -t #   Translation table	- [t]fastx36 and [t]fasty36 support the	 BLAST
	      tranlation  tables.  See http://www.ncbi.nih.gov/htbin-post/Tax-
	      onomy/wprintgc?mode=c/.

       -T #   (threaded, parallel only)	number of threads or  workers  to  use
	      (on  Linux/MacOS/Unix,  the default is to	use as many processors
	      as are available;	on Windows systems, 2 processors are used).

       -U     Do RNA sequence comparisons: treat 'T' as	'U',  allow  G:U  base
	      pairs (by	scoring	"G-A" and "T-C"	as score(G:G)-3).  Search only
	      one strand.

       -V "?$%*"
	      Allow  special  annotation  characters in	query sequence.	 These
	      characters will be displayed in the alignments on	the coordinate
	      number line.

       -w # line width for similarity score, sequence alignment, output.

       -W # context length (default is 1/2 of line width -w) for alignment,
	      like fasta and ssearch, that provide  additional	sequence  con-
	      text.

       -X extended options.  Less used options.	Other options include
	      -XB, -XM4G, -Xo, -Xx, and	-Xy; see fasta_guide.pdf.

       -z 1, 2,	3, 4, 5, 6
	      Specify  the  statistical	calculation. Default is	-z 1 for local
	      similarity searches, which uses regression against the length of
	      the library sequence. -z -1 disables statistics.	-z 0 estimates
	      significance without normalizing for sequence length. -z 2  pro-
	      vides  maximum  likelihood estimates for lambda and K, censoring
	      the 250 lowest and 250 highest scores. -z	3  uses	 Altschul  and
	      Gish's statistical estimates for specific	protein	BLOSUM scoring
	      matrices	and  gap  penalties.  -z  4,5: an alternate regression
	      method.  -z 6 uses a composition based maximum likelihood	 esti-
	      mate  based  on  the  method  of	Mott  (1992) Bull. Math. Biol.
	      54:59-75.

       -z 11,12,14,15,16
	      compute the  regression  against	scores	of  randomly  shuffled
	      copies  of the library sequences.	 Twice as many comparisons are
	      performed, but accurate estimates	can be	generated  from	 data-
	      bases  of	 related  sequences.  -z  11  uses the -z 1 regression
	      strategy,	etc.

       -z 21, 22, 24, 25, 26
	      compute two E()-values.  The standard (library-based)  E()-value
	      is  calculated  in the standard way (-z 1, 2, etc), but a	second
	      E2() value is calculated by shuffling the	high-scoring sequences
	      (those with E()-values less than the threshold).	For  "average"
	      composition  proteins,  these  two  estimates  will  be  similar
	      (though the best-shuffle estimates  are  always  more  conserva-
	      tive).   For  biased composition proteins, the two estimates may
	      differ by	100-fold or more.  A second -z option, e.g. -z "21 2",
	      specifies	the estimation method for the  best-shuffle  E2()-val-
	      ues. Best-shuffle	E2()-values approximate	the estimates given by
	      PRSS (or in a pairwise SSEARCH).

       -Z db_size
	      Set the apparent database	size used for expectation value	calcu-
	      lations  (used  for  protein/protein  FASTA and SSEARCH, and for
	      [T]FASTX/Y).

Reading	sequences from STDIN
       The FASTA programs can accept a query sequence from  the	 unix  "stdin"
       data  stream.   This  makes it much easier to use fasta36 and its rela-
       tives as	part of	a WWW page. To indicate	that stdin is to be used,  use
       "@" as the query	sequence file name.  "@" can also be used to specify a
       subset of the query sequence to be used,	e.g:

     cat query.aa | fasta36 @:50-150 s

       would  search the 's' database with residues 50-150 of query.aa.	 FASTA
       cannot automatically detect the sequence	type  (protein	vs  DNA)  when
       "stdin"	is  used  and assumes protein comparisons by default; the '-n'
       option is required for DNA for STDIN queries.

Environment variables:
       FASTLIBS
	      location of library choice file (-l FASTLIBS)

       SRCH_URL1, SRCH_URL2
	      format strings used to define options to re-search the database.

       REF_URL
	      the format string	used to	define the option to  lookup  the  li-
	      brary sequence in	entrez,	or some	other database.

AUTHOR
       Bill Pearson
       wrp@virginia.EDU

       Version:	$ Id: $	Revision: $Revision: 210 $

					fasta36/ssea...36/lalign36    1(local)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=tfastf35&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages