Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help

       fasta - scan a protein or DNA sequence library for similar sequences

       tfasta  -  compare a protein sequence to	a DNA sequence library,	trans-
       lating the DNA sequence library `on-the-fly'.

       lfasta -	compare	two protein or DNA sequences for local similarity  and
       show the	local sequence alignments

       plfasta - compare two sequences for local similarity and	plot the local
       sequence	alignments

       fasta [-a -A -b # -c # -d #  -E # -f # -g # -k #	-l file	-L  FASTLIBS
       -r STATFILE -m #	-o -O file -p #	-Q -s SMATRIX -w # -x "# #" -y # -z -1
       ] query-sequence-file library-file [ ktup ]

       fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file @library-name-file

       fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file "%PRMVI"

       fasta [-aAbcdEgHlmnoOprswyx] - interactive mode

       fastx [-aAbcdEfghHlmnoOprswyx] DNA-query-file protein-library [ ktup ]

       tfasta [-aAbcdEfgkmoOprswy3] protein-query-file DNA-library [ ktup ]

       tfastx [-abcdEfghHikmoOprswy3] protein-query-file DNA-library [ ktup ]

       lfasta [-afgmnpswx] sequence-file-1 sequence-file-2 [ ktup ]

       plfasta [-afgkmnpsxv] sequence-file-1 sequence-file-2 [ ktup ]

       fasta is	used to	compare	a protein or DNA sequence to all  of  the  en-
       tries  in a sequence library.  For example, fasta can compare a protein
       sequence	to all of the sequences	in the NBRF PIR	protein	sequence data-
       base.   fasta  will  automatically decide whether the query sequence is
       DNA or protein by reading the query sequence as protein and determining
       whether	the  `amino-acid composition' is more than 85% A+C+G+T.	 fasta
       uses an improved	version	of the rapid sequence comparison algorithm de-
       scribed	by  Lipman  and	Pearson	(Science, (1985) 227:1427) that	is de-
       scribed in Pearson and Lipman, Proc. Natl. Acad.	USA,  (1988)  85:2444.
       The program can be invoked either with command line arguments or	in in-
       teractive mode.	The optional third argument, ktup sets the sensitivity
       and  speed  of  the  search.  If	ktup=2,	similar	regions	in the two se-
       quences being compared  are  found  by  looking	at  pairs  of  aligned
       residues; if ktup=1, single aligned amino acids are examined.  ktup can
       be set to 2 or 1	for protein sequences, or from 1  to  6	 for  DNA  se-
       quences.	  The default if ktup is not specified is 2 for	proteins and 6
       for DNA.

       fasta compares a	query sequence to a sequence library which consists of
       sequence	 data  interspersed with comments, see below.  Normally	fasta,
       fastx, tfasta, and tfastx search	 the  libraries	 listed	 in  the  file
       pointed	to  by	the environment	variable FASTLIBS.  The	format of this
       file is described in the	file FASTA.DOC.	 tfasta	compares a protein se-
       quence to a DNA sequence	database, translating the DNA sequence library
       in 6 frames `on-the-fly'	(3 frames with the  -3	option).   The	search
       uses  the  standard  BLOSUM50  scoring matrix, and uses a ktup=2	by de-
       fault.  tfasta searches a DNA sequence database in  the	standard  text
       format  described  below.   tfastx, like	tfasta,	compares a protein se-
       quence to a DNA sequence	library.  However, tfastx compares the protein
       sequence	 to the	forward	and reverse three-frame	translation of the DNA
       library sequence, allowing for frameshifts.  fastx compares a  DNA  se-
       quence  to a protein sequence database, translating the DNA sequence in
       three frames and	allowing frameshifts in	 the  alignment.   lfasta  and
       plfasta programs	compare	two sequences looking for local	sequence simi-
       larities.  While	fasta, fastx, and tfasta report	only the  best	align-
       ment  between  the  query sequence and the library sequence, lfasta and
       plfasta will report all of the alignments  between  the	two  sequences
       with  scores greater than a cut-off value.  lfasta shows	the actual lo-
       cal alignments between  the  two	 sequences  and	 their	scores,	 while
       plfasta produces	a plot of the alignments that looks similar to a `dot-
       matrix' homology	plot.  On Unixtm systems, plfasta generates postscript

       The fasta programs use a	standard text format sequence file.  Lines be-
       ginning with '>'	or ';' are considered comments and ignored;  sequences
       can  be	upper or lower case, blanks,tabs and unrecognizable characters
       are ignored.  fasta expects sequences to	use the	 single	 letter	 amino
       acid codes, see protcodes(1) .  Library files for fasta should have the
       form shown below.

       fasta and the other programs can	be directed to change the scoring  ma-
       trix,  search parameters, output	format,	and default search directories
       by entering options on the command line (preceeded by a `-' or `/'  for
       MS-DOS).	All of the options should preceed the file name	and ktup argu-
       ments). Alternately, these options can be changed by  setting  environ-
       ment variables.	The options and	environment variables are:

       -1     Normally,	 the  top  scoring sequences are ranked	by the z-score
	      based on the opt score.  To rank sequences by  raw  scores,  use
	      the  -z  option. With the	-1 option, sequences are ranked	by the
	      z-score based on the init1 score.	With the

       -a     (SHOWALL)	Modifies the display of	the two	 sequences  in	align-
	      ments.  Normally,	both sequences are shown only where they over-
	      lap (SHOWALL=0); If -a or	the environment	variable SHOWALL =  1,
	      both sequences are shown in their	entirety.

       -A     Force  use  of  unlimited	Smith-Waterman alignment for DNA FASTA
	      and TFASTA.  By default, the program uses	the older (and faster)
	      band-limited  Smith-Waterman  alignment for DNA FASTA and	TFASTA

       -b #   The number of similarity scores to be shown when the  -Q	option
	      is  used.	  This value is	usually	calculated based on the	actual

       -c #   (OPTCUT) The threshold for optimization with  the	 option.   The
	      OPTCUT value is normally calculated based	on sequence length.

       -d #   The number of alignments to be shown.  Normally, fasta shows the
	      same number of alignments	as similarity scores.  By using	 fasta
	      -Q -b 200	-d 50, one would see the top scoring 200 sequences and
	      alignments for the 50 best scores.

       -E #   The expectation value threshold for displaying similarity	scores
	      and  sequence  alignments.   fasta  -Q -E	2.0 would show all li-
	      brary sequences with scores expected to occur  no	 more  than  2
	      times by chance in a search of the library.

       -f #   Penalty for the first residue in a gap (-12 by default for fasta
	      with proteins, -16 for DNA).

       -g #   Penalty for additional residues in a  gap	 (-2  by  default  for
	      fasta with proteins, -4 for DNA).

       -h #   (fastx, tfastx only) penalty for a +1 or -1 frameshift.

       -H     Do not display histogram of similarity scores.

       -i     (fasta,  fastx)  search with the reverse-complement of the query
	      DNA sequence.  (tfastx) search only the  reverse	complement  of
	      the DNA library sequence.

       -k #   (GAPCUT)	Sets the threshold for joining the initial regions for
	      calculating the initn score.

       -l file
	      (FASTLIBS) The name of the library  menu	file.	Normally  this
	      will  be	determined by the environment variable FASTLIBS.  How-
	      ever, a library menu file	can also be specified with -l.

       -L     display more information	about  the  library  sequence  in  the

       -m #   (MARKX)  =0,1,2,3,4,10.  Alternate  display  of matches and mis-
	      matches in alignments. MARKX=0 uses ":","."," ", for identities,
	      consevative replacements,	and non-conservative replacements, re-
	      spectively. MARKX=1 uses " ","x",	and  "X".   MARKX=2  does  not
	      show  the	second sequence, but uses the second alignment line to
	      display matches with a "."  for identity,	or with	the mismatched
	      residue  for  mismatches.	  MARKX=2 is useful for	aligning large
	      numbers of similar sequences.  MARKX=3 writes out	a file of  li-
	      brary  sequences in FASTA	format.	 MARKX=3 should	always be used
	      with the "SHOWALL" (-a) option, but this does not	completely en-
	      sure  that  all of the sequences output will be aligned. MARKX=4
	      displays a graph of the alignment	of the library	sequence  with
	      repect  to  the query sequence, so that one can identify the re-
	      gions of the query sequence that are conserved. MARKX=10 is used
	      to produce a parseable output format.

       -n     Forces the query sequence	to be treated as a DNA sequence.

       -O filename
	      send copy	of results to "filename."

       -o     Turns  off  default fasta	limited	optimization on	all of the se-
	      quences in the library with initn	scores	greater	 than  OPTCUT.
	      This option is now the reverse of	previous versions of fasta.

       -Q     Quiet option.  This allows fasta and tfasta to search a database
	      and report the results without asking any	 questions.  fasta  -Q
	      file  library  > output can be put in the	background or run at a
	      later time with the unix 'at' command.  The number of similarity
	      scores  and alignments displayed with the	-Q option can be modi-
	      fied with	the -b (scores)	and -d (alignments) options.

       -r     STATFILE Causes fasta to write out the sequence identifier,  su-
	      perfamily	 number	(if available),	and similarity scores to STAT-
	      FILE for every sequence in the library.  These results  are  not

       -s str (SMATRIX)	 the  filename	of an alternative scoring matrix file.
	      For protein sequences, BLOSUM50 is used by default;  PAM250  can
	      be used with the command line option -s 250.

       -v str (LINEVAL)	 (plfasta  only)  plfasta and pclfasta can use up to 4
	      different	line styles to denote the scores of local  alignments.
	      The scores that correspond to these line styles can be specified
	      with the environment variable LINVAL, or with the	-v option.  In
	      either  case,  a	string	with three numbers separated by	spaces
	      should be	given.	This string must be surrounded by double  quo-
	      tation  marks.   For example, LINEVAL="200 100 50" tells plfasta
	      to use solid lines for local alignments with scores greater than
	      200,  long  dashed  lines	 for scores between 100	and 200, short
	      dashed lines for scores between 50 and 100, and dotted lines for
	      scores less than 50.
		   plfasta -v "200 100 50"
	      Normally,	 the  values are 200, 100, and 50 for protein sequence
	      comparisons and 400, 200,	and 100	for DNA	sequence comparisons.

       -w #   (LINLEN) output line length for sequence alignments.   (normally
	      60, can be set up	to 200).

       -x "offset1 offset2"
	      Causes  fasta/lfasta/plfasta  to start numbering the aligned se-
	      quences starting with offset1 and	offset2, rather	than 1 and  1.
	      This  is	particularly useful for	showing	alignments of promoter

       -y     Set the band-width used for optimization.	 -y 16 is the  default
	      for  protein  when  ktup=2  and for all DNA alignments. -y 32 is
	      used for protein and ktup=1.  For	proteins,  optimization	 slows
	      comparison 2-fold	and is highly recommended.

       -z     Do  not  do  statistical	significance  calculation. Results are
	      ranked by	the unnormalized opt, initn, or	init1 score.

       -3     (tfasta, tfastx) only.  Normally tfasta and tfastx translate se-
	      quences in the DNA sequence library in all six frames.  With the
	      -3 option, only the three	forward	frames are searched.

       (1)    fasta musplfm.aa $AABANK

       Compare the amino acid sequence in the file musplfm.aa  with  the  com-
       plete  PIR  protein  sequence library using ktup	= 2 Each "library" se-
       quence (there need only be one) should start with a comment line	 which
       starts with a '>', e.g.

	    >LCBO bovine preprolactin
	    WILLLSQ ...
	    >LCHU human	...

       (2)    fasta -a -w 80 musplfm.aa	lcbo.aa	1

       Compare	the  amino  acid  sequence in the file musplfm.aa with the se-
       quences in the file lcbo.aa using ktup =	1.   Show  both	 sequences  in
       their entirety, with 80 residues	on each	output line.

       (3)    fasta

       Run the fasta program in	interactive mode.  The program will prompt for
       the file	name for the query sequence, list alternative libraries	to  be
       seached (if FASTLIBS is set), and prompt	for the	ktup.

       This  version of	fasta prompts for the library file to be searched from
       a list of file names that are saved in the file pointed to by the envi-
       ronment	variable  FASTLIBS.   If FASTLIBS = fastgb.list, then the file
       fastgb.list might have the entries:

	    NBRF Protein$0P/u/lib/aabank.lib 0
	    GB Primate$1P@/u/lib/gpri.nam
	    GB Rodent$1R@/u/lib/grod.nam
	    GB Mammal$1M@/u/lib/gmammal.nam

       Each line in this file has 4 fields: (1)	The  library  name,  separated
       from  the  remaining fields by a	'$'; (2) A 0 or	a 1 indicating protein
       or DNA library respectively; (3)	A single letter	that will be  used  to
       choose  the  library;  (4) the location of the library file itself (the
       library file name can contain  an  optional  library  format  specfier.
       Fasta  recognizes the following library formats:	0 - Pearson/FASTA; 1 -
       Genbank flat file; 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 4 -	Intel-
       ligenetics;  5 -	NBRF/PIR VMS); Note that this fourth field can contain
       an '@' character, which indicates that the library file is an  indirect
       library	file  containing list of library files,	one per	line. An indi-
       rect library file might have the	lines:
	    </usr/slib/genbank	(the directory for the library files)
	    gbpri.seq 1
	    gbrod.seq 1
	    gbmam.seq 1
	    gbvrl.seq 1

       You can use your	own sequence files for fasta, just be certain to put a
       '>'  and	 comment  as the first line before the sequence.  Only one li-
       brary file type,	the standard NBRF library format, is supported by  the
       VAX/VMS	programs.  lfasta and plfasta do not required the '>' and com-
       ment line.  fasta does.

       rdf2(1),protcodes(5), dnacodes(5)

       Bill Pearson

				     local   FASTA/TFASTA/FASTX/TFASTXv2.0u(1)


Want to link to this manual page? Use this URL:

home | help