FreeBSD Manual Pages

home | help
FASTA/TFAS...FASTXv2.0u(1)  General Commands Manual FASTA/TFAS...FASTXv2.0u(1)

NAME
       fasta - scan a protein or DNA sequence library for similar sequences

       tfasta  -  compare a protein sequence to	a DNA sequence library,	trans-
       lating the DNA sequence library `on-the-fly'.

       lfasta -	compare	two protein or DNA sequences for local similarity  and
       show the	local sequence alignments

       plfasta - compare two sequences for local similarity and	plot the local
       sequence	alignments

SYNOPSIS
       fasta  [-a  -A -b # -c #	-d #  -E # -f #	-g # -k	# -l file -L FASTLIBS
       -r STATFILE -m #	-o -O file -p #	-Q -s SMATRIX -w # -x "# #" -y # -z -1
       ] query-sequence-file library-file [ ktup ]

       fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file @library-name-file

       fasta [-QaAbcdEfgHiklmnoOprswxyz] query-file "%PRMVI"

       fasta [-aAbcdEgHlmnoOprswyx] - interactive mode

       fastx [-aAbcdEfghHlmnoOprswyx] DNA-query-file protein-library [ ktup ]

       tfasta [-aAbcdEfgkmoOprswy3] protein-query-file DNA-library [ ktup ]

       tfastx [-abcdEfghHikmoOprswy3] protein-query-file DNA-library [ ktup ]

       lfasta [-afgmnpswx] sequence-file-1 sequence-file-2 [ ktup ]

       plfasta [-afgkmnpsxv] sequence-file-1 sequence-file-2 [ ktup ]

DESCRIPTION
       fasta is	used to	compare	a protein or DNA sequence to all  of  the  en-
       tries  in a sequence library.  For example, fasta can compare a protein
       sequence	to all of the sequences	in the NBRF PIR	protein	sequence data-
       base.  fasta will automatically decide whether the  query  sequence  is
       DNA or protein by reading the query sequence as protein and determining
       whether	the  `amino-acid composition' is more than 85% A+C+G+T.	 fasta
       uses an improved	version	of the rapid sequence comparison algorithm de-
       scribed by Lipman and Pearson (Science, (1985) 227:1427)	 that  is  de-
       scribed	in  Pearson and	Lipman,	Proc. Natl. Acad. USA, (1988) 85:2444.
       The program can be invoked either with command line arguments or	in in-
       teractive mode.	The optional third argument, ktup sets the sensitivity
       and speed of the	search.	 If ktup=2, similar regions  in	 the  two  se-
       quences	being  compared	 are  found  by	 looking  at  pairs of aligned
       residues; if ktup=1, single aligned amino acids are examined.  ktup can
       be set to 2 or 1	for protein sequences, or from 1  to  6	 for  DNA  se-
       quences.	  The default if ktup is not specified is 2 for	proteins and 6
       for DNA.

       fasta compares a	query sequence to a sequence library which consists of
       sequence	data interspersed with comments, see below.   Normally	fasta,
       fastx,  tfasta,	and  tfastx  search  the  libraries listed in the file
       pointed to by the environment variable FASTLIBS.	 The  format  of  this
       file is described in the	file FASTA.DOC.	 tfasta	compares a protein se-
       quence to a DNA sequence	database, translating the DNA sequence library
       in  6  frames  `on-the-fly'  (3 frames with the -3 option).  The	search
       uses the	standard BLOSUM50 scoring matrix, and uses  a  ktup=2  by  de-
       fault.	tfasta	searches  a DNA	sequence database in the standard text
       format described	below.	tfastx,	like tfasta, compares  a  protein  se-
       quence to a DNA sequence	library.  However, tfastx compares the protein
       sequence	 to the	forward	and reverse three-frame	translation of the DNA
       library sequence, allowing for frameshifts.  fastx compares a  DNA  se-
       quence  to a protein sequence database, translating the DNA sequence in
       three frames and	allowing frameshifts in	 the  alignment.   lfasta  and
       plfasta programs	compare	two sequences looking for local	sequence simi-
       larities.   While  fasta, fastx,	and tfasta report only the best	align-
       ment between the	query sequence and the library	sequence,  lfasta  and
       plfasta	will  report  all  of the alignments between the two sequences
       with scores greater than	a cut-off value.  lfasta shows the actual  lo-
       cal  alignments	between	 the  two  sequences  and  their scores, while
       plfasta produces	a plot of the alignments that looks similar to a `dot-
       matrix' homology	plot.  On Unixtm systems, plfasta generates postscript
       output.

       The fasta programs use a	standard text format sequence file.  Lines be-
       ginning with '>'	or ';' are considered comments and ignored;  sequences
       can  be	upper or lower case, blanks,tabs and unrecognizable characters
       are ignored.  fasta expects sequences to	use the	 single	 letter	 amino
       acid codes, see protcodes(1) .  Library files for fasta should have the
       form shown below.

OPTIONS
       fasta  and the other programs can be directed to	change the scoring ma-
       trix, search parameters,	output format, and default search  directories
       by  entering options on the command line	(preceeded by a	`-' or `/' for
       MS-DOS).	All of the options should preceed the file name	and ktup argu-
       ments). Alternately, these options can be changed by  setting  environ-
       ment variables.	The options and	environment variables are:

       -1     Normally,	 the  top  scoring sequences are ranked	by the z-score
	      based on the opt score.  To rank sequences by  raw  scores,  use
	      the  -z  option. With the	-1 option, sequences are ranked	by the
	      z-score based on the init1 score.	With the

       -a     (SHOWALL)	Modifies the display of	the two	 sequences  in	align-
	      ments.  Normally,	both sequences are shown only where they over-
	      lap (SHOWALL=0); If -a or	the environment	variable SHOWALL =  1,
	      both sequences are shown in their	entirety.

       -A     Force  use  of  unlimited	Smith-Waterman alignment for DNA FASTA
	      and TFASTA.  By default, the program uses	the older (and faster)
	      band-limited Smith-Waterman alignment for	DNA FASTA  and	TFASTA
	      alignments.

       -b #   The  number  of similarity scores	to be shown when the -Q	option
	      is used.	This value is usually calculated based on  the	actual
	      scores.

       -c #   (OPTCUT)	The  threshold	for optimization with the option.  The
	      OPTCUT value is normally calculated based	on sequence length.

       -d #   The number of alignments to be shown.  Normally, fasta shows the
	      same number of alignments	as similarity scores.  By using	 fasta
	      -Q -b 200	-d 50, one would see the top scoring 200 sequences and
	      alignments for the 50 best scores.

       -E #   The expectation value threshold for displaying similarity	scores
	      and  sequence  alignments.   fasta  -Q -E	2.0 would show all li-
	      brary sequences with scores expected to occur  no	 more  than  2
	      times by chance in a search of the library.

       -f #   Penalty for the first residue in a gap (-12 by default for fasta
	      with proteins, -16 for DNA).

       -g #   Penalty  for  additional	residues  in  a	gap (-2	by default for
	      fasta with proteins, -4 for DNA).

       -h #   (fastx, tfastx only) penalty for a +1 or -1 frameshift.

       -H     Do not display histogram of similarity scores.

       -i     (fasta, fastx) search with the reverse-complement	of  the	 query
	      DNA  sequence.   (tfastx)	 search	only the reverse complement of
	      the DNA library sequence.

       -k #   (GAPCUT) Sets the	threshold for joining the initial regions  for
	      calculating the initn score.

       -l file
	      (FASTLIBS)  The  name  of	 the library menu file.	 Normally this
	      will be determined by the	environment variable  FASTLIBS.	  How-
	      ever, a library menu file	can also be specified with -l.

       -L     display  more  information  about	 the  library  sequence	in the
	      alignment.

       -m #   (MARKX) =0,1,2,3,4,10. Alternate display	of  matches  and  mis-
	      matches in alignments. MARKX=0 uses ":","."," ", for identities,
	      consevative replacements,	and non-conservative replacements, re-
	      spectively.  MARKX=1  uses  "  ","x", and	"X".  MARKX=2 does not
	      show the second sequence,	but uses the second alignment line  to
	      display matches with a "."  for identity,	or with	the mismatched
	      residue  for  mismatches.	  MARKX=2 is useful for	aligning large
	      numbers of similar sequences.  MARKX=3 writes out	a file of  li-
	      brary  sequences in FASTA	format.	 MARKX=3 should	always be used
	      with the "SHOWALL" (-a) option, but this does not	completely en-
	      sure that	all of the sequences output will be  aligned.  MARKX=4
	      displays	a  graph of the	alignment of the library sequence with
	      repect to	the query sequence, so that one	can identify  the  re-
	      gions of the query sequence that are conserved. MARKX=10 is used
	      to produce a parseable output format.

       -n     Forces the query sequence	to be treated as a DNA sequence.

       -O filename
	      send copy	of results to "filename."

       -o     Turns  off  default fasta	limited	optimization on	all of the se-
	      quences in the library with initn	scores	greater	 than  OPTCUT.
	      This option is now the reverse of	previous versions of fasta.

       -Q     Quiet option.  This allows fasta and tfasta to search a database
	      and  report  the	results	without	asking any questions. fasta -Q
	      file library > output can	be put in the background or run	 at  a
	      later time with the unix 'at' command.  The number of similarity
	      scores  and alignments displayed with the	-Q option can be modi-
	      fied with	the -b (scores)	and -d (alignments) options.

       -r     STATFILE Causes fasta to write out the sequence identifier,  su-
	      perfamily	 number	(if available),	and similarity scores to STAT-
	      FILE for every sequence in the library.  These results  are  not
	      sorted.

       -s str (SMATRIX)	 the  filename	of an alternative scoring matrix file.
	      For protein sequences, BLOSUM50 is used by default;  PAM250  can
	      be used with the command line option -s 250.

       -v str (LINEVAL)	 (plfasta  only)  plfasta and pclfasta can use up to 4
	      different	line styles to denote the scores of local  alignments.
	      The scores that correspond to these line styles can be specified
	      with the environment variable LINVAL, or with the	-v option.  In
	      either  case,  a	string	with three numbers separated by	spaces
	      should be	given.	This string must be surrounded by double  quo-
	      tation  marks.   For example, LINEVAL="200 100 50" tells plfasta
	      to use solid lines for local alignments with scores greater than
	      200, long	dashed lines for scores	between	 100  and  200,	 short
	      dashed lines for scores between 50 and 100, and dotted lines for
	      scores less than 50.
		   plfasta -v "200 100 50"
	      Normally,	 the  values are 200, 100, and 50 for protein sequence
	      comparisons and 400, 200,	and 100	for DNA	sequence comparisons.

       -w #   (LINLEN) output line length for sequence alignments.   (normally
	      60, can be set up	to 200).

       -x "offset1 offset2"
	      Causes  fasta/lfasta/plfasta  to start numbering the aligned se-
	      quences starting with offset1 and	offset2, rather	than 1 and  1.
	      This  is	particularly useful for	showing	alignments of promoter
	      regions.

       -y     Set the band-width used for optimization.	 -y 16 is the  default
	      for  protein  when  ktup=2  and for all DNA alignments. -y 32 is
	      used for protein and ktup=1.  For	proteins,  optimization	 slows
	      comparison 2-fold	and is highly recommended.

       -z     Do  not  do  statistical	significance  calculation. Results are
	      ranked by	the unnormalized opt, initn, or	init1 score.

       -3     (tfasta, tfastx) only.  Normally tfasta and tfastx translate se-
	      quences in the DNA sequence library in all six frames.  With the
	      -3 option, only the three	forward	frames are searched.

EXAMPLES
       (1)    fasta musplfm.aa $AABANK

       Compare the amino acid sequence in the file musplfm.aa  with  the  com-
       plete  PIR  protein  sequence library using ktup	= 2 Each "library" se-
       quence (there need only be one) should start with a comment line	 which
       starts with a '>', e.g.

	    >LCBO bovine preprolactin
	    WILLLSQ ...
	    >LCHU human	...
	    ...

       (2)    fasta -a -w 80 musplfm.aa	lcbo.aa	1

       Compare	the  amino  acid  sequence in the file musplfm.aa with the se-
       quences in the file lcbo.aa using ktup =	1.   Show  both	 sequences  in
       their entirety, with 80 residues	on each	output line.

       (3)    fasta

       Run the fasta program in	interactive mode.  The program will prompt for
       the  file name for the query sequence, list alternative libraries to be
       seached (if FASTLIBS is set), and prompt	for the	ktup.

FILES
       This version of fasta prompts for the library file to be	searched  from
       a list of file names that are saved in the file pointed to by the envi-
       ronment	variable  FASTLIBS.   If FASTLIBS = fastgb.list, then the file
       fastgb.list might have the entries:

	    NBRF Protein$0P/u/lib/aabank.lib 0
	    GB Primate$1P@/u/lib/gpri.nam
	    GB Rodent$1R@/u/lib/grod.nam
	    GB Mammal$1M@/u/lib/gmammal.nam

       Each line in this file has 4 fields: (1)	The  library  name,  separated
       from  the  remaining fields by a	'$'; (2) A 0 or	a 1 indicating protein
       or DNA library respectively; (3)	A single letter	that will be  used  to
       choose  the  library;  (4) the location of the library file itself (the
       library file name can contain  an  optional  library  format  specfier.
       Fasta  recognizes the following library formats:	0 - Pearson/FASTA; 1 -
       Genbank flat file; 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 4 -	Intel-
       ligenetics; 5 - NBRF/PIR	VMS); Note that	this fourth field can  contain
       an  '@' character, which	indicates that the library file	is an indirect
       library file containing list of library files, one per line.  An	 indi-
       rect library file might have the	lines:
	    </usr/slib/genbank	(the directory for the library files)
	    gbpri.seq 1
	    gbrod.seq 1
	    gbmam.seq 1
	    ...
	    gbvrl.seq 1
	    ...

       You can use your	own sequence files for fasta, just be certain to put a
       '>'  and	 comment  as the first line before the sequence.  Only one li-
       brary file type,	the standard NBRF library format, is supported by  the
       VAX/VMS	programs.  lfasta and plfasta do not required the '>' and com-
       ment line.  fasta does.

SEE ALSO
       rdf2(1),protcodes(5), dnacodes(5)

AUTHOR
       Bill Pearson
       wrp@virginia.EDU

				     local	    FASTA/TFAS...FASTXv2.0u(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=fasta&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages