FreeBSD Manual Pages

home | help
SPIDEY(1)		   NCBI	Tools User's Manual		     SPIDEY(1)

NAME
       spidey -	align mRNA sequences to	a genome

SYNOPSIS
       spidey [-] [-F N] [-G] [-L N] [-M filename] [-N filename] [-R filename]
       [-S p/m]	 [-T N]	 [-X]  [-a filename]  [-c N] [-d] [-e X] [-f X]	[-g X]
       -i filename  [-j]  [-k filename]	 [-l N]	 -m filename  [-n N]  [-o str]
       [-p N] [-r c/d/m/p/v] [-s] [-t filename]	[-u] [-w]

DESCRIPTION
       spidey is a tool	for aligning one or more mRNA sequences	to a given ge-
       nomic  sequence.	  spidey was written with two main goals in mind: find
       good alignments regardless of intron size; and avoid  getting  confused
       by  nearby  pseudogenes	and  paralogs.	Towards	the first goal,	spidey
       uses BLAST and Dot View (another	local  alignment  tool)	 to  find  its
       alignments; since these are both	local alignment	tools, spidey does not
       intrinsically favor shorter or longer introns and has no	maximum	intron
       size.   To  avoid  mistakenly including exons from paralogs and pseudo-
       genes, spidey first defines windows on the genomic  sequence  and  then
       performs	 the  mRNA-to-genomic alignment	separately within each window.
       Because of the way the windows are constructed, neighboring paralogs or
       pseudogenes should be in	separate windows and should not	be included in
       the final spliced alignment.

   Initial alignments and construction of genomic windows
       spidey takes as input a single genomic sequence and a set of  mRNA  ac-
       cessions	 or FASTA sequences.  All processing is	done one mRNA sequence
       at a time.  The first step for each mRNA	sequence is a  high-stringency
       BLAST against the genomic sequence.  The	resulting hits are analyzed to
       find the	genomic	windows.

       The BLAST alignments are	sorted by score	and then assigned into windows
       by  a  recursive	function which takes the first alignment and then goes
       down the	alignment list to find all alignments that are consistent with
       the first (same strand of mRNA, both the	mRNA and  genomic  coordinates
       are nonoverlapping and linearly consistent).  On	subsequent passes, the
       remaining  alignments  are examined and are put into their own nonover-
       lapping,	consistent windows, until no alignments	are  left.   Depending
       on how many gene	models are desired, the	top n windows are chosen to go
       on to the next step and the others are deleted.

   Aligning in each window
       Once  the genomic windows are constructed, the initial BLAST alignments
       are freed and another BLAST search is performed,	this time with the en-
       tire mRNA against the genomic region defined by the window,  and	 at  a
       lower  stringency  than	the initial search.  spidey then uses a	greedy
       algorithm to generate a	high-scoring,  nonoverlapping  subset  of  the
       alignments  from	 the second BLAST search.  This	consistent set is ana-
       lyzed carefully to make sure that the entire mRNA sequence  is  covered
       by the alignments.  When	gaps are found between the alignments, the ap-
       propriate  region  of  genomic sequence is searched against the missing
       mRNA, first using a very	low-stringency BLAST and, if the  BLAST	 fails
       to  find	 a hit,	using DotView functions	to locate the alignment.  When
       gaps are	found at the ends of the alignments,  the  BLAST  and  DotView
       searches	are actually allowed to	extend past the	boundaries of the win-
       dow.   If the 3'	end of the mRNA	does not align completely, it is first
       examined	for the	presence of a poly(A) tail.  No	 attempt  is  made  to
       align  the  portion  of the mRNA	that seems to be a poly(A) tail; some-
       times there is a	poly(A)	tail that does align to	the genomic  sequence,
       and  these are noted because they indicate the possibility of a pseudo-
       gene.

       Now that	the mRNA is completely covered by the set of  alignments,  the
       boundaries  of  the  alignments (there should be	one alignment per exon
       now) are	adjusted so that the alignments	abut each other	precisely  and
       so  that	 they  are  adjacent  to good splice donor and acceptor	sites.
       Most commonly, two adjacent exons' alignments overlap by	as much	as  20
       or  30 base pairs on the	mRNA sequence.	The true exon boundary may lie
       anywhere	within this overlap, or	(as we have seen empirically)  even  a
       few  base  pairs	outside	the overlap.  To position the exon boundaries,
       the overlap plus	a few base pairs on each side is examined  for	splice
       donor  sites,  using  functions that have different splice matrices de-
       pending on the organism chosen.	The top	few  splice  donor  sites  (by
       score)  are  then  evaluated  as	 to  how much they affect the original
       alignment boundaries.  The site that affects the	boundaries  the	 least
       is  chosen,  and	 is  evaluated as to the presence of an	acceptor site.
       The alignments are truncated or extended	as necessary so	that they ter-
       minate at the splice donor site and so that they	do not overlap.

   Final result
       The windows are examined	carefully to  get  the	percent	 identity  per
       exon,  the  number  of gaps per exon, the overall percent identity, the
       percent coverage	of the mRNA, presence of an aligning  or  non-aligning
       poly(A)	tail, number of	splice donor sites and the presence or absence
       of splice donor and acceptor sites for each exon, and the occurrence of
       an mRNA that has	a 5' or	3' end (or both) that does not	align  to  the
       genomic	sequence.   If the overall percent identity and	percent	length
       coverage	are above  the	user-defined  cutoffs,	a  summary  report  is
       printed,	 and,  if  requested,  a text alignment	showing	identities and
       mismatches is also printed.

   Interspecies	alignments
       spidey is capable of performing	interspecies  alignments.   The	 major
       difference in interspecies alignments is	that the mRNA-genomic identity
       will  not  be  close to 100% as it is in	intraspecies alignments; also,
       the alignments have numerous and	lengthy	gaps.  If spidey  is  used  in
       its  normal mode	to do interspecies alignments, it produces gene	models
       with many, many short exons.  When the interspecies flag	is set,	spidey
       uses different BLAST parameters to encourage longer and more  gaps  and
       to  not	penalize  as heavily for mismatches.  This way,	the alignments
       for the exons are much longer and more closely approximate  the	actual
       gene structure.

   Extracting CDS alignments
       When  spidey  is	run in network-aware mode or when ASN.1	files are used
       for the mRNA records, it	is capable of extracting a CDS alignment  from
       an mRNA alignment and printing the CDS information also.	 Since the CDS
       alignment  is  just  a  subset  of the mRNA alignment, it is relatively
       straightforward to truncate the exon alignments	as  necessary  and  to
       generate	 a  CDS	 alignment.  Furthermore, the untranslated regions are
       now defined, so the percent identity for	the 5' and 3' untranslated re-
       gions is	also calculated.

OPTIONS
       A summary of options is included	below.

       -      Print usage message.

       -F N   Start of genomic interval	desired	(from; 0-based).

       -G     Input file is a GI list.

       -L N   The extra-large intron size to use (default = 220000).

       -M filename
	      File with	donor splice matrix.

       -N filename
	      File with	acceptor splice	matrix.

       -R filename
	      File (including path) to repeat blast database for filtering.

       -S p/m Restrict to plus (p) or minus (m)	strand of genomic sequence.

       -T N   Stop of genomic interval desired (to; 0-based).

       -X     Use extra-large intron sizes (increases the  limit  for  initial
	      and terminal introns from	100kb to 240kb and for all others from
	      35kb  to	120kb);	 may  result  in  significantly	longer compute
	      times.

       -a filename
	      Output file for alignments when directed to a separate file with
	      -p 3 (default = spidey.aln).

       -c N   Identity cutoff, in percent, for quality control purposes.

       -d     Also try to align	coding sequences corresponding	to  the	 given
	      mRNA records (may	require	network	access).

       -e X   First-pass  e-value (default = 1.0e-10).	Higher values increase
	      speed at the cost	of sensitivity.

       -f X   Second-pass e-value (default = 0.001).

       -g X   Third-pass e-value (default = 10).

       -i filename
	      Input file containing the	genomic	sequence  in  ASN.1  or	 FASTA
	      format.	If  your computer is running on	a network that can ac-
	      cess GenBank, you	can substitute the  desired  accession	number
	      for the filename.

       -j     Print ASN.1 alignment?

       -k filename
	      File for ASN.1 output with -k (default = spidey.asn).

       -l N   Length coverage cutoff, in percent.

       -m filename
	      Input  file  containing  the  mRNA sequence(s) in	ASN.1 or FASTA
	      format, or a list	of their accessions (with -G).	If  your  com-
	      puter  is	 running on a network that can access GenBank, you can
	      substitute a single accession number for the filename.

       -n N   Number of	gene models to return per input	mRNA (default =	1).

       -o str Main output file (default	= stdout; contents controlled by -p).

       -p N   Print alignment?
	      0	     summary and alignments together (default)
	      1	     just the summary
	      2	     just the alignments
	      3	     summary and alignments in different files

       -r c/d/m/p/v
	      Organism of genomic sequence, used to determine splice matrices.
	      c	     C.	elegans
	      d	     Drosophila
	      m	     Dictyostelium discoideum
	      p	     plant
	      v	     vertebrate	(default)

       -s     Tune for interspecies alignments.

       -t filename
	      File with	feature	table, in 4 tab-delimited columns:
	      seqid  (e.g., NM_04377.1)
	      name   (only repetitive_region is	currently supported)
	      start  (0-based)
	      stop   (0-based)

       -u     Make a multiple alignment	of all input mRNAs (which must overlap
	      on the genomic sequence).

       -w     Consider lowercase characters in input  FASTA  sequences	to  be
	      masked.

AUTHOR
       Sarah  Wheelan  and others at the National Center for Biotechnology In-
       formation; Steffen Moeller contributed to this documentation.

SEE ALSO
       <http://www.ncbi.nlm.nih.gov/spidey>

NCBI				  2005-01-25			     SPIDEY(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=spidey&sektion=1&manpath=FreeBSD+Ports+15.0>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages