Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
faidx(5)		    Bioinformatics formats		      faidx(5)

NAME
       faidx - an index	enabling random	access to FASTA	and FASTQ files

SYNOPSIS
       file.fa.fai, file.fasta.fai, file.fq.fai, file.fastq.fai

DESCRIPTION
       Using an	fai index file in conjunction with a FASTA/FASTQ file contain-
       ing  reference  sequences enables efficient access to arbitrary regions
       within those reference sequences.  The index  file  typically  has  the
       same  filename  as  the	corresponding  FASTA/FASTQ file, with .fai ap-
       pended.

       An fai index file is a text file	consisting of  lines  each  with  five
       TAB-delimited columns for a FASTA file and six for FASTQ:
       NAME	    Name of this reference sequence
       LENGTH	    Total length of this reference sequence, in	bases
       OFFSET	    Offset in the FASTA/FASTQ file of this sequence's first base
       LINEBASES    The	number of bases	on each	line
       LINEWIDTH    The	number of bytes	in each	line, including	the newline
       QUALOFFSET   Offset of sequence's first quality within the FASTQ	file

       The  NAME  and  LENGTH columns contain the same data as would appear in
       the SN and LN fields of a SAM @SQ header	for  the  same	reference  se-
       quence.

       The  OFFSET  column contains the	offset within the FASTA/FASTQ file, in
       bytes starting from zero, of the	first base of this reference sequence,
       i.e., of	the character following	the newline at the end of  the	header
       line  (the  ">"	line in	FASTA, "@" in FASTQ). Typically	the lines of a
       fai index file appear in	the order in which the reference sequences ap-
       pear in the FASTA/FASTQ file, so	.fai files are	typically  sorted  ac-
       cording to this column.

       The  LINEBASES  column  contains	the number of bases in each of the se-
       quence lines that form the body of this reference sequence, apart  from
       the final line which may	be shorter.  The LINEWIDTH column contains the
       number of bytes in each of the sequence lines (except perhaps the final
       line),  thus  differing from LINEBASES in that it also counts the bytes
       forming the line	terminator.

       The QUALOFFSET works the	same way as OFFSET but for the	first  quality
       score  of  this	reference sequence.  This would	be the first character
       following the newline at	the end	of the	"+"  line.   For  FASTQ	 files
       only.

   FASTA Files
       In order	to be indexed with samtools faidx, a FASTA file	must be	a text
       file of the form

	      >name [description...]
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      >name [description...]
	      ATGCATGCATGCAT
	      GCATGCATGCATGC
	      [...]

       In  particular, each reference sequence must be "well-formatted", i.e.,
       all of its sequence lines must be the same length, apart	from the final
       sequence	line which may be shorter.  (While this	sequence  line	length
       must  be	 the  same within each sequence, it may	vary between different
       reference sequences in the same FASTA file.)

       This also means that although the FASTA file may	have Unix- or Windows-
       style or	other line termination,	the newline characters present must be
       consistent, at least within each	reference sequence.

       The samtools implementation uses	the first word of the ">" header  line
       text  (i.e.,  up	 to the	first whitespace character, having skipped any
       initial whitespace after	the ">") as the	NAME column.

   FASTQ Files
       FASTQ files for indexing	work in	the same way as	the FASTA files.

	      @name [description...]
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      +
	      FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
	      HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
	      8011<<
	      @name [description...]
	      ATGCATGCATGCAT
	      GCATGCATGCATGC
	      +
	      IIA94445EEII==
	      =>IIIIIIIIICCC
	      [...]

       Quality lines must be wrapped at	the same length	as  the	 corresponding
       sequence	lines.

EXAMPLE
       For example, given this FASTA file

	      >one
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      >two another chromosome
	      ATGCATGCATGCAT
	      GCATGCATGCATGC

       formatted  with Unix-style (LF) line termination, the corresponding fai
       index would be
	      one   66	  5   30   31
	      two   28	 98   14   15

       If the FASTA file were formatted	with Windows-style (CR-LF) line	termi-
       nation, the fai index would be
	      one   66	   6   30   32
	      two   28	 103   14   16

       An example FASTQ	file

	      @fastq1
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      +
	      FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
	      HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
	      8011<<
	      @fastq2
	      ATGCATGCATGCAT
	      GCATGCATGCATGC
	      +
	      IIA94445EEII==
	      =>IIIIIIIIICCC

       Formatted with Unix-style line termination would	give this fai index
	      fastq1   66     8	  30   31    79
	      fastq2   28   156	  14   15   188

SEE ALSO
       samtools(1)

       https://en.wikipedia.org/wiki/FASTA_format

       https://en.wikipedia.org/wiki/FASTQ_format

	      Further description of the FASTA and FASTQ formats

htslib				   June	2018			      faidx(5)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=faidx&sektion=5&manpath=FreeBSD+Ports+14.3.quarterly>

home | help