FreeBSD Manual Pages

home | help
vcftools(man)			 2 August 2018			 vcftools(man)

NAME
       vcftools	 v0.1.17 - Utilities for the variant call format (VCF) and bi-
       nary variant call format	(BCF)

SYNOPSIS
       vcftools	[ --vcf	FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT  PRE-
       FIX ] [ FILTERING OPTIONS ]  [ OUTPUT OPTIONS ]

DESCRIPTION
       vcftools	 is  a suite of	functions for use on genetic variation data in
       the form	of VCF and BCF files. The tools	provided will be  used	mainly
       to  summarize data, run calculations on data, filter out	data, and con-
       vert data into other useful file	formats.

EXAMPLES
       Output allele frequency for all sites in	the input vcf file from	 chro-
       mosome 1
	 vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

       Output  a  new  vcf file	from the input vcf file	that removes any indel
       sites
	 vcftools --vcf	input_file.vcf --remove-indels --recode	--recode-INFO-
	 all --out SNPs_only

       Output file comparing the sites in two vcf files
	 vcftools  --gzvcf  input_file1.vcf.gz	 --gzdiff   input_file2.vcf.gz
	 --diff-site --out in1_v_in2

       Output  a  new  vcf  file to standard out without any sites that	have a
       filter tag, then	compress it with gzip
	 vcftools  --gzvcf  input_file.vcf.gz  --remove-filtered-all  --recode
	 --stdout | gzip -c > output_PASS_only.vcf.gz

       Output  a  Hardy-Weinberg  p-value  for every site in the bcf file that
       does not	have any missing genotypes
	 vcftools --bcf	input_file.bcf --hardy --max-missing  1.0  --out  out-
	 put_noMissing

       Output nucleotide diversity at a	list of	positions
	 zcat  input_file.vcf.gz  |  vcftools  --vcf  -	 --site-pi --positions
	 SNP_list.txt --out nucleotide_diversity

BASIC OPTIONS
       These options are used to specify the input and output files.

   INPUT FILE OPTIONS
	 --vcf <input_filename>
	   This	option defines the VCF file to be processed. VCFtools  expects
	   files  in  VCF  format  v4.0, v4.1 or v4.2. The latter two are sup-
	   ported with some small limitations. If the  user  provides  a  dash
	   character  '-' as a file name, the program expects a	VCF file to be
	   piped in through standard in.

	 --gzvcf <input_filename>
	   This	option can be used in place of the --vcf option	to  read  com-
	   pressed (gzipped) VCF files directly.

	 --bcf <input_filename>
	   This	 option	 can be	used in	place of the --vcf option to read BCF2
	   files directly. You do not need to specify if  this	file  is  com-
	   pressed  with  BGZF encoding. If the	user provides a	dash character
	   '-' as a file name, the program expects a BCF2 file to be piped  in
	   through standard in.

   OUTPUT FILE OPTIONS
	 --out <output_prefix>
	   This	option defines the output filename prefix for all files	gener-
	   ated	 by  vcftools. For example, if <prefix>	is set to output_file-
	   name, then all output files will be of the form output_filename.***
	   . If	this option is omitted,	all output files will have the	prefix
	   "out." in the current working directory.

	 --stdout
	 -c
	   These  options direct the vcftools output to	standard out so	it can
	   be piped into another program or written directly to	a filename  of
	   choice. However, a select few output	functions cannot be written to
	   standard out.

	 --temp	<temporary_directory>
	   This	 option	 can  be  used	to  redirect  any temporary files that
	   vcftools creates into a specified directory.

SITE FILTERING OPTIONS
       These options are used to include or exclude  certain  sites  from  any
       analysis	being performed	by the program.

   POSITION FILTERING
	 --chr <chromosome>
	 --not-chr <chromosome>
	   Includes or excludes	sites with indentifiers	matching <chromosome>.
	   These options may be	used multiple times to include or exclude more
	   than	one chromosome.

	 --from-bp <integer>
	 --to-bp <integer>
	   These  options specify a lower bound	and upper bound	for a range of
	   sites to be processed. Sites	with positions less  than  or  greater
	   than	 these values will be excluded.	These options can only be used
	   in conjunction with a single	usage of --chr.	 Using	one  of	 these
	   does	not require use	of the other.

	 --positions <filename>
	 --exclude-positions <filename>
	   Include  or	exclude	a set of sites on the basis of a list of posi-
	   tions in a file. Each line of the input file	should contain a (tab-
	   separated) chromosome and position. The file	can have comment lines
	   that	start with a "#", they will be ignored.

	 --positions-overlap <filename>
	 --exclude-positions-overlap <filename>
	   Include or exclude a	set of sites on	the basis of the reference al-
	   lele	overlapping with a list	of positions in	a file.	Each  line  of
	   the	input file should contain a (tab-separated) chromosome and po-
	   sition. The file can	have comment lines that	start with a "#", they
	   will	be ignored.

	 --bed <filename>
	 --exclude-bed <filename>
	   Include or exclude a	set of sites on	the basis of a BED file.  Only
	   the	first  three  columns (chrom, chromStart and chromEnd) are re-
	   quired. The BED file	is expected to have a header line. A site will
	   be kept or excluded if any part of any allele (REF  or  ALT)	 at  a
	   site	is within the range of one of the BED entries.

	 --thin	<integer>
	   Thin	 sites	so that	no two sites are within	the specified distance
	   from	one another.

	 --mask	<filename>
	 --invert-mask <filename>
	 --mask-min <integer>
	   These options are used to specify a FASTA-like mask file to	filter
	   with.  The mask file	contains a sequence of integer digits (between
	   0 and 9) for	each position on a chromosome that specify if  a  site
	   at that position should be filtered or not.
	   An example mask file	would look like:
	     >1
	     0000011111222...
	     >2
	     2222211111000...
	   In  this  example, sites in the VCF file located within the first 5
	   bases of the	start of chromosome 1 would be kept, whereas sites  at
	   position 6 onwards would be filtered	out. And sites before the 11th
	   position on chromosome 2 would be filtered out as well.
	   The	"--invert-mask"	 option	takes the same format mask file	as the
	   "--mask" option, however it inverts the mask	file before  filtering
	   with	it.
	   And	the  "--mask-min"  option specifies a threshold	mask value be-
	   tween 0 and 9 to filter positions by. The default threshold	is  0,
	   meaning only	sites with that	value or lower will be kept.

   SITE	ID FILTERING
	 --snp <string>
	   Include  SNP(s)  with matching ID (e.g. a dbSNP rsID). This command
	   can be used multiple	times in order to include more than one	SNP.

	 --snps	<filename>
	 --exclude <filename>
	   Include or exclude a	list of	SNPs given in a	file. The file	should
	   contain a list of SNP IDs (e.g. dbSNP rsIDs), with one ID per line.
	   No header line is expected.

   VARIANT TYPE	FILTERING
	 --keep-only-indels
	 --remove-indels
	   Include  or	exclude	sites that contain an indel. For these options
	   "indel" means any variant that alters the length of the REF allele.

   FILTER FLAG FILTERING
	 --remove-filtered-all
	   Removes all sites with a FILTER flag	other than PASS.

	 --keep-filtered <string>
	 --remove-filtered <string>
	   Includes or excludes	all sites marked with a	specific FILTER	 flag.
	   These options may be	used more than once to specify multiple	FILTER
	   flags.

   INFO	FIELD FILTERING
	 --keep-INFO <string>
	 --remove-INFO <string>
	   Includes or excludes	all sites with a specific INFO flag. These op-
	   tions  only	filter	on the presence	of the flag and	not its	value.
	   These options can be	used multiple times to specify	multiple  INFO
	   flags.

   ALLELE FILTERING
	 --maf <float>
	 --max-maf <float>
	   Include  only  sites	 with a	Minor Allele Frequency greater than or
	   equal to the	"--maf"	value and less than or equal  to  the  "--max-
	   maf"	value. One of these options may	be used	without	the other. Al-
	   lele	 frequency is defined as the number of times an	allele appears
	   over	all individuals	at that	site, divided by the total  number  of
	   non-missing alleles at that site.

	 --non-ref-af <float>
	 --max-non-ref-af <float>
	 --non-ref-ac <integer>
	 --max-non-ref-ac <integer>

	 --non-ref-af-any <float>
	 --max-non-ref-af-any <float>
	 --non-ref-ac-any <integer>
	 --max-non-ref-ac-any <integer>
	   Include  only sites with all	Non-Reference (ALT) Allele Frequencies
	   (af)	or Counts (ac) within the range	specified, and	including  the
	   specified  value.  The  default options require all alleles to meet
	   the specified criteria, whereas the options appended	with "any" re-
	   quire only one allele to meet the criteria. The Allele frequency is
	   defined as the number of times an allele appears over all individu-
	   als at that site, divided by	the total number of non-missing	 alle-
	   les at that site.

	 --mac <integer>
	 --max-mac <integer>
	   Include only	sites with Minor Allele	Count greater than or equal to
	   the	"--mac"	value and less than or equal to	the "--max-mac"	value.
	   One of these	options	may be used without the	other. Allele count is
	   simply the number of	times that allele appears over all individuals
	   at that site.

	 --min-alleles <integer>
	 --max-alleles <integer>
	   Include only	sites with a number of alleles greater than  or	 equal
	   to  the "--min-alleles" value and less than or equal	to the "--max-
	   alleles" value. One of these	options	may be used without the	other.
	   For example,	to include only	bi-allelic sites, one could use:
	     vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

   GENOTYPE VALUE FILTERING
	 --min-meanDP <float>
	 --max-meanDP <float>
	   Includes only sites with mean depth values (over all	included indi-
	   viduals) greater than or equal to the "--min-meanDP"	value and less
	   than	or equal to the	"--max-meanDP" value. One of these options may
	   be used without the other. These options require that the "DP" FOR-
	   MAT tag is included for each	site.

	 --hwe <float>
	   Assesses sites for Hardy-Weinberg Equilibrium using an exact	 test,
	   as  defined	by Wigginton, Cutler and Abecasis (2005). Sites	with a
	   p-value below the threshold defined by this option are taken	to  be
	   out of HWE, and therefore excluded.

	 --max-missing <float>
	   Exclude  sites  on the basis	of the proportion of missing data (de-
	   fined to be between 0 and 1,	where 0	allows	sites  that  are  com-
	   pletely missing and 1 indicates no missing data allowed).

	 --max-missing-count <integer>
	   Exclude  sites with more than this number of	missing	genotypes over
	   all individuals.

	 --phased
	   Excludes all	sites that contain unphased genotypes.

   MISCELLANEOUS FILTERING
	 --minQ	<float>
	   Includes only sites with Quality value above	this threshold.

INDIVIDUAL FILTERING OPTIONS
       These options are used to include or exclude certain  individuals  from
       any analysis being performed by the program.
	 --indv	<string>
	 --remove-indv <string>
	   Specify an individual to be kept or removed from the	analysis. This
	   option  can be used multiple	times to specify multiple individuals.
	   If both options are specified, then the "--indv" option is executed
	   before the "--remove-indv option".

	 --keep	<filename>
	 --remove <filename>
	   Provide files containing a list of individuals to either include or
	   exclude in subsequent analysis. Each	individual ID (as  defined  in
	   the	VCF headerline)	should be included on a	separate line. If both
	   options are used, then the "--keep" option is executed  before  the
	   "--remove"  option.	When multiple files are	provided, the union of
	   individuals from all	keep files subtracted by the union of individ-
	   uals	from all remove	files are kept.	No header line is expected.

	 --max-indv <integer>
	   Randomly thins individuals so that only the	specified  number  are
	   retained.

GENOTYPE FILTERING OPTIONS
       These  options  are  used  to exclude genotypes from any	analysis being
       performed by the	program. If excluded, these values will	be treated  as
       missing.
	 --remove-filtered-geno-all
	   Excludes all	genotypes with a FILTER	flag not equal to "." (a miss-
	   ing value) or PASS.

	 --remove-filtered-geno	<string>
	   Excludes genotypes with a specific FILTER flag.

	 --minGQ <float>
	   Exclude all genotypes with a	quality	below the threshold specified.
	   This	 option	requires that the "GQ" FORMAT tag is specified for all
	   sites.

	 --minDP <float>
	 --maxDP <float>
	   Includes only genotypes greater than	 or  equal  to	the  "--minDP"
	   value  and  less  than or equal to the "--maxDP" value. This	option
	   requires that the "DP" FORMAT tag is	specified for all sites.

OUTPUT OPTIONS
       These options specify which analyses or conversions to perform  on  the
       data that passed	through	all specified filters.

   OUTPUT ALLELE STATISTICS
	 --freq
	 --freq2
	   Outputs  the	allele frequency for each site in a file with the suf-
	   fix ".frq". The second option is used to suppress output of any in-
	   formation about the alleles.

	 --counts
	 --counts2
	   Outputs the raw allele counts for each site in a file with the suf-
	   fix ".frq.count". The second	option is used to suppress  output  of
	   any information about the alleles.

	 --derived
	   For	use  with  the previous	four frequency and count options only.
	   Re-orders the output	file columns so	that the ancestral allele  ap-
	   pears first.	This option relies on the ancestral allele being spec-
	   ified in the	VCF file using the AA tag in the INFO field.

   OUTPUT DEPTH	STATISTICS
	 --depth
	   Generates  a	 file  containing  the mean depth per individual. This
	   file	has the	suffix ".idepth".

	 --site-depth
	   Generates a file containing the depth per site  summed  across  all
	   individuals.	This output file has the suffix	".ldepth".

	 --site-mean-depth
	   Generates a file containing the mean	depth per site averaged	across
	   all individuals. This output	file has the suffix ".ldepth.mean".

	 --geno-depth
	   Generates  a	 (possibly  very  large) file containing the depth for
	   each	genotype in the	VCF file. Missing entries are given the	 value
	   -1. The file	has the	suffix ".gdepth".

   OUTPUT LD STATISTICS
	 --hap-r2
	   Outputs  a file reporting the r2, D,	and D' statistics using	phased
	   haplotypes. These are the traditional measures of LD	often reported
	   in the population genetics literature. The output file has the suf-
	   fix ".hap.ld". This option assumes that  the	 VCF  input  file  has
	   phased haplotypes.

	 --geno-r2
	   Calculates  the  squared  correlation coefficient between genotypes
	   encoded as 0, 1 and 2 to represent the number of non-reference  al-
	   leles  in  each  individual.	This is	the same as the	LD measure re-
	   ported by PLINK. The	D and D' statistics  are  only	available  for
	   phased genotypes. The output	file has the suffix ".geno.ld".

	 --geno-chisq
	   If  your  data contains sites with more than	two alleles, then this
	   option can be used to test for genotype independence	via  the  chi-
	   squared statistic. The output file has the suffix ".geno.chisq".

	 --hap-r2-positions <positions list file>
	 --geno-r2-positions <positions	list file>
	   Outputs  a  file reporting the r2 statistics	of the sites contained
	   in the provided file	verses all other sites.	The output files  have
	   the	suffix	".list.hap.ld"	or ".list.geno.ld", depending on which
	   option is used.

	 --ld-window <integer>
	   This	optional parameter defines the maximum number of SNPs  between
	   the	SNPs  being  tested for	LD in the "--hap-r2", "--geno-r2", and
	   "--geno-chisq" functions.

	 --ld-window-bp	<integer>
	   This	optional parameter defines  the	 maximum  number  of  physical
	   bases  between  the	SNPs  being  tested  for LD in the "--hap-r2",
	   "--geno-r2",	and "--geno-chisq" functions.

	 --ld-window-min <integer>
	   This	optional parameter defines the minimum number of SNPs  between
	   the	SNPs  being  tested for	LD in the "--hap-r2", "--geno-r2", and
	   "--geno-chisq" functions.

	 --ld-window-bp-min <integer>
	   This	optional parameter defines  the	 minimum  number  of  physical
	   bases  between  the	SNPs  being  tested  for LD in the "--hap-r2",
	   "--geno-r2",	and "--geno-chisq" functions.

	 --min-r2 <float>
	   This	optional parameter sets	a minimum value	for  r2,  below	 which
	   the	LD  statistic  is not reported by the "--hap-r2", "--geno-r2",
	   and "--geno-chisq" functions.

	 --interchrom-hap-r2
	 --interchrom-geno-r2
	   Outputs a file reporting the	r2 statistics for sites	 on  different
	   chromosomes.	 The output files have the suffix ".interchrom.hap.ld"
	   or ".interchrom.geno.ld", depending on the option used.

   OUTPUT TRANSITION/TRANSVERSION STATISTICS
	 --TsTv	<integer>
	   Calculates the Transition / Transversion ratio in bins of size  de-
	   fined by this option. Only uses bi-allelic SNPs. The	resulting out-
	   put file has	the suffix ".TsTv".

	 --TsTv-summary
	   Calculates  a  simple summary of all	Transitions and	Transversions.
	   The output file has the suffix ".TsTv.summary".

	 --TsTv-by-count
	   Calculates the Transition / Transversion ratio as a function	of al-
	   ternative allele count. Only	uses bi-allelic	 SNPs.	The  resulting
	   output file has the suffix ".TsTv.count".

	 --TsTv-by-qual
	   Calculates the Transition / Transversion ratio as a function	of SNP
	   quality  threshold. Only uses bi-allelic SNPs. The resulting	output
	   file	has the	suffix ".TsTv.qual".

	 --FILTER-summary
	   Generates a summary of the number of	SNPs and Ts/Tv ratio for  each
	   FILTER category. The	output file has	the suffix ".FILTER.summary".

   OUTPUT NUCLEOTIDE DIVERGENCE	STATISTICS
	 --site-pi
	   Measures nucleotide divergency on a per-site	basis. The output file
	   has the suffix ".sites.pi".

	 --window-pi <integer>
	 --window-pi-step <integer>
	   Measures  the nucleotide diversity in windows, with the number pro-
	   vided as the	window size. The output	file  has  the	suffix	".win-
	   dowed.pi".  The  latter is an optional argument used	to specify the
	   step	size in	between	windows.
	   Note: vcftools can make use of a mask (defined using	the --mask pa-
	   rameter) to define which sites have been well characterized for the
	   estimation of nucleotide diversity. Using  a	 mask  to  define  the
	   well-characterized  portion of the genome is	recommended when esti-
	   mating nucleotide diversity,	as (for	example) genetic variants  may
	   be poorly characterized in low-coverage or poorly sequenced regions
	   of the genome.

   OUTPUT FST STATISTICS
	 --weir-fst-pop	<filename>
	   This	 option	 is  used  to  calculate an Fst	estimate from Weir and
	   Cockerham's 1984 paper. This	is the preferred calculation  of  Fst.
	   The	provided file must contain a list of individuals (one individ-
	   ual per line) from the VCF file that	correspond to one  population.
	   This	 option	 can  be used multiple times to	calculate Fst for more
	   than	two populations. These files will also be included as "--keep"
	   options. By default,	calculations are done on a per-site basis. The
	   output file has the suffix ".weir.fst".

	 --fst-window-size <integer>
	 --fst-window-step <integer>
	   These options can be	used with "--weir-fst-pop" to do the Fst  cal-
	   culations  on  a  windowed basis instead of a per-site basis. These
	   arguments specify the desired window	size and the desired step size
	   between windows.

   OUTPUT OTHER	STATISTICS
	 --het
	   Calculates a	measure	of heterozygosity on a	per-individual	basis.
	   Specfically,	 the  inbreeding coefficient, F, is estimated for each
	   individual using a method of	moments. The resulting	file  has  the
	   suffix ".het".

	 --hardy
	   Reports  a  p-value for each	site from a Hardy-Weinberg Equilibrium
	   test	(as defined by Wigginton, Cutler and Abecasis (2005)). The re-
	   sulting file	(with suffix ".hwe") also contains the	Observed  num-
	   bers	 of  Homozygotes  and  Heterozygotes and the corresponding Ex-
	   pected numbers under	HWE.

	 --TajimaD <integer>
	   Outputs Tajima's D statistic	in bins	with  size  of	the  specified
	   number. The output file has the suffix ".Tajima.D".

	 --indv-freq-burden
	   This	 option	calculates the number of variants within each individ-
	   ual of a specific frequency.	The  resulting	file  has  the	suffix
	   ".ifreqburden".

	 --LROH
	   This	option will identify and output	Long Runs of Homozygosity. The
	   output  file	has the	suffix ".LROH".	This function is experimental,
	   and will use	a lot of memory	if applied to large datasets.

	 --relatedness
	   This	option is used to calculate and	output a relatedness statistic
	   based  on  the  method  of  Yang  et	 al,  Nature   Genetics	  2010
	   (doi:10.1038/ng.608).  Specifically,	 calculate  the	unadjusted Ajk
	   statistic. Expectation of Ajk is zero for individuals within	a pop-
	   ulations, and one for an individual	with  themselves.  The	output
	   file	has the	suffix ".relatedness".

	 --relatedness2
	   This	option is used to calculate and	output a relatedness statistic
	   based  on  the  method  of  Manichaikul et al., BIOINFORMATICS 2010
	   (doi:10.1093/bioinformatics/btq559).	The output file	has the	suffix
	   ".relatedness2".

	 --site-quality
	   Generates a file containing the per-site SNP	quality, as  found  in
	   the QUAL column of the VCF file. This file has the suffix ".lqual".

	 --missing-indv
	   Generates  a	file reporting the missingness on a per-individual ba-
	   sis.	The file has the suffix	".imiss".

	 --missing-site
	   Generates a file reporting the missingness on a per-site basis. The
	   file	has the	suffix ".lmiss".

	 --SNPdensity <integer>
	   Calculates the number and density of	SNPs in	bins of	 size  defined
	   by this option. The resulting output	file has the suffix ".snpden".

	 --kept-sites
	   Creates  a file listing all sites that have been kept after filter-
	   ing.	The file has the suffix	".kept.sites".

	 --removed-sites
	   Creates a file listing all sites that have been removed after  fil-
	   tering. The file has	the suffix ".removed.sites".

	 --singletons
	   This	 option	will generate a	file detailing the location of single-
	   tons, and the individual they occur in. The file reports both  true
	   singletons,	and  private doubletons	(i.e. SNPs where the minor al-
	   lele	only occurs in a single	individual and that individual is  ho-
	   mozygotic  for  that	allele). The output file has the suffix	".sin-
	   gletons".

	 --hist-indel-len
	   This	option will generate a histogram file of the length of all in-
	   dels	(including SNPs). It shows both	the count and  the  percentage
	   of all indels for indel lengths that	occur at least once in the in-
	   put	file.  SNPs are	considered indels with length zero. The	output
	   file	has the	suffix ".indel.hist".

	 --hapcount <BED file>
	   This	option will output the number of unique	haplotypes within user
	   specified bins, as defined by the BED file. The output file has the
	   suffix ".hapcount".

	 --mendel <PED file>
	   This	option is use to report	mendel errors identified in trios. The
	   command requires a  PLINK-style  PED	 file,	with  the  first  four
	   columns  specifying	a  family ID, the child	ID, the	father ID, and
	   the mother ID. The output of	this command has the suffix ".mendel".

	 --extract-FORMAT-info <string>
	   Extract information from the	genotype fields	in the VCF file	relat-
	   ing to a specfied FORMAT identifier.	The resulting output file  has
	   the	suffix	".<FORMAT_ID>.FORMAT". For example, the	following com-
	   mand	would extract the all of the GT	(i.e. Genotype)	entries:
	     vcftools --vcf file1.vcf --extract-FORMAT-info GT

	 --get-INFO <string>
	   This	option is used to extract information from the INFO  field  in
	   the	VCF  file.  The	<string> argument specifies the	INFO tag to be
	   extracted, and the option can be used multiple times	 in  order  to
	   extract  multiple  INFO  entries.  The  resulting file, with	suffix
	   ".INFO", contains the required INFO information in a	 tab-separated
	   table.  For	example, to extract the	NS and DB flags, one would use
	   the command:
	     vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

   OUTPUT VCF FORMAT
	 --recode
	 --recode-bcf
	   These options are used to generate a	new file in either VCF or  BCF
	   from	the input VCF or BCF file after	applying the filtering options
	   specified by	the user. The output file has the suffix ".recode.vcf"
	   or  ".recode.bcf". By default, the INFO fields are removed from the
	   output file,	as the INFO values may be invalidated by the  recoding
	   (e.g.  the  total  depth may	need to	be recalculated	if individuals
	   are removed). This behavior may be overriden	by the	following  op-
	   tions.  By  default,	 BCF  files are	written	out as BGZF compressed
	   files.

	 --recode-INFO <string>
	 --recode-INFO-all
	   These options can be	used with the above recode options  to	define
	   an  INFO  key  name	to keep	in the output file. This option	can be
	   used	multiple times to keep more of the INFO	fields.	The second op-
	   tion	is used	to keep	all INFO values	in the original	file.

	 --contigs <string>
	   This	option can be used in conjuction with  the  --recode-bcf  when
	   the	input  file does not have any contig declarations. This	option
	   expects a file name with one	contig header per  line.  These	 lines
	   are included	in the output file.

   OUTPUT OTHER	FORMATS
	 --012
	   This	 option	 outputs  the genotypes	as a large matrix. Three files
	   are produced. The first, with suffix	".012",	contains the genotypes
	   of each individual on a separate line. Genotypes are	represented as
	   0, 1	and 2, where the number	represent that number of non-reference
	   alleles. Missing genotypes are represented by -1. The second	 file,
	   with	 suffix	 ".012.indv"  details  the individuals included	in the
	   main	file. The third	file, with suffix ".012.pos" details the  site
	   locations included in the main file.

	 --IMPUTE
	   This	 option	 outputs  phased  haplotypes in	IMPUTE reference-panel
	   format. As IMPUTE requires phased data, using this option also  im-
	   plies  --phased.  Unphased  individuals and genotypes are therefore
	   excluded. Only bi-allelic sites are included	in the	output.	 Using
	   this	 option	 generates  three files. The IMPUTE haplotype file has
	   the suffix ".impute.hap", and the IMPUTE legend file	has the	suffix
	   ".impute.hap.legend".   The	 third	 file,	 with	suffix	 ".im-
	   pute.hap.indv",  details  the individuals included in the haplotype
	   file, although this file is not needed by IMPUTE.

	 --ldhat
	 --ldhelmet
	 --ldhat-geno
	   These options output	data in	LDhat/LDhelmet format. This option re-
	   quires the "--chr" filter option to also be used. The two first op-
	   tions  output  phased  data	only,  and  therefore	also   implies
	   "--phased"  be  used, leading to unphased individuals and genotypes
	   being excluded. For LDhelmet, only snps  will  be  considered,  and
	   therefore  it  implies  "--remove-indels". The second option	treats
	   all of the data as unphased,	and therefore outputs LDhat  files  in
	   genotype/unphased  format.  Two output files	are generated with the
	   suffixes ".ldhat.sites" and ".ldhat.locs", which correspond to  the
	   LDhat  "sites"  and	"locs" input files respectively; for LDhelmet,
	   the two files generated  have  the  suffixes	 ".ldhelmet.snps"  and
	   ".ldhelmet.pos",  which  corresponds	 to the	"SNPs" and "positions"
	   files.

	 --BEAGLE-GL
	 --BEAGLE-PL
	   These options output	genotype likelihood information	for input into
	   the BEAGLE program. The VCF file  is	 required  to  contain	FORMAT
	   fields with "GL" or "PL" tags, which	can generally be output	by SNP
	   callers  such as the	GATK. Use of this option requires a chromosome
	   to be specified via the "--chr" option. The resulting  output  file
	   has	the  suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype
	   likelihoods for biallelic sites. This file is  suitable  for	 input
	   into	BEAGLE via the "like=" argument.

	 --plink
	 --plink-tped
	 --chrom-map
	   These  options  output  the genotype	data in	PLINK PED format. With
	   the first option, two files are generated, with suffixes ".ped" and
	   ".map". Note	that only bi-allelic loci will be output. Further  de-
	   tails of these files	can be found in	the PLINK documentation.
	   Note:  The  first  option can be very slow on large datasets. Using
	   the --chr option to divide up the dataset is	advised,  or  alterna-
	   tively  use	the --plink-tped option	which outputs the files	in the
	   PLINK transposed format with	suffixes ".tped" and ".tfam".
	   For usage with variant sites	in  species  other  than  humans,  the
	   --chrom-map	option	may  be	used to	specify	a file name that has a
	   tab-delimited mapping of chromosome name to a desired integer value
	   with	one line per chromosome. This file must	contain	a mapping  for
	   every chromosome value found	in the file.

COMPARISON OPTIONS
       These  options are used to compare the original variant file to another
       variant file and	output the results. All	of the diff functions  require
       both files to contain the same chromosomes and that the files be	sorted
       in  the	same  order. If	one of the files contains chromosomes that the
       other file does not, use	the --not-chr filter to	remove them  from  the
       analysis.

   DIFF	VCF FILE
	 --diff	<filename>
	 --gzdiff <filename>
	 --diff-bcf <filename>
	   These  options  compare  the	 original input	file to	this specified
	   VCF,	gzipped	VCF, or	BCF file. These	options	must be	specified with
	   one additional option described below in order to specify what type
	   of comparison is to be performed. See the examples section for typ-
	   ical	usage.

   DIFF	OPTIONS
	 --diff-site
	   Outputs the sites that are common / unique to each file. The	output
	   file	has the	suffix ".diff.sites_in_files".

	 --diff-indv
	   Outputs the individuals that	are common / unique to each file.  The
	   output file has the suffix ".diff.indv_in_files".

	 --diff-site-discordance
	   This	option calculates discordance on a site	by site	basis. The re-
	   sulting output file has the suffix ".diff.sites".

	 --diff-indv-discordance
	   This	 option	 calculates discordance	on a per-individual basis. The
	   resulting output file has the suffix	".diff.indv".

	 --diff-indv-map <filename>
	   This	option allows the user to specify a mapping of individual  IDs
	   in  the second file to those	in the first file. The program expects
	   the file to contain a tab-delimited line containing an individual's
	   name	in file	one followed by	that same individual's	name  in  file
	   two with one	mapping	per line.

	 --diff-discordance-matrix
	   This	option calculates a discordance	matrix.	This option only works
	   with	bi-allelic loci	with matching alleles that are present in both
	   files.  The	resulting  output  file	 has the suffix	".diff.discor-
	   dance.matrix".

	 --diff-switch-error
	   This	option calculates phasing  errors  (specifically  "switch  er-
	   rors"). This	option creates an output file describing switch	errors
	   found between sites,	with suffix ".diff.switch".

AUTHORS
       Adam Auton
       Anthony Marcketta

1				     page			 vcftools(man)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=vcftools&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages