Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
VCF-SPLIT(1)		    General Commands Manual		  VCF-SPLIT(1)

NAME
       vcf-split  -  Efficiently  split	a multi-sample VCF stream into single-
       sample files

SYNOPSIS
       vcf-split \
	   [--het-only]	[--alt-only] [--max-calls N] \
	   [--sample-id-file file] [--output-fields field-spec]	\
	   output-file-prefix first-column last-column < file.vcf

       bcftools	view file.bcf |	vcf-split ...

OPTIONS	and ARGUMENTS
       --het-only
	      Output only heterozygous	sites.	 When  decoding	 a  BCF	 file,
	      "bcftools	 view  --genotype  het"	slows down the "bcftools view"
	      process and vcf-split uses far less CPU.	Since bcftools is  al-
	      ready  saturating	 a  CPU	 core  and vcf-split has CPU cycles to
	      spare, allowing vcf-split	to perform the heterozygous  site  se-
	      lection increases	pipeline performance considerably.

       --alt-only
	      Output only sites	with at	least one ALT allele.

       --max-calls N
	      Limit the	number of VCF calls processed (for quick testing with-
	      out the need to generate smaller test input files).

       --sample-id-file	filename
	      File  containing a whitespace-separated list of arbitrary	sample
	      IDS, which must match the	sample names in	the VCF	input.

       --output-fields field-spec
	      Indicates	which fields to	pass to	the output.  Fields not	 indi-
	      cated  here  are replaced	with a reasonable placeholder for that
	      field, such as ".".  field-spec is  a  comma-separated  list  of
	      fields  to  include  in  the  output  including  one  or more of
	      chrom,pos,id,ref,alt,qual,filter,	and info.

       output-file-prefix
	      Common filename prefix for all single-sample output  files  (see
	      Examples directory).

       first-column last-column
	      1-based  column  numbers limiting	the number of samples for each
	      run.  vcf-split opens one	output stream for each sample and many
	      systems cannot support more than 30,000 to 40,000	open files  at
	      a	 time.	 E.g. for a 100,000 sample VCF stream, you may want to
	      do multiple runs of 10,000 each (see EXAMPLES below).   To  pre-
	      vent  system overload, the maximum number	of open	files is hard-
	      coded at 10,000.	To override this limit,	 you  must  edit  vcf-
	      split.h  and  recompile.	 Note  that using --sample-id-file may
	      limits the number	of open	 files	to  less  than	last-column  -
	      first_column + 1 and may make multiple runs unnecessary.

PURPOSE
       vcf-split efficiently splits a multi-sample VCF stream into single-sam-
       ple VCF files.

DESCRIPTION
       Traditional  methods  for splitting a multi-sample VCF stream into sin-
       gle-sample files	involve	a loop or parallel job that rereads the	multi-
       sample input for	every sample.  This is grossly inefficient and can be-
       come a major bottleneck where there are many samples and/or  the	 input
       is compressed.  For example, using "bcftools view" with optimal filter-
       ing options to decode one human chromosome BCF with 137,977 samples and
       pipe the	VCF output through "wc"	took 12	hours on a fast	server using 2
       cores.  To split	it into	137,977	single-sample VCFs would therefore re-
       quire  about 137,977 * 12 * 2 = ~3 million core-hours.  This translates
       to 171 years on a single	server or 125 days using 1000 cores on an  HPC
       cluster,	 assuming  file	 I/O  is  not a	bottleneck with	1000 processes
       reading the same	input file.  ( The input would need to be prestaged on
       multiple	local disks to avoid overloading the network file system. )

       vcf-split solves	this problem by	writing	a large	number of  single-sam-
       ple VCFs	simultaneously during a	single read of the multi-sample	input.
       Modern  Unix  systems  support tens of thousands	of simultaneously open
       files, providing	a simple way to	achieve	enormous speedup.

       To avoid	system overload, vcf-split has a hard-coded  limit  of	10,000
       samples	at  a  time.  Hence, vcf-split may reduce the time required to
       split a large VCF by a factor of	10,000 and can process 137,977 samples
       in 14 passes.

       vcf-split is written entirely in	C and attempts to optimize  CPU,  mem-
       ory,  and  disk	access.	 It does not inhale large amounts of data into
       RAM, so memory use is trivial and it runs mostly	in cache  RAM,	making
       computational code as fast as possible.

       The  example  BCF  file mentioned above can be split in a few days on a
       single server using the maximum of 10,000 samples per run.

SEE ALSO
       ad2vcf, vcf2hap,	haplohseq, biolibc

EXAMPLES
       Split a simple VCF file with 100	samples, filtering for specific	sample
       IDs:

       vcf-split < input.vcf --het-only	--sample-id-file samples.csv \
	   single-sample- 1 100

       Split a large BCF file with 120,000 samples (too	 many  for  your  open
       file limit):

       bcftools	view --min-ac 2	--exclude-types	indels \
	   freeze.8.chr1.pass_only.phased.bcf \
	   | vcf-split --het-only chr01. 1 30000

       bcftools	view --min-ac 2	--exclude-types	indels \
	   freeze.8.chr1.pass_only.phased.bcf \
	   | vcf-split --het-only chr01. 30001 60000

       bcftools	view --min-ac 2	--exclude-types	indels \
	   freeze.8.chr1.pass_only.phased.bcf \
	   | vcf-split --het-only chr01. 60001 90000

       bcftools	view --min-ac 2	--exclude-types	indels \
	   freeze.8.chr1.pass_only.phased.bcf \
	   | vcf-split --het-only chr01. 90001 120000

BUGS
       Please  report bugs to the author and send patches in unified diff for-
       mat.  (Run "man diff" for more information)

AUTHOR
       Jason W.	Bacon
       Paul Auer Lab
       UW -- Milwaukee Zilber School of	Public Health

								  VCF-SPLIT(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=vcf-split&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help