FreeBSD Manual Pages
VCF-SPLIT(1) General Commands Manual VCF-SPLIT(1) NAME vcf-split - Efficiently split a multi-sample VCF stream into single- sample files SYNOPSIS vcf-split \ [--het-only] [--alt-only] [--max-calls N] \ [--sample-id-file file] [--output-fields field-spec] \ output-file-prefix first-column last-column < file.vcf bcftools view file.bcf | vcf-split ... OPTIONS and ARGUMENTS --het-only Output only heterozygous sites. When decoding a BCF file, "bcftools view --genotype het" slows down the "bcftools view" process and vcf-split uses far less CPU. Since bcftools is al- ready saturating a CPU core and vcf-split has CPU cycles to spare, allowing vcf-split to perform the heterozygous site se- lection increases pipeline performance considerably. --alt-only Output only sites with at least one ALT allele. --max-calls N Limit the number of VCF calls processed (for quick testing with- out the need to generate smaller test input files). --sample-id-file filename File containing a whitespace-separated list of arbitrary sample IDS, which must match the sample names in the VCF input. --output-fields field-spec Indicates which fields to pass to the output. Fields not indi- cated here are replaced with a reasonable placeholder for that field, such as ".". field-spec is a comma-separated list of fields to include in the output including one or more of chrom,pos,id,ref,alt,qual,filter, and info. output-file-prefix Common filename prefix for all single-sample output files (see Examples directory). first-column last-column 1-based column numbers limiting the number of samples for each run. vcf-split opens one output stream for each sample and many systems cannot support more than 30,000 to 40,000 open files at a time. E.g. for a 100,000 sample VCF stream, you may want to do multiple runs of 10,000 each (see EXAMPLES below). To pre- vent system overload, the maximum number of open files is hard- coded at 10,000. To override this limit, you must edit vcf- split.h and recompile. Note that using --sample-id-file may limits the number of open files to less than last-column - first_column + 1 and may make multiple runs unnecessary. PURPOSE vcf-split efficiently splits a multi-sample VCF stream into single-sam- ple VCF files. DESCRIPTION Traditional methods for splitting a multi-sample VCF stream into sin- gle-sample files involve a loop or parallel job that rereads the multi- sample input for every sample. This is grossly inefficient and can be- come a major bottleneck where there are many samples and/or the input is compressed. For example, using "bcftools view" with optimal filter- ing options to decode one human chromosome BCF with 137,977 samples and pipe the VCF output through "wc" took 12 hours on a fast server using 2 cores. To split it into 137,977 single-sample VCFs would therefore re- quire about 137,977 * 12 * 2 = ~3 million core-hours. This translates to 171 years on a single server or 125 days using 1000 cores on an HPC cluster, assuming file I/O is not a bottleneck with 1000 processes reading the same input file. ( The input would need to be prestaged on multiple local disks to avoid overloading the network file system. ) vcf-split solves this problem by writing a large number of single-sam- ple VCFs simultaneously during a single read of the multi-sample input. Modern Unix systems support tens of thousands of simultaneously open files, providing a simple way to achieve enormous speedup. To avoid system overload, vcf-split has a hard-coded limit of 10,000 samples at a time. Hence, vcf-split may reduce the time required to split a large VCF by a factor of 10,000 and can process 137,977 samples in 14 passes. vcf-split is written entirely in C and attempts to optimize CPU, mem- ory, and disk access. It does not inhale large amounts of data into RAM, so memory use is trivial and it runs mostly in cache RAM, making computational code as fast as possible. The example BCF file mentioned above can be split in a few days on a single server using the maximum of 10,000 samples per run. SEE ALSO ad2vcf, vcf2hap, haplohseq, biolibc EXAMPLES Split a simple VCF file with 100 samples, filtering for specific sample IDs: vcf-split < input.vcf --het-only --sample-id-file samples.csv \ single-sample- 1 100 Split a large BCF file with 120,000 samples (too many for your open file limit): bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 1 30000 bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 30001 60000 bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 60001 90000 bcftools view --min-ac 2 --exclude-types indels \ freeze.8.chr1.pass_only.phased.bcf \ | vcf-split --het-only chr01. 90001 120000 BUGS Please report bugs to the author and send patches in unified diff for- mat. (Run "man diff" for more information) AUTHOR Jason W. Bacon Paul Auer Lab UW -- Milwaukee Zilber School of Public Health VCF-SPLIT(1)
NAME | SYNOPSIS | OPTIONS and ARGUMENTS | PURPOSE | DESCRIPTION | SEE ALSO | EXAMPLES | BUGS | AUTHOR
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=vcf-split&sektion=1&manpath=FreeBSD+Ports+15.0>
