The Allele Frequency Calculator
VCF files of variant sites and genotypes, released by the 1000 Genomes Project, are usually annotated with allele frequencies (AF) at the global and continental super population levels. Many users also want the AF of certain variants for the specific populations of interest. The AF Calculator provides an interface to generate AF for variants in a given genomic interval for a given population. If no specific population is specified, the tool will calculate and output AF for every population in the input files.
The AF Calculator takes a VCF file containing individual genotypes and a sample to population mapping file as input files. Both these files are released by the 1000 Genomes Project as part of their standard released on the project FTP site. The tool will also work with external files independent from the project, provided they are publicly visible. Local files can be used as input to the API Perl script of the Calculator.
The VCF file format specifications can be found at the hts-specs website.
The VCF files must be indexed by tabix. All vcf files found on the 1000 Genomes FTP site are indexed using this method. The tabix software is available from the SAMtools sourceforge project.
The sample to population mapping file is a tab-delimitated file; the first column is sample id; the second column is population. Additional columns will be ignored. An example can be found at the project ftp site in the phase1 integrated variant calls directory:
The calculator must be given a genomic interval to define which sites are to have their frequencies calculated. For the web-based tool we recommend an interval shorter than 5 Mbases to ensure the tool returns in a reasonable time frame.
After you click the “Next” button, you may choose a population from a dropdown menu. If you choose “ALL” as population, the AF will be calculated for every population in the VCF file.
The output of the calculator can be previewed on the web page and an output file can be downloaded.
Here are a few lines from an example of output file:
The header is:
POS: Start position of the variant
ID: Identification of the variant
REF: Reference allele
ALT: Alternative allele
TOTAL_CNT: Total number of alleles in samples of the chosen population(s)
ALT_CNT: Number of alternative alleles observed in samples of the chosen populations(s)
FRQ: Ratio of ALT_CNT to TOTAL_CNT