Variation File Format - Definition and supported options

Input format
Output format

The Ensembl Variant Effect Predictor (VEP) tool which appears as an option when you click on Manage your Data allows you to upload a set of variation data and predict the effect of the variants.

Note that the input and output formats are completely different.

Input format

Data must be supplied in a simple tab-separated format, containing five columns, all required:

chromosome - just the name or number, with no 'chr' prefix
start
end
allele - pair of alleles separated by a '/', with the reference allele first
strand - defined as + (forward) or - (reverse).

1   881907    881906    -/C   +
5   140532    140532    T/C   +
12  1017956   1017956   T/A   +
2   946507    946507    G/C   +
14  19584687  19584687  C/T   -
19  66520     66520     G/A   +
8   150029    150029    A/T   +

An insertion is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:

8   12601     12600     -/C   +

A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:

8   12600     12602     CGT/- -

The following input file formats are also supported:

Variant Call Format (VCF) - see http://www.1000genomes.org/wiki/Analysis/vcf4.0 for details.
Pileup format
HGVS notations - see https://varnomen.hgvs.org/ for details. These must be relative to genomic or Ensembl transcript coordinates. It is possible, although less reliable, to use notations relative to RefSeq transcripts in the Ensembl VEP script.
Variant identifiers - these should be e.g. dbSNP rsIDs, or any synonym for a variant present in the Ensembl Variation database. See here for a list of identifier sources in Ensembl.

When using the web Ensembl VEP, ensure that you have the correct file format selected from the drop-down menu. The Ensembl VEP script is able to auto-detect the format of the input file.

Output format

The tool predicts the consequence of this variation, the amino acid position and change (if the variation falls within a protein) and the identifier of known variations that occur at this position. The output columns are:

Uploaded variation - as chromosome_start_alleles
Location - in standard coordinate format (chr:start or chr:start-end)
Allele - the variant allele used to calculate the consequence
Gene - Ensembl stable ID of affected gene
Feature - Ensembl stable ID of feature
Feature type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
Consequence - consequence type of this variation
Relative position in cDNA - base pair position in cDNA sequence
Relative position in CDS - base pair position in coding sequence
Relative position in protein - amino acid position in protein
Amino acid change - only given if the variation affects the protein-coding sequence
Codons - the alternate codons with the variant base highlighted as bold (HTML) or upper case (text)
Corresponding variation - identifier of existing variation
Extra - this column contains extra information as key=value pairs separated by ";". The keys are as follows:
- HGNC - the HGNC gene identifier
- ENSP - the Ensembl protein identifier of the affected transcript
- HGVSc - the HGVS coding sequence name
- HGVSp - the HGVS protein sequence name
- SIFT - the SIFT prediction and/or score, with both given as prediction(score)
- PolyPhen - the PolyPhen prediction and/or score
- Condel - the Condel consensus prediction and/or score
- MOTIF_NAME - the source and identifier of a transcription factor binding profile aligned at this position
- MOTIF_POS - The relative position of the variation in the aligned TFBP
- HIGH_INF_POS - a flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
- MOTIF_SCORE_CHANGE - The difference in motif score of the reference and variant sequences for the TFBP
- CANONICAL - a flag indicating if the transcript is denoted as the canonical transcript for this gene
- CCDS - the CCDS identifier for this transcript, where applicable
- INTRON - the intron number (out of total number)
- EXON - the exon number (out of total number)
- DOMAINS - the source and identifier of any overlapping protein domains

Empty values are denoted by '-'. Further fields in the Extra column can be added by plugins or using custom annotations in the Ensembl VEP script. Output fields can be configured using the --fields flag when running the Ensembl VEP script.

11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000525319  Transcript         NON_SYNONYMOUS_CODING   742  716  239  T/N  aCc/aAc  -  SIFT=deleterious(0);PolyPhen=unknown(0)
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000534381  Transcript         5_PRIME_UTR             -    -    -    -    -        -  -
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000529055  Transcript         DOWNSTREAM              -    -    -    -    -        -  -
11_224585_G/A    11:224585   A  ENSG00000142082  ENST00000529937  Transcript         INTRONIC,NMD_TRANSCRIPT -    -    -    -    -        -  HGVSc=ENST00000529937.1:c.136-346G>A
22_16084370_G/A  22:16084370 A  -                ENSR00000615113  RegulatoryFeature  REGULATORY_REGION       -    -    -    -    -        -  -

The Ensembl VEP script will also add a header to the output file. This contains information about the databases connected to, and also a key describing the key/value pairs used in the extra column.

## ENSEMBL VARIANT EFFECT PREDICTOR v2.4
## Output produced at 2012-02-20 16:09:38
## Connected to homo_sapiens_core_66_37 on ensembldb.ensembl.org
## Using API version 66, DB version 66
## Extra column keys:
## CANONICAL    : Indicates if transcript is canonical for this gene
## CCDS         : Indicates if transcript is a CCDS transcript
## HGNC         : HGNC gene identifier
## ENSP         : Ensembl protein identifier
## HGVSc        : HGVS coding sequence name
## HGVSp        : HGVS protein sequence name
## SIFT         : SIFT prediction
## PolyPhen     : PolyPhen prediction
## Condel       : Condel SIFT/PolyPhen consensus prediction
## EXON         : Exon number
## INTRON       : Intron number
## DOMAINS      : The source and identifier of any overlapping protein domains
## MOTIF_NAME   : The source and identifier of a transcription factor binding profile (TFBP) aligned at this position
## MOTIF_POS    : The relative position of the variation in the aligned TFBP
## HIGH_INF_POS : A flag indicating if the variant falls in a high information position of the TFBP
## MOTIF_SCORE_CHANGE : The difference in motif score of the reference and variant sequences for the TFBP