Variant Effect Predictor
Custom annotations
VEP can integrate custom annotation from standard format files into your results by using the --custom flag.
These files may be hosted locally or remotely, with no limit to the number or size of the files. The files must be indexed using the tabix utility (BED, GFF, GTF, VCF); bigWig files contain their own indices.
Annotations typically appear as key=value pairs in the Extra column of the VEP output; they will also appear in the INFO column if using VCF format output. The value for a particular annotation is defined as the identifier for each feature; if not available, an identifier derived from the coordinates of the annotation is used. Annotations will appear in each line of output for the variant where multiple lines exist.
Data formats
VEP supports the following annotation formats:
Format | Type | Description | Notes |
---|---|---|---|
GFF GTF |
Gene/transcript annotations | Formats to describe genes and other genomic features — format specifications: GFF3 and GTF | Requires a FASTA file in offline mode or if the desired species or assembly is not part of the Ensembl species list. |
VCF | Variant data | A format used to describe genomic variants | VEP uses the 3rd column as the identifier. INFO and FILTER fields from records may be added to the VEP output. |
BED | Basic/uninterpreted data | A simple tab-delimited format containing 3-12 columns of data. The first 3 columns contain the coordinates of the feature. | VEP uses the 4th column (if available) as the feature identifier. |
bigWig | Basic/uninterpreted data | A format for storage of dense continuous data. | VEP uses the value for the given position as the identifier. BigWig files contain their own indices, and do not need to be indexed by tabix. Requires Bio::DB::BigFile. |
Any other files can be easily converted to be compatible with VEP; the easiest format to produce is a BED-like file containing coordinates and an (optional) identifier:
chr1 10000 11000 Feature1 chr3 25000 26000 Feature2 chrX 99000 99001 Feature3
Chromosomes can be denoted by either e.g. "chr7" or "7", "chrX" or "X".
Preparing files
Custom annotation files must be prepared in a particular way in order to work with tabix and therefore with VEP. Files must be stripped of comment lines, sorted in chromosome and position order, compressed using bgzip and finally indexed using tabix. Here are some examples of that process for:
- GFF file
grep -v "#" myData.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > myData.gff.gz tabix -p gff myData.gff.gz
Note
- BED file
grep -v "#" myData.bed | sort -k1,1 -k2,2n -k3,3n -t$'\t' | bgzip -c > myData.bed.gz tabix -p bed myData.bed.gz
The tabix utility has several preset filetypes that it can process, and it can also process any arbitrary filetype containing at least a chromosome and position column. See the documentation for details.
If you are going to use the file remotely (i.e. over HTTP or FTP protocol), you should ensure the file is world-readable on your server.
Options
Since VEP 110, you can configure each custom file using a comma-separated list of key-value pairs:
./vep [...] --custom file=Filename,short_name=Short_name,format=File_type,type=Annotation_type,fields=VCF_fields
The order of the options is irrelevant and most options have sensible defaults as described below:
Option | Accepted values | Description |
---|---|---|
file |
String with valid path to file | (Required) Filename: The path to the file. For Tabix indexed files, VEP will check if both the file and the corresponding index (.tbi) exist. For remote files, VEP will check that the tabix index is accessible on startup. |
format |
bed, gff, gtf, vcf or bigwig | (Required) File format of file. |
short_name |
Annotation filename (default) or any string without commas | Short name: A name for the annotation that will appear as the key in the key=value pairs in the results. If not defined, this will default to the annotation filename. |
fields |
VCF fields:
Percentage (%) separated list of INFO fields to print (such as
AC) present in the custom input VCF or specify
FILTER for the FILTER field, to add these as custom
annotations:
|
|
type |
overlap (default), within, surrounding or exact |
Annotation type:
|
overlap_cutoff |
From 0 (default) to 100 | Minimum percentage overlap (*) between annotation and variant. See also reciprocal. |
reciprocal |
0 (default) or 1 |
Mode of calculating the overlap percentage (*):
|
distance |
0 or a positive integer (disabled by default) | Distance (in base pairs) to the ends of the overlapping feature (*). |
coords |
0 (default) or 1 |
Force report coordinates:
|
same_type |
0 (default) or 1 | Only match identical variant classes (*). For instance, only match deletions with deletions. This is only available for VCF annotations. |
num_records |
50 (default), all, 0 or any positive integer | Number of matching records to display. Any remaining records are represented with ellipsis (...). Use num_records = all to display all matching records and num_records = 0 to only display ... if there are matching records. |
summary_stats |
none (default), min, mean, max, count or sum | Summary statistics to display. A percentage-separated list may be used to calculate multiple summary statistics, such as min%mean%max%count%sum. |
When format = vcf, the features marked with (*) only work on structural variants.
Examples:
# BigWig file ./vep [...] --custom file=frequencies.bw,short_name=Frequency,format=bigwig,type=exact,coords=0 # BED file ./vep [...] --custom file=http://www.myserver.com/data/myPhenotypes.bed.gz,short_name=Phenotype,format=bed,type=exact,coords=1 # VCF file ./vep [...] --custom file=https://ftp.ensembl.org/pub/grch37/data_files/homo_sapiens/GRCh37/variation_genotype/TOPMED_GRCh37.vcf.gz,format=vcf,type=exact,coords=0,fields=TOPMED ./vep [...] --custom file=gnomad_v2.1_sv.sites.vcf.gz,short_name=gnomad,fields=PC%EVIDENCE%SVTYPE,format=vcf,type=within,reciprocal=1,overlap_cutoff=80 # For multiple custom files, use: ./vep [...] --custom file=clinvar.vcf.gz,short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN \ --custom file=TOPMED_GRCh38_20180418.vcf.gz,short_name=topmed_20180418,format=vcf,type=exact,coords=0,fields=TOPMED \ --custom file=UK10K_COHORT.20160215.sites.GRCh38.vcf.gz,short_name=uk10k,format=vcf,type=exact,coords=0,fields=AF_ALSPAC
Example - ClinVar
We include the most recent public variant and phenotype data available in each Ensembl release, but some projects release data more frequently than we do.
If you want to have the very latest annotations, you can use the data files from your prefered projects (in any format listed in Data formats) and use them as a VEP custom annotation.
For instance, you can annotate you variants with VEP, using the the latest ClinVar data as custom annotation.
ClinVar provides VCF files on their FTP site: GRCh37 and GRCh38.
See below an example about how to use ClinVar VCF files as a VEP custom annotation:
- Download the VCF files (you need the compressed VCF file and the index file), e.g.:
# Compressed VCF file curl -O https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz # Index file curl -O https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
- Example of command you can use:
./vep [...] --custom file=clinvar.vcf.gz,short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN ## Where the selected ClinVar INFO fields (from the ClinVar VCF file) are: # - CLNSIG: Clinical significance for this single variant # - CLNREVSTAT: ClinVar review status for the Variation ID # - CLNDN: ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB # Of course you can select the INFO fields you want in the ClinVar VCF file # Quick example on GRCh38: ./vep --id "1 230710048 230710048 A/G 1" --species homo_sapiens -o /path/to/output/output.txt --cache --offline --assembly GRCh38 --custom file=/path/to/custom_files/clinvar.vcf.gz,short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN
Using remote files
The tabix utility makes it possible to read annotation files from remote locations, for example over HTTP or FTP protocols.
In order to do this, the .tbi index file is downloaded locally (to the current working directory) when VEP is run. From this point on, only the portions of data requested by VEP (i.e. those overlapping the variants in your input file) are downloaded.
Be aware
bigWig files can also be used remotely in the same way as tabix-indexed files, although less stringent checks are carried out on VEP startup.
Example - phyloP and phastCons conservation scores
The UCSC Genome Browser provides multiple alignment files with phyloP and phastCons conservation scores for different genomes in the BigWig (.bw) format.
These files can be remotely used as VEP custom annotations by simply pointing to their URL. For instance, to include phyloP or phastCons 100 way conservation scores found in the Downloads section of the UCSC Genome Browser, you can use commands such as:
# Human GRCh38/hg38 phyloP100way scores ./vep [...] --custom file=http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phyloP100way/hg38.phyloP100way.bw,short_name=phyloP100way,format=bigwig # Human GRCh38/hg38 phastCons100way scores ./vep [...] --custom file=http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phastCons100way/hg38.phastCons100way.bw,short_name=phastCons100way,format=bigwig