Variant Effect Predictor Annotation sources
VEP can use a variety of annotation sources to retrieve the transcript models used to predict consequence types.
- Cache - a downloadable file containing all transcript models, regulatory features and variant data for a species
- GFF or GTF - use transcript models defined in a tabix-indexed GFF or GTF file
- Database - connect to a MySQL database server hosting Ensembl databases
Data from VCF, BED and bigWig files can also be incorporated by VEP's Custom annotation feature.
Using a cache (--cache) is the fastest and most efficient way to use VEP, as in most cases only a single initial network connection is made and most data is read from local disk. Use offline mode to eliminate all network connections for speed and/or privacy.
Ensembl creates cache files for every species for each Ensembl release. They can be automatically downloaded and configured using INSTALL.pl.
If interested in RefSeq transcripts you may download an alternate cache file (e.g. homo_sapiens_refseq), or a merged file of RefSeq and Ensembl transcripts (eg homo_sapiens_merged); remember to specify --refseq or --merged when running VEP to use the relevant cache. See documentation for full details.
Manually downloading caches
It is also simple to download and set up caches without using the installer. By default, VEP searches for caches in $HOME/.vep; to use a different directory when running VEP, use --dir_cache.
cd $HOME/.vep curl -O ftp://ftp.ensembl.org/pub/release-94/variation/VEP/homo_sapiens_vep_94_GRCh38.tar.gz tar xzf homo_sapiens_vep_94_GRCh38.tar.gz
FTP directories by species grouping:
|Ensembl Genomes:||Bacteria | Fungi | Metazoa | Plants | Protists|
NB: When using Ensembl Genomes caches, you should use the --cache_version option to specify the relevant Ensembl Genomes version number as these differ from the concurrent Ensembl/VEP version numbers.
Data in the cache
The data content of VEP caches vary by species. This table shows the contents of the default human cache files in release 94.
|Source||Version (GRCh38)||Version (GRCh37)|
|Ensembl database version||94||94|
|1000 Genomes||Phase 3 (remapped)||Phase 3|
|gnomAD||r2.0 170228, exomes only (remapped)||r2.0 170228, exomes only|
Limitations of the cache
Convert with tabix
For those with Bio::DB::HTS (as set up by INSTALL.pl) or tabix installed on their systems, the speed of retrieving existing co-located variants can be greatly improved by converting the cache files using the supplied script, convert_cache.pl. This replaces the plain-text, chunked variant dumps with a single tabix-indexed file per chromosome. The script is simple to run:
perl convert_cache.pl -species [species] -version [vep_version]
To convert all species and all versions, use "all":
perl convert_cache.pl -species all -version all
A full description of the options can be seen using --help. When complete, VEP will automatically detect the converted cache and use this in place.
Note that tabix and bgzip must be installed on your system to convert a cache. INSTALL.pl downloads these when setting up Bio::DB::HTS; to enable convert_cache.pl to find them, run:
Data privacy and offline mode
When using the public database servers, VEP requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be appropriate for those with sensitive or private data. Users should note that only the coordinates are transmitted to the server; no other information is sent.
To run VEP in an offline mode that does not use any network connections, use the flag --offline.
ERROR: Cannot use ID format in offline mode
VEP can use transcript annotations defined in GFF or GTF files. The files must be bgzipped and indexed with tabix, and VEP requires a FASTA file containing the genomic sequence in order to generate transcript models.
Your GFF or GTF file must be sorted in chromosomal order. VEP does not use header lines so it is safe to remove them.
grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gz tabix -p gff data.gff.gz ./vep -i input.vcf -gff data.gff.gz -fasta genome.fa.gz
You may use any number of GFF/GTF files in this way, providing they refer to the same genome. You may also use them in concert with annotations from a cache or database source; annotations are distinguished by the SOURCE field in the VEP output:
./vep -i input.vcf -cache -gff data.gff.gz -fasta genome.fa.gz
This functionality uses VEP's custom annotation feature, and the --gff flag is a shortcut to:
You should use the longer form if you wish to customise the name of the GFF as it appears in the SOURCE field and VEP output header.
GFF format expectations
VEP has been tested on GFF files generated by Ensembl and NCBI (RefSeq). Due to inconsistency in the GFF specification and adherence to it, VEP may encounter problems parsing some GFF files. For the same reason, not all transcript biotypes defined in your GFF may be supported by VEP. VEP does not support GFF files with embedded FASTA sequence.
The following entity types (3rd column in the GFF) are supported by VEP. Lines of other types will be ignored; if this leads to an incomplete transcript model, the whole transcript model may be discarded.
Expected parameters in the 9th column
- Entities in the GFF are expected to be linked using a key named "parent" or "Parent" in the attributes (9th) column of the GFF.
- Unlinked entities (i.e. those with no parents or children) are discarded.
- Sibling entities (those that share the same parent) may have overlapping coordinates, e.g. for exon and CDS entities.
Transcripts require a Sequence Ontology biotype to be defined in order to be parsed by VEP. The simplest way to define this is using an attribute named "biotype" on the transcript entity. Other configurations are supported in order for VEP to be able to parse GFF files from NCBI and other sources.
Here is an example:
##gff-version 3.2.1 ##sequence-region 1 1 10000 1 Ensembl gene 1000 5000 . + . ID=gene1;Name=GENE1 1 Ensembl transcript 1100 4900 . + . ID=transcript1;Name=GENE1-001;Parent=gene1;biotype=protein_coding 1 Ensembl exon 1200 1300 . + . ID=exon1;Name=GENE1-001_1;Parent=transcript1 1 Ensembl exon 1500 3000 . + . ID=exon2;Name=GENE1-001_2;Parent=transcript1 1 Ensembl exon 3500 4000 . + . ID=exon3;Name=GENE1-001_2;Parent=transcript1 1 Ensembl CDS 1300 3800 . + . ID=cds1;Name=CDS0001;Parent=transcript1
GTF format expectations
The following GTF entity types will be parsed by VEP:
- cds (or CDS)
Entities are linked by an attribute named for the parent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id.
Transcript biotypes are defined in attributes named "biotype", "transcript_biotype" or "transcript_type". If none of these exist, VEP will attempt to interpret the source field (2nd column) of the GTF as the biotype.
If the chromosome names used in your GFF/GTF differ from those used in the FASTA or your input VCF, you may see warnings like this when running VEP:
WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160
To circumvent this you may provide VEP with a synonyms file. A synonym file is included in VEP's cache files, so if you have one of these for your species you can use it as follows:
./vep -i input.vcf -cache -gff data.gff.gz -fasta genome.fa.gz -synonyms ~/.vep/homo_sapiens/94_GRCh38/chr_synonyms.txt
Limitations of the cache
By pointing VEP to a FASTA file (or directory containing several files), it is possible to retrieve reference sequence locally when using --cache or --offline. This enables VEP to retrieve HGVS notations (--hgvs), check the reference sequence given in input data (--check_ref), and construct transcript models from a GFF or GTF file without accessing a database.
FASTA files can be set up using the installer; files set up using the installer are automatically detected by VEP when using --cache or --offline; you should not need to use --fasta to manually specify them.
To enable this VEP uses one of two modules:
- The Bio::DB::HTS Perl XS module with HTSlib. This module uses compiled C code and can access compressed (bgzipped) or uncompressed FASTA files. It is set up by the VEP installer.
- The Bio::DB::Fasta module. This may be used on systems where installation of the Bio::DB::HTS module has not been possible. It can access only uncompressed FASTA files. It is also set up by the VEP installer and comes as part of the BioPerl package.
The first time you run VEP with a specific FASTA file, an index will be built. This can take a few minutes, depending on the size of the FASTA file and the speed of your system. On subsequent runs the index does not need to be rebuilt (if the FASTA file has been modified, VEP will force a rebuild of the index).
Ensembl provides suitable reference FASTA files as downloads from its FTP server. See the Downloads page for details. You should preferably use the installer as described above to fetch these files; manual instructions are provided for reference. In most cases it is best to download the single large "primary_assembly" file for your species. You should use the unmasked (without "_rm" or "_sm" in the name) sequences. Note that VEP requires that the file be either unzipped (Bio::DB::Fasta) or unzipped and then recompressed with bgzip (Bio::DB::HTS::Faidx) to run; when unzipped these files can be very large (25GB for human). An example set of commands for setting up the data for human follows:
curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa ./vep -i input.vcf --offline --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
VEP can use remote or local database servers to retrieve annotations.
- Using --cache (without --offline) uses the local cache on disk to fetch most annotations, but allows database connections for some features (see cache limitations)
- Using --database tells VEP to retrieve all annotations from the database. Please only use this for small input files or when using a local database server!
Public database servers
By default, VEP is configured to connect to Ensembl's public
MySQL instance at ensembldb.ensembl.org. For users in the US (or for any
user geographically closer to the East coast of the USA than to Ensembl's
data centre in Cambridge, UK), a mirror server is available at
useastdb.ensembl.org. To use the mirror, use the flag
Users of Ensembl Genomes species (e.g. plants, fungi, microbes) should use their public MySQL instance; the connection parameters for this can be automatically loaded by using the flag --genomes
Users with small data sets (100s of variants) should find using the default connection settings adequate. Those with larger data sets, or those who wish to use VEP in a batch manner, should consider one of the alternatives below.
Using a local database
It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, see here. You will need a MySQL server that you can connect to from the machine where you will run VEP (this can be the same machine). For most of the functionality of VEP, you will only need the Core database (e.g. homo_sapiens_core_94_38) installed. In order to find co-located variants or to use SIFT or PolyPhen, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_94_38).
Note that unless you have custom data to insert in the database, in most cases it will be much more efficient to use a pre-built cache in place of a local database.
To connect to your mirror, you can either set the connection parameters using --host, --port, --user and --password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:
use Bio::EnsEMBL::DBSQL::DBAdaptor; use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; use Bio::EnsEMBL::Registry; Bio::EnsEMBL::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "core", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_core_94_38' ); Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "variation", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_variation_94_38' ); Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");
For more information on the registry and registry files, see here.
Cache - technical information
ADVANCED The cache consists of compressed files containing listrefs of serialised objects. These objects are initially created from the database as if using the Ensembl API normally. In order to reduce the size of the cache and allow the serialisation to occur, some changes are made to the objects before they are dumped to disk. This means that they will not behave in exactly the same way as an object retrieved from the database when writing, for example, a plugin that uses the cache.
The following hash keys are deleted from each transcript object:
- dbentries : this contains the external references retrieved when calling $transcript->get_all_DBEntries(); hence this call on a cached object will return no entries
- transcript_mapper : used to convert between genomic, cdna,
cds and protein coordinates. A copy of this is cached separately
by VEP as
As mentioned above, a special hash key "_variation_effect_feature_cache" is created on the transcript object and used to cache things used by VEP in predicting consequences, things which might otherwise have to be fetched from the database. Some of these are stored in place of equivalent keys that are deleted as described above. The following keys and data are stored:
- introns : listref of intron objects for the transcript. The adaptor, analysis, dbID, next, prev and seqname keys are stripped from each intron object
- translateable_seq : as returned by
- mapper : transcript mapper as described above
- peptide : the translated sequence as a string, as returned by
- protein_features : protein domains for the transcript's translation
as returned by
$transcript->translation->get_all_ProteinFeaturesEach protein feature is stripped of all keys but: start, end, analysis, hseqname
- codon_table : the codon table ID used to translate the transcript,
as returned by
- protein_function_predictions : a hashref containing the keys "sift"
and "polyphen"; each one contains a protein function prediction matrix
as returned by e.g.
Similarly, some further data is cached directly on the transcript object under the following keys:
- _gene : gene object. This object has all keys but the following deleted: start, end, strand, stable_id
- _gene_symbol : the gene symbol
- _ccds : the CCDS identifier for the transcript
- _refseq : the "NM" RefSeq mRNA identifier for the transcript
- _protein : the Ensembl stable identifier of the translation
- _source_cache : the source of the transcript object. Only defined in the merged cache (values: Ensembl, RefSeq) or when using a GFF/GTF file (value: short name or filename)