Variant Effect Predictor FAQ


For any questions not covered here, please send an email to the Ensembl developer's mailing list (public) or contact the Ensembl Helpdesk (private). Also you can report issues through our (public) Github repositories. For general vep issues you should use ensembl-vep repository and for specific plugins you should use VEP_plugins repository.

General questions

Q: Why has my insertion/deletion variant encoded in VCF disappeared from the VEP output?

Ensembl treats unbalanced variants differently to VCF - your variant hasn't disappeared, it may have just changed slightly! You can solve this by giving your variants a unique identifier in the third column of the VCF file. See here for a full discussion.

 

Q: Why don't I see any co-located variants when using species X?

Ensembl only has variation databases for a subset of all Ensembl species - see this document for details.

 

Q: Why do I see multiple known variants mapped to my input variant?

VEP compares your input to known variants from the Ensembl variation database. In some cases one input variant can match multiple known variants:

  • Germline variants from dbSNP and somatic mutations from COSMIC may be found at the same locus
  • Some sources, e.g. HGMD, do not provide public access to allele-specific data, so an HGMD variant with unknown alleles may colocate with one from dbSNP with known alleles
  • Multiple alternate alleles from your input may match different variants as they are described in dbSNP
See here for a full discussion.

 

Q: VEP is not assigning a frequency to my input variant - why?

VEP's cache contains frequency data only for variants and alleles imported into Ensembl's variation database. See here for a full discussion.

 

Q: Why do I see so many lines of output for each variant in my input?

While it would be convenient to have a simple, one word answer to the question "What is the consequence of this variant?", in reality biology is not this simple! Many genes have more than one transcript, so VEP provides a prediction for each transcript that a variant overlaps. VEP has options to help select results according to your requirements; the --canonical and --ccds options indicate which transcripts are canonical and belong to the CCDS set respectively, while --pick, --per_gene, --summary and --most_severe allow you to give a more summary level assessment per variant.

Furthermore, several "compound" consequences are also possible - if, for example, a variant falls in the final few bases of an exon, it may be considered to affect a splicing site, in addition to possibly affecting the coding sequence.

 

Q: How do I reduce VEP's memory requirement?

There are a number of ways to do this-

  1. Ensure your input file is sorted by location. This can greatly reduce memory requirements and runtime
  2. Consider reducing the buffer size. This reduces the number of variants annotated together in a batch and can be modified in both command line and web interfaces. Reducing buffer size may increase run time.
  3. Ensure you are only using the options you need, rather than --everything. Some data-rich options, such as regulatory annotation have an impact on memory use

 

Q: How to cite VEP?

If you use VEP, please cite our UPDATED publication so we can continue to support VEP development.


Web VEP questions

Q: How do I access the web version of the Variant Effect Predictor?

You can find the web VEP on the Tools page.

 

Q: Why is the output I get for my input file different when I use the web VEP and command line VEP?

Ensure that you are passing equivalent arguments to the script that you are using in the web version. If you are sure this is still a problem, please report it on the ensembl-dev mailing list.

 

Q: Is there a tutorial for web VEP?

Yes, see our latest tutorial Annotating and prioritizing genomic variants using the Ensembl Variant Effect Predictor — A tutorial for more information on using the Ensembl VEP web interface.


Command line VEP questions

Q: How can I make VEP run faster?

There are a number of factors that influence how fast VEP runs. Have a look at our handy guide for tips on improving VEP runtime.

 

Q: Why do I see "N" as the reference allele in my HGVS strings?

Q: Why do I see the following error (or similar) in my VEP output?

substr outside of string at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 511.
Use of uninitialized value $ref_allele in string eq at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 514.
Use of uninitialized value in concatenation (.) or string at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 643.

Both of these error types are usually seen when using a FASTA file for retrieving sequence. There are a couple of steps you can take to try to remedy them:

  1. The index alongside the FASTA can become corrupted. Delete [fastafile].index and re-run VEP to regenerate it. By default this file is located in your $HOME/.vep/[species]/[version]_[assembly] directory.
  2. The FASTA file itself may have been corrupted during download; delete the fasta file and the index and re-download (you can use the VEP installer to do this).
  3. Older versions of BioPerl (1.2.3 in particular is known to have this) cannot properly index large FASTA files. Make sure you are using a later (>=1.6) version of BioPerl. The VEP installer installs 1.6.924 for you.

If you still see problems after taking these steps, or if you were not using a FASTA file in the first place, please contact us.

 

Q: Why do I see the following warning?

WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160

This can occur if the chromosome names differ between your input variant and any annotation source that you are using (cache, database, GFF/GTF file, FASTA file, custom annotation file). To circumvent this you may provide VEP with a synonyms file. A synonym file is included in VEP's cache files, so if you have one of these for your species you can use it as follows:

./vep -i input.vcf -cache -synonyms ~/.vep/homo_sapiens/111_GRCh38/chr_synonyms.txt

The file consists of lines containing pairs of tab-separated synonyms. Order is not important as synonyms can be used in both "directions".

 

Q: Can I get gnomAD exomes and genomes frequencies in VEP?

Yes, see this guide.

 

Q: Why do I see the following error?

Could not connect to database homo_sapiens_core_63_37 as user anonymous using [DBI:mysql:database=homo_sapiens_core_63_37;host=ensembldb.ensembl.org;port=5306] as a locator:
Unknown MySQL server host 'ensembldb.ensembl.org' (2) at $HOME/src/ensembl/modules/Bio/EnsEMBL/DBSQL/DBConnection.pm line 290.

-------------------- EXCEPTION --------------------
MSG: Could not connect to database homo_sapiens_core_63_37 as user anonymous using [DBI:mysql:database=homo_sapiens_core_63_37;host=ensembldb.ensembl.org;port=5306] as a locator:
Unknown MySQL server host 'ensembldb.ensembl.org' (2)

By default VEP is configured to connect to the public MySQL server at ensembldb.ensembl.org. Occasionally the server may break connection with your process, which causes this error. This can happen when the server is busy, or due to various network issues. Consider using a local copy of the database, or the caching system.

 

Q: Can I use VEP on Windows?

Yes - see the documentation for a few different ways to get the VEP running on Windows.

 

Q: Can I use VEP with custom species and assemblies not available in Ensembl?

Yes - you can run VEP on any data you have by providing a custom GFF/GTF annotation and FASTA file, like so:

./vep -i input.vcf --gff data.gff.gz --fasta genome.fa.gz

 

Q: Can I download all of the SIFT and/or PolyPhen predictions?

The Ensembl Variation database and the human VEP cache file contain precalculated SIFT and PolyPhen-2 predictions for every possible amino acid change in every translated protein product in Ensembl. Since these data are huge, we store them in a compressed format. The best approach to extract them is to use our Perl API.

The format in which the data are stored in our database is described here

The simplest way to access these matrices is to use an API script to fetch a ProteinFunctionPredictionMatrix for your protein of interest and then call its 'get_prediction' method to get the score for a particular position and amino acid, looping over all possible amino acids for your position. There is some detailed documentation on this class in the API documentation here.

You would need to work out which peptide position your codon maps to, but there are methods in the TranscriptVariation class that should help you (probably translation_start and translation_end).