Variant Effect Predictor FAQ
Q: Why don't I see any co-located variations when using species X?
A: Ensembl only has variation databases for a subset of all Ensembl species - see this document for details.
Q: Why has my insertion/deletion variant encoded in VCF disappeared from the VEP output?
A: Ensembl treats unbalanced variants differently to VCF - your variant hasn't disappeared, it may have just changed slightly! You can solve this by giving your variants a unique identifier in the third column of the VCF file. See here for a full discussion.
Q: Why do I see so many lines of output for each variant in my input?
A: While it can be convenient to search for a easy, one word answer to the question "What is the consequence of this variant?", in reality biology does not make it this simple! Many genes have more than one transcript, so VEP provides a prediction for each transcript that a variant overlaps. The VEP script can help here; the --canonical and --ccds options indicate which transcripts are canonical and belong to the CCDS set respectively, while --pick, --per_gene, --summary and --most_severe allow you to give a more summary level assessment per variant.
Furthermore, several "compound" consequences are also possible - if, for example, a variant falls in the final few bases of an exon, it may be considered to affect a splicing site, in addition to possibly affecting the coding sequence.
Since we cannot possibly predict the exact biology of what will happen, what we provide is the most conservative estimate that covers all reasonable scenarios. It is up to you, the user, to interpret this information!
Web VEP questions
Q: How do I access the web version of the Variant Effect Predictor?
A: You can find the web VEP on the Tools page.
Q: Why is the output I get for my input file different when I use the web VEP and the VEP script?
A: Ensure that you are passing equivalent arguments to the script that you are using in the web version. If you are sure this is still a problem, please report it on the ensembl-dev mailing list.
VEP script questions
Q: How can I make VEP run faster?
There are a number of factors that influence how fast VEP runs. Have a look at our handy guide for tips on improving VEP runtime.
Q: Why do I see "N" as the reference allele in my HGVS strings?
Q: Why do I see the following error (or similar) in my VEP output?
substr outside of string at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 511. Use of uninitialized value $ref_allele in string eq at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 514. Use of uninitialized value in concatenation (.) or string at /nfs/users/nfs_w/wm2/Perl/ensembl-variation/modules/Bio/EnsEMBL/Variation/Utils/Sequence.pm line 643.
Both of these error types are usually seen when using a FASTA file for retrieving sequence. There are a couple of steps you can take to try to remedy them:
- The index alongside the FASTA can become corrupted. Delete [fastafile].index and re-run VEP to regenerate it. By default this file is located in your $HOME/.vep/[species]/[version]_[assembly] directory.
- The FASTA file itself may have been corrupted during download; delete the fasta file and the index and re-download (you can use the VEP installer to do this).
- Older versions of BioPerl (1.2.3 in particular is known to have this) cannot properly index large FASTA files. Make sure you are using a later (>=1.6) version of BioPerl. The VEP installer installs 1.6.1 for you.
If you still see problems after taking these steps, or if you were not using a FASTA file in the first place, please contact us.
Q: Why do I see the following warning?
WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160
This can occur if the chromsome names differ between your input variant and any annotation source that you are using (cache, database, GFF/GTF file, FASTA file, custom annotation file). To circumvent this you may provide VEP with a synonyms file. A synonym file is included in VEP's cache files, so if you have one of these for your species you can use it as follows:
./vep -i input.vcf -cache -synonyms ~/.vep/homo_sapiens/89_GRCh38/chr_synonyms.txt
The file consists of lines containing pairs of tab-separated synonyms. Order is not important as synonyms can be used in both "directions".
Q: Can I get gnomAD allele frequencies in VEP?
Yes, see this guide.
Q: Why do I see the following error?
Could not connect to database homo_sapiens_core_63_37 as user anonymous using [DBI:mysql:database=homo_sapiens_core_63_37;host=ensembldb.ensembl.org;port=5306] as a locator: Unknown MySQL server host 'ensembldb.ensembl.org' (2) at $HOME/src/ensembl/modules/Bio/EnsEMBL/DBSQL/DBConnection.pm line 290. -------------------- EXCEPTION -------------------- MSG: Could not connect to database homo_sapiens_core_63_37 as user anonymous using [DBI:mysql:database=homo_sapiens_core_63_37;host=ensembldb.ensembl.org;port=5306] as a locator: Unknown MySQL server host 'ensembldb.ensembl.org' (2)
A: By default the VEP script is configured to connect to the public MySQL server at ensembldb.ensembl.org. Occasionally the server may break connection with your script, which causes this error. This can happen when the server is busy, or due to various network issues. Consider using a local copy of the database, or the caching system.
Q: Can I use the VEP script on Windows?
Yes - see the documentation for a few different ways to get the VEP running on Windows.
Q: Can I download all of the SIFT and/or PolyPhen predictions?
A: The Ensembl Variation database and the human VEP cache file contain precalculated SIFT and PolyPhen predictions for every possible amino acid change in every translated protein product in Ensembl. Since these data are huge, we store them in a compressed format. The best approach to extract them is to use our Perl API.
The format in which the data are stored in our database is described here
The simplest way to access these matrices is to use an API script to fetch a ProteinFunctionPredictionMatrix for your protein of interest and then call its 'get_prediction' method to get the score for a particular position and amino acid, looping over all possible amino acids for your position. There is some detailed documentation on this class in the API documentation here.
You would need to work out which peptide position your codon maps to, but there are methods in the TranscriptVariationAllele class that should help you (probably translation_start and translation_end).