Variant Effect Predictor Tutorial
Have you downloaded VEP yet? Use git to clone it:
git clone https://github.com/Ensembl/ensembl-vep cd ensembl-vep
VEP uses "cache files" or a remote database to read genomic data. Using cache files gives the best performance - let's set one up using the installer:
perl INSTALL.pl Hello! This installer is configured to install v100 of the Ensembl API for use by VEP. It will not affect any existing installations of the Ensembl API that you may have. It will also download and install cache files from Ensembl's FTP server. Checking for installed versions of the Ensembl API...done It looks like you already have v100 of the API installed. You shouldn't need to install the API Skip to the next step (n) to install cache files Do you want to continue installing the API (y/n)?
If you haven't yet installed the API, type "y" followed by enter, otherwise type "n" (perhaps if you ran the installer before). At the next prompt, type "y" to install cache files
Do you want to continue installing the API (y/n)? n - skipping API installation VEP can either connect to remote or local databases, or use local cache files. Cache files will be stored in /nfs/users/nfs_w/wm2/.vep Do you want to install any cache files (y/n)? y Downloading list of available cache files The following species/files are available; which do you want (can specify multiple separated by spaces): 1 : ailuropoda_melanoleuca_vep_100_ailMel1.tar.gz 2 : anas_platyrhynchos_vep_100_BGI_duck_1.0.tar.gz 3 : anolis_carolinensis_vep_100_AnoCar2.0.tar.gz ... 42 : homo_sapiens_vep_100_GRCh38.tar.gz ... ?
Type "42" (or the relevant number for homo_sapiens and GRCh38) to install the cache for the latest human assembly. This will take a little while to download and unpack! By default VEP assumes you are working in human; it's easy to switch to any other species using --species [species].
? 42 - downloading ftp://ftp.ensembl.org/pub/release-100/variation/vep/homo_sapiens_vep_100_GRCh38.tar.gz - unpacking homo_sapiens_vep_100_GRCh38.tar.gz Success
By default VEP installs cache files in a folder in your home area ($HOME/.vep); you can easily change this using the -d flag when running the installer. See the installer documentation for more details.
VEP needs some input containing variant positions to run. In their most basic form, this should just be a chromosomal location and a pair of alleles (reference and alternate). VEP can also use common formats such as VCF and HGVS as input. Have a look at the Data formats page for more information.
We can now use our cache file to run VEP on the supplied example file examples/homo_sapiens_GRCh38.vcf, which is a VCF file containing variants from the 1000 Genomes Project, remapped to GRCh38:
./vep -i examples/homo_sapiens_GRCh38.vcf --cache 2013-07-31 09:17:54 - Read existing cache info 2013-07-31 09:17:54 - Starting... ERROR: Output file variant_effect_output.txt already exists. Specify a different output file with --output_file or overwrite existing file with --force_overwrite
You may see this error message if you've already run VEP in the same directory. VEP tries not to trample over your existing files unless you tell it to. So let's tell it to using --force_overwrite
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite
By default VEP writes to a file named "variant_effect_output.txt" - you can change this file name using -o. Let's have a look at the output.
head variant_effect_output.txt ## ENSEMBL VARIANT EFFECT PREDICTOR v100.0 ## Output produced at 2017-03-21 14:51:27 ## Connected to homo_sapiens_core_100_38 on ensembldb.ensembl.org ## Using cache in /homes/user/.vep/homo_sapiens/100_GRCh38 ## Using API version 100, DB version 100 ## polyphen version 2.2.2 ## sift version sift5.2.2 ## COSMIC version 78 ## ESP version 20141103 ## gencode version GENCODE 25 ## genebuild version 2014-07 ## HGMD-PUBLIC version 20162 ## regbuild version 16 ## assembly version GRCh38.p7 ## ClinVar version 201610 ## dbSNP version 147 ## Column descriptions: ## Uploaded_variation : Identifier of uploaded variant ## Location : Location of variant in standard coordinate format (chr:start or chr:start-end) ## Allele : The variant allele used to calculate the consequence ## Gene : Stable ID of affected gene ## Feature : Stable ID of feature ## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature ## Consequence : Consequence type ## cDNA_position : Relative position of base pair in cDNA sequence ## CDS_position : Relative position of base pair in coding sequence ## Protein_position : Relative position of amino acid in protein ## Amino_acids : Reference and variant amino acids ## Codons : Reference and variant codon sequence ## Existing_variation : Identifier(s) of co-located known variants ## Extra column keys: ## IMPACT : Subjective impact classification of consequence type ## DISTANCE : Shortest distance from variant to transcript ## STRAND : Strand of the feature (1/-1) ## FLAGS : Transcript quality flags #Uploaded_variation Location Allele Gene Feature Feature_type Consequence ... rs7289170 22:17181903 G ENSG00000093072 ENST00000262607 Transcript synonymous_variant ... rs7289170 22:17181903 G ENSG00000093072 ENST00000330232 Transcript synonymous_variant ...
The lines starting with "#" are header or meta information lines. The final one of these (highlighted in blue above) gives the column names for the data that follows. To see more information about VEP's output format, see the Data formats page.
We can see two lines of output here, both for the uploaded variant named rs7289170. In many cases, a variant will fall in more than one transcript. Typically this is where a single gene has multiple splicing variants. Here our variant has a consequence for the transcripts ENST00000262607 and ENST00000330232.
In the consequence column, we can see the consequence term synonymous_variant. This is terms forms part of an ontology for describing the effects of sequence variants on genomic features, produced by the Sequence Ontology (SO). See our predicted data page for a guide to the consequence types that VEP and Ensembl uses.
Let's try something a little more interesting. SIFT is an algorithm for predicting whether a given change in a protein sequence will be deleterious to the function of that protein. VEP can give SIFT predictions for most of the missense variants that it predicts. To do this, simply add --sift b (the b means we want both the prediction and the score):
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite --sift b
SIFT calls variants either "deleterious" or "tolerated". We can use the VEP's filtering tool to find only those that SIFT considers deleterious:
./filter_vep -i variant_effect_output.txt -filter "SIFT is deleterious" | grep -v "##" | head -n5 #Uploaded_variation Location Allele Gene Feature ... Extra rs2231495 22:17188416 C ENSG00000093072 ENST00000262607 ... SIFT=deleterious(0.05) rs2231495 22:17188416 C ENSG00000093072 ENST00000399837 ... SIFT=deleterious(0.05) rs2231495 22:17188416 C ENSG00000093072 ENST00000399839 ... SIFT=deleterious(0.05) rs115736959 22:19973143 A ENSG00000099889 ENST00000263207 ... SIFT=deleterious(0.01)
Note that the SIFT score appears in the "Extra" column, as a key/value pair. This column can contain multiple key/value pairs depending on the options you give to VEP. See the Data formats page for more information on the fields in the Extra column.
You can also configure how VEP writes its output using the --fields flag.
You'll also see that we have multiple results for the same gene, ENSG00000093072. Let's say we're only interested in what is considered the canonical transcript for this gene (--canonical), and that we want to know what the commonly used gene symbol from HGNC is for this gene (--symbol). We can also use a UNIX pipe to pass the output from VEP directly into the filtering tool:
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite --sift b --canonical --symbol --tab --fields Uploaded_variation,SYMBOL,CANONICAL,SIFT -o STDOUT | \ ./filter_vep --filter "CANONICAL is YES and SIFT is deleterious" ... #Uploaded_variation SYMBOL CANONICAL SIFT rs2231495 CECR1 YES deleterious(0.05) rs115736959 ARVCF YES deleterious(0.01) rs116398106 ARVCF YES deleterious(0) rs116782322 ARVCF YES deleterious(0) ... ... ... ... rs115264708 PHF21B YES deleterious(0.03)
So now we can see all of the variants that have a deleterious effect on canonical transcripts, and the symbol for their genes. Nice!
For species with an Ensembl database of variants, VEP can annotate your input with identifiers and frequency data from variants co-located with your input data. For human, VEP's cache contains frequency data from 1000 Genomes, NHLBI-ESP and ExAC. Since our input file is from 1000 Genomes, let's add frequency data using --af_1kg:
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite --af_1kg -o STDOUT | grep -v "##" | head -n2 #Uploaded_variation Location Allele Gene Feature ... Existing_variation Extra rs7289170 22:17181903 G ENSG00000093072 ENST00000262607 ... rs7289170 IMPACT=LOW;STRAND=-1;AFR_AF=0.2390;AMR_AF=0.2003;EAS_AF=0.0456;EUR_AF=0.3211;SAS_AF=0.1401
We can see frequency data for the AFR, AMR, EAS, EUR and SAS continental population groupings; these represent the frequency of the alternate (ALT) allele from our input (G in the case of rs7289170). Note that the Existing_variation column is populated by the identifier of the variant found in the VEP cache (and that it corresponds to the identifier from our input in Uploaded_variation). To retrieve only this information and not the frequency data, we could have used --check_existing (--af_1kg silently switches on --check_existing).
Over to you!
This has been just a short introduction to the capabilities of VEP - have a look through some more of the options, see them all on the command line using --help, or try using the shortcut --everything which switches on almost all available output fields! Try out the different options in the filtering tool, and if you're feeling adventurous why not use some of your own data to annotate your variants or have a go with a plugin or two.