Variation API Tutorial
This is a tutorial about getting the Ensembl Variation data programmatically, using a Perl API. To get a description about the Ensembl Variation data, please go to this page.
Introduction
This tutorial is an introduction to the Ensembl Variation API. Knowledge of the Ensembl Core API and of the coding conventions used in the Ensembl APIs is assumed.
Documentation about the Variation database schema is available here, and while not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.
Below is a non exhaustive list of Ensembl Variation adaptors that are most often used:
- IndividualAdaptor to fetch Bio::EnsEMBL::Variation::Individual objects
- LDFeatureContainerAdaptor to fetch LDFeatureContainerAdaptor to fetch Bio::EnsEMBL::Variation::LDFeatureContainer objects
- PhenotypeFeatureAdaptor to fetch Bio::EnsEMBL::Variation::PhenotypeFeature objects
- PopulationAdaptor to fetch Bio::EnsEMBL::Variation::Population objects
- SampleAdaptor to fetch Bio::EnsEMBL::Variation::Sample objects
- TranscriptVariationAdaptor to fetch Bio::EnsEMBL::Variation::TranscriptVariation objects
- VariationAdaptor to fetch Bio::EnsEMBL::Variation::Variation objects
- VariationFeatureAdaptor to fetch Bio::EnsEMBL::Variation::VariationFeature objects
Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.
Variation and VariationFeature
Variants occur as two different objects in the database: Variation and VariationFeature. Variation represents the variant itself, whereas VariationFeature represents the position of the variant on the genome. This is to allow for variants that map to more than one genomic position, such as those that fall within assembly exceptions, and means that one Variation may have multiple VariationFeatures attached to it.
The methods in the objects and adaptors for Variation and VariationFeature reflect this difference. The VariationAdaptor can fetch variants by their identifier, source and similar features, and the Variation object has methods to retrieve things like alleles, phenotypes, population frequencies and genotypes and source. The VariationFeatureAdaptor can fetch by genomic locus, and the VariationFeature object has methods to get locus. Both objects have methods to easily convert between them.
Getting variants by genomic region
One of the most important uses for the Variation database is to be able to get all variants in a certain region in the genome. Since you are working with a location, you need to get a VariationFeature. Below is a simple commented perl script to illustrate how to get all variants in chromosome 25 in zebrafish.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); #get the database adaptor for Slice objects my $slice_adaptor = $registry->get_adaptor('danio_rerio', 'core', 'slice'); #get chromosome 25 in zebrafish my $slice = $slice_adaptor->fetch_by_region('chromosome',25); #get adaptor to VariationFeature object my $vf_adaptor = $registry->get_adaptor('danio_rerio', 'variation', 'variationfeature'); #return ALL variations defined in $slice my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); foreach my $vf (@{$vfs}){ print "Variation: ", $vf->variation_name, " with alleles ", $vf->allele_string, " in chromosome ", $slice->seq_region_name, " and position ", $vf->start, "-", $vf->end, "\n"; }
- SliceAdaptor (Core API)
- VariationFeatureAdaptor
- Slice (Core API)
- VariationFeature
Registry
Getting variants by ID
Below is a complete example on how to use the Variation API to retrieve different data from the database. In this particular example, we want to get, for a list of variant names, information about alleles and flanking sequences.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # get the different adaptors for the different objects needed my $va_adaptor = $registry->get_adaptor('human', 'variation', 'variation'); my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature'); # define variants and move through the list my @rsIds = qw(rs1367827 rs1367830); foreach my $id (@rsIds) { # get Variation object from the database using the name my $var = $va_adaptor->fetch_by_name($id); # Get the VariationFeatures and move through the list. There can be more than one VariationFeature for a variant. foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){ # print name and alleles print $vf->variation_name(),",", $vf->allele_string(),"\n"; # print sequence with the alleles in the middle print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]", substr($var->three_prime_flanking_seq,0,10), "\n\n"; } }
Consequences of variants
Another common use of the Variation database is to retrieve the effects that variations have on a transcript. In human, Ensembl also provides SIFT and PolyPhen predictions of the effects of non-synonymous protein changes. For a complete list of the consequence types predicted by Ensembl, click here.
Consequences are represented by a hierarchy of objects as shown below:
- TranscriptVariation - represents the overlap of a variation feature and a transcript
- TranscriptVariationAllele - the component represented by a particular allele of a variation
- OverlapConsequence - represents the consequence itself
- TranscriptVariationAllele - the component represented by a particular allele of a variation
In the example below, it is explained how to get all variants in a particular human transcript and see what is the effect of that variant in the transcript, including the PolyPhen and SIFT predictions. It is also shown how to retrieve the Sequence Ontology terms for the consequences.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous', ); # get the TranscriptAdaptor and use it to get the transcript object my $stable_id = 'ENST00000393489'; my $transcript_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'transcript'); my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id); # get the TranscriptVariationAdaptor and use it to getch all variants in the transcript my $trv_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'transcriptvariation'); my @trvs = @{ $trv_adaptor->fetch_all_by_Transcripts([$transcript]) }; # Move through the TranscriptVariation objects and get the TranscriptVariationAlleles for each while (my $tv = shift @trvs) { my $tvas = $tv->get_all_alternate_TranscriptVariationAlleles(); foreach my $tva (@{$tvas}) { # Get all the consequences my @ocs = @{ $tva->get_all_OverlapConsequences() }; # Create an array of consequences and add the terms to the array my @consequences; foreach my $oc (@ocs) { push @consequences, $oc->SO_term; } # print the consequence and SIFT/PolyPhen predictions for the variants print "Variant ", $tv->variation_feature->variation_name, " allele ", $tva->variation_feature_seq, " has consequence ", join(",", @consequences); if (defined($tva->sift_prediction)) { print " (SIFT=", $tva->sift_prediction; } if (defined($tva->polyphen_prediction)) { print ", PolyPhen=", $tva->polyphen_prediction, ")"; } print "\n"; } }
HGVS notation
HGVS notation is a popular way to express variants in terms of the transcripts and proteins they fall within. The TranscriptVariationAllele module can display the genomic, transcript or protein HGVS; the notation appears as strings that can be printed:
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Get the variant rs699 my $rsid = 'rs699'; my $va = $registry->get_adaptor('human','variation','variation'); my $v = $va->fetch_by_name($rsid); # Get the VariationFeatures, loop through getting the TranscriptVariations, then loop through these getting the TranscriptVariationAlleles my @vfs = @{ $v->get_all_VariationFeatures }; foreach my $vf (@vfs) { my @tvs = @{ $vf->get_all_TranscriptVariations }; foreach my $tv (@tvs) { my $tvas = $tv->get_all_alternate_TranscriptVariationAlleles(); # Loop through the alleles getting the HGVS notation foreach my $tva (@{$tvas}) { my $genomichgvs = $tva->hgvs_genomic; my $cdshgvs = $tva->hgvs_transcript; my $pephgvs = $tva->hgvs_protein; print $genomichgvs, ", ", $cdshgvs, ", ", $pephgvs, "\n"; } } }
Ensembl VEP
It is also possible to calculate consequence types for variants not currently in the Ensembl database. In the example below, we create a VariationFeature object from scratch, given a slice and VariationFeatureAdaptor object, and use this to calculate the consequence of our new SNP. Here we use the aggregation methods in the TranscriptVariation object instead of retrieving each TranscriptVariationAllele and OverlapConsequence object.
We have developed a tool called the Ensembl Variant Effect Predictor (VEP) which uses this API. It is optimised to calculate consequences for large numbers of variants and has a large number of additional analysis options.
use strict; use Bio::EnsEMBL::Registry; # get registry my $reg = 'Bio::EnsEMBL::Registry'; $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $vfa = $reg->get_adaptor('human', 'variation', 'variationfeature'); my $sa = $reg->get_adaptor('human', 'core', 'slice'); # get a slice for the new feature to be attached to my $slice = $sa->fetch_by_region('chromosome', 13); # create a new VariationFeature object my $new_vf = Bio::EnsEMBL::Variation::VariationFeature->new( -start => 114268626, -end => 114268626, -slice => $slice, # the variation must be attached to a slice -allele_string => 'A/C', # the first allele should be the reference allele -strand => 1, -map_weight => 1, -adaptor => $vfa, # we must attach a variation feature adaptor -variation_name => 'newSNP', ); # get the consequence types my $cons = $new_vf->get_all_TranscriptVariations(); print "Number of TranscriptVariations: " . scalar(@{$cons}) . "\n"; foreach my $con (@{$cons}) { foreach my $string (@{$con->consequence_type}) { print "Variation ", $new_vf->variation_name, " at position ", $new_vf->seq_region_start, " on chromosome ", $new_vf->seq_region_name, " has consequence ", $string, " in transcript ", $con->transcript->stable_id, "\n"; } }
- SliceAdaptor (Core API)
- VariationFeatureAdaptor
- Slice (Core API)
- Transcript (Core API)
- VariationFeature
- TranscriptVariation
Failed variants
In the Variation pipeline, quality checks are performed in order to identify variants that contain errors or inconsistencies. These variants are flagged as being failed but all related data is kept in the database and can be retrieved via the API or the web interface.
The default API behaviour when fetching multiple objects is not to return data for variants that have been flagged as failed but this can be modified in the Bio::EnsEMBL::Variation::DBSQL::DBAdaptor module by setting the include_failed_variations flag. This will affect all adaptors that have been created (and will be created) using the connection until the behaviour is explicitly changed again (or the connection to the database is closed). It is also good to clear the cache if one was used and the same slice is fetched.
Below is a simple commented perl script to illustrate how to get data related to variants that have been flagged as failed.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Get the database adaptor for Slice objects for human my $slice_adaptor = $registry->get_adaptor('human', 'core', 'slice'); # Get a slice from chromosome 6 my $slice = $slice_adaptor->fetch_by_region('chromosome','6',133017695,133161157); # Get adaptor to VariationFeature object my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature'); # First, get all variations on the slice. The default behaviour is to exclude variants that have been flagged as failed my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); # Count the number of variants returned print "There are " . scalar(@{$vfs}) . " variants in the " . $slice->display_id() . " region.\n"; # Now, tell the DBAdaptor that we want the failed variations as well. We get the DBAdaptor via the db() method in our adaptor $vf_adaptor->db->include_failed_variations(1); $vf_adaptor->clear_cache(); # Again, fetch all variants on the slice but this time include the failed variants $vfs = $vf_adaptor->fetch_all_by_Slice($slice); # Count the number of variants returned, this number is likely to be higher than when failed variants were omitted print "There are " . scalar(@{$vfs}) . " variants in the " . $slice->display_id() . " region.\n";
There are a few methods to query the failed status of a variant. In the code snippet below we loop over the variants we got and find out how many are flagged as failed and why.
while (my $vf = shift @{$vfs}) { # Check whether the variation for this variation feature is failed if ($vf->variation->is_failed()) { print $vf->display_id, " failed because ", join(", ", @{ $vf->variation->get_all_failed_descriptions() }), ".\n" } }
NOTE: some fetch methods are unaffected by the include_failed_variations flag in DBAdaptor. These are methods that return data related to specific variants, rather than a general batch fetch. For example, the fetch_by_dbID and fetch_all_by_VariationFeatures methods in TranscriptVariationAdaptor will not be affected by the flag whereas the fetch_all_by_Transcripts method will be.
Failed alleles
Even though a variant has not been failed in the Variation pipeline, some of the alleles associated with it may have been. Variants having failed alleles are not themselves flagged as failed but we can query them to see if any associated alleles have been flagged as failed.
foreach my $vf (@{$vfs}) { # Check whether the variation for this variation feature has failed alleles if ($vf->variation->has_failed_alleles()) { print "Variation " . $vf->variation_name() . " has the following failed alleles:\n"; # Get all the alleles associated with the variation and check which ones have failed and why foreach my $allele (@{$vf->variation->get_all_Alleles()}) { # Check if the allele is flagged as failed if ($allele->is_failed()) { # Get the reason why print $allele->allele, " has been flagged as failed because ", $allele->failed_description(), "\n"; } } } }
- SliceAdaptor (Core API)
- VariationFeatureAdaptor
- Slice (Core API)
- VariationFeature
- Variation
- Allele
Phenotype annotations
The Ensembl Variation API provides some methods to retrieve phenotype data associated with various Ensembl object types (variations, structural variations, genes, and QTLs). These data, stored into the Ensembl Variation database, are imported from a variety of sources (e.g. NHGRI-EBI GWAS catalog, OMIM, UniProt, EGA, ...).
There are two phenotype objects: Phenotype and PhenotypeFeature. The Phenotype object represents the observable phenotype, and contains the name and description of the phenotype, along with phenotype ontology terms. The PhenotypeFeature object represents the link between the Phenotype, and genomic features such as variants, genes and QTLs. The PhenotypeFeature links to the Phenotype, links to the variants/genes/QTLs and inherits location methods from the Feature object in the Core API. Since the PhenotypeFeature object has the name and description, you only need to use the Phenotype object when you're looking to use phenotype ontologies.
The following script starts with a variant name and uses the PhenotypeFeatures to get associated phenotypes:
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Fetch a variation object my $var_adaptor = $registry->get_adaptor('human', 'variation', 'variation'); my $var = $var_adaptor->fetch_by_name('rs1421085'); # Fetch all the phenotype features associated with the variation my $pf_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'phenotypefeature'); foreach my $pf (@{$pf_adaptor->fetch_all_by_Variation($var)}) { print "Variation ", $pf->variation_names, " is associated with the phenotype '", $pf->phenotype->description, "' in source ", $pf->source_name; if (defined($pf->p_value)) { print " with a p-value of ",$pf->p_value ; } print ".\n"; if (defined($pf->risk_allele)){ print "The risk allele is ", $pf->risk_allele, ".\n" ; } }
It is also possible to retrieve all features associated with a phenotype description and an optional source name.
my $source_name = 'NHGRI-EBI GWAS catalog'; my $phenotype = 'Cardiac hypertrophy'; my $pf_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'phenotypefeature'); foreach my $pf (@{$pf_adaptor->fetch_all_by_phenotype_description_source_name($phenotype, $source_name)}) { my $external_reference = $pf->external_reference; $external_reference =~ s/\// ID: /; print $pf->type, " ", $pf->object_id, " is associated with the phenotype '", $phenotype, "' in source ", $source_name," (", $external_reference,")\n"; }
Variant sets
Variant sets group variants that share some properties eg. from the same sequencing project, data source or with assays on the same genotyping chip.
For example, all the variants identified in the 1000 Genomes - phase 3 study have been grouped into one variation set ('1000 Genomes 3 - All').
Sets can be part of supersets or divided into subsets depending on the hierarchical relationships between them.
> For example, the set representing variations identified in the 1000 Genomes - phase 3 study is named
'1000 Genomes 3 - All' and has five subsets based on population: '1000 Genomes 3 - EUR', '1000 Genomes 3 - EAS', '1000 Genomes 3 - SAS', '1000 Genomes 3 - AMR', '1000 Genomes 3 - AFR'.
A list of all the variant sets with their description is available on the Variation Data Description page.
NOTE: The variant sets and the populations are different! e.g. The variant set associated with the European 1000 Genomes data is "1000 Genomes 3 - EUR", whereas the associated population is "1000GENOMES:phase_3:EUR".
In the Ensembl Variation API, variant sets are handled by the module VariationSet, and the accompanying adaptor module, VariationSetAdaptor, that interfaces with the underlying database. A VariationSet has a name attribute, a brief description and also a short name attribute. In the example below, we will connect to a Human Variation database, get a VariationSetAdaptor, fetch all available variant sets and print out their names and hierarchical relationships.
First, we get a connection and an adaptor:
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; # Load the registry from db $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Get a VariationSetAdaptor on the Human Variation database my $vs_adaptor = $registry->get_adaptor('human','variation','variationset'); # Get all top-level variation sets my $top_vss = $vs_adaptor->fetch_all_top_VariationSets(); # Loop over the top level variant sets and recursively print the subsets foreach my $top_vs (@{$top_vss}) { print_set($top_vs); } # We define a function that will help us recurse over the set hierarchy and print the data sub print_set { my $set = shift; my $indent = shift || ""; # Print the set attributes printf("\%s\%s (\%s)\n",$indent,$set->name(),$set->short_name()); # Get the subsets that have the current set as immediate parent my $subsets = $set->get_all_sub_VariationSets(1); # Call the print subroutine for each of the subsets with an increased indentation foreach my $subset (@{$subsets}) { print_set($subset,$indent . " "); } }
Running the code in the example above on the Human Variation database produces a table similar to the one on the Variation Data Description page.
Using the Iterator method to handle larger datasets
In order to get objects for the variants assigned to a variation set, we use the get_Variation_Iterator method. Note that rather than returning a reference to a list of all variation objects, this returns an Iterator object. The Iterator allows you to iterate over large numbers of objects without trying to fit them all into memory at once which could otherwise easily cause your script to crash. The main methods we will use on the iterator are the has_next() and next() methods. In the example below, we fetch a variant set that contains variants that have been linked to phenotypes and print the first 10 examples. Note that getting variants from a variant set that has subsets below it automatically returns the variants from the subsets.
# Get the variant set for the phenotype-associated variants. my $vs = $vs_adaptor->fetch_by_short_name('ph_variants'); printf("\%s (\%s):\n\t\%s\n",$vs->name(),$vs->short_name(),$vs->description()); my $limit = 10; my $fetched = 0; my $it = $vs->get_Variation_Iterator(); # Get the first 10 examples and print some data from them while ($fetched < $limit && $it->has_next()) { # Print the name of the variant my $var = $it->next(); printf("\t\%s has been found to be associated with:\n",$var->name()); # Get the PhenotypeFeature objects for the variant my $pfs = $var->get_all_PhenotypeFeatures(); # Loop over the annotations and print the phenotypes foreach my $pf (@{$pfs}) { my @output = (); push @output, 'Desc: ' . $pf->phenotype->description; push @output, 'p-value: ' . $pf->p_value if (defined $pf->p_value); push @output, 'study_type: ' . $pf->study->type if (defined $pf->study && defined $pf->study->type); push @output, 'study_name: ' . $pf->study->name if (defined $pf->study && defined $pf->study->name); print join(' ', @output), "\n"; } $fetched++; }
Get variant sets by Slice
You can also use Slices to retrieve VariationFeatures belonging to a desired variant set in a particular genomic region. In the example below, we fetch all VariationFeatures discovered by the 1000 Genomes Project phase 3 data on the RIC8A gene.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; # Load the registry from db $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Get a VariationSetAdaptor on the Human Variation database my $vs_adaptor = $registry->get_adaptor('human','variation','variationset'); # Get the variation set object my $vs = $vs_adaptor->fetch_by_short_name('1kg_3'); # Get a gene adaptor and the gene object for the RIC8A gene my $gene_adaptor = $registry->get_adaptor('human','core','gene'); my $gene = $gene_adaptor->fetch_by_display_label('RIC8A'); my $transcript = $gene->canonical_transcript; # Get the variation features on the slice belonging to the variation set my $vfs = $vs->get_all_VariationFeatures_by_Slice($transcript->feature_Slice()); # Loop over the variation features and print the variation using HGVS notation relative to the canonical transcript foreach my $vf (@{$vfs}) { my @tvs = @{ $vf->get_all_TranscriptVariations([$transcript]) }; my $tv = $tvs[0]; print $tv->transcript_stable_id; my $tvas = $tv->get_all_alternate_TranscriptVariationAlleles(); # Loop through the alleles getting the HGVS notation foreach my $tva (@{$tvas}) { my $genomichgvs = $tva->hgvs_genomic; my $cdshgvs = $tva->hgvs_transcript; my $pephgvs; if ($tva->hgvs_protein) { $pephgvs = $tva->hgvs_protein; } else { $pephgvs = "No protein HGVS"; } print $genomichgvs, ", ", $cdshgvs, ", ", $pephgvs, "\n"; } }
- GeneAdaptor (Core API)
- VariationSetAdaptor
- Gene (Core API)
- Transcript (Core API)
- Slice (Core API)
- TranscriptVariation
- VariationFeature
- VariationSet
Structural variant variation sets
The variant sets can also be retrieved from the structural variants.
# Get the StructuralVariation object my $sv_adaptor = $registry->get_adaptor( 'human', 'variation', 'structuralvariation' ); my $sv = $sv_adaptor->fetch_by_name('esv3626563'); # Get the structural variation features my $svfs = $sv->get_all_StructuralVariationFeatures(); # Loop over the structural variation features and print the corresponding variant set(s) foreach my $svf (@{$svfs}) { my $vss = $svf->get_all_VariationSets(); foreach my $vs (@{$vss}) { print "The structural variation belongs to the set ", $vs->name, "\n"; } }
Alleles and Genotypes
Requirement for 1000 Genomes data and other large genotyping projects
Alleles and frequencies
Many variants available through the Ensembl Variation API have allele frequency information. Each allele object associated with a variant represents an observation of a variant allele in a given population, and may have an associated frequency. Populations are also represented as objects.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $variation_adaptor = $registry->get_adaptor('mus_musculus', 'variation', 'variation'); my $variation = $variation_adaptor->fetch_by_name('rs4224826'); my $alleles = $variation->get_all_Alleles(); foreach my $allele (@{$alleles}) { next unless (defined $allele->population); my $allele_string = $allele->allele; my $frequency = $allele->frequency || 'NA'; my $population_name = $allele->population->name; printf("Allele %s has frequency: %s in population %s.\n", $allele_string, $frequency, $population_name); }
Retrieve external data (e.g. 1000 Genomes phase 3)
Sample genotypes
Many variants in the Ensembl Variation database, especially those genotyped in the 1000 Genomes projects, have associated genotypes. Genotypes are stored per sample as SampleGenotype objects, and are associated with a Variation and Sample object. They can be retrieved from a SampleGenotypeAdaptor, or directly from a Variation object.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation'); # OPTIONAL: uncomment the following line to retrieve 1000 genomes phase 3 data also # $variation_adaptor->db->use_vcf(1); my $variation = $variation_adaptor->fetch_by_name('rs1333049'); my $genotypes = $variation->get_all_SampleGenotypes(); foreach my $genotype (@{$genotypes}) { print "Sample ", $genotype->sample->name, " has genotype ", $genotype->genotype_string, "\n"; }
Sample genotype features
The SampleGenotypeFeatureAdaptor provides methods for retrieving SampleGenotypeFeature objects in a genomic region. A SampleGenotypeFeature stores information about the SampleGenotype and the genomic location of the SampleGenotype which can be accessed through the VariationFeature object that is stored in a SampleGenotypeFeature object. The SampleGenotypeFeatureAdaptor provides methods for retrieving only unique genotypes for a given Sample and Population and a method that returns all SampleGenotypeFeatures in a region and also stores genotypes for samples that differ from the given sample's genotype and are also different from the reference.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $sgfa = $registry->get_adaptor('mouse', 'variation', 'samplegenotypefeature'); $sgfa->db->use_vcf(1); my $slice_adaptor = $registry->get_adaptor('mouse', 'core', 'slice'); my $chr = 18; my $start = 68_260_187; my $end = 68_300_333; my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $start, $end); my $pa = $registry->get_adaptor('mouse', 'variation', 'population'); my $sa = $registry->get_adaptor('mouse', 'variation', 'sample'); my $population = $pa->fetch_by_name('Mouse Genomes Project'); my $samples = $sa->fetch_all_by_name('MGP:CAST/EiJ'); my $sample = $samples->[0]; my $unique_sgfs = $sgfa->fetch_all_unique_by_Slice($slice, $sample, $population); foreach my $sgf (@$unique_sgfs) { my $genotype_string = $sgf->genotype_string; my $vf = $sgf->variation_feature; my $variation_name = $vf->variation_name; my $seq_region_name = $vf->seq_region_name; my $seq_region_start = $vf->seq_region_start; my $seq_region_end = $vf->seq_region_end; my $allele_string = $vf->allele_string; print "$variation_name $genotype_string $allele_string $seq_region_name $seq_region_start $seq_region_end\n"; } my $differences_sgfs = $sgfa->fetch_all_differences_by_Slice($slice, $sample, $population); foreach my $sgf (@$differences_sgfs) { my $genotype_string = $sgf->genotype_string; my $vf = $sgf->variation_feature; my $variation_name = $vf->variation_name; my $seq_region_name = $vf->seq_region_name; my $seq_region_start = $vf->seq_region_start; my $seq_region_end = $vf->seq_region_end; my $allele_string = $vf->allele_string; print "$variation_name $genotype_string $allele_string $seq_region_name $seq_region_start $seq_region_end\n"; my $differences = $sgf->differences; foreach my $sample_name (keys %$differences) { my $genotype = $differences->{$sample_name}; print " $sample_name $genotype\n"; } }
Population genotypes
Many variants in the Ensembl Variation database, especially those genotyped in the 1000 Genomes project, have associated genotypes at the population level. Population genotypes are stored as PopulationGenotype objects, and are associated with a Variation and a Population object. They can be retrieved from a PopulationGenotypeAdaptor, or directly from a Variation object.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation'); # OPTIONAL: uncomment the following line to retrieve 1000 genomes phase 3 data also # $variation_adaptor->db->use_vcf(1); my $variation = $variation_adaptor->fetch_by_name('rs1333049'); my $genotypes = $variation->get_all_PopulationGenotypes(); foreach my $genotype (@{$genotypes}) { print "Population ", $genotype->population->name, " has genotype ", $genotype->genotype_string, " with the frequency ", $genotype->frequency, "\n"; }
LD calculation
Requirement
At present, Linkage Disequilibrium (LD) is only available for the 1000 Genomes populations in human. To calculate LD, you need to specify the population you want data for. You can fetch LD using the
If you get all r2 or D' across a region using the LDFeatureContainer, they will return a hash with the keys: variation1, variation2 (both VariationFeatures), r2 or d_prime and population_id. If you get them for a pair of variants, they will return just the score.
In the example below, it calculates the LD in a region in human chromosome 6 for a 1000 Genomes project population, but only prints when there is a high LD
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Define the region in chromosome 6 and get the Slice my $chr = 6; my $start = 25_834_000; my $end = 25_854_000; my $slice_adaptor = $registry->get_adaptor( 'human', 'core', 'slice' ); my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $start, $end ); # Define the Population and fetch it using the PopulationAdaptor my $population_name = '1000GENOMES:phase_3:CEU'; my $population_adaptor = $registry->get_adaptor( 'human', 'variation', 'population' ); my $population = $population_adaptor->fetch_by_name( $population_name ); # Get adaptor for LDFeatureContainer object, and tell it to use the VCF file to get the 1000 Genomes genotype data my $ldFeatureContainerAdaptor = $registry->get_adaptor( 'human', 'variation', 'ldfeaturecontainer' ); $ldFeatureContainerAdaptor->db->use_vcf(1); # Retrieve all LD values in the region my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_Slice( $slice, $population ); # Get all the r squared values for each pair of variants in the region and move through them foreach my $r_square (@{$ldFeatureContainer->get_all_r_square_values}){ # only print high LD, where high is defined as r2 > 0.8 if ($r_square->{r2} > 0.8){ print "High LD between variations ", $r_square->{variation1}->variation_name, "-", $r_square->{variation2}->variation_name, "\n"; } }
- Slice (Core API)
- LDFeatureContainer
- Population
- VariationFeature
Please note the following warning MSG: variation features have no pairwise ld information
is returned when there is no data in the LDFeatureContainer structure for the specified pair of VariationFeature objects. Here are some reasons why this may happen:
- the LDFeatureContainer does not span the region contained by one or both VariationFeature objects specified
- there are no genotypes for one or both variants, meaning no LD can be calculated
- one or both variants is non-variant in the specified population (i.e. has a minor allele frequency of 0)
- the estimated r2 value is lower than the cut-off threshold in the calculation software (the threshold is 0.05)
Looking at other associated data will help to narrow down the reason. If there are variants in the region spanned by the LDFeatureContainer with genotypes with a minor allele frequency greater than zero, then assume the r2 is < 0.05, which essentially means no correlation.
Specific strain information
The Ensembl Variation API has is the possibility to work with your specific strain as if it was the reference one, and compare it against others. In the example, we create a StrainSlice object for a region in the mouse strain A/J and compare it against the reference sequence.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $reg = 'Bio::EnsEMBL::Registry'; my $host= 'ensembldb.ensembl.org'; my $user= 'anonymous'; $reg->load_registry_from_db( -host => $host, -user => $user ); # Get the Slice my $slice_adaptor = $reg->get_adaptor('mouse', 'core', 'Slice'); my $slice = $slice_adaptor->fetch_by_region('chromosome', 19, 20380186, 20384187); # Get the StrainSliceAdaptor, and tell it to use the VCFs my $strain_slice_adaptor = $reg->get_adaptor('mouse', 'variation', 'StrainSlice'); $strain_slice_adaptor->db->use_vcf(1); # get strainSlice from the slice my $a_j = $strain_slice_adaptor->get_by_strain_Slice("MGP:A/J", $slice); my @differences = @{$a_j->get_all_AlleleFeatures_Slice()}; foreach my $diff (@differences){ print "Locus ", $diff->seq_region_start, "-", $diff->seq_region_end, ", A/J's alleles: ", $diff->allele_string, "\n"; }
- SliceAdaptor (Core API)
- StrainSliceAdaptor
- Slice (Core API)
- AlleleFeature
Structural variants
The Ensembl Variation database also stores information about structural variants. These data are imported from the DGVa (Database of Genomic Variants archive). Structural variants are remapped to the current genome assembly using the $slice->project() method from the Ensembl Core API, allowing for one gap in the resultant mapping.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $reg = 'Bio::EnsEMBL::Registry'; my $host= 'ensembldb.ensembl.org'; my $user= 'anonymous'; $reg->load_registry_from_db( -host => $host, -user => $user ); # Get adaptor to StructuralVariation object my $sva = $reg->get_adaptor("human", "variation", "structuralvariation"); # Get the StructuralVariation object my $sv = $sva->fetch_by_name('esv2663683'); print $sv->variation_name, " ", $sv->var_class, " (SO term: ", $sv->class_SO_term, ")\n"; # Get Study information if ($sv->study) { print "This structural variant comes from the study ", $sv->source_name, "-", $sv->study->name, ": ", $sv->study->description, "\n"; }
Like short variants, you can fetch the structural variants using a Slice.
# Get adaptor to Slice object from core my $sa = $reg->get_adaptor("human", "core", "slice"); my $slice = $sa->fetch_by_region('chromosome', 5, 1, 1000000); # Get adaptor to StructuralVariationFeature my $svfa = $reg->get_adaptor("human","variation","structuralvariationfeature"); # Get all structural variation features on the slice my $svfs = $svfa->fetch_all_by_Slice($slice); # StructuralVariationFeature objects foreach my $svf (@$svfs) { print $svf->variation_name, " ", $svf->seq_region_name, ":", $svf->seq_region_start, "-", $svf->seq_region_end, " ", $svf->var_class, " (SO term: ", $svf->class_SO_term, ")\n"; }
You can also retrieve all the supporting evidence associated with a structural variant.
# Get adaptor to StructuralVariation object from the Ensembl Variation API my $sva = $reg->get_adaptor("human", "variation", "structuralvariation"); # Get the StructuralVariation object my $sv = $sva->fetch_by_name('esv2663683'); # Get the supporting evidences associated to the structural variation my $ssvs = $sv->get_all_SupportingStructuralVariants(); foreach my $ssv (@$ssvs) { print $ssv->variation_name, " ", $ssv->var_class, " (SO term: ", $ssv->class_SO_term, ")\n"; }
Others
Fetching Iterators
Sometimes an adaptor method may return more objects than can fit into memory at once, and for such cases the Ensembl Variation API provides methods to fetch Iterator objects instead of the usual array reference. An Iterator is an object which allows you to iterate over the entire set of objects fetched by the adaptor without loading them all into memory at once, by calling the next() method to fetch each object in turn. You can tell when you have finished iterating over the set of objects by using the has_next() method which returns true when the Iterator still has objects to fetch and false when it is exhausted. These are the most commonly used Iterator methods, but there are also some other useful methods which allow you to transform and filter the set of objects in an analogous way to using map and grep on standard perl arrays, using the methods called (imaginatively) map() and grep(). Various other methods are also supported, please see the Iterator API documentation for full details.
Some example code using Iterators to fetch somatic mutations from the VariationAdaptor is listed below. After the first fetch_Iterator_somatic() call we just iterate over the first 10 somatic mutations found in the Ensembl Variation database, using next() to fetch each object, and has_next() or count to check when we are done.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $reg = 'Bio::EnsEMBL::Registry'; $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous', ); my $va = $reg->get_adaptor('human', 'variation', 'variation') or die; my $somatic_iter = $va->fetch_Iterator_somatic(); my $limit = 10; my $fetched = 0; while ($fetched < $limit && $somatic_iter->has_next()) { my $mutation = $somatic_iter->next(); print $mutation->name(), "\n"; $fetched ++; }
In the example below only the variations from the 'COSMIC' set are retrieved. This is can be done either using grep() and a subroutine to search for 'COSMIC' as a source or using the VariationSet parameter retrieved via the corresponding adaptor and variation set, as in the Variant sets examples above.
my $va = $reg->get_adaptor('human', 'variation', 'variation') or die; my $cosmic_iter = $va->fetch_Iterator_somatic()->grep(sub {$_->source_name eq 'COSMIC'}); # OR using the VariationSet parameter: #my $vs_adaptor = $registry->get_adaptor('human','variation','variationset'); #my $vs = $vs_adaptor->fetch_by_short_name('ph_cosmic'); #my $cosmic_iter = $va->fetch_Iterator_by_VariationSet($vs); my $limit = 10; my $fetched = 0; while ($fetched < $limit && $cosmic_iter->has_next()) { print $cosmic_iter->next()->name(), "\n"; $fetched ++; }
In the third call to fetch_Iterator_somatic we use the map() method to return the name of each Variation rather than the object itself using an anonymous subroutine just as you would use a block for perl's map function. In the final call we combine the grep() and map() methods to return an Iterator over all the names of somatic mutations from COSMIC. The map() and grep() methods return a new Iterator and only do their transforming and filtering on each object in turn, without needing all objects in memory, they are therefore much more memory-efficient than mapping and grepping a native perl array.
my $name_iter = $va->fetch_Iterator_somatic()->map(sub {$_->name}); my $limit = 10; my $fetched = 0; while ($fetched < $limit && $name_iter->has_next()) { print $name_iter->next(), "\n"; $fetched ++; } my $cosmic_name_iter = $va->fetch_Iterator_somatic()->grep(sub {$_->source_name eq 'COSMIC'})->map(sub {$_->name}); $fetched = 0; while ($fetched < $limit && $cosmic_name_iter->has_next()) { print $cosmic_name_iter->next(), "\n"; $fetched ++; }
Further help
For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.