Variation API Tutorial

This is a tutorial about getting the Ensembl Variation data programmatically, using a Perl API. To get a description about the Ensembl Variation data, please go to this page.

Introduction

This tutorial is an introduction to the Ensembl Variation API. Knowledge of the Ensembl Core API and of the coding conventions used in the Ensembl APIs is assumed.

Documentation about the Variation database schema is available here, and while not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.

Below is a non exhaustive list of Ensembl Variation adaptors that are most often used:

Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.

Variation and VariationFeature

Variants occur as two different objects in the database: Variation and VariationFeature. Variation represents the variant itself, whereas VariationFeature represents the position of the variant on the genome. This is to allow for variants that map to more than one genomic position, such as those that fall within assembly exceptions, and means that one Variation may have multiple VariationFeatures attached to it.

The methods in the objects and adaptors for Variation and VariationFeature reflect this difference. The VariationAdaptor can fetch variants by their identifier, source and similar features, and the Variation object has methods to retrieve things like alleles, phenotypes, population frequencies and genotypes and source. The VariationFeatureAdaptor can fetch by genomic locus, and the VariationFeature object has methods to get locus. Both objects have methods to easily convert between them.

Getting variants by genomic region

One of the most important uses for the Variation database is to be able to get all variants in a certain region in the genome. Since you are working with a location, you need to get a VariationFeature. Below is a simple commented perl script to illustrate how to get all variants in chromosome 25 in zebrafish.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

#get the database adaptor for Slice objects
my $slice_adaptor = $registry->get_adaptor('danio_rerio', 'core', 'slice'); 

#get chromosome 25 in zebrafish
my $slice = $slice_adaptor->fetch_by_region('chromosome',25); 

#get adaptor to VariationFeature object
my $vf_adaptor = $registry->get_adaptor('danio_rerio', 'variation', 'variationfeature'); 

#return ALL variations defined in $slice
my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); 

foreach my $vf (@{$vfs}){
  print "Variation: ", $vf->variation_name, " with alleles ", $vf->allele_string, 
        " in chromosome ", $slice->seq_region_name, " and position ", $vf->start, "-", $vf->end, "\n";
}
Used objects:
Adaptor objects
Main objects

Registry

What is this object $registry? Make sure you have defined it in all your scripts. Learn more in the general instructions page.

Getting variants by ID

Below is a complete example on how to use the Variation API to retrieve different data from the database. In this particular example, we want to get, for a list of variant names, information about alleles and flanking sequences.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

# get the different adaptors for the different objects needed
my $va_adaptor = $registry->get_adaptor('human', 'variation', 'variation');
my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature');

# define variants and move through the list
my @rsIds = qw(rs1367827 rs1367830);
foreach my $id (@rsIds) {

  # get Variation object from the database using the name
  my $var = $va_adaptor->fetch_by_name($id);
  
  # Get the VariationFeatures and move through the list. There can be more than one VariationFeature for a variant.  
  foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){
    
    # print name and alleles
    print $vf->variation_name(),",", $vf->allele_string(),"\n";
    # print sequence with the alleles in the middle
    print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]", substr($var->three_prime_flanking_seq,0,10), "\n\n";
  }
}

Used objects:

Consequences of variants

Another common use of the Variation database is to retrieve the effects that variations have on a transcript. In human, Ensembl also provides SIFT and PolyPhen predictions of the effects of non-synonymous protein changes. For a complete list of the consequence types predicted by Ensembl, click here.

Consequences are represented by a hierarchy of objects as shown below:

  • TranscriptVariation - represents the overlap of a variation feature and a transcript
    • TranscriptVariationAllele - the component represented by a particular allele of a variation
      • OverlapConsequence - represents the consequence itself

In the example below, it is explained how to get all variants in a particular human transcript and see what is the effect of that variant in the transcript, including the PolyPhen and SIFT predictions. It is also shown how to retrieve the Sequence Ontology terms for the consequences.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous',
);

# get the TranscriptAdaptor and use it to get the transcript object
my $stable_id = 'ENST00000393489';
my $transcript_adaptor = $registry->get_adaptor('homo_sapiens', 'core', 'transcript');
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id);

# get the TranscriptVariationAdaptor and use it to getch all variants in the transcript
my $trv_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'transcriptvariation');
my @trvs = @{ $trv_adaptor->fetch_all_by_Transcripts([$transcript]) };

# Move through the TranscriptVariation objects and get the TranscriptVariationAlleles for each
while (my $tv = shift @trvs) {
  my $tvas = $tv->get_all_alternate_TranscriptVariationAlleles();

  foreach my $tva (@{$tvas}) {
	
	# Get all the consequences
    my @ocs = @{ $tva->get_all_OverlapConsequences() };

	# Create an array of consequences and add the terms to the array
	my @consequences;
    foreach my $oc (@ocs) {
		push @consequences, $oc->SO_term;
    }

	# print the consequence and SIFT/PolyPhen predictions for the variants
    print "Variant ", $tv->variation_feature->variation_name,
      " allele ", $tva->variation_feature_seq,
      " has consequence ", join(",", @consequences);

    if (defined($tva->sift_prediction)) {
      print " (SIFT=", $tva->sift_prediction;
    }
    if (defined($tva->polyphen_prediction)) {
      print ", PolyPhen=", $tva->polyphen_prediction, ")";
    }

    print "\n";
  }
}

HGVS notation

HGVS notation is a popular way to express variants in terms of the transcripts and proteins they fall within. The TranscriptVariationAllele module can display the genomic, transcript or protein HGVS; the notation appears as strings that can be printed:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Get the variant rs699
my $rsid = 'rs699';
my $va = $registry->get_adaptor('human','variation','variation');
my $v = $va->fetch_by_name($rsid);

# Get the VariationFeatures, loop through getting the TranscriptVariations, then loop through these getting the TranscriptVariationAlleles
my @vfs = @{ $v->get_all_VariationFeatures };
foreach my $vf (@vfs) {
  my @tvs = @{ $vf->get_all_TranscriptVariations };
  foreach my $tv (@tvs) {  	
    my $tvas = $tv->get_all_alternate_TranscriptVariationAlleles();

    # Loop through the alleles getting the HGVS notation
    foreach my $tva (@{$tvas}) {
      my $genomichgvs = $tva->hgvs_genomic;
      my $cdshgvs = $tva->hgvs_transcript;
      my $pephgvs = $tva->hgvs_protein;
      print $genomichgvs, ", ", $cdshgvs, ", ", $pephgvs, "\n";
    }
  }
}

Ensembl VEP

It is also possible to calculate consequence types for variants not currently in the Ensembl database. In the example below, we create a VariationFeature object from scratch, given a slice and VariationFeatureAdaptor object, and use this to calculate the consequence of our new SNP. Here we use the aggregation methods in the TranscriptVariation object instead of retrieving each TranscriptVariationAllele and OverlapConsequence object.

We have developed a tool called the Ensembl Variant Effect Predictor (VEP) which uses this API. It is optimised to calculate consequences for large numbers of variants and has a large number of additional analysis options.

use strict;
use Bio::EnsEMBL::Registry;

# get registry
my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(
  -host => 'ensembldb.ensembl.org', 
  -user => 'anonymous'
);

my $vfa = $reg->get_adaptor('human', 'variation', 'variationfeature');
my $sa = $reg->get_adaptor('human', 'core', 'slice');

# get a slice for the new feature to be attached to
my $slice = $sa->fetch_by_region('chromosome', 13);

# create a new VariationFeature object
my $new_vf = Bio::EnsEMBL::Variation::VariationFeature->new(
  -start => 114268626,
  -end => 114268626,
  -slice => $slice,           # the variation must be attached to a slice
  -allele_string => 'A/C',    # the first allele should be the reference allele
  -strand => 1,
  -map_weight => 1,
  -adaptor => $vfa,           # we must attach a variation feature adaptor
  -variation_name => 'newSNP',
);

# get the consequence types
my $cons = $new_vf->get_all_TranscriptVariations();
print "Number of TranscriptVariations: " . scalar(@{$cons}) . "\n";

foreach my $con (@{$cons}) {
  foreach my $string (@{$con->consequence_type}) {
    print
      "Variation ", $new_vf->variation_name,
      " at position ", $new_vf->seq_region_start,
      " on chromosome ", $new_vf->seq_region_name,
      " has consequence ", $string,
      " in transcript ", $con->transcript->stable_id, "\n";
  }
}
Used objects:
Adaptor objects
Main objects

Failed variants

In the Variation pipeline, quality checks are performed in order to identify variants that contain errors or inconsistencies. These variants are flagged as being failed but all related data is kept in the database and can be retrieved via the API or the web interface.

The default API behaviour when fetching multiple objects is not to return data for variants that have been flagged as failed but this can be modified in the Bio::EnsEMBL::Variation::DBSQL::DBAdaptor module by setting the include_failed_variations flag. This will affect all adaptors that have been created (and will be created) using the connection until the behaviour is explicitly changed again (or the connection to the database is closed). It is also good to clear the cache if one was used and the same slice is fetched.

Below is a simple commented perl script to illustrate how to get data related to variants that have been flagged as failed.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Get the database adaptor for Slice objects for human
my $slice_adaptor = $registry->get_adaptor('human', 'core', 'slice');
# Get a slice from chromosome 6
my $slice = $slice_adaptor->fetch_by_region('chromosome','6',133017695,133161157); 

# Get adaptor to VariationFeature object
my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature');

# First, get all variations on the slice. The default behaviour is to exclude variants that have been flagged as failed
my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); 

# Count the number of variants returned
print "There are " . scalar(@{$vfs}) . " variants in the " . $slice->display_id() . " region.\n";

# Now, tell the DBAdaptor that we want the failed variations as well. We get the DBAdaptor via the db() method in our adaptor
$vf_adaptor->db->include_failed_variations(1);
$vf_adaptor->clear_cache();

# Again, fetch all variants on the slice but this time include the failed variants
$vfs = $vf_adaptor->fetch_all_by_Slice($slice); 

# Count the number of variants returned, this number is likely to be higher than when failed variants were omitted
print "There are " . scalar(@{$vfs}) . " variants in the " . $slice->display_id() . " region.\n";

There are a few methods to query the failed status of a variant. In the code snippet below we loop over the variants we got and find out how many are flagged as failed and why.

while (my $vf = shift @{$vfs}) {
  # Check whether the variation for this variation feature is failed
  if ($vf->variation->is_failed()) {
  	print $vf->display_id, " failed because ", join(", ",  @{ $vf->variation->get_all_failed_descriptions() }), ".\n"
  }
}

NOTE: some fetch methods are unaffected by the include_failed_variations flag in DBAdaptor. These are methods that return data related to specific variants, rather than a general batch fetch. For example, the fetch_by_dbID and fetch_all_by_VariationFeatures methods in TranscriptVariationAdaptor will not be affected by the flag whereas the fetch_all_by_Transcripts method will be.

Failed alleles

Even though a variant has not been failed in the Variation pipeline, some of the alleles associated with it may have been. Variants having failed alleles are not themselves flagged as failed but we can query them to see if any associated alleles have been flagged as failed.

foreach my $vf (@{$vfs}) {
  # Check whether the variation for this variation feature has failed alleles
  if ($vf->variation->has_failed_alleles()) {
    print "Variation " . $vf->variation_name() . " has the following failed alleles:\n";
    # Get all the alleles associated with the variation and check which ones have failed and why

    foreach my $allele (@{$vf->variation->get_all_Alleles()}) {
      # Check if the allele is flagged as failed
      if ($allele->is_failed()) {
        # Get the reason why
        print $allele->allele, " has been flagged as failed because ", $allele->failed_description(), "\n";
      }
    } 
  }
}
Used objects:
Adaptor objects
Main objects

Phenotype annotations

The Ensembl Variation API provides some methods to retrieve phenotype data associated with various Ensembl object types (variations, structural variations, genes, and QTLs). These data, stored into the Ensembl Variation database, are imported from a variety of sources (e.g. NHGRI-EBI GWAS catalog, OMIM, UniProt, EGA, ...).

There are two phenotype objects: Phenotype and PhenotypeFeature. The Phenotype object represents the observable phenotype, and contains the name and description of the phenotype, along with phenotype ontology terms. The PhenotypeFeature object represents the link between the Phenotype, and genomic features such as variants, genes and QTLs. The PhenotypeFeature links to the Phenotype, links to the variants/genes/QTLs and inherits location methods from the Feature object in the Core API. Since the PhenotypeFeature object has the name and description, you only need to use the Phenotype object when you're looking to use phenotype ontologies.

The following script starts with a variant name and uses the PhenotypeFeatures to get associated phenotypes:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Fetch a variation object
my $var_adaptor = $registry->get_adaptor('human', 'variation', 'variation');
my $var = $var_adaptor->fetch_by_name('rs1421085');

# Fetch all the phenotype features associated with the variation
my $pf_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'phenotypefeature');
foreach my $pf (@{$pf_adaptor->fetch_all_by_Variation($var)}) {
  print "Variation ", $pf->variation_names, " is associated with the phenotype '", $pf->phenotype->description,
    "' in source ", $pf->source_name;
  if (defined($pf->p_value)) {
    print " with a p-value of ",$pf->p_value ;
  }
  print ".\n";
  if (defined($pf->risk_allele)){
    print "The risk allele is ", $pf->risk_allele, ".\n" ;
  }
}

It is also possible to retrieve all features associated with a phenotype description and an optional source name.

my $source_name = 'NHGRI-EBI GWAS catalog';
my $phenotype   = 'Cardiac hypertrophy';

my $pf_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'phenotypefeature');
 
foreach my $pf (@{$pf_adaptor->fetch_all_by_phenotype_description_source_name($phenotype, $source_name)}) {
  my $external_reference = $pf->external_reference;
  $external_reference =~ s/\// ID: /;

  print $pf->type, " ", $pf->object_id, " is associated with the phenotype '", $phenotype,
    "' in source ", $source_name," (", $external_reference,")\n";
}		
Used objects:

Variant sets

Variant sets group variants that share some properties eg. from the same sequencing project, data source or with assays on the same genotyping chip.
For example, all the variants identified in the 1000 Genomes - phase 3 study have been grouped into one variation set ('1000 Genomes 3 - All').

Sets can be part of supersets or divided into subsets depending on the hierarchical relationships between them.
> For example, the set representing variations identified in the 1000 Genomes - phase 3 study is named '1000 Genomes 3 - All' and has five subsets based on population: '1000 Genomes 3 - EUR', '1000 Genomes 3 - EAS', '1000 Genomes 3 - SAS', '1000 Genomes 3 - AMR', '1000 Genomes 3 - AFR'.

A list of all the variant sets with their description is available on the Variation Data Description page.

NOTE: The variant sets and the populations are different! e.g. The variant set associated with the European 1000 Genomes data is "1000 Genomes 3 - EUR", whereas the associated population is "1000GENOMES:phase_3:EUR".

In the Ensembl Variation API, variant sets are handled by the module VariationSet, and the accompanying adaptor module, VariationSetAdaptor, that interfaces with the underlying database. A VariationSet has a name attribute, a brief description and also a short name attribute. In the example below, we will connect to a Human Variation database, get a VariationSetAdaptor, fetch all available variant sets and print out their names and hierarchical relationships.

First, we get a connection and an adaptor:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

# Load the registry from db
$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Get a VariationSetAdaptor on the Human Variation database
my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');

# Get all top-level variation sets
my $top_vss = $vs_adaptor->fetch_all_top_VariationSets();

# Loop over the top level variant sets and recursively print the subsets
foreach my $top_vs (@{$top_vss}) {
  print_set($top_vs);
}

# We define a function that will help us recurse over the set hierarchy and print the data   
sub print_set {
  my $set = shift;
  my $indent = shift || "";

  # Print the set attributes
  printf("\%s\%s (\%s)\n",$indent,$set->name(),$set->short_name());

  # Get the subsets that have the current set as immediate parent
  my $subsets = $set->get_all_sub_VariationSets(1);

  # Call the print subroutine for each of the subsets with an increased indentation
  foreach my $subset (@{$subsets}) {
    print_set($subset,$indent . "  ");
  }
}
Used objects:
Adaptor objects
Main objects

Running the code in the example above on the Human Variation database produces a table similar to the one on the Variation Data Description page.

Using the Iterator method to handle larger datasets

In order to get objects for the variants assigned to a variation set, we use the get_Variation_Iterator method. Note that rather than returning a reference to a list of all variation objects, this returns an Iterator object. The Iterator allows you to iterate over large numbers of objects without trying to fit them all into memory at once which could otherwise easily cause your script to crash. The main methods we will use on the iterator are the has_next() and next() methods. In the example below, we fetch a variant set that contains variants that have been linked to phenotypes and print the first 10 examples. Note that getting variants from a variant set that has subsets below it automatically returns the variants from the subsets.

# Get the variant set for the phenotype-associated variants.
my $vs = $vs_adaptor->fetch_by_short_name('ph_variants');

printf("\%s (\%s):\n\t\%s\n",$vs->name(),$vs->short_name(),$vs->description());

my $limit = 10;
my $fetched = 0;
my $it = $vs->get_Variation_Iterator();

# Get the first 10 examples and print some data from them
while ($fetched < $limit && $it->has_next()) {
  
  # Print the name of the variant
  my $var = $it->next();
  printf("\t\%s has been found to be associated with:\n",$var->name());
  
  # Get the PhenotypeFeature objects for the variant
  my $pfs = $var->get_all_PhenotypeFeatures();
  
  # Loop over the annotations and print the phenotypes
  foreach my $pf (@{$pfs}) {
    my @output = ();
    push @output, 'Desc: ' . $pf->phenotype->description;
    push @output, 'p-value: ' . $pf->p_value if (defined $pf->p_value);
    push @output, 'study_type: ' . $pf->study->type if (defined $pf->study && defined $pf->study->type);
    push @output, 'study_name: ' . $pf->study->name if (defined $pf->study && defined $pf->study->name);
    print join(' ', @output), "\n";
  }
  $fetched++;
}
Used objects:
Adaptor objects

Get variant sets by Slice

You can also use Slices to retrieve VariationFeatures belonging to a desired variant set in a particular genomic region. In the example below, we fetch all VariationFeatures discovered by the 1000 Genomes Project phase 3 data on the RIC8A gene.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

# Load the registry from db
$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

# Get a VariationSetAdaptor on the Human Variation database
my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');

# Get the variation set object
my $vs = $vs_adaptor->fetch_by_short_name('1kg_3');

# Get a gene adaptor and the gene object for the RIC8A gene
my $gene_adaptor = $registry->get_adaptor('human','core','gene');
my $gene = $gene_adaptor->fetch_by_display_label('RIC8A');
my $transcript = $gene->canonical_transcript;

# Get the variation features on the slice belonging to the variation set
my $vfs = $vs->get_all_VariationFeatures_by_Slice($transcript->feature_Slice());

# Loop over the variation features and print the variation using HGVS notation relative to the canonical transcript
foreach my $vf (@{$vfs}) {
  my @tvs = @{ $vf->get_all_TranscriptVariations([$transcript]) };

  my $tv = $tvs[0];
  print $tv->transcript_stable_id;
  	my $tvas = $tv->get_all_alternate_TranscriptVariationAlleles();

	# Loop through the alleles getting the HGVS notation
  	foreach my $tva (@{$tvas}) {
  		my $genomichgvs = $tva->hgvs_genomic;
		my $cdshgvs = $tva->hgvs_transcript;
		my $pephgvs;
		if ($tva->hgvs_protein) {
			$pephgvs = $tva->hgvs_protein;
		} else {
			$pephgvs = "No protein HGVS";
		}
		print $genomichgvs, ", ", $cdshgvs, ", ", $pephgvs, "\n";
	}
}
Used objects:
Adaptor objects
Main objects

Structural variant variation sets

The variant sets can also be retrieved from the structural variants.

# Get the StructuralVariation object
my $sv_adaptor  = $registry->get_adaptor( 'human', 'variation', 'structuralvariation' );
my $sv = $sv_adaptor->fetch_by_name('esv3626563');

# Get the structural variation features
my $svfs = $sv->get_all_StructuralVariationFeatures();

# Loop over the structural variation features and print the corresponding variant set(s)
foreach my $svf (@{$svfs}) {
  my $vss = $svf->get_all_VariationSets();
  foreach my $vs (@{$vss}) {
    print "The structural variation belongs to the set ", $vs->name, "\n";
  }
}

Alleles and Genotypes

Requirement for 1000 Genomes data and other large genotyping projects

Genotype and frequency data for a number of large scale projects, including the 1000 Genomes Project, can be accessed direct VCF files stored on Ensembl's FTP server The API makes this access seamless, though it requires you install some extra software dependencies detailed on the API installation page. See also this blog post.

Alleles and frequencies

Many variants available through the Ensembl Variation API have allele frequency information. Each allele object associated with a variant represents an observation of a variant allele in a given population, and may have an associated frequency. Populations are also represented as objects.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('mus_musculus', 'variation', 'variation');

my $variation = $variation_adaptor->fetch_by_name('rs4224826');
my $alleles = $variation->get_all_Alleles();

foreach my $allele (@{$alleles}) {
  next unless (defined $allele->population);
  my $allele_string   = $allele->allele;
  my $frequency       = $allele->frequency || 'NA';
  my $population_name = $allele->population->name;
  printf("Allele %s has frequency: %s in population %s.\n", $allele_string, $frequency, $population_name);
}
Used objects:
Adaptor objects
Main objects

Retrieve external data (e.g. 1000 Genomes phase 3)

Additional frequency data is available via our VCF API extensions. Add the line "use_vcf" to 1 before making a call to retrieve alleles, genotypes or LD data. See an example in the following code:

my $variation_adaptor = $registry->get_adaptor("human", "variation", "variation");

# Set a flag on the variation adaptor to use VCF files for genotype extraction as well as the MySQL database
$variation_adaptor->db->use_vcf(1);

This instructs the API to integrate data fetched from VCF files alongside data from the Ensembl MySQL databases. No further configuration is required; the API is pre-configured to access indexed VCF files on the Ensembl FTP server.


Sample genotypes

Many variants in the Ensembl Variation database, especially those genotyped in the 1000 Genomes projects, have associated genotypes. Genotypes are stored per sample as SampleGenotype objects, and are associated with a Variation and Sample object. They can be retrieved from a SampleGenotypeAdaptor, or directly from a Variation object.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');

# OPTIONAL: uncomment the following line to retrieve 1000 genomes phase 3 data also
# $variation_adaptor->db->use_vcf(1);

my $variation = $variation_adaptor->fetch_by_name('rs1333049');

my $genotypes = $variation->get_all_SampleGenotypes();

foreach my $genotype (@{$genotypes}) {
  print "Sample ", $genotype->sample->name, " has genotype ", $genotype->genotype_string, "\n";
}
Used objects:
Adaptor objects

Sample genotype features

The SampleGenotypeFeatureAdaptor provides methods for retrieving SampleGenotypeFeature objects in a genomic region. A SampleGenotypeFeature stores information about the SampleGenotype and the genomic location of the SampleGenotype which can be accessed through the VariationFeature object that is stored in a SampleGenotypeFeature object. The SampleGenotypeFeatureAdaptor provides methods for retrieving only unique genotypes for a given Sample and Population and a method that returns all SampleGenotypeFeatures in a region and also stores genotypes for samples that differ from the given sample's genotype and are also different from the reference.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $sgfa = $registry->get_adaptor('mouse', 'variation', 'samplegenotypefeature');
$sgfa->db->use_vcf(1);

my $slice_adaptor = $registry->get_adaptor('mouse', 'core', 'slice');
my $chr = 18;
my $start = 68_260_187;
my $end = 68_300_333;
my $slice = $slice_adaptor->fetch_by_region('chromosome', $chr, $start, $end);

my $pa = $registry->get_adaptor('mouse', 'variation', 'population');
my $sa = $registry->get_adaptor('mouse', 'variation', 'sample');

my $population = $pa->fetch_by_name('Mouse Genomes Project');
my $samples = $sa->fetch_all_by_name('MGP:CAST/EiJ');
my $sample = $samples->[0];

my $unique_sgfs = $sgfa->fetch_all_unique_by_Slice($slice, $sample, $population);
foreach my $sgf (@$unique_sgfs) {
  my $genotype_string = $sgf->genotype_string;
  my $vf = $sgf->variation_feature;
  my $variation_name = $vf->variation_name;
  my $seq_region_name = $vf->seq_region_name;
  my $seq_region_start = $vf->seq_region_start;
  my $seq_region_end = $vf->seq_region_end;
  my $allele_string = $vf->allele_string;
  print "$variation_name $genotype_string $allele_string $seq_region_name $seq_region_start $seq_region_end\n";
}

my $differences_sgfs = $sgfa->fetch_all_differences_by_Slice($slice, $sample, $population);
foreach my $sgf (@$differences_sgfs) {
  my $genotype_string = $sgf->genotype_string;
  my $vf = $sgf->variation_feature;
  my $variation_name = $vf->variation_name;
  my $seq_region_name = $vf->seq_region_name;
  my $seq_region_start = $vf->seq_region_start;
  my $seq_region_end = $vf->seq_region_end;
  my $allele_string = $vf->allele_string;
  print "$variation_name $genotype_string $allele_string $seq_region_name $seq_region_start $seq_region_end\n";

  my $differences = $sgf->differences;
  foreach my $sample_name (keys %$differences) {
    my $genotype = $differences->{$sample_name};
    print "   $sample_name $genotype\n";
  }
}

Population genotypes

Many variants in the Ensembl Variation database, especially those genotyped in the 1000 Genomes project, have associated genotypes at the population level. Population genotypes are stored as PopulationGenotype objects, and are associated with a Variation and a Population object. They can be retrieved from a PopulationGenotypeAdaptor, or directly from a Variation object.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');

# OPTIONAL: uncomment the following line to retrieve 1000 genomes phase 3 data also
# $variation_adaptor->db->use_vcf(1);

my $variation = $variation_adaptor->fetch_by_name('rs1333049');
my $genotypes = $variation->get_all_PopulationGenotypes();

foreach my $genotype (@{$genotypes}) {
  print "Population ", $genotype->population->name, " has genotype ", $genotype->genotype_string, " with the frequency ", $genotype->frequency, "\n";
}
Used objects:
Adaptor objects

LD calculation

Requirement

In order to be able to use the LD calculation, you need to compile the C source code and install a Perl module, called IPC::Run.
There is more information on how to do this in here.

At present, Linkage Disequilibrium (LD) is only available for the 1000 Genomes populations in human. To calculate LD, you need to specify the population you want data for. You can fetch LD using the

  • LDFeatureContainer
  • object, which is fetched using the
  • LDFeatureContainerAdaptor
  • . The LDFeatureContainer represents LD across a genomic region, and can be fetched from a Slice, Variation or VariationFeature. Methods in the LDFeatureContainer allow you to get LD scores (r2 and D') across all pairs of variants in a region, or for specific pairs of variants.

    If you get all r2 or D' across a region using the LDFeatureContainer, they will return a hash with the keys: variation1, variation2 (both VariationFeatures), r2 or d_prime and population_id. If you get them for a pair of variants, they will return just the score.

    In the example below, it calculates the LD in a region in human chromosome 6 for a 1000 Genomes project population, but only prints when there is a high LD

    use strict;
    use warnings;
    use Bio::EnsEMBL::Registry;
    
    my $registry = 'Bio::EnsEMBL::Registry';
    
    $registry->load_registry_from_db(
      -host => 'ensembldb.ensembl.org',
      -user => 'anonymous'
    );
    
    # Define the region in chromosome 6 and get the Slice
    my $chr = 6;  
    my $start = 25_834_000;
    my $end = 25_854_000;
    my $slice_adaptor = $registry->get_adaptor( 'human', 'core', 'slice' );
    my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $start, $end );
    
    # Define the Population and fetch it using the PopulationAdaptor
    my $population_name = '1000GENOMES:phase_3:CEU';
    my $population_adaptor = $registry->get_adaptor( 'human', 'variation', 'population' );
    my $population = $population_adaptor->fetch_by_name( $population_name );
    
    # Get adaptor for LDFeatureContainer object, and tell it to use the VCF file to get the 1000 Genomes genotype data
    my $ldFeatureContainerAdaptor = $registry->get_adaptor( 'human', 'variation', 'ldfeaturecontainer' ); 
    $ldFeatureContainerAdaptor->db->use_vcf(1);
    
    # Retrieve all LD values in the region
    my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_Slice( $slice, $population ); 
    
    # Get all the r squared values for each pair of variants in the region and move through them
    foreach my $r_square (@{$ldFeatureContainer->get_all_r_square_values}){
      
      # only print high LD, where high is defined as r2 > 0.8
      if ($r_square->{r2} > 0.8){ 
        print "High LD between variations ", $r_square->{variation1}->variation_name, "-", $r_square->{variation2}->variation_name, "\n";
      }
    }
    
    Used objects:

    Please note the following warning MSG: variation features have no pairwise ld information is returned when there is no data in the LDFeatureContainer structure for the specified pair of VariationFeature objects. Here are some reasons why this may happen:

    • the LDFeatureContainer does not span the region contained by one or both VariationFeature objects specified
    • there are no genotypes for one or both variants, meaning no LD can be calculated
    • one or both variants is non-variant in the specified population (i.e. has a minor allele frequency of 0)
    • the estimated r2 value is lower than the cut-off threshold in the calculation software (the threshold is 0.05)

    Looking at other associated data will help to narrow down the reason. If there are variants in the region spanned by the LDFeatureContainer with genotypes with a minor allele frequency greater than zero, then assume the r2 is < 0.05, which essentially means no correlation.

    Specific strain information

    The Ensembl Variation API has is the possibility to work with your specific strain as if it was the reference one, and compare it against others. In the example, we create a StrainSlice object for a region in the mouse strain A/J and compare it against the reference sequence.

    use strict;
    use warnings;
    
    use Bio::EnsEMBL::Registry;
    
    my $reg = 'Bio::EnsEMBL::Registry';
    my $host= 'ensembldb.ensembl.org';
    my $user= 'anonymous';
    
    $reg->load_registry_from_db(
      -host => $host,
      -user => $user
    );
    
    # Get the Slice
    my $slice_adaptor = $reg->get_adaptor('mouse', 'core', 'Slice');
    my $slice = $slice_adaptor->fetch_by_region('chromosome', 19, 20380186, 20384187);
    
    # Get the StrainSliceAdaptor, and tell it to use the VCFs
    my $strain_slice_adaptor = $reg->get_adaptor('mouse', 'variation', 'StrainSlice');
    $strain_slice_adaptor->db->use_vcf(1);
    
    # get strainSlice from the slice
    my $a_j = $strain_slice_adaptor->get_by_strain_Slice("MGP:A/J", $slice);
    
    my @differences = @{$a_j->get_all_AlleleFeatures_Slice()};
    
    foreach my $diff (@differences){
      print "Locus ", $diff->seq_region_start, "-", $diff->seq_region_end, ", A/J's alleles: ", $diff->allele_string, "\n";
    }
    
    Used objects:
    Adaptor objects
    Main objects

    Structural variants

    The Ensembl Variation database also stores information about structural variants. These data are imported from the DGVa (Database of Genomic Variants archive). Structural variants are remapped to the current genome assembly using the $slice->project() method from the Ensembl Core API, allowing for one gap in the resultant mapping.

    use strict;
    use warnings;
    
    use Bio::EnsEMBL::Registry;
    
    my $reg = 'Bio::EnsEMBL::Registry';
    my $host= 'ensembldb.ensembl.org';
    my $user= 'anonymous';
    
    $reg->load_registry_from_db(
      -host => $host,
      -user => $user
    );
    
    # Get adaptor to StructuralVariation object
    my $sva = $reg->get_adaptor("human", "variation", "structuralvariation");
    
    # Get the StructuralVariation object
    my $sv = $sva->fetch_by_name('esv2663683');
    
    print $sv->variation_name, " ",  $sv->var_class, " (SO term: ", $sv->class_SO_term, ")\n";
    
    # Get Study information
    if ($sv->study) {
      print "This structural variant comes from the study ", $sv->source_name, "-", $sv->study->name, ": ", $sv->study->description, "\n";
    }
    

    Like short variants, you can fetch the structural variants using a Slice.

    #  Get adaptor to Slice object from core
    my $sa = $reg->get_adaptor("human", "core", "slice");
    
    my $slice = $sa->fetch_by_region('chromosome', 5, 1, 1000000);
    
    # Get adaptor to StructuralVariationFeature
    my $svfa = $reg->get_adaptor("human","variation","structuralvariationfeature");
    
    # Get all structural variation features on the slice
    my $svfs = $svfa->fetch_all_by_Slice($slice);
    
    # StructuralVariationFeature objects
    foreach my $svf (@$svfs) {
      print $svf->variation_name, " ", $svf->seq_region_name, ":", $svf->seq_region_start, "-", $svf->seq_region_end, " ", 
    	$svf->var_class, " (SO term: ", $svf->class_SO_term, ")\n";
    }
    

    You can also retrieve all the supporting evidence associated with a structural variant.

    # Get adaptor to StructuralVariation object from the Ensembl Variation API
    my $sva = $reg->get_adaptor("human", "variation", "structuralvariation");
    
    # Get the StructuralVariation object
    my $sv = $sva->fetch_by_name('esv2663683');
    
    # Get the supporting evidences associated to the structural variation
    my $ssvs = $sv->get_all_SupportingStructuralVariants();
    
    foreach my $ssv (@$ssvs) {
      print $ssv->variation_name, " ",  $ssv->var_class, " (SO term: ", $ssv->class_SO_term, ")\n";
    }
    


    Others

    Fetching Iterators

    Sometimes an adaptor method may return more objects than can fit into memory at once, and for such cases the Ensembl Variation API provides methods to fetch Iterator objects instead of the usual array reference. An Iterator is an object which allows you to iterate over the entire set of objects fetched by the adaptor without loading them all into memory at once, by calling the next() method to fetch each object in turn. You can tell when you have finished iterating over the set of objects by using the has_next() method which returns true when the Iterator still has objects to fetch and false when it is exhausted. These are the most commonly used Iterator methods, but there are also some other useful methods which allow you to transform and filter the set of objects in an analogous way to using map and grep on standard perl arrays, using the methods called (imaginatively) map() and grep(). Various other methods are also supported, please see the Iterator API documentation for full details.

    Some example code using Iterators to fetch somatic mutations from the VariationAdaptor is listed below. After the first fetch_Iterator_somatic() call we just iterate over the first 10 somatic mutations found in the Ensembl Variation database, using next() to fetch each object, and has_next() or count to check when we are done.

    use strict;
    use warnings;
    
    use Bio::EnsEMBL::Registry;
    
    my $reg = 'Bio::EnsEMBL::Registry';
    $reg->load_registry_from_db(
      -host => 'ensembldb.ensembl.org',
      -user => 'anonymous',
    );
    
    my $va = $reg->get_adaptor('human', 'variation', 'variation') or die;
    
    my $somatic_iter = $va->fetch_Iterator_somatic();
    my $limit = 10;
    my $fetched = 0;
    while ($fetched < $limit && $somatic_iter->has_next()) {
      my $mutation = $somatic_iter->next();
      print $mutation->name(), "\n";
      $fetched ++;
    }
    

    In the example below only the variations from the 'COSMIC' set are retrieved. This is can be done either using grep() and a subroutine to search for 'COSMIC' as a source or using the VariationSet parameter retrieved via the corresponding adaptor and variation set, as in the Variant sets examples above.

    my $va = $reg->get_adaptor('human', 'variation', 'variation') or die;
    
    my $cosmic_iter = $va->fetch_Iterator_somatic()->grep(sub {$_->source_name eq 'COSMIC'});
    # OR using the VariationSet parameter:
    #my $vs_adaptor = $registry->get_adaptor('human','variation','variationset');
    #my $vs = $vs_adaptor->fetch_by_short_name('ph_cosmic');
    #my $cosmic_iter = $va->fetch_Iterator_by_VariationSet($vs);
    
    my $limit = 10;
    my $fetched = 0;
    while ($fetched < $limit && $cosmic_iter->has_next()) {
      print $cosmic_iter->next()->name(), "\n";
      $fetched ++;
    }
    

    In the third call to fetch_Iterator_somatic we use the map() method to return the name of each Variation rather than the object itself using an anonymous subroutine just as you would use a block for perl's map function. In the final call we combine the grep() and map() methods to return an Iterator over all the names of somatic mutations from COSMIC. The map() and grep() methods return a new Iterator and only do their transforming and filtering on each object in turn, without needing all objects in memory, they are therefore much more memory-efficient than mapping and grepping a native perl array.

    my $name_iter = $va->fetch_Iterator_somatic()->map(sub {$_->name});
    my $limit = 10;
    my $fetched = 0;
    while ($fetched < $limit && $name_iter->has_next()) {
      print $name_iter->next(), "\n";
      $fetched ++;
    }
    
    my $cosmic_name_iter = $va->fetch_Iterator_somatic()->grep(sub {$_->source_name eq 'COSMIC'})->map(sub {$_->name});
    $fetched = 0;
    while ($fetched < $limit && $cosmic_name_iter->has_next()) {
      print $cosmic_name_iter->next(), "\n";
      $fetched ++;
    }
    
    Used objects:
    Adaptor objects
    Main objects

    Further help

    For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.