EnsemblEnsembl Home

Ensembl Regulation (funcgen) API Tutorial

Introduction

The Ensembl Regulation team deals with functional genomics data. The API and databases for Ensembl Regulation are called Funcgen.

This tutorial is an introduction to the Funcgen API. Knowledge of the Ensembl Core API and of the concepts and conventions in the Ensembl Core API tutorial is assumed. Please note that the Ensembl Core API must also be installed to use the Funcgen API.

Connecting to the funcgen Database using the Registry

Connecting to any Ensembl database is made simple by using the Bio::EnsEMBL::Registry module:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

Bio::EnsEMBL::Registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

The use of the registry ensures you will load the correct versions of the Ensembl databases for the software release it can find on a database instance. Using the registry object, you can then create any of number of database adaptors. Each of these adaptors is responsible for generating an object of one type. The above code will be omitted from the following examples for brevity.

Regulatory Features

Regulatory Features are features like involved with regulatory aspects like:

  • Predicted promoters,
  • Predicted promoter flanking regions,
  • Predicted enhancer regions,
  • CTCF Binding Sites,
  • Transcription factor binding sites or
  • Open chromatin regions.

The are generated by the Ensembl Regulatory Build.

To fetch Regulatory Features from the funcgen database, you need to use the corresponding adaptor. To obtain all the regulatory features present in a given region of the genome, use the adaptor method fetch_all_by_Slice:


my $slice_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Human', 'Core', 'Slice');
my $slice = $slice_adaptor->fetch_by_region('chromosome', 1, 54_960_000, 54_980_000);

my $regulatory_feature_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Human', 'Funcgen', 'RegulatoryFeature');
my @regulatory_features = @{$regulatory_feature_adaptor->fetch_all_by_Slice($slice)};

foreach my $current_regulatory_feature (@regulatory_features) {
  print $current_regulatory_feature->stable_id.": ";
  print_feature($current_regulatory_feature);
  print "\tFeature Type: ".$current_regulatory_feature->feature_type->name."\n";
}

Regulatory Activities

For every regulatory feature the Ensembl Regulatory Build predicts the regulatory activity of the regulatory feature in each of the epigenomes of the regulatory build. For every epigenome there are 5 possible activities:

  1. Inactive,
  2. Repressed,
  3. Poised (Has both active and repressive marks, "ready to go"),
  4. Active or
  5. NA (Not part of the regulatory build)

The regulatory activities have their own object and can be queried like this:

  my $regulatory_feature_demo_stable_id = 'ENSR00000165384';

  my $regulatory_feature = $regulatory_feature_adaptor->fetch_by_stable_id($regulatory_feature_demo_stable_id);
  print "The regulatory feature with stable id: "  . $regulatory_feature->stable_id . " has the following activities: \n";

  my $regulatory_activity_adaptor = Bio::EnsEMBL::Registry->get_adaptor('homo_sapiens', 'funcgen', 'RegulatoryActivity');
  my $regulatory_activity_list    = $regulatory_activity_adaptor->fetch_all_by_RegulatoryFeature($regulatory_feature);

  foreach my $current_regulatory_activity (@$regulatory_activity_list) {
    print "\tIn the epigenome "  
      . $current_regulatory_activity->get_Epigenome->display_label 
      . ' it is: ' 
      . $current_regulatory_activity->activity 
      . "\n";

  }

Peaks: Enriched regions from ChIP-seq and other high throughput experiments

Regulatory Features are built based on results from experiments like Dnase1 sensitivity assays (Dnase-Seq) to detect regions of open chromatin, or transcription factor binding assays, like Chromatin immunoprecipitation (ChIP) coupled with high throughput sequencing (ChIP-Seq). Results from these experiments are stored as Peaks.

ChIP-Seq studies are also used to detect histone modifications (eg. H3K36 trimethylation) and Polymerase binding sites.

Peaks have these properties:

  • Feature Type usually representing antibody used in ChIP experiment.
  • Feature Set for this Peak (see further down in Tutorial).
  • Score. An analysis-dependent value (eg. peak-caller score)
  • The peak Summit. Precise 1bp position within the peak with the highest read density in a ChIP experiment. It is dependent on the analysis and sometimes it may not be present.

Fetch all Peaks on a Slice

Here is an example how peaks can be fetched from a Slice and printed with their properties:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use List::Util qw( min );

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $peak_adaptor  = Bio::EnsEMBL::Registry->get_adaptor('homo_sapiens', 'funcgen', 'Peak');
my $slice_adaptor = Bio::EnsEMBL::Registry->get_adaptor('homo_sapiens', 'core',    'Slice');

# Fetch a slice
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '17', 63_992_802, 64_038_237);

# Fetch all peaks on the slice
my $peaks_on_slice = $peak_adaptor->fetch_all_by_Slice($slice);

my $number_of_peaks_on_slice = @$peaks_on_slice;
print "There are $number_of_peaks_on_slice peaks on the slice.\n";

# Print the first ten
my $max_features_to_print = 10;

# This prints:
# 
# There are 697 peaks on the slice.
# H3K36me3 - Fetal Stomach Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K27me3 - EM CD8+ ab T cell (VB) Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K9me3 - CD14+CD16- monocyte (VB) Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K36me3 - H1-trophoblast Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K36me3 - naive B cell (To) Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K36me3 - CD14+CD16- monocyte (VB) Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K36me3 - naive B cell (VB) Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K27me3 - neutrophil (VB) Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K27me3 - M2 macrophage (VB) Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
# H3K36me3 - Fetal Stomach Enriched Site (chromosome:GRCh37:17:63992802:64038237:1)
#
for my $i ( 1.. min($max_features_to_print, $number_of_peaks_on_slice) ) {
  print_feature($peaks_on_slice->[$i]);
}

sub print_feature {
  my $feature = shift;
  print 
    $feature->display_label
    . " (" 
    . $feature->slice->name 
    . ")\n";
}

Motif Features: Transcription factor binding sites

Motif Features represent short genomic regions where a Transcription Factor is thought to be directly interacting with the DNA. These regions are called Transcription Factor binding sites. More information on how these sites are found in Ensembl is on the RegulatoryBuild page.

To obtain the motif features present in a given Regulatory Feature, you can do this:

  my $motif_features = $regulatory_activity->get_RegulatoryEvidence_by_type('motif');
or
  my $slice = $slice_adaptor->fetch_by_region('chromosome', 1, 54_960_000, 54_980_000);

  my $motif_feature_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Human', 'Funcgen', 'MotifFeature');
  my $motif_features = $motif_feature_adaptor->fetch_all_by_Slice($slice);

Motif Feature objects have these properties:

  • Binding Matrix: Position Weight Matrix used to define the site. These are currently from Jaspar and their name is the Jaspar Identifier.
  • Score. analysis-dependent value indicating degree of similarity to the binding matrix

Here is an example how properties from motif features can be printed:

foreach my $current_motif_feature (@$motif_features) {
  print_feature($current_motif_feature);
  print $current_motif_feature->binding_matrix->name . "\n";
  print $current_motif_feature->seq                  . "\n";
  print $current_motif_feature->score                . "\n";
  
}

External Features: Externally curated data

There are some Feature Sets that are either entirely or partially curated by external groups. These are stored as External Features and can be accessed as follows:

  # Fetch all external feature sets
  my $external_feature_sets = $feature_set_adaptor->fetch_all_by_feature_class('external');

  foreach my $current_external_feature_set (@$external_feature_sets) {
    print "External FeatureSet: " . $current_external_feature_set->name . "\n";
  }

If you know the name of a feature set, you can use the name to fetch the data. For example, we store data from the Vista Enhancer Browser.

  my $vista_feature_set = $feature_set_adaptor->fetch_by_name('VISTA enhancer set');

  # Now you can get all the features (in this case external features) 
  # You can also get features by Slice using get_Features_by_Slice: 
  #
  foreach my $current_vista_feature (@{$vista_feature_set->get_all_Features}){
      print_feature($current_vista_feature);
      
      # There is no epigenome for these features
      # Feature type indicates vista annotation (eg. active enhancer)
      #
      print $current_vista_feature->feature_type->name."\n";
  }

Feature Types

Feature Types provide a biological annotation for features. They are divided in classes forming biologically coherent groups (eg. Transcription Factors). This is different from the Feature Set class, which just states the origin of the data. Feature Types can be accessed using a specific adaptor:

  my $feature_type_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Human', 'Funcgen', 'FeatureType');

External Feature Types

Feature Types for External Features have a meaning that is specific to the Feature Set. For example, for features of the Vista Feature Set, the feature type indicates if the feature was active or inactive in an experiment.

Microarrays and associated Information

This code demonstrates the use of adaptors, specifically the ArrayAdaptor. In this example, we export all microarray platforms and associated information.

  # Grab the adaptor
  my $array_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Human','Funcgen','Array');

  my $array = $array_adaptor->fetch_all;

  foreach my $current_array (@$array) {
  
    # Print some array info
    print "Array:  " . $current_array->name   ."\n";
    print "Type:   " . $current_array->type   ."\n";
    print "Vendor: " . $current_array->vendor ."\n";
    
    my $array_chips   = $current_array->get_ArrayChips;

    #Print some ArrayChip info
    foreach my $current_array_chip (@$array_chips) {
      print "ArrayChip: " 
        . $current_array_chip->name 
        . " DesignID: " 
        . $current_array_chip->design_id . "\n";
    }
    print "\n";
  }

Fetch all ProbeSets from a specific Array

In this example, probesets from the WholeGenome_4x44k_v1 array are obtained.

  my $probe_adaptor         = Bio::EnsEMBL::Registry->get_adaptor('Human', 'Funcgen', 'Probe');
  my $probe_feature_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Human', 'Funcgen', 'ProbeFeature'); 

  # Fetch a probeset from the WholeGenome_4x44k_v1 array
  my $probe = $probe_adaptor->fetch_by_array_probe_probeset_name('WholeGenome_4x44k_v1', 'A_23_P18656');
  print "Got " . $probe->class . " probe " . $probe->get_probename . "\n";

  # Fetch the feature associated with this probe
  my @pfeatures = @{$probe_feature_adaptor->fetch_all_by_Probe($probe)};
  print "\nFound ".scalar(@pfeatures)." ProbeFeatures\n";

  #Print some info about the features
  foreach my $pfeature ( @pfeatures ){
    print "\nProbeFeature found at:\t".$pfeature->feature_Slice->name."\n";
  }

Microarray Annotations

In this example, the FOXP2 transcript is fetched by it's stable_id. Then all ProbeSets that have been mapped to this transcript are fetched and printed.

  #Grab the relevant adaptors
  my $transcript_adaptor = Bio::EnsEMBL::Registry->get_adaptor("human", "Core",    "Transcript");
  my $probe_set_adaptor  = Bio::EnsEMBL::Registry->get_adaptor("human", "Funcgen", "ProbeSet");

  #Fetch the transcript and associated ProbeSets
  my $transcript = $transcript_adaptor->fetch_by_stable_id('ENST00000393489');#Foxp2
  my $probesets  = $probe_set_adaptor->fetch_all_by_external_name($transcript->stable_id);

  foreach my $probeset (@$probesets) {

    my $arrays_string = join(', ', (map { $_->name } @{$probeset->get_all_Arrays}));

    # Now get linkage annotation
    #
    my $database_entry_info;
    foreach my $current_database_entry (@{$probeset->get_all_Transcript_DBEntries($transcript)}) {

      # This will return all ProbeSet DBEntries for this transcript
      # There should really be one max per transcript per probeset/probe
      #
      $database_entry_info = $current_database_entry->linkage_annotation;
    }
    print $probeset->name 
      . " on arrays " . $arrays_string 
      . " with Probe hits $database_entry_info\n";
  }

Documentation on the current array mapping strategy can be found here. More examples for accessing microarray data can be found in the funcgen API package: ensembl-funcgen/scripts/examples/microarray_annotation_example.pl

Further help

Complete example scripts for the funcgen API can be found in the funcgen API package: ensembl-funcgen/scripts/examples/ For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.