Compara API Tutorial
Introduction
This tutorial is an introduction to the Ensembl Compara API. Knowledge of the Ensembl Core API and of the coding conventions used in the Ensembl APIs is assumed.
Documentation about the Compara database schema is available here, and while is not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.
Species and databases in Ensembl Compara
The compara API can be used to access the main Ensembl database as well as the non-vertebrate Ensembl Genomes databases. Details of how to access these databases with your registry are found in the general instructions for the Ensembl APIs. There is no comparative genomics data in the Ensembl GRCh37 database.
When using Compara adaptors, we specify the species as 'Multi' for vertebrates; for non vertebrates, we name the division, such as 'Plants' or 'Fungi'. The pan-taxonomic comparative analysis can be accessed by specifying as 'pan_homology'.
Main Ensembl Compara objects
The Ensembl Compara databases uses a number of objects that share some similarities to objects in other Ensembl APIs, but are not exactly the same. Here are some examples of object types and how they differ from their counterparts in other APIs:
- GenomeDB objects represent a genome in the database, usually a single species but it may also be a strain of a species. It contains information about the taxonomy and is usually fetched with the GenomeDBAdaptor and the species name. Unlike the Core, Variation and Regulation databases, where each genome has its own database and this is specified in the adaptor, the compara databases are multi-genome, which means each genome must be represented as an object within the database.
- DnaFrag objects are similar to the seq_regions attached to Features in other databases, and represent chromosomes, contigs or scaffolds.
- SpeciesSets are groups of species.
- A Method is a type of analysis that can be done on a SpeciesSet.
- A MethodLinkSpeciesSet combines a Method with a SpeciesSet to specify a particular analysis.
MethodLinkSpeciesSets
The compara API works by specifying which type of analysis and set of species you're working with using the MethodLinkSpeciesSet object. This is fetched using the MethodLinkSpeciesSetAdaptor.
A MethodLinkSpeciesSet is obtained using its species and its Method. The Method represents the type of analysis, and includes gene homology methods and whole genome alignment methods. Below is a non-exhaustive list of methods available:
Method name | Description |
---|---|
LASTZ_NET | Pairwise whole genome alignment using LastZ |
EPO | Multiple whole genome alignment (WGA), with ancestral inference, using Enredo, Pecan and Ortheus |
EPO_EXTENDED | Multiple WGA for lower quality genomes, extending EPO using LASTZ |
PECAN | Multiple WGA using Mercator and Pecan |
CACTUS_HAL | Multiple WGA using progressiveCactus (* additional setup required - see here for details) |
ENSEMBL_ORTHOLOGUES | Orthologue calls between a pair of species |
ENSEMBL_PARALOGUES | Paralogue calls for a species |
PROTEIN_TREES | Gene trees for protein coding genes |
NC_TREES | Gene trees for non-coding RNAs |
For pairwise MethodLinkSpeciesSets, such as pairwise orthologues or LASTZ_NET pairwise whole genome alignments, you can get the MethodLinkSpeciesSet using the species name (known as the registry_alias), which can be the common name or the latin name, or by the GenomeDB.
For multiple alignment MethodLinkSpeciesSets, such as EPO, PECAN or CACTUS_HAL, these can be fetched using the taxonomy-based name of the group, such as "mammals", "amniotes" or "murinae". The documentation for the alignments includes the terms you need to fetch these alignments with the MethodLinkSpeciesSetAdaptor.
It is possible to find the MethodLinkSpeciesSet IDs, and fetch using these, but we do not recommend this, particularly for the multiple alignments, as when new species are added to these alignments, new IDs are assigned, so your scripts will not work in new releases.
The following script fetches the LastZ-net alignment between human and mouse, using the MethodLinkSpeciesSet and GenomeDBs:
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Get the GenomeDB adaptor my $genome_db_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomeDB' ); # Fetch GenomeDB objects for human and mouse: my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens'); my $mouse_genome_db = $genome_db_adaptor->fetch_by_name_assembly('mus_musculus'); # Get the MethodLinkSpeciesSet adaptor my $method_link_species_set_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'MethodLinkSpeciesSet'); # Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between human and mouse genomic sequences my $human_mouse_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$human_genome_db, $mouse_genome_db] );
Registry
Ensembl Compara Adaptors
Below is a non-exhaustive list of Ensembl Compara adaptors that are most often used:
- GenomeDBAdaptor to fetch Bio::EnsEMBL::Compara::GenomeDB objects
- DnaFragAdaptor to fetch Bio::EnsEMBL::Compara::DnaFrag objects
- GenomicAlignBlockAdaptor to fetch Bio::EnsEMBL::Compara::GenomicAlignBlock objects
- SyntenyRegionAdaptor to fetch Bio::EnsEMBL::Compara::SyntenyRegion objects
- GeneMemberAdaptor to fetch Bio::EnsEMBL::Compara::GeneMember objects
- GeneTreeAdaptor to fetch Bio::EnsEMBL::Compara::GeneTree objects
- HomologyAdaptor to fetch Bio::EnsEMBL::Compara::Homology objects
Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.
Whole Genome Alignments
The Compara database contains a number of different types of whole genome alignments, including multiple alignments and pairwise alignments.
GenomicAlignBlock objects
GenomicAlignBlocks are the preferred way to store and fetch genomic alignments. A GenomicAlignBlock contains several GenomicAlign objects. Every GenomicAlign object corresponds to a piece of genomic sequence from one genome aligned with another GenomicAlign from another genome in the same GenomicAlignBlock. A GenomicAlign object is always in relation with other GenomicAlign objects and this relation is defined through the GenomicAlignBlock object. Therefore the usual way to fetch genomic alignments is by fetching GenomicAlignBlock objects.
You can fetch GenomicAlignBlocks using Slice objects. When you fetch GenomicAlignBlocks this way, the blocks will not just represent your Slice, but the whole block of alignment. This means that you will need to restrict the block once you have fetched it. You will always get an array of blocks.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # Define the query species and the coordinates of the Slice my $query_species = 'human'; my $seq_region = '14'; my $seq_region_start = 75000000; my $seq_region_end = 75010000; # Get the SliceAdaptor and fetch a slice my $slice_adaptor = $registry->get_adaptor( $query_species, 'core', 'Slice' ); my $query_slice = $slice_adaptor->fetch_by_region( 'toplevel', $seq_region, $seq_region_start, $seq_region_end ); # Get the GenomeDB adaptor my $genome_db_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomeDB' ); # Fetch GenomeDB objects for human and mouse: my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens'); my $mouse_genome_db = $genome_db_adaptor->fetch_by_name_assembly('mus_musculus'); # Get the MethodLinkSpeciesSetAdaptor my $method_link_species_set_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'MethodLinkSpeciesSet'); # Fetch the MethodLinkSpeciesSet object corresponding to LASTZ_NET alignments between human and mouse genomic sequences my $human_mouse_lastz_net_mlss = $method_link_species_set_adaptor->fetch_by_method_link_type_GenomeDBs( "LASTZ_NET", [$human_genome_db, $mouse_genome_db] ); # Get the GenomicAlignBlockAdaptor my $genomic_align_block_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomicAlignBlock' ); # Fetch all the GenomicAlignBlocks corresponding to this Slice from the pairwise alignments (LASTZ_NET) between human and mouse my @genomic_align_blocks = @{ $genomic_align_block_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_lastz_net_mlss, $query_slice ) }; # We will then (usually) need to restrict the blocks to the required positions in the reference sequence foreach my $genomic_align_block( @genomic_align_blocks ) { my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end); }
- GenomeDB
- MethodLinkSpeciesSet
- Slice (Core API)
- GenomicAlignBlock
Print alignment
Once you've retrieved you alignment, you can use modules in BioPerl to print your alignment, such as Bio::SimpleAlign, from the GenomicAlignBlocks. The following code can be added to the end of the previous script to get SimpleAligns and print them to a file:
my $all_aligns; # Get a Bio::SimpleAlign object from every GenomicAlignBlock foreach my $this_genomic_align_block (@genomic_align_blocks) { my $restricted_gab = $this_genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end); my $simple_align = $restricted_gab->get_SimpleAlign; push(@$all_aligns, $simple_align); } # print all the genomic alignments using a Bio::AlignIO object my $output_format = "clustalw"; my $alignIO = Bio::AlignIO->newFh( -interleaved => 0, -fh => \*STDOUT, -format => $output_format, -idlength => 10 ); foreach my $this_align (@$all_aligns) { print $alignIO $this_align; }
Print coordinates
From a GenomicAlign object, you can create a Slice for that species, which you can then use to fetch coordinates or sequence, or perform any other Slice functions, like find overlapping features. Taking again the script where we fetched the GenomicAlignBlocks, we can add the following code to get coordinates:
foreach my $genomic_align_block( @genomic_align_blocks ) { my $restricted_gab = $genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end); # fetch the GenomicAligns and move through my @genomic_aligns = @ { $restricted_gab->get_all_GenomicAligns }; foreach my $genomic_align (@genomic_aligns) { my $species = $genomic_align->genome_db->get_scientific_name; my $slice = $genomic_align->get_Slice; print $species, "\t", $slice->seq_region_name, ":", $slice->seq_region_start, "-", $slice->seq_region_end, "\t"; } print "\n"; }
Gene trees, Homologies and Protein clusters
All the gene trees and homologies refer to GeneMembers and SeqMembers. Homology objects store orthologous and paralogous relationships between members, GeneTree objects represent the evolutionary history of a set of members.
*Member objects
A member represent either a gene (GeneMember) or a sequence-bearing locus, e.g. a protein or a transcript (SeqMember). Most of them are defined in the corresponding Ensembl core database. For instance, the sequence for the human gene ENSG00000004059 is stored in the human core database.
The fetch_by_stable_id_GenomeDB method of the corresponding *MemberAdaptor returns Members by their stable_id and genome. Here is a simple example:
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # get the GenomeDB adaptor my $genome_db_adaptor = $registry->get_adaptor('Multi','compara','GenomeDB'); # fetch GenomeDB object for human my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens'); # get the MemberAdaptor my $genemember_adaptor = $registry->get_adaptor('Multi','compara','GeneMember'); # fetch a human Member my $member = $genemember_adaptor->fetch_by_stable_id_GenomeDB('ENSG00000004059', $human_genome_db); # print out some information about the Member print $member->source_name, " (", $member->dnafrag->name, ":", $member->dnafrag_start, "-", $member->dnafrag_end, "): ", $member->description, "\n";
You can fetch the corresponding Ensembl Core API objects from all of the *Member objects, as well as coordinates. The taxon method returns an NCBITaxon object, which contains information about the species and taxonomy.
*Member objects have a source_name, which describes where the member comes from:
- for GeneMember
-
- ENSEMBLGENE, derived from an Ensembl gene
- EXTERNALGENE, loaded from an external source (currently unused in the live databases)
- for SeqMember
-
- ENSEMBLPEP, derived from an Ensembl translation
- ENSEMBLTRANS, derived from an Ensembl transcript
- Uniprot/SWISSPROT, derived from a Uniprot/Swissprot entry
- Uniprot/SPTREMBL, derived from a Uniprot/SP-TrEMBL entry
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # get the GenomeDB adaptor my $genome_db_adaptor = $registry->get_adaptor('Multi','compara','GenomeDB'); # fetch GenomeDB object for human my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens'); # get the MemberAdaptor my $genemember_adaptor = $registry->get_adaptor('Multi','compara','GeneMember'); # fetch a human Member my $member = $genemember_adaptor->fetch_by_stable_id_GenomeDB('ENSG00000004059', $human_genome_db); my $taxon = $member->taxon; print "common_name ", $taxon->get_common_name,"\ngenus ", $taxon->genus,"\nspecies ", $taxon->species, "\nbinomial ", $taxon->scientific_name, "\nclassification ", $taxon->classification,"\n";
GeneTree Objects
GeneTree objects give us the phylogenetic context for a set of genes, as well as their alignment.
In general, you would want to fetch the gene tree for a given gene of interest. The GeneTreeAdaptor has a fetching method called fetch_all_by_Member(). You will need the GeneMember object for your query gene, therefore you will fetch the GeneMember first like in this example:
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # first, let's use a GenomeDBAdaptor to get the GenomeDB of interest my $genome_db_adaptor = $registry->get_adaptor( 'Multi', 'compara', 'GenomeDB' ); my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens'); # next, get our human GeneMember of interest from a GeneMemberAdaptor my $genemem_adapt = $registry->get_adaptor( 'Multi', 'compara', 'GeneMember' ); my $genemem = $genemem_adapt->fetch_by_stable_id_GenomeDB('ENSG00000238344', $human_genome_db); # then, set up a GeneTreeAdaptor and fetch the default tree for our GeneMember my $genetree_adapt = $registry->get_adaptor( 'Multi', 'compara', 'GeneTree' ); my $genetree = $genetree_adapt->fetch_default_for_Member($genemem); # look at all members of the tree print "Members of tree:\n"; my @members = @{ $genetree->get_all_Members }; foreach my $m ( @members ) { print $m->name, " (", $m->genome_db->name, ")\n"; } # print the full tree in Newick format print $genetree->newick_format() . "\n";
GeneTree objects not only hold the structure of the phylogeny, they also hold the gene alignment upon which the tree was based. This can be printed out using the following:
$genetree->print_alignment_to_file('/path/to/file', -format=>'clustalw');
Homology Objects
An Homology object represents either an orthologous or paralogous relationships between two members.
Typically you want to get homologies for a given gene. As with the GeneTreeAdaptor, the HomologyAdaptor has a fetching method called fetch_all_by_Member(). This has optional arguments to only fetch by a particular Method, MethodLinkSpeciesSet and target species.
When you get all Homologies, you will get an array where each item is a Homology representing the relationship between exactly two genes. One is the query gene, the gene you used as input, and the other is the target gene, the homologue. Even if you specify a single species, if there is a one-to-many or many-to-many relationship, each of these will be one homology.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); # get human GenomeDB object my $genome_db_adaptor = $registry->get_adaptor('Multi', 'compara', 'GenomeDB'); my $human_genome_db = $genome_db_adaptor->fetch_by_name_assembly('homo_sapiens'); # get a GeneMember object my $gene_member_adaptor = $registry->get_adaptor('Multi', 'compara', 'GeneMember'); my $gene_member = $gene_member_adaptor->fetch_by_stable_id_GenomeDB('ENSG00000004059', $human_genome_db); # get the homologies where the member is involved my $homology_adaptor = $registry->get_adaptor('Multi', 'compara', 'Homology'); my @homologies = @ { $homology_adaptor->fetch_all_by_Member($gene_member, -TARGET_SPECIES=>"mus_musculus") }; # the homology_adaptor will always return an array, even if it only has one homology in it, so we move through the array foreach my $homology (@homologies) { # Get the GeneMembers and print their species and ID my @homologous_genes = @ { $homology->get_all_GeneMembers }; foreach my $gene (@homologous_genes){ print $gene->taxon->get_common_name, ": ", $gene->stable_id, "\n"; } # Print the homology relationship print $homology->description," ", $homology->taxonomy_level,"\n"; }
You can get the original alignment used to define an homology:
$homology->print_alignment_to_file('/path/to/file', -format=>'fasta');
Memory management and code efficiency
There are a myriad of ways to ensure optimal performance of your code. Here, we provide just a few tricks that pertain specifically to Ensembl APIs.
Fetching from adaptors
In all examples above, we've used a foreach to loop through the results of a fetch_all_by_YY call. This is fine for small datasets. However, adaptors can also support a while + shift combo for memory efficiency:
my $members = $genetree->get_all_Members; while ( my $m = shift @$members ) { # do something with member }
Releasing trees from memory
When working with many GeneTree objects, memory can quickly get out of hand. This is because our current object model uses a cyclic graph of Perl references. As a consequence, the usual garbage-collector is not able to release the memory used by a gene tree when you lose its reference (unlike most of the Ensembl objects). This means that you will have to call release_tree() on each tree after using it.
Preloading data
Most of the objects do lazy-loading of related objects via queries to the database. This system is sub-optimal when there are a lot of objects to fetch or if the server is distant. Our Bio::EnsEMBL::Compara::Utils::Preloader module provides several methods to do a bulk-loading of objects in a minimum number of queries. This will result in a higher memory usage, but faster processing of data.
Further help
For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.