Gene Families in Compara
Ensembl families are determined through classification of all Ensembl proteins, including multiple isoforms of the same gene, along with metazoan sequences from UniProt. It therefore provides a way of exploring orthologues and closely related homologues across a range of animal species.
The pipeline consists of the following steps:
- Load proteins from Ensembl and UniProt.
- Run an HMM search on the TreeFam HMM library to classify the sequences into their families.
- Align the families with Mafft.
- Annotate the family with a consensus description, based on its members' descriptions.
The families have been assigned the stable ID of their corresponding HMM.
For each cluster obtained, a consensus annotation is automatically generated from the UniProt description lines using the following approach:
- If the description covers less than 40% of UniProt members in the cluster, the family description is assigned 'AMBIGUOUS'.
- If the annotation confidence score, described below, is zero, 'UNKNOWN' is assigned.
- Be aware that 'UNCHARACTERIZED' is a UniProt description for a protein, and does not reflect the score.
The annotation confidence score is the percentage of UniProt family members with this description, or part of it. Note that only family members with 'informative' UniProt descriptions are taken into account.
Ensembl provides pre-calculated multiple sequence alignments of all members for each cluster. We provide a Wasabi viewer in the browser for viewing the alignments between just the Ensembl proteins, and the Ensembl and UniProt proteins together. You can also export a text file with the alignments of all the family members - a wide range of formats is available from the control panel.
Alternatively, export alignments using the Compara Perl API.