1-to-1 orthologues | Orthologues | A type of orthologue assigned for a pair of species where only one copy is found in each species. | |
1-to-many orthologues | Orthologues | A type of orthologue assigned for a pair of species where one gene in one species is orthologous to multiple genes in the other species, due to (a) duplication event(s) in the second species. | |
1000 Genomes project | Variation source database | The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the human populations studied. Ensembl display sample genotypes and population frequencies from the 1000 Genomes project. http://www.internationalgenome.org/ | |
3 prime UTR variant | Variant consequence | A UTR variant of the 3' UTR | |
3' incomplete | Transcript | A protein-coding transcript which is missing the stop codon due to incomplete evidence. | |
3' overlapping ncRNA | Long non-coding RNA (lncRNA) | Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand. | |
3' UTR | Transcript | The region of a coding cDNA downstream of the stop codon which is not translated. | |
5 prime UTR variant | Variant consequence | A UTR variant of the 5' UTR | |
5' incomplete | Transcript | A protein-coding transcript which is missing the start codon due to incomplete evidence. | |
5' UTR | Transcript | The region of a coding cDNA upstream of the start codon which is not translated. | |
Active | Regulatory activity | When a regulatory feature displays an epigenetic signature which is consistent with it carrying out its named function, for example an active Promoter has an epigenetic signature consistent with initiating transcription, while an active CTCF binding site will bind CTCF. It is analogous to a sprinter running. | |
AGP | File formats | A golden path. A file provided to Ensembl that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig. | |
Algorithm | | A sequence of computational tasks or actions that carry out a specific function. | |
Alignments | Genome annotation | A comparison between two or more sequences by matching identical and/or similar residues/nucleotides and assigning a score to the match. | |
Allele (gene) | Gene | Different versions of a gene found between the primary assembly and a patch or genome haplotype. | |
Allele (variant) | Variant | One of a number of alternative forms of the same genetic locus/variant. | |
Alternative allele | Allele (variant) | Any allele of a variant which is not the in the reference genome currently being studied. The alternative allele is not necessarily the minor allele. | |
Alternative sequence | Genome assembly | Genomic sequence that differs from the genomic DNA on the primary assembly. These are represented as sequence on top of the primary assembly. Provided by the GRC for human and mouse. | |
Alu insertion | Repeat | A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it. | |
Ambiguity code | Variant | A single letter code that represents two or more possible nucleotides at a single base locus. | |
Ancestral allele | Allele (variant) | The allele which occurs at this locus in closely related species and is thought to reflect the allele present at the time of speciation. The ancestral allele may be the reference or the alternative allele, and the major or minor allele. | |
Animal QTLdb | Phenotype source database | Project aiming to house all publicly available QTL and association data on livestock animal species. Ensembl display phenotypes from the Animals QTLdb. https://www.animalgenome.org/cgi-bin/QTLdb/index | |
Antisense | Long non-coding RNA (lncRNA) | Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand. | |
APPRIS | Transcript | APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods to identify the most functionally important transcript(s) of a gene. | |
APPRIS ALT1 | APPRIS | For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that is conserved in at least three tested species. | |
APPRIS ALT2 | APPRIS | For genes in which the APPRIS core modules are unable to choose a clear principal isoform, the ALT1 is the candidate transcript(s) models that appear to be conserved in fewer than three tested species. | |
APPRIS P1 | APPRIS | Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. | |
APPRIS P2 | APPRIS | Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. | |
APPRIS P3 | APPRIS | Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. | |
APPRIS P4 | APPRIS | Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. | |
APPRIS P5 | APPRIS | Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. | |
BAC | Clone | A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria. Many genomes (such as human) were sequenced by cloning segments into BACs, amplifying and sequencing the clones. | |
BAM/CRAM | File formats | BAM and CRAM store alignments of NGS data to the genome. Ensembl allow attachment of BAM and CRAM files to view in against the gene, and store RNA-seq, ChIP-seq and DNase-seq in BAM. | |
Base pairs (genome size) | Genome assembly | The actual number of bases of sequence we have for a full genome assembly, including alternative sequences and PARs, excluding gaps. | |
BED | File formats | BED is a simple format for listing genomic loci. It can be used to upload data to view in Ensembl, as a custom file for additional VEP annotation and is used to store and download constrained elements in Ensembl. | |
BedGraph | File formats | BedGraph allows you to store scores for loci in BED format, the loci can be of varying size. It can be uploaded to view in Ensembl. | |
Between species paralogues | Paralogues | Members of the same gene family in different species that are not direct orthologues. In a gene tree, these genes are separated by a duplication node. | |
BigBed | File formats | BigBed is an indexed form of BED, which can be used to store larger scale data. Ensembl allow attachment of BigBed files to view against the genome and store peaks of regulatory evidence as BigBed. | |
BigWig | File formats | BigWig is an indexed form of wiggle and can be used to store larger scale data. Ensembl simplify NGS data, such as ChIP-seq and RNA-seq into BigWig to view in the browser. It can also be used to attach your own data to Ensembl. | |
Biotype | Gene | A gene or transcript classification. | |
Bisulfite sequencing | Epigenome evidence | A method to determine the methylation of genomic cytosines. | |
BLAST | Algorithm | A sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query. | |
BlastZ | Pairwise whole genome alignment | BlastZ is a program for aligning DNA sequences in a pairwise manner. It has been replaced by LASTZ. | |
BLAT | Algorithm | An mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more. | |
BLOSUM 62 | Algorithm | A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members). | |
Blueprint Epigenomes | Epigenome source database | Project aiming to apply functional genomics analysis on primary cells of the haematopoietic cell lineage from healthy and diseased individuals, to produce lineage-specific epigenomes. Used as a source for the Ensembl regulatory build. http://www.blueprint-epigenome.eu/ | |
CADD | Algorithm | A tool that integrates multiple annotations into one metric for scoring the deleteriousness of single nucleotide variants. | |
CCDS | Transcript | A coding sequence in the Consensus Coding Sequence Set is consistently annotated between Ensembl, MGI, HGNC and NCBI. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. | |
cDNA | Transcript | The sequence of the spliced exons of a transcript expressed in DNA notation (T rather than U), representing the coding or sense strand. The cDNA contains the whole sequence of the RNA, including coding and untranslated sequence. | |
CDS | Transcript | CoDing Sequence. The region of a cDNA which is translated. In Ensembl displays, the stop codon is included as part of the CDS sequence. | |
Centromere | Repeat | The region of the chromosome at which the two sister chromatids are joined during mitosis and meiosis, mostly composed of satellite DNA. | |
chain | File formats | Chain files describe the mapping between different genome assemblies. Ensembl store these on the FTP site. | |
ChIP-seq | Epigenome evidence | A method to determine the genomic regions that proteins bind to. | |
CIGAR | Alignments | The cigar line defines the sequence of matches/mismatches and deletions (or gaps) in an alignment | |
Clinical significance | Variant | A classification of a variant's impact on disease, taken from ClinVar. | |
ClinVar | Variation source database | NCBI resource that aggregates information about genomic variation and its relationship to human health. Ensembl display clinical significance and phenotypes from ClinVar. https://www.ncbi.nlm.nih.gov/clinvar/ | |
Clone | Genome assembly | A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies. | |
CNV | Structural variant | Copy Number Variation: increases or decreases the copy number of a given locus. Subcategorised into Loss and Gain compared to the reference. | |
Coding sequence variant | Variant consequence | A sequence variant that changes the coding sequence | |
Codon | CDS | Three base pairs in either DNA or RNA that code for an amino acid (or stop translation). | |
Complex structural alteration | Structural variant | A structural sequence alteration or rearrangement encompassing one or more genome fragments, with four or more breakpoints. | |
Complex substitution | Structural variant | When no simple or well defined DNA mutation event describes the observed DNA change, the keyword ""complex"" should be used. Usually there are multiple equally plausible explanations for the change. | |
Constitutive exon | Exon | Exons that are not spliced out, therefore present in all transcripts of a given gene. | |
Contig | Genome assembly | A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information. | |
Coordinate system | Genome assembly | Which level of the assembly we are working on. | |
COSMIC | Variation source database | Database of somatic variants found in cancer. COSMIC licensing does not permit redistribution of the full dataset, but mutation identifiers, locations and tumour types are available in Ensembl. http://cancer.sanger.ac.uk/cosmic | |
Cosmid | Clone | DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced. | |
Coverage | Genome assembly | Refers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information. | |
CTCF binding sites | Regulatory features | Regions that bind CTCF, the insulator protein that demarcates open and closed chromatin. | |
Cytogenetic band | Genome assembly | A banding pattern on a chromosome resulting from staining and examination by microscopy. These are named in terms of the chromosome arm they are found on, and are often used as a shorthand for describing the location of genomic features. | |
D' | Linkage disequilibrium | The difference between the observed and the expected frequency of a given haplotype. If two loci are independent (i.e. in linkage equilibrium and therefore not coinherited at all), the D' value will be 0. | |
dbSNP | Variation source database | The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms in human, maintained by NCBI. https://www.ncbi.nlm.nih.gov/projects/SNP/ | |
dbVar | Variation source database | dbVar is NCBI's database of human genomic structural variation — insertions, deletions, duplications, inversions, mobile elements, and translocations. https://www.ncbi.nlm.nih.gov/dbvar/ | |
DDBJ | INSDC | The Asian branch of INSDC. http://www.ddbj.nig.ac.jp/ | |
Deletion | Sequence variant | Deletion of one or more nucleotides | |
DGVa | Variation source database | The Database of Genomic Variants archive (DGVa) is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species.https://www.ebi.ac.uk/dgva | |
DNA methylation | Epigenome evidence | Modification of cytosines in CpGs with methyl groups, which is known to repress gene expression. | |
DNase sensitivity | Epigenome evidence | A method to determine regions of open and closed chromatin. | |
Downstream gene variant | Variant consequence | A sequence variant located 3' of a gene | |
DUST | Algorithm | A standalone application that looks for low complexity sequences. | |
EMBL (file format) | File formats | EMBL files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome. | |
EMF Alignment format | File formats | Ensembl Multi Format (EMF) stores genomic alignments in Ensembl. | |
ENA | INSDC | Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications. https://www.ebi.ac.uk/ena | |
ENCODE | Epigenome source database | Project aiming to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active, by large scale functional analyses of laboratory cell lines. Used as a source for the Ensembl regulatory build. https://www.encodeproject.org/ | |
Enhancers | Regulatory features | Regions that bind transcription factors and interact with promoters to stimulate transcription of distant genes. | |
Ensembl canonical | Transcript | A single transcript chosen for a gene which is the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt. This is defined in detail on http://www.ensembl.org/info/genome/genebuild/canonical.html | |
Ensembl default (VEP) | File formats | Ensembl default is an input format for the VEP, used to describe the position and alleles of a variant. | |
Ensembl gene tree pipeline | Algorithm | The process by which Ensembl compare gene sequences in order to construct gene trees and predict homologues. | |
Ensembl Genebuild | Algorithm | The automatic process by which Ensembl plot known RNA and protein sequence onto the genome, using sequence similarity. | |
Ensembl Havana | Ensembl Genebuild | Human And Vertebrate ANalysis and Annotation. The team within Ensembl who manually annotate genes and transcripts for a subset of species. | |
Ensembl Regulatory Build | Algorithm | The process by which Ensembl predict the location of regions that regulate gene expression using epigenomic evidence. | |
Ensembl sources | | Publicly available database that Ensembl imports data from. | |
Epigenome | Regulatory activity | A cell type, such as a primary tissue or lab cell line, for which we have epigenome evidence and can predict regulatory features. | |
Epigenome evidence | Regulatory features | Experimental data that is used to construct and determine activity of regulatory features. | |
Epigenome source database | Ensembl sources | Database from which Ensembl imports ChIP-seq, DNase-seq and other related datasets, which are used in the Ensembl regulatory build. | |
EPO | Multiple whole genome alignment | The EPO (Enredo, Pecan, Ortheus) pipeline is a three step pipeline for whole-genome multiple alignments, using Enredo segments, aligning them with Pecan and constructing ancestal sequences with Ortheus. | |
Eponine | Algorithm | Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognising specific sequence motifs. Each of these is associated with a position distribution relative to the TSS. http://www.sanger.ac.uk/science/tools/eponine | |
eQTL | QTL | Genetic loci where allelic variation is associated with expression levels of other genes. | |
EST | Genome annotation | Expressed Sequence Tag. Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development. | |
EVA | Variation source database | The European Variation Archive is an open-access database of all types of genetic variation data from all species. https://www.ebi.ac.uk/eva/ | |
Evidence status | Variant | Codes that reflect the amount and type of evidence that supports the existence of a variant. | |
Exon | Transcript | Transcribed genomic region that remains in the RNA after splicing, includes both the CDS and the UTRs. | |
Exonerate | Algorithm | A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate | |
External reference | Genome annotation | Mapping between Ensembl genes, transcripts and proteins to the same features in other databases. | |
FASTA | File formats | FASTA is used to store finished nucleotide and peptide sequences. The Ensembl FTP site has genome, cDNA, CDS and peptide sequences in FASTA, and you can export FASTA from various webpages in Ensembl. | |
Feature elongation | Variant consequence | A sequence variant located within a regulatory region | |
Feature truncation | Variant consequence | A sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence | |
Fix patch | Patch | Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. | |
Flagged variant | Variant | Variants that failed our quality control analyses, therefore they are flagged as suspicious. | |
Flanking sequence | Genome annotation | Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat). | |
Forward strand | Genome annotation | DNA strand arbitrary defined as the strand with its 5' end at the tip of the short chromosome arm (p). If a gene is forward-stranded, its sense (sequence matching cDNA) is on the forward strand. Forward strand is reverse complementary to the reverse strand. | |
Frameshift variant | Variant consequence | A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three | |
GenBank (database) | INSDC | The US branch of INSDC. https://www.ncbi.nlm.nih.gov/genbank/ | |
GenBank (file format) | File formats | GenBank files store sequence and accompanying annotation for features across a genomic region. They can be exported from various webpages in Ensembl and are stored for 1Mb regions across the genome. | |
GENCODE | Gene source database | The aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human and mouse genomes at a high accuracy. The GENCODE gene set is the default geneset in Ensembl and is equivalent to the Ensembl/HAVANA merged genes. https://www.gencodegenes.org/ | |
GENCODE Basic | Transcript | A subset of the GENCODE transcript set, containing only 5' and 3' complete transcripts. | |
GENCODE Comprehensive | Transcript | The full GENCODE transcript set, containing both complete transcripts and 5' and 3' incomplete transcripts. | |
Gene | Genome annotation | Genomic locus where transcription occurs. A gene may have one or more transcripts, which may or may not encode proteins. | |
Gene Ontology | Gene source database | An organised hierarchy of terms produced by the Gene Ontology Consortium, used to describe the function of proteins. GO terms are split into three subcategories: biological processes (what the protein does), cellular component (where in the cell the protein is found), and molecular function (how the protein acts). http://www.geneontology.org/ | |
Gene source database | Ensembl sources | Database from which Ensembl imports cDNA or protein sequence for gene annotation, or gene names. | |
Gene split | Paralogues | Pairs of genes in a species that occur together in the same tree, but are actually two halves of the same gene split partway along. | |
Gene tree | Homologues | A representation of the evolutionary relationship between homologues, constructed using the Ensembl gene tree pipeline. | |
Genetic marker | Variant | A measurable locus that varies within a population. | |
GeneWise | Algorithm | GeneWise is a sequence analysis tool for comparing proteins to DNA sequences allowing for introns and frameshifts. It is used in the Targetted stage of the Ensembl GeneBuild. https://www.ebi.ac.uk/Tools/psa/genewise/ | |
Genome | | The complete set of DNA found in each cell. | |
Genome annotation | | A genomic locus that has been annotated. | |
Genome assembly | Genome | A computational representation of the sequence of a haploid genome, representative of a species or strain. | |
Genotype | Allele (variant) | The specific alleles that are present in an individual's genome. In diploid organisms two alleles make up the genotype (except for the sex chromosomes). | |
GENSCAN | Algorithm | An HMM-based ab initio gene prediction method, used to create a track of ab initio genes in Ensembl. http://genes.mit.edu/GENSCAN.html | |
GFF | File formats | GFF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GFF, allow attachment of GFF files to view against the genome and allow custom annotation with the VEP using GFF files. | |
Global MAF | Minor allele frequency | The frequency of the second most common allele in the global population, defined in human by the 1000 Genomes Project phase 3. | |
gnomAD | Variation source database | An aggregation of publicly available whole genome and whole exome variant calling experiments in human. GnomAD was previously known as ExAC, when it contained only exome data. Ensembl display population frequencies from gnomAD. http://gnomad.broadinstitute.org/ | |
Golden path (genome size) | Genome assembly | The golden path is the length of the non-redundant reference assembly. It excludes alternative sequences and PARs, but includes the estimated size of the gaps. | |
GTF | File formats | GTF is a tab-limited format that describes genomic features, such as genes and transcripts, and allows hierarchical linking of gene features. Ensembl store gene files as GTF, allow attachment of GTF files to view against the genome and allow custom annotation with the VEP using GTF files. | |
GVF | File formats | Genome Variation Format (GVF) is used to store variation data. It can be found on the Ensembl FTP site. | |
GWAS catalog | Phenotype source database | A curated database that extracts associations between variants and genes from published genome-wide association studies in human. Ensembl display phenotypes from the GWAS catalog. https://www.ebi.ac.uk/gwas/ | |
Haplotype (genome) | Alternative sequence | Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus). These were included as part of the genome assembly when it was first produced. | |
Haplotype (variation) | Linkage disequilibrium | A set of variant alleles in a contiguous genomic region. A haplotype block describes a set of alleles which tend to be inherited together. | |
HapMap | Variation source database | An international collaboration formed to develop a haplotype map of the human genome and thus describe the common patterns of human DNA sequence variation using genotyping. Ensembl display sample genotypes and population frequencies from the HapMap project. https://www.genome.gov/10001688/international-hapmap-project/ | |
Hard masked | Repeat masking | Hard masked sequence is repeat masked with the repeat sequences replaced by Ns. Hard masked sequence files on the Ensembl FTP site have "rm" in their file name. | |
HGMD | Variation source database | Project aiming to collate all known (published) gene lesions responsible for human inherited disease. Full HGMD access is restricted to license holders so Ensembl supports the minimal public data release which consists of variant/mutation names and locations. http://www.hgmd.cf.ac.uk/ac/index.php | |
HGNC | Gene source database | HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication. HGNC gene names are used for Ensembl human genes, where available, and for orthologous genes in other species. https://www.genenames.org/ | |
HGVS nomenclature | File formats | A set of recomendations for variant naming. The nomenclature describes the change a variant allele has on a named (genomic, transcript or protein) sequence. Can be used as an input for the VEP and displayed for known variants. http://varnomen.hgvs.org/ | |
High impact variant consequence | Variant impact | The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay. | |
Highest population MAF | Minor allele frequency | The highest minor allele frequency observed in any population typed for this variant. For human this includes the 1000 Genomes Project, gnomAD and UK10K. | |
Histone modification | Epigenome evidence | Covalent modifications to the histone proteins that make up the nucleosome, which are known to regulate gene expression. | |
Homoeologues | Homologues | Pairs of genes in a polyploid genome that underwent (a) hybridisation event(s). The original genes were orthologues in the two (or more) species that hybridised, and now occur in the same species. Since they did not arise through a duplication event, they are not paralogues. | |
Homologues | Gene | Specific genes that are descended from the same common sequence in an ancestor. | |
Identity | Alignments | A measure of how similar two alignment sequences are, specifically, what percentage of amino acids or nucleotides are the same in type and position between the two sequences. The value is dependent on which sequence is used as the reference, since it is a percentage of that reference. | |
IG C gene | IG gene | Constant chain immunoglobulin gene that undergoes somatic recombination before transcription | |
IG D gene | IG gene | Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription | |
IG gene | Biotype | Immunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/. | |
IG J gene | IG gene | Joining chain immunoglobulin gene that undergoes somatic recombination before transcription | |
IG pseudogene | Pseudogene | Inactivated immunoglobulin gene. | |
IG V gene | IG gene | Variable chain immunoglobulin gene that undergoes somatic recombination before transcription | |
IMGT | Gene source database | International ImMunoGeneTics information system. Database of immunoglobulin and T-cell receptor annotation. We collaborate with IMGT on manual annotation of somatically recombined genes. http://www.imgt.org/ | |
IMPC | Phenotype source database | An international scientific endeavour to create and characterise the phenotype of 20,000 knockout mouse strains. Ensembl display phenotypes from the IMPC. http://www.mousephenotype.org/ | |
Inactive | Regulatory activity | When a regulatory feature bears no epigenetic modifications from the ones included in the Regulatory Build. | |
Incomplete terminal codon variant | Variant consequence | A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed | |
Indel | Sequence variant | An insertion and a deletion, affecting two or more nucleotides | |
Inframe deletion | Variant consequence | An inframe non synonymous variant that deletes bases from the coding sequenc | |
Inframe insertion | Variant consequence | An inframe non synonymous variant that inserts bases into in the coding sequenc | |
INSDC | Gene source database | An international consortium between the ENA, GenBank and DDBJ to share submissions of nucleotide sequence. These sequences are used as evidence for annotating Ensembl genes. http://www.insdc.org/ | |
Insertion | Sequence variant | Insertion of one or more nucleotides | |
Interchromosomal breakpoint | Translocation | A rearrangement breakpoint between two different chromosomes. | |
Interchromosomal translocation | Translocation | A translocation where the regions involved are from different chromosomes. | |
Intergenic variant | Variant consequence | A sequence variant located in the intergenic region, between genes | |
InterProScan | Algorithm | InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases, including PROSITE, PRINTS, Pfam, Seg, SignalP, Gene3D, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY. Ensembl run InterProScan on all protein sequences, which uses these protein signatures to identify domains. https://www.ebi.ac.uk/interpro/ | |
Intrachromosomal breakpoint | Translocation | A rearrangement breakpoint within the same chromosome. | |
Intrachromosomal translocation | Translocation | A translocation where the regions involved are from the same chromosome. | |
Intron | Transcript | Transcribed genomic regions that is removed from the RNA by splicing. | |
Intron variant | Variant consequence | A transcript variant occurring within an intron | |
Inversion | Structural variant | A continuous nucleotide sequence is inverted in the same position | |
Karyotype | Genome assembly | The number of chromosomes of a genome. | |
LastZ | Pairwise whole genome alignment | LASTZ is a program for aligning DNA sequences in a pairwise manner. Its precedesessor is BlastZ. | |
lincRNA (long intergenic ncRNA) | Long non-coding RNA (lncRNA) | Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species. | |
Linkage disequilibrium | Variant | A measure of how often two variants or specific sequences are inherited together. | |
Long non-coding RNA (lncRNA) | Processed transcript | A non-coding gene/transcript >200bp in length | |
Loss of heterozygosity | Structural variant | A functional variant whereby the sequence alteration causes a loss of function of one allele of a gene. | |
Low complexity regions | Repeat | Poly-purine or poly-pyrimidine stretches, or regions of extremely high AT or GC content. | |
Low impact variant consequence | Variant impact | A variant that is assumed to be mostly harmless or unlikely to change protein behaviour. | |
LTRs | Repeat | Long tandem repeats. | |
Macro lncRNA | Long non-coding RNA (lncRNA) | Unspliced lncRNAs that are several kb in size. | |
MAF | File formats | Multiple alignment format (MAF) stores genomic alignments. | |
Major allele | Allele (variant) | The allele which is most frequent in the global population, defined in human by the 1000 Genomes Project. The major allele may be the reference or the alternative allele, and may or may not be the ancestral allele. | |
MANE | Transcript | The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq to identify transcripts that match GRCh38 and are 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR. | |
MANE Plus Clinical | MANE | Transcripts in the MANE Plus Clinical set are additional transcripts per locus necessary to support clinical variant reporting, for example transcripts containing known Pathogenic or Likely Pathogenic clinical variants not reportable using the MANE Select set. Note there may be additional clinically relevant transcripts in the wider RefSeq and Ensembl/GENCODE sets but not yet in MANE. | |
MANE Select | MANE | The Matched Annotation from NCBI and EMBL-EBI is a collaboration between Ensembl/GENCODE and RefSeq. The MANE Select is a default transcript per human gene that is representative of biology, well-supported, expressed and highly-conserved. This transcript set matches GRCh38 and is 100% identical between RefSeq and Ensembl/GENCODE for 5' UTR, CDS, splicing and 3'UTR. | |
Many-to-many orthologues | Orthologues | A type of orthologue assigned for a pair of species where multiple orthologues are found in both species, where the duplication events in both species occurred after the speciation event. | |
Marker | Genome annotation | A short sequence whose placement on the genome is known. | |
Mature miRNA variant | Variant consequence | A transcript variant located with the sequence of the mature miRNA | |
MetaLR | Algorithm | A tool for predicting the pathogenicity of single nucleotide variants using a logistic regression based ensemble method. | |
MGI | Gene source database | MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI gene names are used for Ensembl mouse genes, where available. http://www.informatics.jax.org/ | |
Microsatellite | Repeat | A region in the genomic sequence containing short tandem repeats of 2-10bp. | |
Minor allele | Allele (variant) | The allele which is the second most frequent in the global population, defined in human by the 1000 Genomes Project. The minor allele may be the reference or the alternative allele, and may or may not be the ancestral allele. | |
Minor allele frequency | Minor allele | The frequency of the second most common allele in the specified population. | |
miRbase | Gene source database | The miRBase database is a searchable database of published miRNA sequences and annotation. These sequences are used as evidence for annotating Ensembl miRNA genes. http://www.mirbase.org/ | |
miRNA | ncRNA | A small RNA (~22bp) that silences the expression of target mRNA. | |
miscRNA | ncRNA | Miscellaneous RNA. A non-coding RNA that cannot be classified. | |
Missense variant | Variant consequence | A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved | |
Mobile element deletion | Structural variant | A deletion of a mobile element when comparing a reference sequence (has mobile element) to a individual sequence (does not have mobile element). | |
Mobile element insertion | Structural variant | A kind of insertion where the inserted sequence is a mobile element. | |
Moderate impact variant consequence | Variant impact | A non-disruptive variant that might change protein effectiveness. | |
Modifier impact variant consequence | Variant impact | Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact. | |
Multiple whole genome alignment | Whole genome alignment | An alignment between more than two whole genomes of a selected taxon. | |
MutationAssessor | Algorithm | A tool for assessing the functional impact of single nucleotide variants based on evolutionary conservation of the affected amino acid in protein homologues. | |
MySQL | File formats | MySQL is a database. All Ensembl data is stored in MySQL relational tables, which can be found on the FTP site and accessed directly by MySQL queries. | |
NA | Regulatory activity | When there is no available data in the cell type for this regulatory feature. | |
ncRNA | Processed transcript | A non-coding gene. | |
Newick | File formats | Newick is a tree format. Ensembl gene trees can be downloaded in Newick and it is used to store Ensembl species trees. | |
NMD transcript variant | Variant consequence | A variant in a transcript that is the target of NMD | |
Non coding | Long non-coding RNA (lncRNA) | Transcripts which are known from the literature to not be protein coding. | |
Non coding transcript exon variant | Variant consequence | A sequence variant that changes non-coding exon sequence in a non-coding transcript | |
Non coding transcript variant | Variant consequence | A transcript variant of a non coding RNA gene | |
Non-ATG start | Transcript | A transcript with a non-ATG start codon but which still encodes a methionine since the ribosomal machinery allows non-AUG to translate as methionine in specific cases. | |
Nonsense Mediated Decay | Biotype | A transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction. | |
Novel patch | Patch | Novel patches represent new allelic loci. They can usually be considered as similar to haplotypes and are likely to be reclassified as such in the next genome assembly, but not necessarily. | |
Novel sequence insertion | Structural variant | An insertion the sequence of which cannot be mapped to the reference genome. | |
OMIA | Phenotype source database | An online database that describes the function and phenotypes associated with animal genes. Ensembl display phenotypes from OMIA. https://www.omia.org/ | |
OMIM | Phenotype source database | An online database that describes the function and phenotypes associated with human genes. Ensembl display phenotypes from OMIM and MIM morbid. https://www.omim.org/ | |
Open chromatin regions | Regulatory features | Regions of spaced out histones, making them accessible to protein interactions. | |
Orphanet | Phenotype source database | A catalogue of rare disease associations. Ensembl display phenotypes from Orphanet. http://www.orpha.net/ | |
Orthologues | Homologues | Orthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart. | |
OrthoXML | File formats | OrthoXML is an XML format to allow the storage and comparison of orthology data. It is used to store Ensembl homologues. | |
Other paralogues | Paralogues | Paralogues which are very far away from the other members of a paralogue family. They are part of the same super-family, but the precise taxonomic relationship to other members is undefined, as the trees are too large to compute. | |
Pairwise interactions (WashU) | File formats | Pairwise interactions, such as those derived from Hi-C, can be stored in the WashU format and viewed in Ensembl. | |
Pairwise whole genome alignment | Whole genome alignment | An alignment between two whole genomes. | |
PAR | Genome assembly | Small regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked. | |
Paralogues | Homologues | Genes (homologues) that have evolved by duplication. | |
Patch | Alternative sequence | New sequences that have been added to the genome assembly since its release. There are two types: fix and nove patches. | |
PDB | Protein source database | A repository for 3D biological macromolecular structure data. Ensembl provide links out to the PDB, and use structures to display the locations of variants in proteins. http://www.ebi.ac.uk/pdbe/ | |
Peak | Epigenome evidence | Locus identified from epigenome signal as being having high signal, shown as a BigBed across the genome. | |
Pecan | Multiple whole genome alignment | Pecan is a global multiple sequence alignment program that makes practical the probabilistic consistency methodology for significant numbers of sequences of practically arbitrary length. | |
Peptide | Transcript | A sequence of amino acids, translated from a CDS. | |
Phase | Exon | The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1. | |
Phenotype source database | Ensembl sources | Database from which Ensembl imports phenotype associations with genes and/or variants. | |
PhyloXML | File formats | PhyloXML is an XML language for the analysis, exchange, and storage of phylogenetic trees (or networks) and associated data. It is used to store Ensembl phylogenetic trees. | |
piRNA | ncRNA | An RNA that interacts with piwi proteins involved in genetic silencing. | |
Placed scaffold | Scaffold | A scaffold that can be positioned on a chromosome based on genetic mapping information. | |
Poised | Regulatory activity | When a regulatory feature displays a epigenetic signature with the potential to be activated. It is analogous to a sprinter in the blocks. | |
Polymorphic pseudogene | Pseudogene | Pseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated. | |
PolyPhen | Algorithm | A tool which predicts if missense variants are likely to affect protein function based on physical and comparative considerations. http://genetics.bwh.harvard.edu/pph2/ | |
Primary assembly | Genome assembly | The underlying genome sequence, without alternative sequence included. | |
Private allele | Allele (variant) | An allele which has only been identified in one individual or one family. A private allele may be the reference or the alternative allele, and may or may not be the ancestral allele. | |
Probe | Structural variant | A DNA sequence used experimentally to detect the presence or absence of a complementary nucleic acid. | |
Processed pseudogene | Pseudogene | Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome. | |
Processed transcript | Biotype | Gene/transcript that doesn't contain an open reading frame (ORF). | |
Progressive cactus | Multiple whole genome alignment | Progressive-Cactus is a next-generation aligner that stores whole-genome alignments in a graph structure. | |
Projection build | Ensembl Genebuild | A gene build method used by Ensembl for low coverage genomes, allowing genes to be annotated that span two scaffolds by mapping to the human gene. | |
Promoter flanking regions | Regulatory features | Transcription factor binding regions that flank promoters. | |
Promoters | Regulatory features | Regions at the 5' end of genes where transcription factors and RNA polymerase bind to initiate transcription. | |
Protein altering variant | Variant consequence | A sequence_variant which is predicted to change the protein encoded in the coding sequence | |
Protein coding | Biotype | Gene/transcipt that contains an open reading frame (ORF). | |
Protein coding CDS not defined | Biotype | Alternatively spliced transcript of a protein coding gene for which we cannot define a CDS. | |
Protein coding LOF | Biotype | Not translated in the reference genome owing to a SNP/DIP but in other individuals/haplotypes/strains the transcript is translated. Replaces the polymorphic_pseudogene transcript biotype. | |
Protein domain | Peptide | A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics. | |
Pseudogene | Biotype | A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function. | |
PSL | File formats | PSL represents alignments and can be viewed in Ensembl. | |
QTL | Variant | Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure). | |
r2 | Linkage disequilibrium | The correlation between a pair of loci. It varies from 0 (loci are in complete linkage equilibrium) to 1 (loci are in complete linkage disequilibrium and coinherited). | |
RDF | File formats | Resource Description Framework (RDF) is used as a metadata data model. Ensembl use it to describe links from Ensembl annotations to those annotations in other databases. | |
Readthrough | Biotype | A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs). | |
Reference allele | Allele (variant) | The allele of a variant found in the reference genome currently being studied. The reference allele is not necessarily the major or ancestral allele. | |
RefSeq | Gene source database | NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products. https://www.ncbi.nlm.nih.gov/refseq/ | |
RefSeq Match | MANE Select | RefSeq transcripts that match 100% across the sequence, exon/intron structure and UTRs as part of the MANE project | |
Regulatory activity | Regulatory features | The activity state of a regulatory feature in a specific epigenome. | |
Regulatory features | Genome annotation | Regions that are predicted to regulate the expression of genes, based on the Ensembl regulatory build. | |
Regulatory region ablation | Variant consequence | A feature ablation whereby the deleted region includes a regulatory region | |
Regulatory region amplification | Variant consequence | A feature amplification of a region containing a regulatory region | |
Regulatory region variant | Variant consequence | A sequence variant located within a regulatory region | |
Repeat masking | Repeat | The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs. | |
RepeatMasker | Algorithm | The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs. http://www.repeatmasker.org/ | |
Repressed | Regulatory activity | When a regulatory feature is epigenetically repressed, having an epigenetic signature that prevents it from being active. | |
Retained intron | Long non-coding RNA (lncRNA) | An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene. | |
REVEL | Algorithm | A tool for predicting the pathogenicity of single nucleotide variants using an ensemble method. | |
Reverse strand | Genome annotation | DNA strand arbitrary defined as the strand with its 5' end at the tip of the long chromosome arm (q). If a gene is reverse-stranded, its sense (sequence matching cDNA) is on the reverse strand. Reverse strand is reverse complementary to the forward strand. | |
Rfam | Gene source database | The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). These sequences are used as evidence for annotating Ensembl non-coding genes. http://rfam.xfam.org/ | |
RNA repeats | Repeat | Non-functional copies of RNA genes which have been reintegrated into the genome with the assistance of a reverse transcriptase. | |
Roadmap Epigenomics | Epigenome source database | Project aiming to develop publicly available reference epigenome maps from a variety of cell types. http://www.roadmapepigenomics.org/ | |
rRNA | ncRNA | The RNA component of a ribosome. | |
Satellite repeats | Repeat | Multiple copies of the same base sequence on a DNA sequence. The repeated pattern can vary in length from a single base to several thousand bases long. | |
Scaffold | Genome assembly | Scaffolds are sets of ordered, oriented contigs, assembled by sequence overlap. They are longer sequences than contigs, but shorter than full chromosomes. | |
Sense intronic | Long non-coding RNA (lncRNA) | A long non-coding transcript in introns of a coding gene that does not overlap any exons. | |
Sense overlapping | Long non-coding RNA (lncRNA) | A long non-coding transcript that contains a coding gene in its intron on the same strand. | |
Sequence variant | Variant | Variant that only affects a small locus | |
SGD | Gene source database | Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae, source of the annotation seen in Ensembl. https://www.yeastgenome.org/ | |
Short tandem repeat variant | Variant | A variation that expands or contracts a tandem repeat with regard to a reference. | |
SIFT | Algorithm | A tool which predicts if missense variants are likely to affect protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. http://sift.bii.a-star.edu.sg/ | |
Signal | Epigenome evidence | A count of the number of NGS reads from an epigenome experiment aligned to a locus, shown as a BigWig across the genome. | |
Similarity | Alignments | How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues/nucleotides. | |
Simple repeats | Repeat | Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc. | |
siRNA | ncRNA | A small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway. | |
Slice | Genome assembly | The term "slice" in Ensembl refers to a length of DNA sequence. A slice can be any length, from one base long to the entire length of a chromosome. | |
snoRNA | ncRNA | Small RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs. | |
SNP | Sequence variant | Single Nucleotide Polymorphism, substitution of a single nucleotide for another nucleotide | |
snRNA | ncRNA | Small RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs | |
Soft masked | Repeat masking | Soft masked sequence is repeat masked with the repeat sequences in lower case. Soft masked sequence files on the Ensembl FTP site have "sm" in their file name. | |
Splice acceptor variant | Variant consequence | A splice variant that changes the 2 base region at the 3' end of an intron | |
Splice donor variant | Variant consequence | A splice variant that changes the 2 base region at the 5' end of an intron | |
Splice region variant | Variant consequence | A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron | |
Start lost | Variant consequence | A codon variant that changes at least one base of the canonical start codo | |
Stop codon readthrough | Biotype | The coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence | |
Stop gained | Variant consequence | A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript | |
Stop lost | Variant consequence | A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript | |
Stop retained variant | Variant consequence | A sequence variant where at least one base in the terminator codon is changed, but the terminator remains | |
Structural variant | Variant | Variant that affects a large locus | |
Substitution | Sequence variant | A sequence alteration where the length of the deleted sequence is the same as the length of the inserted sequence. | |
SwissProt | UniProt | UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. These sequences are used as evidence for annotating Ensembl genes. | |
Synonymous variant | Variant consequence | A sequence variant where there is no resulting change to the encoded amino acid | |
Synteny | Whole genome alignment | In a genomic context we refer to syntenic regions if the sequence is globally conserved between two species. | |
TAGENE | Transcript | Long-read sequence data is computationally processed into non-redundant transcript models which are manually appraised by the Ensembl-Havana annotation team. | |
Tandem duplication | Structural variant | A duplication consisting of 2 identical adjacent regions. | |
Tandem repeat | Variant | Two or more adjacent copies of a region (of length greater than 1). | |
Tandem repeats | Repeat | Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences. | |
TEC (To be Experimentally Confirmed) | Biotype | Regions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies. | |
TF binding site variant | Variant consequence | A sequence variant located within a transcription factor binding site | |
TFBS ablation | Variant consequence | A feature ablation whereby the deleted region includes a transcription factor binding site | |
TFBS amplification | Variant consequence | A feature amplification of a region containing a transcription factor binding site | |
Toplevel | Genome assembly | The largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs. | |
TOPMed | Variation source database | Whole genome variant calling data from humans worldwide with heart, lung, blood, and sleep disorders. Ensembl display population frequencies from TOPMed. https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program | |
TR C gene | TR gene | Constant chain T cell receptor gene that undergoes somatic recombination before transcription | |
TR D gene | TR gene | Diversity chain T cell receptor gene that undergoes somatic recombination before transcription | |
TR gene | Biotype | T cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/. | |
TR J gene | TR gene | Joining chain T cell receptor gene that undergoes somatic recombination before transcription | |
TR V gene | TR gene | Variable chain T cell receptor gene that undergoes somatic recombination before transcription | |
Transcribed pseudogene | Pseudogene | Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'. | |
Transcript | Gene | A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a protein | |
Transcript ablation | Variant consequence | A feature ablation whereby the deleted region includes a transcript feature | |
Transcript amplification | Variant consequence | A feature amplification of a region containing a transcript | |
Transcript haplotype | Haplotype (variation) | The transcript sequence derived from one copy of a gene in an individual, based on the phased 1000 Genomes genotype data. CDS and protein sequences are derived from this. | |
Transcript support level | Transcript | The Transcript Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users, based on the type and quality of the alignments used to annotate the transcript. | |
Transcription factor | Epigenome evidence | A protein that binds to DNA and controls the rate of transcription. | |
Transcription factor binding motif | Regulatory features | Short genomic sequence that is known to bind to a particular transcription factor. | |
Transcription factor binding sites | Regulatory features | Sites which bind transcription factors, for which no other role can be determined as yet. | |
Translated Blat | Pairwise whole genome alignment | Translated Blat can be used for alignment of the coding regions of genomes only in a pairwise manner. | |
Translated pseudogene | Pseudogene | Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed' | |
Translocation | Structural variant | A region of nucleotide sequence that has translocated to a new position | |
TrEMBL | UniProt | A subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the ENA (formerly EMBL-bank) that are not yet incorporated into the UniProt/SwissProt database. These sequences are used as evidence for annotating Ensembl genes. | |
tRNA | ncRNA | A transfer RNA, which acts as an adaptor molecule for translation of mRNA. | |
TSL 1 | Transcript support level | A transcript where all splice junctions are supported by at least one non-suspect mRNA. | |
TSL 2 | Transcript support level | A transcript where the best supporting mRNA is flagged as suspect or the support is from multiple ESTs | |
TSL 3 | Transcript support level | A transcript where the only support is from a single EST | |
TSL 4 | Transcript support level | A transcript where the best supporting EST is flagged as suspect | |
TSL 5 | Transcript support level | A transcript where no single transcript supports the model structure. | |
TSL NA | Transcript support level | A transcript that was not analysed for TSL. | |
Type I Transposons/LINE | Repeat | Long Interspersed Elements. Retrotransposed elements in the genome containing open reading frames encoding (often inactive) reverse transcription machinery. | |
Type I Transposons/SINE | Repeat | Short Interspersed Elements. Retrotransposed elements less than 500 bp that contain tRNA, snRNA and rRNA, which require other mobile elements to be transposed. Alu elements are a type of SINE. | |
Type II Transposons | Repeat | Elements that have been transposed and duplicated around the genome by excision and ligation. | |
UCSC Genome Browser | Gene source database | A genome browser hosted at the University of California Santa Cruz. Ensembl collaborates with UCSC in projects such as GENCODE, CCDS and TSL. https://genome.ucsc.edu/ | |
UK10K | Variation source database | Study comparing exomes of 6000 diseased individuals with 4000 healthy individuals in the UK in order to identify disease-causing variants. Ensembl display population frequencies from the control group. https://www.uk10k.org/ | |
UniProt | Gene source database | Database of protein sequence and functional information, based at European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). These sequences are used as evidence for annotating Ensembl genes. http://www.uniprot.org/ | |
UniProt Match | UniProt | The UniProt identifier that matches to the Ensembl transcript. This may be a UniProt protein isoform and will have a number suffix, or may just refer to a UniProt entry. | |
UniSTS | Marker | UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR. | |
Unitary pseudogene | Pseudogene | A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species. | |
Unknown repeat | Repeat | Repeats that cannot be classified. | |
Unplaced scaffold | Scaffold | A scaffold that cannot be positioned on a chromosome. | |
Unprocessed pseudogene | Pseudogene | Pseudogene that can contain introns since produced by gene duplication. | |
Untranslated region | Transcript | The region of a coding cDNA which is not translated. | |
Upstream gene variant | Variant consequence | A sequence variant located 5' of a gene | |
Variant | Genome annotation | Locus where the sequence differs between individuals of the same species | |
Variant consequence | Variant | The effect that the variant has on each feature that it overlaps. A variant will have a consequence for each feature that it overlaps. | |
Variant impact | Variant consequence | A subjective classification of the severity of the variant consequence, based on agreement with SNPEff. | |
Variation source database | Ensembl sources | Database from which Ensembl imports variation data, including loci, sample genotypes, population frequencies and phenotype associations. | |
vaultRNA | ncRNA | Short non coding RNA genes that form part of the vault ribonucleoprotein complex. | |
VCF | File formats | VCF is a standard format for listing genetic variation, which is the output for many variant callers. It can be used as an input for the Ensembl VEP and is used to store and download variation data in Ensembl. | |
VEP | Algorithm | The Variant Effect Predictor (VEP) is an Ensembl tool that predicts the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. | |
VEP cache | File formats | A VEP cache contains all the gene and variant data needed to run a VEP query, and can be used to run large queries quickly on your own machine. These can be installed as part of your VEP installtion, or downloaded from the FTP site. | |
Wasabi | Alignments | An application for displaying sequence alignments with custom colour-annotation, which is used by Ensembl displaying gene tree and family alignments. http://wasabiapp.org/ | |
Whole genome alignment | Alignments | An alignment carried out using the whole genome sequence. | |
Wiggle | File formats | Wiggle format expresses scores across genomic loci, requiring fixed size bins for the scores. It can be uploaded to view in Ensembl. | |
Within species paralogues | Paralogues | Two or more versions of a duplicated gene in a single species. In a gene tree, the genes are separated by a duplication node. | |
YAC | Clone | Originated from a bacterial plasmid, a YAC contains a yeast centromeric region, a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell. | |
zFIN | Gene source database | An online biological database of information about the zebrafish (Danio rerio). zFIN gene names are used for Ensembl zebrafish genes, where available. https://zfin.org/ | |