Ensembl Variation - Schema documentation

This document gives a high-level description of the tables that make up the Ensembl variation schema. Tables are listed by alphabetical order, and the purpose of each table is explained. It is intended to allow people to familiarise themselves with the schema when encountering it for the first time, or when they need to use some tables that they've not used before.

This document refers to version 115 of the Ensembl variation schema.

The variation database schema diagram (PDF format) is available here:

List of the tables:

Failed tables

failed_allele
failed_description
failed_structural_variation
failed_variation
failed_variation_feature

Genotype tables

compressed_genotype_region
compressed_genotype_var
population_genotype
read_coverage
sample_genotype_multiple_bp
tmp_sample_genotype_single_bp

Metadata tables

meta
meta_coord

Other tables

allele_code
coord_system
genotype_code
seq_region
submitter_handle
subsnp_handle

Phenotype tables

phenotype
phenotype_feature
phenotype_feature_attrib
phenotype_ontology_accession

Protein tables

protein_function_predictions
protein_function_predictions_attrib
translation_md5

Sample tables

display_group
individual
individual_synonym
individual_type
population
population_structure
population_synonym
sample
sample_population
sample_synonym

Source/study tables

associate_study
publication
source
study
submitter
variation_citation

Structural variation tables

structural_variation
structural_variation_association
structural_variation_feature
structural_variation_sample

Variation effect tables

motif_feature_variation
regulatory_feature_variation
transcript_variation
variation_genename
variation_hgvs

Variation set tables

variation_set
variation_set_structural_variation
variation_set_structure
variation_set_variation

Variation tables

allele
allele_synonym
variation
variation_attrib
variation_feature
variation_synonym

Attributes tables

These tables define the variation attributes data.

attrib

Column	Type	Default value	Description	Index
attrib_id	INT(11)	-	Primary key	primary key
attrib_type_id	SMALLINT(5)	0	Key into the attrib_type table, identifies the type of this attribute	unique key: type_val_idx
value	TEXT	-	The value of this attribute	unique key: type_val_idx

attrib_id	attrib_type_id	value
1	469	SO:0001483
2	470	SNV
3	471	SNP
4	469	SO:1000002
5	470	substitution
6	469	SO:0001019
7	470	copy_number_variation
8	471	CNV
9	469	SO:0000667
10	470	insertion
11	469	SO:0000159
12	470	deletion
13	469	SO:1000032
14	470	indel
15	469	SO:0000705
16	470	tandem_repeat
17	469	SO:0001059
18	470	sequence_alteration
19	469	SO:0001628
20	470	intergenic_variant
21	471	INTERGENIC

Column	Type	Default value	Description	Index
attrib_set_id	INT(11)	0	Primary key	unique key: set_idx
attrib_id	INT(11)	0	Key of an attribute in this set	unique key: set_idx key: attrib_idx

attrib_type_id	code	name	description
469	SO_accession	SO accession	Sequence Ontology accession
470	SO_term	SO term	Sequence Ontology term
471	display_term	display term	Ensembl display term
472	NCBI_term	NCBI term	NCBI term
473	feature_SO_term	feature SO term	Sequence Ontology term for the associated feature
474	rank	rank	Relative severity of this variation consequence
475	polyphen_prediction	polyphen prediction	PolyPhen-2 prediction
476	sift_prediction	sift prediction	SIFT prediction
477	short_name	Short name	A shorter name for an instance, e.g. a VariationSet
478	dbsnp_clin_sig	dbSNP/ClinVar clinical significance	The clinical significance of a variant as reported by ClinVar and dbSNP

Column	Type	Default value	Description	Index
failed_allele_id	INT(11)	-	Primary key, internal identifier.	primary key
allele_id	INT(10)	-	Foreign key references to the allele table.	unique key: allele_idx
failed_description_id	INT(10)	-	Foreign key references to the failed_description table.	unique key: allele_idx

failed_description_id	description
1	Variant maps to more than 3 different locations
2	None of the variant alleles match the reference allele
3	Variant has more than 3 different alleles
4	Loci with no observed variant alleles in dbSNP
5	Variant does not map to the genome
6	Variant has no genotypes
7	Genotype frequencies do not add up to 1
8	Variant has no associated sequence
9	Variant submission has been withdrawn by the 1000 genomes project due to high false positive rate
11	Additional submitted allele data from dbSNP does not agree with the dbSNP refSNP alleles
12	Variant has more than 3 different submitted alleles
13	Alleles contain non-nucleotide characters
14	Alleles contain ambiguity codes
15	Mapped position is not compatible with reported alleles
16	Flagged as suspect by dbSNP
17	Variant can not be re-mapped to the current assembly
18	Supporting evidence can not be re-mapped to the current assembly
19	Variant maps to more than one genomic location
20	Variant at first base in sequence
21	Reference allele does not match the bases at this genome location
22	Alleles cannot be resolved

Column	Type	Default value	Description	Index
failed_structural_variation_id	INT(11)	-	Primary key, internal identifier.	primary key
structural_variation_id	INT(10)	-	Foreign key references to the structural_variation table.	unique key: structural_variation_idx
failed_description_id	INT(10)	-	Foreign key references to the failed_description table.	unique key: structural_variation_idx

Column	Type	Default value	Description	Index
failed_variation_id	INT(11)	-	Primary key, internal identifier.	primary key
variation_id	INT(10)	-	Foreign key references to the variation table.	unique key: variation_idx
failed_description_id	INT(10)	-	Foreign key references to the failed_description table.	unique key: variation_idx

Column	Type	Default value	Description	Index
failed_variation_feature_id	INT	-	Primary key, internal identifier.	primary key
variation_feature_id	INT	-	Foreign key references to the variation_feature table.	unique key: variation_feature_idx
failed_description_id	INT	-	Foreign key references to the failed_description table.	unique key: variation_feature_idx

Column	Type	Default value	Description	Index
sample_id	INT(10)	-	Foreign key references to the sample table.	key: sample_idx
seq_region_id	INT(10)	-	Foreign key references seq_region in core db. ers to the seq_region which this variant is on, which may be a chromosome, a clone, etc...	key: pos_idx
seq_region_start	INT(11)	-	The start position of the variation on the seq_region.	key: pos_idx
seq_region_end	INT(11)	-	The end position of the variation on the seq_region.
seq_region_strand	TINYINT(4)	-	The orientation of the variation on the seq_region.
genotypes	BLOB	NULL	Encoded representation of the genotype data: Each row in the compressed table stores genotypes from one individual/sample in one fixed-size region of the genome (arbitrarily defined as 100 Kb). The compressed string (using Perl's pack method) consisting of a repeating triplet of elements: a <span style="color:#D00">distance</span> in base pairs from the previous genotype; a <span style="color:#090">variation dbID</span>; a <span style="color:#00D">genotype_code_id</span> identifier. For example, a given row may have a start position of 1000, indicating the chromosomal position of the first genotype in this row. The unpacked genotypes field then may contain the following elements: <span style="color:#D00">0</span>, <span style="color:#090">1</span>, <span style="color:#00D">1</span>, <span style="color:#D00">20</span>, <span style="color:#090">2</span>, <span style="color:#00D">5</span>, <span style="color:#D00">35</span>, <span style="color:#090">3</span>, <span style="color:#00D">3</span>, ... The first genotype ("<span style="color:#D00">0</span>,<span style="color:#090">1</span>,<span style="color:#00D">1</span>") has a position of 1000 + <span style="color:#D00">0</span> = 1000, and corresponds to the variation with the internal identifier <span style="color:#090">1</span> and genotype_code_id corresponding to the genotype A\|G (internal ID <span style="color:#00D">1</span>). The second genotype ("<span style="color:#D00">20</span>,<span style="color:#090">2</span>,<span style="color:#00D">5</span>") has a position of 1000 + <span style="color:#D00">20</span> = 1020, internal variation_id <span style="color:#090">2</span> and genotype_code_id corresponding to the genotype C\|C ( internal ID <span style="color:#00D">5</span>). The third genotype similarly has a position of 1055, and so on.

Column	Type	Default value	Description	Index
coord_system_id	INT(10)	-	Primary key, internal identifier.	primary key
species_id	INT(10)	1	Identifies the species for multi-species databases.	unique key: rank_idx unique key: name_idx key: species_idx
name	VARCHAR(40)	-	Co-oridinate system name, e.g. 'chromosome', 'contig', 'scaffold' etc.	unique key: name_idx
version	VARCHAR(255)	NULL	Assembly.	unique key: name_idx
rank	INT	-	Co-oridinate system rank.	unique key: rank_idx
attrib	SET: default_version sequence_level	NULL	Co-oridinate system attrib (e.g. "top_level", "sequence_level").

Column	Type	Default value	Description	Index
population_genotype_id	INT(10)	-	Primary key, internal identifier.	primary key
variation_id	INT(11)	-	Foreign key references to the variation table.	key: variation_idx
subsnp_id	INT(11)	NULL	Foreign key references to the subsnp_handle table.	key: subsnp_idx
genotype_code_id	INT(11)	NULL	Foreign key reference to the genotype_code table.
frequency	FLOAT	NULL	Frequency of the genotype in the population.
population_id	INT(10)	NULL	Foreign key references to the population table.	key: population_idx
count	INT(10)	NULL	Number of individuals/samples who have this genotype, in this population.

Column	Type	Default value	Description	Index
variation_id	INT(10)	-	Primary key. Foreign key references to the variation table.	key: variation_idx
subsnp_id	INT(15)	NULL	Foreign key references to the subsnp_handle table.	key: subsnp_idx
allele_1	VARCHAR(25000)	NULL	One of the alleles of the genotype, e.g. "TAG".
allele_2	VARCHAR(25000)	NULL	The other allele of the genotype.
sample_id	INT(10)	NULL	Foreign key references to the sample table.	key: sample_idx

Column	Type	Default value	Description	Index
meta_id	INT(10)	-	Primary key, internal identifier.	primary key
species_id	INT	1	...	unique key: species_key_value_idx key: species_value_idx
meta_key	VARCHAR( 64 )	-	Name of the meta entry, e.g. "schema_version".	unique key: species_key_value_idx
meta_value	VARCHAR( 255 )	-	Corresponding value of the key, e.g. "61".	unique key: species_key_value_idx key: species_value_idx

Column	Type	Default value	Description	Index
table_name	VARCHAR(40)	-	Name of the feature table, e.g. "variation_feature".	unique: key
coord_system_id	INT(10)	-	Foreign key to core database coord_system table refers to coordinate system that features from this table can be found in.	unique: key
max_length	INT	NULL	Maximum length of the feature.

Column	Type	Default value	Description	Index
allele_code_id	INT(11)	-	Primary key, internal identifier.	primary key
allele	VARCHAR(60000)	NULL	String representing the allele. Has a unique constraint on the first 1000 characters (max allowed by MySQL).	unique key: allele_idx

Column	Type	Default value	Description	Index
genotype_code_id	INT(11)	-	Internal identifier.	key: genotype_code_id
allele_code_id	INT(11)	-	Foreign key reference to allele_code table.	key: allele_code_id
haplotype_id	TINYINT(2)	-	Sorting order of the genotype's alleles.
phased	TINYINT(2)	NULL	Indicates if this genotype is phased

Column	Type	Default value	Description	Index
seq_region_id	INT(10)	-	Primary key. Foreign key references seq_region in core db. Refers to the seq_region which this variant is on, which may be a chromosome, a clone, etc...	primary key
name	VARCHAR(255)	-	The name of this sequence region.	unique key: name_cs_idx
coord_system_id	INT(10)	-	Foreign key references to the coord_system table.	unique key: name_cs_idx key: cs_idx

Column	Type	Default value	Description	Index
handle_id	INT(10)	-	Primary key, internal identifier.	primary key
handle	VARCHAR(25)	NULL	Short string assigned to the data submitter.	unique: key

Column	Type	Default value	Description	Index
subsnp_id	INT(11)	-	Primary key. It corresponds to the subsnp identifier (ssID) from dbSNP. This ssID is stored in this table without the "ss" prefix. e.g. "120258606" instead of "ss120258606".	primary key
handle	VARCHAR(20)	NULL	The name of the dbSNP handler who submitted the ssID. Name of the synonym (a different sample_id).

Column	Type	Default value	Description	Index
phenotype_id	INT(10)	-	Primary key, internal identifier.	primary key
stable_id	VARCHAR(255)	NULL	Ensembl stable identifier for the phenotype	key: stable_idx
name	VARCHAR(50)	NULL	Phenotype short name. e.g. "CAD".	key: name_idx
description	VARCHAR(255)	NULL	varchar Phenotype long name. e.g. "Coronary Artery Disease".	unique key: desc_idx
class_attrib_id	INT	NULL	Class of phenotype entry, eg trait, non_specified, tumour - used for filtering

Column	Type	Default value	Description	Index
phenotype_feature_id	INT(11)	-	Primary key, internal identifier.	primary key
phenotype_id	INT(11)	NULL	Foreign key references to the phenotype table.	key: phenotype_idx
source_id	INT(11)	NULL	Foreign key references to the source table.	key: source_idx
study_id	INT(11)	NULL	Foreign key references to the study table.
type	ENUM: Gene Variation StructuralVariation SupportingStructuralVariation QTL RegulatoryFeature	NULL	Type of object associated.	key: object_idx key: type_idx
object_id	VARCHAR(255)	NULL	Stable identifier for associated object.	key: object_idx
is_significant	TINYINT(1)	'1'	Flag indicating if the association is statistically significant in the given study.
seq_region_id	INT(11)	NULL	Foreign key references seq_region in core db. Refers to the seq_region which this feature is on, which may be a chromosome, a clone, etc...	key: pos_idx
seq_region_start	INT(11)	NULL	The start position of the feature on the seq_region.	key: pos_idx
seq_region_end	INT(11)	NULL	The end position of the feature on the seq_region.	key: pos_idx
seq_region_strand	TINYINT(4)	NULL	The orientation of the feature on the seq_region.
DNA_type	ENUM: Germline Somatic	NULL	The type of DNA, 'Germline' or 'Somatic'.	key: dna_type_idx

Column	Type	Default value	Description	Index
phenotype_feature_id	INT(11)	-	Foreign key, references to the phenotype_feature table.	key: phenotype_feature_idx
attrib_type_id	INT(11)	NULL	Foreign key references to the attrib_type table.	key: type_value_idx
value	VARCHAR(255)	NULL	The value of the attribute.	key: type_value_idx

Column	Type	Default value	Description	Index
translation_md5_id	INT(11)	-	Identifies the MD5 hash corresponding to the protein sequence to which these predictions apply	primary key
analysis_attrib_id	INT(11)	-	Identifies the analysis (sift, polyphen etc.) that produced these predictions	primary key
prediction_matrix	MEDIUMBLOB	NULL	A compressed binary string containing the predictions for all possible amino acid substitutions in this protein. See the explanation here

Column	Type	Default value	Description	Index
display_group_id	INT(10)	-	Primary key, internal identifier.	primary key
display_priority	INT(10)	-	Priority level for group (smallest number is highest on page)	unique: key
display_name	VARCHAR(255)	-	Name of the group to be displayed as the table header.	unique: key

Column	Type	Default value	Description	Index
individual_id	INT(10)	-	Primary key, internal identifier.	primary key
name	VARCHAR(255)	NULL	Name of the individual.
description	TEXT	NULL	Description of the individual.
gender	ENUM: Male Female Unknown	'Unknown'	The sex of this individual.
father_individual_id	INT(10)	NULL	Self referential ID, the father of this individual if known.	key: father_individual_idx
mother_individual_id	INT(10)	NULL	Self referential ID, the mother of this individual if known.	key: mother_individual_idx
individual_type_id	INT(10)	0	Foreign key references to the individual_type table.

Column	Type	Default value	Description	Index
synonym_id	INT(10)	-	Primary key, internal identifier.	primary key
individual_id	INT(10)	-	Foreign key references to the individual table.	key: individual_idx
source_id	INT(10)	-	Foreign key references to the source table.	key: name, source_id
name	VARCHAR(255)	NULL	Name of the synonym.	key: name, source_id

Column	Type	Default value	Description	Index
individual_type_id	INT(0)	-	Primary key, internal identifier.	primary key
name	VARCHAR(255)	-	Short name of the individual type. e.g. "fully_inbred","mutant".
description	TEXT	NULL	Long name of the individual type.

individual_type_id	name	description
1	fully_inbred	multiple organisms have the same genome sequence
2	partly_inbred	single organisms have reduced genome variability due to human intervention
3	outbred	a single organism which breeds freely
4	mutant	a single or multiple organisms with the same genome sequence that have a natural or experimentally induced mutation

Column	Type	Default value	Description	Index
super_population_id	INT(10)	-	Foreign key references to the population table.	unique key: super_population_idx
sub_population_id	INT(10)	-	Foreign key references to the population table.	unique key: super_population_idx key: sub_population_idx

Column	Type	Default value	Description	Index
study1_id	INT(10)	-	Primary key. Foreign key references to the study table.	primary key
study2_id	INT(10)	-	Primary key. Foreign key references to the study table.	primary key