Human assembly and gene annotation

Name: Ensembl Human Regulatory features
Creator: Ensembl
License: https://www.apache.org/licenses/LICENSE-2.0
Keywords: expression, epigenomics, enhancer, promoter

Assembly

This site provides a data set based on the February 2009 Homo sapiens high coverage assembly GRCh37 from the Genome Reference Consortium. This assembly was used by UCSC to create their hg19 database. The data set consists of gene models built from the genewise alignments of the human proteome as well as from alignments of human cDNAs using the cDNA2genome model of exonerate.

This release of the assembly has the following properties:

27478 contigs.
contig length total 3.2 Gb.
chromosome length total 3.1 Gb.

It also includes nine haplotypic regions, mainly in the MHC region of chromosome 6.

Patches

As the GRC maintains and improves the assembly, patches are being introduced. Currently, assembly patches are of two types:

Novel patch: new sequences that add alternative sequence at a loci and will remain as haplotypes in the next major assembly release by GRC
Fix patch: sequences that correct the reference sequence and will replace the given region of the reference assembly at the next major assembly release by GRC

Other assemblies

Gene annotation

The Ensembl human gene annotations have been updated using Ensembl's automatic annotation pipeline. The updated annotation incorporates new protein and cDNA sequences which have become publicly available since the last GRCh37 genebuild (March 2009).

This archive displays a joint gene set based on the merge between the automatic annotation from Ensembl and a freeze of the manual annotation from Havana (first published in Vega Release 55). Transcripts from the two annotation sources are merged if they share the same internal exon-intron boundaries (i.e. have identical splicing pattern) with slight differences in the terminal exons allowed. Importantly, all Havana transcripts are included in the final Ensembl/Havana merged (GENCODE) gene set. See the summary table opposite for the corresponding GENCODE version number. The Consensus Coding Sequence (CCDS) identifiers have also been mapped to the annotations. More information about the CCDS project.

Detailed information on genebuild (PDF)

More information

General information about this species can be found in Wikipedia.

Statistics

Summary

Assembly	GRCh37.p13 (Genome Reference Consortium Human Reference 37), INSDC Assembly GCA_000001405.14, Feb 2009
Base Pairs	3,098,825,702
Golden Path Length	3,098,825,702
Annotation provider	Ensembl
Annotation method	Full genebuild
Genebuild started	Jul 2010
Genebuild released	Apr 2011
Genebuild last updated/patched	Sep 2013
Database version	115.37
Gencode version	GENCODE 19

Gene counts (Primary assembly)

Gene/transcipt that contains an open reading frame (ORF).Coding genes	20,805 (excl 463 A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Non coding genes	22,966
Small non coding genes	7,057
Long non coding genes	13,870 (excl 184 A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Misc non coding genes	2,039
A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.Pseudogenes	14,181 (excl 4 A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts	196,668

Gene counts (Alternative sequence)

Gene/transcipt that contains an open reading frame (ORF).Coding genes	2,606 (excl 37 A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Non coding genes	1,436
Small non coding genes	517
Long non coding genes	783 (excl 24 A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Misc non coding genes	136
A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.Pseudogenes	1,730
Gene transcripts	18,303

Other

Genscan gene predictions	48,597
Short Variants	1,087,806,087
Structural variants	7,608,658

Favourite species

All species