Gorilla assembly and gene annotation

Gorillas are ground-dwelling, predominantly herbivorous apes that inhabit the forests of central Africa. The eponymous genus Gorilla is divided into two species: the eastern gorillas and the western gorillas (both critically endangered), and either four or five subspecies. They are the largest living primates. The DNA of gorillas is highly similar to that of humans, from 95–99% depending on what is counted, and they are the next closest living relatives to humans after the chimpanzees and bonobos.


This is the third release of the draft assembly of the Western lowland gorilla (Gorilla gorilla gorilla). The DNA sample came from a 30-year-old female, Kamilah, owned by the San Diego Wild Animal Park, and sequencing and assembly is provided by the Wellcome Trust Sanger Institute.

Sequencing was undertaken using two separate methods: traditional capillary whole-genome shotgun (WGS) sequencing and Solexa new-technology sequencing. Results from the two methods were used in the first, second and third draft gorilla assemblies.

The first draft assembly (gorGor1) was released in September 2008. This initial draft assembly was a 2.1x coverage assembly. It was created from WGS capillary reads using the Phusion assembler, with these capillary reads' sequencing errors being corrected by taking the consensus of Solexa data aligned to it.

To create the second draft assemby, Solexa data, sequenced at roughly 35x, was assembled into contigs using Abyss. The resulting contigs of length 50bp or longer were then assembled along with the WGS capillary data using the Phusion assembler. Next, the Solexa read pairs were aligned to the human reference genome using Maq to identify syntenic regions and breakpoints between human and gorilla. Using human-gorilla synteny as a guide, longer gorilla supercontigs were constructed using Velvet and other assembly tools.

In the third draft assembly (gorGor3) for the current release, gorilla supercontigs which could be ordered with respect to the human reference genome were assembled into simulated chromosomes, while incorporating the chromosome 2 split (as in chimpanzee) and the reciprocal translocation between chromosomes 5 and 17.

The total length of the gorGor3 assembly is 3.04Gb. The N50 size for contigs is 11657 bp and the N50 size for supercontigs is 913458 bp.

The genome assembly represented here corresponds to GenBank Assembly ID GCA_000151905.1

Other assemblies

Gene annotation

Gene annotation in gorilla has been generated by projection of genes from the human reference genome as well as alignment of proteins from three major sources (in descending order of their contribution to the final gene set):

  1. Ensembl human translations from Ensembl release 56
  2. Uniprot mammalian and vertebrate proteins with evidence at either the protein or transcript level for their existence; and
  3. Gorilla gorilla proteins obtained from UniprotKB.

Projection of human genes to gorilla began with the alignment of gorilla genome to the latest human reference genome (GRCh37 assembly) using BLASTz. These alignments were used to project human Ensembl gene structures (Ensembl version 56) to the corresponding location in gorilla. About 60% of human protein-coding genes were projected onto the gorilla genome. Small insertions/deletions that disrupt the reading-frame of the resultant projected transcripts are corrected for by inserting "frame-shift" introns into the structure. For some human exons and parts of exons, the corresponding gorilla sequence is missing from the assembly. In most of these cases, the missing exon is omitted from the gorilla gene model. In a small number of cases however, where BLASTz has aligned the human sequence to a gap in the gorilla sequence, the exon is placed in the gap, resulting in a run of X's of the correct length in the translation.

Ensembl human translations were also aligned to the gorilla genome using Exonerate. The alignment of mammalian/vertebrate proteins and gorilla-specific proteins followed procedures in the standard Ensembl genebuild pipeline using Genewise.

The gene-building procedure on the gorGor3 assembly identified 20803 protein coding genes and 1553 pseudogenes.

Vega logo Additional manual annotation of this genome can be found in Vega

More information

General information about this species can be found in Wikipedia.



AssemblygorGor3.1, INSDC Assembly GCA_000151905.1, Dec 2009
Base Pairs2,828,888,833
Golden Path Length3,040,677,044
Annotation methodFull genebuild
Genebuild startedAug 2009
Genebuild releasedMar 2010
Genebuild last updated/patchedJul 2011
Database version95.31

Gene counts

Coding genes20,962
Gene transcripts35,727


Genscan gene predictions50,831

About this species