ncRNA trees are generated by a pipeline that uses a strategy similar to the one used for protein trees, but adapted to the specific characteristics of ncRNAs. This is important because ncRNA genes are well known to form secondary structures where pairs of residues are matched to form loops and other structures. Substitution models that consider pairs of sites have been proposed and implemented in several packages like PHASE or RAxML.
Details on tree building
The ncRNA tree pipeline consists of the following steps:
- Get and store ncRNA family models from RFAM.
- Load and identify all the ncRNA members annotated in all the Ensembl genomes.
- Filter out extra copies in low-coverage assemblies using our EPO multiple alignments.
- Build secondary structure alignments using INFERNAL and refinement of the covariance model.
- Build ncRNA trees with RAxML using 16 different secondary structure models.
- In parallel with the secondary structure alignments and trees, build multiple alignments with PRANK with the genomic sequences of the ncRNAs. For these alignments we include the flanking region of the genes (twice the length of the gene at each side).
- With the genomic alignments, build a neighbour-joining (NJ) and a maximum-likelihood (ML) tree using TreeBeST.
- For very big families, build fast and efficient trees using FastTree and RAxML-Light.
- For each family, add the species tree to the set of trees already obtained and reconcile them all using TreeBeST obtaining one final tree for each family.
ncRNA orthologies in the vertebrate lineage. Miguel Pignatelli, Albert J. Vilella, Matthieu Muffato, Leo Gordon, Simon White, Paul Flicek, Javier Herrero. Database (Oxford) 2016 pii:bav127.