Performing Phylostratum
and
Divergence Stratum
Enrichment Analyses
Phylostratum
and Divergence Stratum
enrichment analyses have been performed by several studies to correlate
organ or metabolic pathway evolution with the origin of organ or pathway
specific genes (Sestak and Domazet-Loso, 2015).
In detail, Phylostratum
and
Divergence Stratum
enrichment analyses can be performed
analogously to Gene
Ontology and Kegg
enrichment analyses to study the enrichment of evolutionary age or
sequence divergence in a set of selected genes against the entire
genome/transcriptome. In case specific age categories are significantly
over- or underrepresented in the selected gene set, assumptions or
potential correlations between the evolutionary origin of a particular
organ or metabolic pathway can be implied.
In this vignette we will use the data set published by Sestak and
Domazet-Loso, 2015 to demonstrate how to perform enrichment analyses
using myTAI
.
Enrichment Analyses using PlotEnrichment()
The PlotEnrichment()
function implemented in
myTAI
computes and visualizes the significance of enriched
(over- or underrepresented) Phylostrata
or
Divergence Strata
within an input set of process/tissue
specific genes. In detail this function takes the
Phylostratum
or Divergence Stratum
distribution of all genes stored in the input ExpressionSet
as background set and the Phylostratum
or
Divergence Stratum
distribution of the specific gene set
and performs a Fisher’s exact test for each Phylostratum
or
Divergence Stratum
to quantify the statistical significance
of over- or under-represented Phylostrata
or
Divergence Strata
within the set of selected genes. In
other words, the frequency distribution of Phylostrata
or
Divergence Strata
in the complete sample is compared with
the frequency distribution of Phylostrata
or
Divergence Strata
in the set of selected genes and over- or
under-representation is visualized by log-odds (or odds), where a
log-odd of zero means that both frequency distributions are equal (see
also Sestak and Domazet-Loso, 2015).
Example Data Set Retrieval
Before using the PlotEnrichment()
function, we need to
download the example data set from Sestak and Domazet-Loso, 2015.
Download the Phylostratigraphic Map
of D.
rerio:
# download the Phylostratigraphic Map of Danio rerio
# from Sestak and Domazet-Loso, 2015
download.file( url = "http://mbe.oxfordjournals.org/content/suppl/2014/11/17/msu319.DC1/TableS3-2.xlsx",
destfile = "MBE_2015a_Drerio_PhyloMap.xlsx" )
Read the *.xlsx
file storing the
Phylostratigraphic Map
of D. rerio and format it
for the use with myTAI
:
# install the readxl package
install.packages("readxl")
# load package readxl
library(readxl)
# read the excel file
DrerioPhyloMap.MBEa <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 1, skip = 4)
# format Phylostratigraphic Map for use with myTAI
Drerio.PhyloMap <- DrerioPhyloMap.MBEa[ , 1:2]
# have a look at the final format
head(Drerio.PhyloMap)
Phylostrata ZFIN_ID
1 1 ZDB-GENE-000208-13
2 1 ZDB-GENE-000208-17
3 1 ZDB-GENE-000208-18
4 1 ZDB-GENE-000208-23
5 1 ZDB-GENE-000209-3
6 1 ZDB-GENE-000209-4
Now, Drerio.PhyloMap
stores the
Phylostratigraphic Map
of D. rerio which is used
as background set to perform enrichment analyses with
PlotEnrichment()
.
Enrichment Analyses
Now, the PlotEnrichment()
function visualizes the over-
and underrepresented Phylostrata
of brain specific genes
when compared with the total number of genes stored in the
Phylostratigraphic Map
of D. rerio.
# read expression data (organ specific genes) from Sestak and Domazet-Loso, 2015
Drerio.OrganSpecificExpression <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 2, skip = 3)
# select only brain specific genes
Drerio.Brain.Genes <- unique(na.omit(Drerio.OrganSpecificExpression[ , "brain"]))
# visualize enriched Phylostrata of genes annotated as brain specific
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
use.only.map = TRUE,
legendName = "PS")
Here, the first argument is either a standard
ExpressionSet
object (in case
use.only.map = FALSE
: default) or a
Phylostratigraphic Map
or Divergence Map
(in
case use.only.map = TRUE
; see Introduction for details). The second
argument test.set
specifies the gene ids also stored in the
corresponding ExpressionSet
or
Phylostratigraphic Map/ Divergence Map
for which enrichment
shall be quantified and visualized.
To visualize the odds or log-odds of over- or underrepresented genes
within the test.set
the following procedure is
performed:
denotes the number of genes in group j and deriving from PS , with and where denotes the background set and denotes the
test.set
denotes the total number of genes within PS
denotes the total number of genes within group
is the total number of genes within all groups and all PS
= / and = / denote relative frequencies between groups
denotes the between group sum of
The result is the fold-change value (odds;
measure = "foldchange"
) denoted as
which is visualized above and below zero or the log
fold-change value (log-odds;
measure = "log-foldchange"
), where
(C)
=
()
-
()
which is visualized symmetrically above and below zero by
PlotEnrichment()
. Analogously,
but is not visualized by this function.
Internally, PlotEnrichment()
performs a Fisher’s exact
test for each Phylostratum
or
Divergence Stratum
separately, to quantify the significance
of over- or under-representation of corresponding
Phylostrata
or Divergence Strata
within the
test.set
when compared with the entire
ExpressionSet
. PlotEnrichment()
visualizes
significantly enriched (over- or underrepresented)
Phylostrata
or Divergence Strata
with
asterisks ’*’.
Notation:
- ’*’ = P-Value 0.05
- ’**’ = P-Value 0.005
- ’***’ = P-Value 0.0005
Users will notice that when performing the
PlotEnrichment()
function, the p-values and the enrichment
matrix (storing
and
)
will be returned.
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
use.only.map = TRUE,
legendName = "PS")
$p.values
PS1 PS2 PS3 PS4 PS5 PS6
8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01
PS7 PS8 PS9 PS10 PS11 PS12
2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01
PS13 PS14
3.929949e-03 1.593834e-05
$enrichment.matrix
BG_Set Test_Set
PS1 -0.001132832 0.007668216
PS2 0.023733936 -0.172380714
PS3 -0.040879607 0.250587496
PS4 -0.048920465 0.294399729
PS5 -0.114888949 0.603817643
PS6 0.008678915 -0.060350168
PS7 -0.062948352 0.367240944
PS8 0.115630474 -1.206210187
PS9 -0.007353969 0.048964218
PS10 -0.031971192 0.200141519
PS11 0.039742253 -0.303363314
PS12 -0.002418079 0.016311853
PS13 0.101449988 -0.984621732
PS14 0.098211044 -0.938724783
In case users are only interested in the p-values of the Fisher test
and the enrichment matrix without illustrating the bar plot, they can
specify the plot.bars = FALSE
argument to only retrieve the
numeric results.
# specify plot.bars = FALSE to retrieve only numeric results
EnrichmentResult <- PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
use.only.map = TRUE,
legendName = "PS",
plot.bars = FALSE)
# access p-values quantifying the enrichment for each Phylostratum
EnrichmentResult$p.values
PS1 PS2 PS3 PS4 PS5 PS6
8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01
PS7 PS8 PS9 PS10 PS11 PS12
2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01
PS13 PS14
3.929949e-03 1.593834e-05
# access enrichment matrix storing C_1 and C_2
EnrichmentResult$enrichment.matrix
BG_Set Test_Set
PS1 -0.001132832 0.007668216
PS2 0.023733936 -0.172380714
PS3 -0.040879607 0.250587496
PS4 -0.048920465 0.294399729
PS5 -0.114888949 0.603817643
PS6 0.008678915 -0.060350168
PS7 -0.062948352 0.367240944
PS8 0.115630474 -1.206210187
PS9 -0.007353969 0.048964218
PS10 -0.031971192 0.200141519
PS11 0.039742253 -0.303363314
PS12 -0.002418079 0.016311853
PS13 0.101449988 -0.984621732
PS14 0.098211044 -0.938724783
Defining the Background Set
The Fisher test which is performed inside
PlotEnrichment()
assumes that all genes stored in the input
ExpressionSet
or
Phylostratigraphic Map
/Divergence Map
are used
to define the background set for constructing the test statistic.
However, since in most cases the test.set
is an subset of
the input ExpressionSet
or
Phylostratigraphic Map
/Divergence Map
one
could also specify the complete.bg
argument to remove all
test.set
genes from the background set when performing the
Fisher test and visualization.
The following two examples allow users to compare the results when
retaining all genes as background set compared with the option to remove
test.set
genes from the background set.
# complete.bg = TRUE (default) -> retain test.set genes in background set
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
complete.bg = TRUE,
use.only.map = TRUE,
legendName = "PS")
# complete.bg = FALSE -> remove test.set genes from background set
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
complete.bg = FALSE,
use.only.map = TRUE,
legendName = "PS")
Users will notice that although some p-values change, the qualitative
result did not change. In border line cases however, the results might
influence whether or not some Phylostrata
or
Divergence Strata
are denoted as significantly enriched or
not. So always be aware of the interpretation when retaining or removing
the test.set
from the background set, because both options
are valid and have advantages and disadvantages and depend on a valid
interpretation.
Interpretation of Enrichment Results
For the D. rerio brain genes example you can see that PS4, PS5, and PS7 are significantly over-represented in the set of brain specific genes.
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "foldchange",
complete.bg = TRUE,
use.only.map = TRUE,
legendName = "PS")
Again, we retrieve the D. rerio specific taxonomy
represented by PS1-14 using the myTAI::taxonomy()
function
(see Introduction and
Taxonomy
for details).
# retrieve the taxonomy of D. rerio
myTAI::taxonomy(organism = "Danio rerio")
id name rank
1 cellular organisms no rank 131567
2 Eukaryota superkingdom 2759
3 Opisthokonta no rank 33154
4 Metazoa kingdom 33208
5 Eumetazoa no rank 6072
6 Bilateria no rank 33213
7 Deuterostomia no rank 33511
8 Chordata phylum 7711
9 Craniata subphylum 89593
10 Vertebrata no rank 7742
11 Gnathostomata no rank 7776
12 Teleostomi no rank 117570
13 Euteleostomi no rank 117571
14 Actinopterygii superclass 7898
15 Actinopteri class 186623
16 Neopterygii subclass 41665
17 Teleostei infraclass 32443
18 Osteoglossocephalai no rank 1489341
19 Clupeocephala no rank 186625
20 Otomorpha no rank 186634
21 Ostariophysi no rank 32519
22 Otophysa no rank 186626
23 Cypriniphysae superorder 186627
24 Cypriniformes order 7952
25 Cyprinoidea superfamily 30727
26 Cyprinidae family 7953
27 Danio genus 7954
28 Danio rerio species 7955
Sestak and Domazet-Loso, 2015 collapsed these 28 taxonomic nodes into
14 taxonomic nodes (see Figure 2
in Sestak and
Domazet-Loso, 2015) and labelled them as phylostrata 1 to phylostrata
14, where PS1 represents cellular organisms
and PS14
represents D. rerio
specific genes. Based on the
phylostratum categorization of Sestak and Domazet-Loso, 2015, PS4
represents Holozoa (= Metazoa + allies)
, PS5 represents
Metazoa
, and PS7 represents Bilateria
.
Now, the over-representation results of brain specific genes returned
by PlotEnrichment()
provide evidence, that brain specific
genes might indeed have originated during the emergence of the nervous
system at the metazoan-eumetazoan transition leading to the
interpretation that the vertebrate brain has a step wise adaptive
history where most of its extant organization was already present in the
chordate ancestor as argued by Sestak and Domazet-Loso, 2015.
This example shall illustrate how the PlotEnrichment()
function can be used to trace the evolutionary origin of tissue or
process specific genes by investigating their age enrichment.
In case users have an ExpressionSet
storing the
Phylostratigraphic Map
of D. rerio as well as an
expression set, they can furthermore use the PlotGeneSet()
function implemented in myTAI
to visualize the expression
levels of brain specific genes which have been shown to be significantly
enriched in Metazoa
specific phylostrata.
Example:
# the best parameter setting to visualize this plot:
# png("DrerioBrainSpecificGeneExpression.png",700,400)
PlotGeneSet(ExpressionSet = DrerioPhyloExpressionSet,
gene.set = Drerio.Brain.Genes,
plot.legend = FALSE,
type = "l",
lty = 1,
lwd = 4,
xlab = "Ontogeny",
ylab = "Expression Level")
# dev.off()
Here DrerioPhyloExpressionSet
denotes a hypothetical
ExpressionSet
of D. rerio development.
Additionally, the SelectGeneSet()
function allows users
to obtain the ExpresisonSet
subset of selected genes
(gene.set
) for subsequent analyses.
# select the ExpressionSet subset of Brain specific genes
Brain.PhyloExpressionSet <- SelectGeneSet( ExpressionSet = DrerioPhyloExpressionSet,
gene.set = Drerio.Brain.Genes )
head(Brain.PhyloExpressionSet)
Adjust P-values for Multiple Comparisons
In case a large number of Phylostrata or Divergence Strata is
included in the input ExpressionSet
, p-values returned by
PlotEnrichment()
should be adjusted for multiple
comparisons. For this purpose PlotEnrichment()
includes the
argument p.adjust.method
. Here, all methods implemented in
?p.adjust
can be specified:
# adjust p-values for multiple comparisons with Benjamini & Hochberg (1995)
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
complete.bg = FALSE,
use.only.map = TRUE,
legendName = "PS",
p.adjust.method = "BH")
Please consult these reviews Gelman et al., 2008, and Slides) to decide whether or not to apply p-value adjustment to your own dataset.