Introduction
In the Introduction vignette we introduced and discussed how phylotranscriptomics can be applied to capture evolutionary signals in (developmental) transcriptomes. Furthermore, in the Enrichment Analyses vignette we provide a use case to correlate specific groups or sets of genes with their predicted evolutionary origin. Here, we aim to combine previously introduced techniques with classic gene expression analyses to detect possible functional causes for the observed transcriptome conservation.
In other words, phylotranscriptomics allows us to detect stages or periods of evolutionary conservation and is able to predict the evolutionary origin of process or trait specific genes based on enrichment analyses. By combining evolutionary enrichment analyses with the functional annotation of process or trait specific genes (see Functional Annotation for details) the detection of evolutionary signals can be correlated with functional processes. Then, performing gene expression analyses on corresponding process or trait specific genes allows users to detect potential causes of stage/period specific evolutionary transcriptome conservation.
The following sections introduce main gene expression data analysis
techniques implemented in myTAI
:
Detection of Differentially Expressed Genes (DEGs)
Fold-Change
Welch t-test
Wilcoxon Rank Sum Test (Mann-Whitney U test)
Negative Binomial (Exact Tests)
Collapsing Replicate Samples
Filter for Expressed Genes
Compute the Statistical Significance of Each Replicate Combination
Detection of Differenentially Expressed Genes (DEGs)
A variety of methods have been published to detect differentially
expressed genes. Some methods are based on non-statistical
quantification of expression differences (e.g. fold-change and
log-fold-change), but most methods are based on statistical tests to
quantify the significance of differences in gene expression between
samples. These statistical methods can furthermore be divided into two
methodological categories: parametric tests and non-parametric tests.
The DiffGenes()
function available in myTAI
implements the most popular and useful methods to detect differentially
expressed genes. In the literature, different methods have been
introduced and discussed for microarray technologies versus RNA-Seq
technologies.
In this section we will introduce all methods implemented in
DiffGenes()
using small examples and will furthermore,
discuss published advantages and disadvantages of each method and each
mRNA quantification technology.
Note that when using DiffGenes()
it is assumed
that your input dataset has been normalized before passing it to
DiffGenes()
. For RNA-Seq data DiffGenes()
assumes that the libraries have been normalized to have the same size,
i.e., to have the same expected column sum under the null hypothesis (or
the lib.size argument in DiffGenes()
is specified
accordingly).
Fold-Changes
A fold change in gene expression is simply the ratio of the gene expression level of one sample against a second sample: , where is the expression level of gene in sample one and is the expression level of gene in sample two. In case replicate expression levels are present for each sample the ratio of means of the corresponding replicates is computed: , where is the mean of replicate expression levels of gene in sample one and is the mean of replicate expression levels of gene in sample two.
Advantages: Given a small number of replicate values the statistical evaluation of differentially expressed genes might be biased (depending on the statistical test chosen) by underlying sample distributions which are not fulfilled or because a small number of replicate values is not sufficient enough to perform non-parametric tests. Here, fold-changes provide a simple way to quantify gene expression differences between samples by -fold enrichment. In our opinion, although the process of choosing a threshold for defining genes as being differentially expressed or not based on fold-change values is purely subjective and relies on common sense, in some cases this procedure will be more suitable than defining differentially expressed genes based on p-values obtained from a test statistic with violated test assumptions.
Disadvantages: If used appropriately, statistical tests not only systematically quantify the significance of the observed gene-by-gene differences of expression, but furthermore, accounts the variance of replicate expression levels when comparing the mean difference of replicate expression levels between samples. Hence, the gene specific variance between replicates is also quantified by the p-value returned by the sufficient test statistic which is not quantified by a simple fold-change measure.
Example: Fold-Change
For the following example we assume that
PhyloExpressionSetExample[1:5,1:8]
stores 5 genes and 3
developmental stages with 2 replicate expression levels per stage.
data("PhyloExpressionSetExample")
# Detection of DEGs using the fold-change measure
DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "foldchange",
stage.names = c("S1","S2","S3"))
head(DEGs)
Phylostratum GeneID S1->S2 S1->S3 S2->S1 S2->S3 S3->S1 S3->S2
1 1 at1g01040.2 1.6713881 2.0806706 0.5983051 1.2448758 0.4806143 0.8032930
2 1 at1g01050.1 1.0273222 1.2709185 0.9734045 1.2371177 0.7868325 0.8083305
3 1 at1g01070.1 1.3087379 1.4044799 0.7640949 1.0731560 0.7120073 0.9318310
4 1 at1g01080.2 0.7779572 0.7286769 1.2854177 0.9366542 1.3723503 1.0676299
5 1 at1g01090.1 0.3803866 0.2288961 2.6289042 0.6017460 4.3687939 1.6618307
The resulting output shows all combinations of fold-changes between
samples (developmental stages). Here, S1->S2
denotes
that the fold-change was computed for expression levels of stage
S1
against stage S2
.
Example: Log-Fold-Change
When selecting method = "log-foldchange"
it is
assumed that the input ExpressionSet
stores
log2
expression levels. Here, we transform absolute
expression levels stored in PhyloExpressionSetExample
to
log2
expression levels using the tf()
function
before log-fold-changes are computed.
data("PhyloExpressionSetExample")
# Detection of DEGs using the logfold-change measure
log.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep = 2,
method = "log-foldchange",
stage.names = c("S1","S2","S3"))
head(log.DEGs)
Phylostratum GeneID S1->S2 S1->S3 S2->S1 S2->S3 S3->S1 S3->S2
1 1 at1g01040.2 0.74104679 1.0570486 -0.74104679 0.31600182 -1.0570486 -0.31600182
2 1 at1g01050.1 0.03888868 0.3458715 -0.03888868 0.30698280 -0.3458715 -0.30698280
3 1 at1g01070.1 0.38817621 0.4900360 -0.38817621 0.10185975 -0.4900360 -0.10185975
4 1 at1g01080.2 -0.36223724 -0.4566488 0.36223724 -0.09441158 0.4566488 0.09441158
5 1 at1g01090.1 -1.39446159 -2.1272350 1.39446159 -0.73277345 2.1272350 0.73277345
The resulting output stores all combinations of log fold-changes between samples (developmental stages).
Welch t-test
The Welch t-test
is a parametric test to statistically
quantify the difference of sample means in cases where the assumption of
homogeneity of variance (equal variances in the two populations) is
violated (Boslaugh, 2013). The Welch t-test
is a sufficient
parameter test for small sample sizes and thus, has been used to detect
differentially expressed genes based on p-values returned by the test
statistic (Hahne et al., 2008).
In detail, the test statistic is computed as follows:
where and are sample means, and are the sample variances, and and are the sample sizes.
The degrees of freedom for Welch’s t-test are then computed as follows:
To perform a sufficient Welch t-test
the following
assumptions about the input data need to be fulfilled to test whether
two samples come from populations with equal means:
Assumptions about input data
- independent samples
- continuous data
- (approximate) normality
Nevertheless, although in most cases log2
expression
levels are used to perform the Welch t-test
assuming that
expression levels are log-normal distributed which approximates a normal
distribution in infinity, in most cases the small number of replicates
is not sufficient enough to fulfill the (approximate) normality
assumption of the Welch t-test
.
Due to this fact, non-parametric, sampling based, or generalized
linear model based methods have been proposed to quantify p-values of
differential expression. Nevertheless, the DiffGenes()
function implements the Welch t-test
for the detection of
differentially expressed genes, allowing users to compare the results
with more recent DEG detection methods/methodologies also implemented in
DiffGenes()
.
-
Advantages:
- DEG detection based on statistical quantification
- Parametric test resulting in a strong test statistic
- Can handle small sample sizes
-
Disadvantages:
- Test assumptions must be fulfilled to return sufficient
p-values
- Can hardly assure normality with very sample sizes of (replicates)
- Pairwise comparisons between different stages or experiments
- Test assumptions must be fulfilled to return sufficient
p-values
Example: Welch t-test
Performing Welch t-test
with DiffGenes()
can be done by specifying method = "t.test"
. Internally
DiffGenes()
performs a two sided Welch t-test
.
This means that the Welch t-test
quantifies only whether or
not a gene is significantly differentially expressed, but not the
direction of enrichment (over-expressed or under-expressed).
The PhyloExpressionSetExample
we use in the following
example stores absolute expression levels. In case your
ExpressionSet
also stores absolute expression levels (which
is likely due to the ExpressionSet
standard for
Phylotranscriptomics analyses), you can use the tf()
function implemented in myTAI
to transform absolute
expression levels to log2
expression levels before
performing DiffGenes()
with a Welch t-test
,
e.g. tf(PhyloExpressionSetExample[1:5,1:8],log2)
. In
general, using log2
transformed expression levels as input
ExpressionSet
of DiffGenes()
allows us to (at
least) assume that samples (replicate expression levels) used to perform
the Welch t-test
are log-normal distributed and therefore,
somewhat approximate normal distributed.
Please notice however, that RNA-Seq data can include count values of
0. So when transforming absolute counts to log2
counts
infinity values of log2(0) = -Inf
will be produced and
therefore, p-value computations will not be possible. To avoid this case
you could either remove RNA-Seq count values of 0 from the input dataset
using the Expressed()
function (see section Filter for
Expressed Genes), e.g. pass
tf(Expressed(PhyloExpressionSetExample[1:5,1:8], cut.off = 1),log2)
as ExpressionSet
argument to DiffGenes()
or
shift all count values by a constant value, e.g. pass
tf(PhyloExpressionSetExample[1:5,1:8], function(x) log2(x + 1))
as ExpressionSet
argument to DiffGenes()
.
Internally, DiffGenes()
will also check for 0 values in
input data and will automatically shift all expression levels by
+1
in case 0 values are included.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Welch t-test
ttest.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep = 2,
method = "t.test",
stage.names = c("S1","S2","S3"))
# look at the results
ttest.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.027832572 0.04020203 0.13481563
2 1 at1g01050.1 0.852379466 0.31471871 0.36326955
3 1 at1g01070.1 0.003200692 0.00113536 0.02236621
4 1 at1g01080.2 0.086426813 0.03092924 0.45999438
5 1 at1g01090.1 0.090387087 0.04638872 0.04978092
The resulting data.frame
stores the p-values of
stage-wise comparisons for each gene. To adjust p-values for multiple
testing of stage-wise comparisons you can specify the
p.adjust.method
argument with one of the p-value adjustment
methods implemented in DiffGenes()
.
In detail, correcting for multiple testing allows to appropriately choose selection cut-offs for p-values fulfilling the differential expression criteria. Hahne et al., 2008 (p. 87) give a nice example of correcting for multiple testing to determine appropriate selection cut-offs.
Please consult the documentation of ?p.adjust
to see
which p-value adjustment methods are implemented in
DiffGenes()
.
Please also consult these reviews Gelman et al., 2008, and Slides) to decide whether or not to apply p-value adjustment to your own dataset.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep = 2,
method = "t.test",
p.adjust.method = "BH",
stage.names = c("S1","S2","S3"))
ttest.DEGs.p_adjust
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.06958143 0.0579859 0.2246927
2 1 at1g01050.1 0.85237947 0.3147187 0.4540869
3 1 at1g01070.1 0.01600346 0.0056768 0.1118311
4 1 at1g01080.2 0.11298386 0.0579859 0.4599944
5 1 at1g01090.1 0.11298386 0.0579859 0.1244523
The resulting p-value adjusted data.frame
can be used to
filter for differentially expressed genes. Here, specifying the
arguments: comparison
, alpha
, and
filter.method
in DiffGenes()
allows users to
obtain only significant differentially expressed genes.
# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
ttest.DEGs.p_adjust.filtered <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:10 ,1:8],log2),
nrep = 2,
method = "t.test",
p.adjust.method = "BH",
stage.names = c("S1","S2","S3"),
comparison = "above",
alpha = 0.05,
filter.method = "n-set",
n = 1)
# look at the genes fulfilling the filter criteria
ttest.DEGs.p_adjust.filtered
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
3 1 at1g01070.1 0.03200692 0.0113536 0.2192432
In this example, only 1 out of 10 genes fulfills the p-value criteria
(alpha = 0.05
) in at least one stage comparison.
Rank top p-values
Finally, users can rank genes in increasing p-value order for each stage comparison by typing:
ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:500,1:8],log2),
nrep = 2,
method = "t.test",
p.adjust.method = "BH",
stage.names = c("S1","S2","S3"))
head(ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3])
Phylostratum GeneID S1<->S2
54 1 at1g02400.1 0.151388
119 1 at1g03870.1 0.151388
137 1 at1g04380.1 0.151388
289 1 at1g08110.4 0.151388
383 1 at1g10360.1 0.151388
413 1 at1g11040.1 0.151388
Here the line
ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3]
will sort p-values of stage comparison "S1<->S2"
in
increasing order.
Wilcoxon-Mann-Whitney test (Mann-Whitney U test)
The Wilcoxon-Mann-Whitney test is a nonparametric test to quantify the shift in empirical distribution parameters. Nonparametric tests are useful when sample populations do not meet the test assumptions of parametric tests.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
Wilcox.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "wilcox.test",
stage.names = c("S1","S2","S3"))
# look at the results
Wilcox.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.3333333 0.3333333 0.3333333
2 1 at1g01050.1 1.0000000 0.3333333 0.3333333
3 1 at1g01070.1 0.3333333 0.3333333 0.3333333
4 1 at1g01080.2 0.3333333 0.3333333 0.6666667
5 1 at1g01090.1 0.3333333 0.3333333 0.3333333
Again, users can adjust p-values by specifying the
p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
Wilcox.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "wilcox.test",
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
Wilcox.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.4166667 0.3333333 0.4166667
2 1 at1g01050.1 1.0000000 0.3333333 0.4166667
3 1 at1g01070.1 0.4166667 0.3333333 0.4166667
4 1 at1g01080.2 0.4166667 0.3333333 0.6666667
5 1 at1g01090.1 0.4166667 0.3333333 0.4166667
Negative Binomial (Exact Tests)
Exact Tests for Differences between two groups of negative-binomial
counts implemented in DiffGenes()
are based on the
edgeR
function exactTest()
. Please consult the
edgeR
Users Guide for mathematical details.
Install edgeR Package
The detection of DEGs using negative binomial models is based on the
powerful implementations provided by the edgeR
package. Hence, before using the negative binomial models in
DiffGenes()
users need to install the edgeR package.
# install edgeR
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
Double Tail Method
This method computes two-sided p-values by doubling the smaller tail
probability (see ?exactTestByDeviance
for details). To
compute p-values for stagewise comparisons based on negative binomial
models, the DiffGenes()
argument
method = "doubletail"
, the number of replicates per stage
nrep
, and lib.size
quantifying the library
size to equalize sample library sizes by quantile-to-quantile
normalization need to be specified (see also
?equalizeLibSizes
).
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Double Tail Method
DoubleTail.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "doubletail",
lib.size = 1000,
stage.names = c("S1","S2","S3"))
# look at the results
DoubleTail.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.26026604 0.110233012 0.6304508
2 1 at1g01050.1 0.95314428 0.598102712 0.6398757
3 1 at1g01070.1 0.55461941 0.456018563 0.8774231
4 1 at1g01080.2 0.58130025 0.487028051 0.8860005
5 1 at1g01090.1 0.03615134 0.001773543 0.2645537
Again, users can adjust p-values by specifying the
p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Double Tail Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
DoubleTail.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "doubletail",
lib.size = 1000,
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
DoubleTail.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.6506651 0.275582530 0.8860005
2 1 at1g01050.1 0.9531443 0.598102712 0.8860005
3 1 at1g01070.1 0.7266253 0.598102712 0.8860005
4 1 at1g01080.2 0.7266253 0.598102712 0.8860005
5 1 at1g01090.1 0.1807567 0.008867715 0.8860005
Small-P Method
This method performs the method of small probabilities as proposed by
Robinson and Smyth (2008) (see exactTestBySmallP
for
details). To compute p-values for stagewise comparisons based on
negative binomial models, the DiffGenes()
argument
method = "doubletail"
, the number of replicates per stage
nrep
, and lib.size
quantifying the library
size to equalize sample library sizes by quantile-to-quantile
normalization need to be specified (see also
?equalizeLibSizes
).
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Small-P Method
SmallP.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "smallp",
lib.size = 1000,
stage.names = c("S1","S2","S3"))
# look at the results
SmallP.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.26026604 0.110233012 0.6304508
2 1 at1g01050.1 0.95314428 0.598102712 0.6398757
3 1 at1g01070.1 0.55461941 0.456018563 0.8774231
4 1 at1g01080.2 0.58130025 0.487028051 0.8860005
5 1 at1g01090.1 0.03615134 0.001773543 0.2645537
Again, users can adjust p-values by specifying the
p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Small-P Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
SmallP.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "smallp",
lib.size = 1000,
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
SmallP.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.6506651 0.275582530 0.8860005
2 1 at1g01050.1 0.9531443 0.598102712 0.8860005
3 1 at1g01070.1 0.7266253 0.598102712 0.8860005
4 1 at1g01080.2 0.7266253 0.598102712 0.8860005
5 1 at1g01090.1 0.1807567 0.008867715 0.8860005
Deviance Method
This method uses the deviance goodness of fit statistics to define
the rejection region, and is therefore equivalent to a conditional
likelihood ratio test (see exactTestByDeviance
for
details). To compute p-values for stagewise comparisons based on
negative binomial models, the DiffGenes()
argument
method = "doubletail"
, the number of replicates per stage
nrep
, and lib.size
quantifying the library
size to equalize sample library sizes by quantile-to-quantile
normalization need to be specified (see also
?equalizeLibSizes
).
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Deviance
Deviance.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "deviance",
lib.size = 1000,
stage.names = c("S1","S2","S3"))
# look at the results
Deviance.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.26026604 0.110233012 0.6304508
2 1 at1g01050.1 0.95314428 0.598102712 0.6398757
3 1 at1g01070.1 0.55461941 0.456018563 0.8774231
4 1 at1g01080.2 0.58130025 0.487028051 0.8860005
5 1 at1g01090.1 0.03615134 0.001773543 0.2645537
Again, users can adjust p-values by specifying the
p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Deviance Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
Deviance.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "deviance",
lib.size = 1000,
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
Deviance.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.6506651 0.275582530 0.8860005
2 1 at1g01050.1 0.9531443 0.598102712 0.8860005
3 1 at1g01070.1 0.7266253 0.598102712 0.8860005
4 1 at1g01080.2 0.7266253 0.598102712 0.8860005
5 1 at1g01090.1 0.1807567 0.008867715 0.8860005
Replicate Quality Check
Users can also perform replicate quality checks to quantify the variability between replicate expression levels fo each stage separately.
The PlotReplicateQuality()
is designed to perform
customized replicate variability checks for any
ExpressionSet
object storing replicates.
data(PhyloExpressionSetExample)
# visualize the sd() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
nrep = 2,
legend.pos = "topright",
ylim = c(0,0.2),
lwd = 6)
The resulting plot visualizes the kernel density estimates for the
variance (log variance) between replicates. Each curve represents the
density function for the replicate variation within one stage or
experiment. In this case the variance between replicates of
Stage 1
to Stage 3
(each including 2
replicates) seem to deviate from each other allowing the conclusion that
each stage has a different expression level variability between
replicates.
The FUN
argument implemented in
PlotReplicateQuality()
allows users to furthermore, specify
customized criteria quantifying replicate variability. Please notice
that the function specified in FUN
will be performed
separately on each gene and stage.
In the following example the median absolute deviation function
mad()
is used to quantify replicate variability.
data(PhyloExpressionSetExample)
# visualize the mad() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
nrep = 2,
FUN = mad,
legend.pos = "topright",
ylim = c(0,0.015),
lwd = 6)
In general, users are not limited to specific functions implemented
in R. By writing customized functions such as
FUN = function(x) return((x - mean(x))^2)
users can define
their own criteria to quantify replicate variability and can then apply
this criteria to PlotReplicateQuality()
by specifying the
FUN
argument.
Collapsing Replicate Samples
After performing differential gene expression analyses, replicate
expression levels are collapsed to a single stage specific expression
level. For this purpose, myTAI
implements the
CollapseReplicates()
function, allowing users to combine
replicate expression levels stored in a standard
PhyloExpressionSet
or DivergenceExpressionSet
object to a stage specific expression level using a specified window
function.
library(myTAI)
# load example data
data(PhyloExpressionSetExample)
# generate an example PhyloExpressionSet with replicates
ExampleReplicateExpressionSet <- PhyloExpressionSetExample[ ,1:8]
# rename stages
names(ExampleReplicateExpressionSet)[3:8] <- c("Stage_1_Repl_1","Stage_1_Repl_2",
"Stage_2_Repl_1","Stage_2_Repl_2",
"Stage_3_Repl_1","Stage_3_Repl_2")
# have a look at the example dataset
head(ExampleReplicateExpressionSet, 5)
Phylostratum GeneID Stage_1_Repl_1 Stage_1_Repl_2 Stage_2_Repl_1 Stage_2_Repl_2 Stage_3_Repl_1 Stage_3_Repl_2
1 1 at1g01040.2 2173.635 1911.2001 1152.555 1291.4224 1000.253 962.9772
2 1 at1g01050.1 1501.014 1817.3086 1665.309 1564.7612 1496.321 1114.6435
3 1 at1g01070.1 1212.793 1233.0023 939.200 929.6195 864.218 877.2060
4 1 at1g01080.2 1016.920 936.3837 1181.338 1329.4734 1392.643 1287.9746
5 1 at1g01090.1 11424.567 16778.1685 34366.649 39775.6405 56231.569 66980.3673
Now, assume that this example PhyloExpressionSet
stores
three developmental stages and 2 biological replicates for each
developmental stage. Of course, we could now compute and visualize the
TAI profile by typing:
# visualize the TAI profile over 3 stages of development
# and 2 replicates per stage
PlotPattern(ExpressionSet = ExampleReplicateExpressionSet,
type = "l",
lwd = 6)
Usually, one would expect that variations in replicate values are smaller than variations between developmental stages. In this example however, we constructed replicate values that vary larger than expression levels between developmental stages. For many applications it might be useful to visualize TAI/TDI values of replicates as well, but normally replicate values are collapsed to one gene and stage specific value after differential gene expression analyses and replicate quality control have been performed.
The following example illustrates how to collapse replicates with
CollapseReplicates()
:
# combine the expression levels of the 2 replicates (const) per stage
# using geom.mean as window function and rename new stages to: "S1","S2","S3"
CollapssedPhyloExpressionSet <- CollapseReplicates(
ExpressionSet = ExampleReplicateExpressionSet,
nrep = 2,
FUN = geom.mean,
stage.names = c("S1","S2","S3"))
# have a look at the collapsed PhyloExpressionSet
head(CollapssedPhyloExpressionSet)
Phylostratum GeneID S1 S2 S3
1 1 at1g01040.2 2038.1982 1220.0147 981.4381
2 1 at1g01050.1 1651.6070 1614.2524 1291.4582
3 1 at1g01070.1 1222.8557 934.3975 870.6878
4 1 at1g01080.2 975.8215 1253.2189 1339.2866
5 1 at1g01090.1 13844.9740 36972.3612 61371.0937
6 1 at1g01120.1 815.3288 894.8987 905.8272
The nrep
argument specifies either a constant number of
replicates per stage or a numeric vector storing variable numbers of
replicates for each developmental stage. In our example, each
developmental stage had a constant (equal) number of replicates per
developmental stage (nrep = 2
). In case a variable stage
specific number of replicates is present, one could specify
nrep = c(2,3,2)
defining the case that developmental stage
1 stores 2 replicates, stage 2 stores 3 replicates, and stage 3 again,
stores 2 replicates.
The argument FUN
specifies the window function to
collapse replicate expression levels to a single stage specific value.
In this example, we chose the geom.mean()
(geometric mean)
function implemented in myTAI
, because our example
PhyloExpressionSet
stores absolute expression levels.
Notice that the mathematical equivalent of performing arithmetic mean
(mean()
) computations on log
expression levels
is to perform the geometric mean (geom.mean()
) on absolute
expression levels.
The stage.names
argument then specifies the new names of
collapsed stages.
Filter for Expressed Genes
After differential gene expression analyses and replicate aggregation have been performed, some studies filter gene expression levels in RNA-Seq count tables or microarray expression matrices for non-expressed or outlier genes. For example, in most studies performing RNA-Seq experiments FPKM/RPKM values < 1 are remove from the processed (final) count table.
For this purpose myTAI
implements the
Expressed()
function to filter (remove) expression levels
in RNA-Seq count tables or microarray expression matrices which do not
pass a defined expression threshold.
The Expressed()
function takes a standard
PhyloExpressionSet
or DivergenceExpressionSet
object storing a RNA-Seq count table (CT) or microarray gene expression
matrix and removes genes from this count table or gene expression matrix
that have an expression level below a defined cut.off
value.
Expressed()
allows users to choose from several gene
extraction methods (see ?Expressed
for details):
const
: all genes that have at least one stage that undercuts or exceeds the expressioncut.off
will be excluded from theExpressionSet
. Hence, for a 7 stageExpressionSet
genes passing the expression levelcut.off
in 6 stages will be retained in theExpressionSet
.min-set
: genes passing the expression levelcut.off
inceiling(n/2)
stages will be retained in theExpressionSet
, wheren
is the number of stages in theExpressionSet
.n-set
: genes passing the expression levelcut.off
inn
stages will be retained in theExpressionSet
. Here, the argumentn
is defining the number of stages for which the threshold criteria should be fulfilled.
# check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260
# remove genes that have an expression level below 8000
# in at least one developmental stage
FilterConst <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = 8000,
comparison = "below",
method = "const")
nrow(FilterConst) # check number of retained genes
#> [1] 449
Users will observe that only 449 out of 25260 genes in
PhyloExpressionSetExample
have an absolute expression level
above 8000
when omitting genes using
method = 'const'
. The argument comparison
specifies whether genes having expression levels below, above, or below
AND above (both) the cut.off
value should be removed from
the dataset.
The following comparison methods can be selected:
-
comparison = "below"
: define genes as not expressed which undercut thecut-off
threshold. -
comparison = "above"
: define genes as outliers which exceed thecut-off
threshold. -
comparison = "both"
: remove genes fulfilling thecomparison = "below"
ANDcomparison = "above"
criteria.
# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260
# remove genes that have an expression level above 12000
# in at least one developmental stage (outlier removal)
FilterConst.above <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = 12000,
comparison = "above",
method = "const")
nrow(FilterConst.above) # check number of retained genes
#> [1] 23547
For this example 25260 - 23547 = 1713 have been classified as outliers (expression levels above 12000) and were removed from the dataset.
# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260
# remove genes that have an expression level below 8000 AND above 12000
# in at least one developmental stage (non-expressed genes AND outlier removal)
FilterConst.both <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = c(8000,12000),
comparison = "both",
method = "const")
nrow(FilterConst.both) # check number of retained genes
#> [1] 2
When selecting comparison = 'both'
, the
cut.off
argument receives 2 threshold values: the
below cut.off
as first element and the
above cut.off
as second element. In this case
cut.off = c(8000,12000)
. Here, only 2 genes fulfill these
criteria.
Analogously, users can specify the number of stages that should
fulfill the threshold criteria using the n-set
method.
# remove genes that have an expression level below 8000
# in at least 5 developmental stages (in this case: n = 2 stages fulfilling the criteria)
FilterNSet <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = 8000,
method = "n-set",
comparison = "below",
n = 2)
nrow(FilterMinSet) # check number of retained genes
#> [1] 20
Here, 20 genes are fulfilling these criteria.
Compute the Statistical Significance of Each Replicate Combination
In some cases (high variability of replicates) it might be useful to
verify that there is no sequence of replicates (for all possible
combination of replicates) that results in a non-significant
TAI
or TDI
pattern, when the initial pattern
with combined replicates was shown to be significant.
The CombinatorialSignificance()
function implemented in
myTAI
allows users to compute the p-values quantifying the
statistical significance of the underlying pattern for all combinations
of replicates.
A small Example:
Assume a PhyloExpressionSet
stores 3 developmental
stages with 3 replicates measured for each stage. The 9 replicates in
total are denoted as:
.
Now the function computes the statistical significance of each pattern
derived by the corresponding combination of replicates, e.g.
1.1, 2.1, 3.1 : p-value for combination 1
1.1, 2.2, 3.1 : p-value for combination 2
1.1, 2.3, 3.1 : p-value for combination 3
1.2, 2.1, 3.1 : p-value for combination 4
1.2, 2.1, 3.1 : p-value for combination 5
1.2, 2.1, 3.1 : p-value for combination 6
1.3, 2.1, 3.1 : p-value for combination 7
1.3, 2.2, 3.1 : p-value for combination 8
1.3, 2.3, 3.1 : p-value for combination 9
…
This procedure yields 27 p-values for the () replicate combinations, where denotes the number of developmental stages and denotes the number of replicates per stage.
Note that in case users have a large amount of stages/experiments and a large amount of replicates the computation time will increase by . For 11 stages and 4 replicates, = 4194304 p-values have to be computed. Each p-value computation itself is based on a permutation test running with or more permutations. Be aware that this might take some time.
The p-value vector returned by this function can then be used to plot the p-values to see whether an critical value is exceeded or not (e.g. ).
# load a standard PhyloExpressionSet
data(PhyloExpressionSetExample)
# we assume that the PhyloExpressionSetExample
# consists of 3 developmental stages
# and 2 replicates for stage 1, 3 replicates for stage 2,
# and 2 replicates for stage 3
# FOR REAL ANALYSES PLEASE USE: permutations = 1000 or 10000
# BUT NOTE THAT THIS TAKES MUCH MORE COMPUTATION TIME
p.vector <- CombinatorialSignificance(ExpressionSet = PhyloExpressionSetExample,
replicates = c(2,3,2),
TestStatistic = "FlatLineTest",
permutations = 100,
parallel = FALSE)
[1] 2.436296e-03 2.288593e-02 1.608399e-03 1.185615e-02 1.835306e-06 1.077012e-05
[7] 2.025515e-07 5.148342e-07 1.654885e-07 6.251145e-06 9.265520e-10 1.047479e-06
Users will observe that none of the replicate combinations resulted in p-values > 0.05 and thus we can assume that the phylotranscriptomic pattern computed based on collapsed replicates is not biased by insignificant replicate combinations.
any(p.vector > 0.05)
#> FALSE
CombinatorialSignificance()
can perform all significance
tests introduced in the Introduction and
Intermediate vignettes.
Furthermore, the parallel
argument allows users to
perform significance computations in parallel on a multicore machine.
This will speed up p-value computations for a large number of
combinations.