In the Introduction vignette we introduced and discussed how phylotranscriptomics can be applied to capture evolutionary signals in (developmental) transcriptomes. Furthermore, in the Enrichment Analyses vignette we provide a use case to correlate specific groups or sets of genes with their predicted evolutionary origin. Here, we aim to combine previously introduced techniques with classic gene expression analyses to detect possible functional causes for the observed transcriptome conservation.
In other words, phylotranscriptomics allows us to detect stages or periods of evolutionary conservation and is able to predict the evolutionary origin of process or trait specific genes based on enrichment analyses. By combining evolutionary enrichment analyses with the functional annotation of process or trait specific genes (see Functional Annotation for details) the detection of evolutionary signals can be correlated with functional processes. Then, performing gene expression analyses on corresponding process or trait specific genes allows users to detect potential causes of stage/period specific evolutionary transcriptome conservation.
The following sections introduce main gene expression data analysis techniques implemented in myTAI
:
Detection of Differentially Expressed Genes (DEGs)
Fold-Change
Welch t-test
Wilcoxon Rank Sum Test (Mann-Whitney U test)
Negative Binomial (Exact Tests)
Collapsing Replicate Samples
Filter for Expressed Genes
Compute the Statistical Significance of Each Replicate Combination
A variety of methods have been published to detect differentially expressed genes. Some methods are based on non-statistical quantification of expression differences (e.g. fold-change and log-fold-change), but most methods are based on statistical tests to quantify the significance of differences in gene expression between samples. These statistical methods can furthermore be divided into two methodological categories: parametric tests and non-parametric tests. The DiffGenes()
function available in myTAI
implements the most popular and useful methods to detect differentially expressed genes. In the literature, different methods have been introduced and discussed for microarray technologies versus RNA-Seq technologies.
In this section we will introduce all methods implemented in DiffGenes()
using small examples and will furthermore, discuss published advantages and disadvantages of each method and each mRNA quantification technology.
Note that when using DiffGenes()
it is assumed that your input dataset has been normalized before passing it to DiffGenes()
. For RNA-Seq data DiffGenes()
assumes that the libraries have been normalized to have the same size, i.e., to have the same expected column sum under the null hypothesis (or the lib.size argument in DiffGenes()
is specified accordingly).
A fold change in gene expression is simply the ratio of the gene expression level of one sample against a second sample: \(\frac{e_{i1}}{e_{i2}}\), where \(e_{i1}\) is the expression level of gene \(i\) in sample one and \(e_{i2}\) is the expression level of gene \(i\) in sample two. In case replicate expression levels are present for each sample the ratio of means of the corresponding replicates is computed: \(\frac{\bar{e}_{i1}}{\bar{e}_{i2}}\), where \(\bar{e}_{i1}\) is the mean of replicate expression levels of gene \(i\) in sample one and \(\bar{e}_{i2}\) is the mean of replicate expression levels of gene \(i\) in sample two.
Advantages: Given a small number of replicate values the statistical evaluation of differentially expressed genes might be biased (depending on the statistical test chosen) by underlying sample distributions which are not fulfilled or because a small number of replicate values is not sufficient enough to perform non-parametric tests. Here, fold-changes provide a simple way to quantify gene expression differences between samples by \(n\)-fold enrichment. In our opinion, although the process of choosing a threshold for defining genes as being differentially expressed or not based on fold-change values is purely subjective and relies on common sense, in some cases this procedure will be more suitable than defining differentially expressed genes based on p-values obtained from a test statistic with violated test assumptions.
Disadvantages: If used appropriately, statistical tests not only systematically quantify the significance of the observed gene-by-gene differences of expression, but furthermore, accounts the variance of replicate expression levels when comparing the mean difference of replicate expression levels between samples. Hence, the gene specific variance between replicates is also quantified by the p-value returned by the sufficient test statistic which is not quantified by a simple fold-change measure.
For the following example we assume that PhyloExpressionSetExample[1:5,1:8]
stores 5 genes and 3 developmental stages with 2 replicate expression levels per stage.
data("PhyloExpressionSetExample")
# Detection of DEGs using the fold-change measure
DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "foldchange",
stage.names = c("S1","S2","S3"))
head(DEGs)
Phylostratum GeneID S1->S2 S1->S3 S2->S1 S2->S3 S3->S1 S3->S2
1 1 at1g01040.2 1.6713881 2.0806706 0.5983051 1.2448758 0.4806143 0.8032930
2 1 at1g01050.1 1.0273222 1.2709185 0.9734045 1.2371177 0.7868325 0.8083305
3 1 at1g01070.1 1.3087379 1.4044799 0.7640949 1.0731560 0.7120073 0.9318310
4 1 at1g01080.2 0.7779572 0.7286769 1.2854177 0.9366542 1.3723503 1.0676299
5 1 at1g01090.1 0.3803866 0.2288961 2.6289042 0.6017460 4.3687939 1.6618307
The resulting output shows all combinations of fold-changes between samples (developmental stages). Here, S1->S2
denotes that the fold-change was computed for expression levels of stage S1
against stage S2
.
When selecting method = "log-foldchange"
it is assumed that the input ExpressionSet
stores log2
expression levels. Here, we transform absolute expression levels stored in PhyloExpressionSetExample
to log2
expression levels using the tf()
function before log-fold-changes are computed.
data("PhyloExpressionSetExample")
# Detection of DEGs using the logfold-change measure
log.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep = 2,
method = "log-foldchange",
stage.names = c("S1","S2","S3"))
head(log.DEGs)
Phylostratum GeneID S1->S2 S1->S3 S2->S1 S2->S3 S3->S1 S3->S2
1 1 at1g01040.2 0.74104679 1.0570486 -0.74104679 0.31600182 -1.0570486 -0.31600182
2 1 at1g01050.1 0.03888868 0.3458715 -0.03888868 0.30698280 -0.3458715 -0.30698280
3 1 at1g01070.1 0.38817621 0.4900360 -0.38817621 0.10185975 -0.4900360 -0.10185975
4 1 at1g01080.2 -0.36223724 -0.4566488 0.36223724 -0.09441158 0.4566488 0.09441158
5 1 at1g01090.1 -1.39446159 -2.1272350 1.39446159 -0.73277345 2.1272350 0.73277345
The resulting output stores all combinations of log fold-changes between samples (developmental stages).
The Welch t-test
is a parametric test to statistically quantify the difference of sample means in cases where the assumption of homogeneity of variance (equal variances in the two populations) is violated (Boslaugh, 2013). The Welch t-test
is a sufficient parameter test for small sample sizes and thus, has been used to detect differentially expressed genes based on p-values returned by the test statistic (Hahne et al., 2008).
In detail, the test statistic is computed as follows:
\(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)
where \(\bar{x}_1\) and \(\bar{x}_2\) are sample means, \(s_1^2\) and \(s_2^2\) are the sample variances, and \(n_1\) and \(n_2\) are the sample sizes.
The degrees of freedom for Welch’s t-test are then computed as follows:
\(df = \frac{\big(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\big)^2}{\frac{s_1^4}{n_1^2 (n_1 - 1)} + \frac{s_2^4}{n_2^2 (n_2 - 1)}}\)
To perform a sufficient Welch t-test
the following assumptions about the input data need to be fulfilled to test whether two samples come from populations with equal means:
Assumptions about input data
Nevertheless, although in most cases log2
expression levels are used to perform the Welch t-test
assuming that expression levels are log-normal distributed which approximates a normal distribution in infinity, in most cases the small number of replicates is not sufficient enough to fulfill the (approximate) normality assumption of the Welch t-test
.
Due to this fact, non-parametric, sampling based, or generalized linear model based methods have been proposed to quantify p-values of differential expression. Nevertheless, the DiffGenes()
function implements the Welch t-test
for the detection of differentially expressed genes, allowing users to compare the results with more recent DEG detection methods/methodologies also implemented in DiffGenes()
.
Performing Welch t-test
with DiffGenes()
can be done by specifying method = "t.test"
. Internally DiffGenes()
performs a two sided Welch t-test
. This means that the Welch t-test
quantifies only whether or not a gene is significantly differentially expressed, but not the direction of enrichment (over-expressed or under-expressed).
The PhyloExpressionSetExample
we use in the following example stores absolute expression levels. In case your ExpressionSet
also stores absolute expression levels (which is likely due to the ExpressionSet
standard for Phylotranscriptomics analyses), you can use the tf()
function implemented in myTAI
to transform absolute expression levels to log2
expression levels before performing DiffGenes()
with a Welch t-test
, e.g. tf(PhyloExpressionSetExample[1:5,1:8],log2)
. In general, using log2
transformed expression levels as input ExpressionSet
of DiffGenes()
allows us to (at least) assume that samples (replicate expression levels) used to perform the Welch t-test
are log-normal distributed and therefore, somewhat approximate normal distributed.
Please notice however, that RNA-Seq data can include count values of 0. So when transforming absolute counts to log2
counts infinity values of log2(0) = -Inf
will be produced and therefore, p-value computations will not be possible. To avoid this case you could either remove RNA-Seq count values of 0 from the input dataset using the Expressed()
function (see section Filter for Expressed Genes), e.g. pass tf(Expressed(PhyloExpressionSetExample[1:5,1:8], cut.off = 1),log2)
as ExpressionSet
argument to DiffGenes()
or shift all count values by a constant value, e.g. pass tf(PhyloExpressionSetExample[1:5,1:8], function(x) log2(x + 1))
as ExpressionSet
argument to DiffGenes()
.
Internally, DiffGenes()
will also check for 0 values in input data and will automatically shift all expression levels by +1
in case 0 values are included.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Welch t-test
ttest.DEGs <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep = 2,
method = "t.test",
stage.names = c("S1","S2","S3"))
# look at the results
ttest.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.027832572 0.04020203 0.13481563
2 1 at1g01050.1 0.852379466 0.31471871 0.36326955
3 1 at1g01070.1 0.003200692 0.00113536 0.02236621
4 1 at1g01080.2 0.086426813 0.03092924 0.45999438
5 1 at1g01090.1 0.090387087 0.04638872 0.04978092
The resulting data.frame
stores the p-values of stage-wise comparisons for each gene. To adjust p-values for multiple testing of stage-wise comparisons you can specify the p.adjust.method
argument with one of the p-value adjustment methods implemented in DiffGenes()
.
In detail, correcting for multiple testing allows to appropriately choose selection cut-offs for p-values fulfilling the differential expression criteria. Hahne et al., 2008 (p. 87) give a nice example of correcting for multiple testing to determine appropriate selection cut-offs.
Please consult the documentation of ?p.adjust
to see which p-value adjustment methods are implemented in DiffGenes()
.
Please also consult these reviews (Biostatistics Handbook, Gelman et al., 2008, and Slides) to decide whether or not to apply p-value adjustment to your own dataset.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:5,1:8],log2),
nrep = 2,
method = "t.test",
p.adjust.method = "BH",
stage.names = c("S1","S2","S3"))
ttest.DEGs.p_adjust
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.06958143 0.0579859 0.2246927
2 1 at1g01050.1 0.85237947 0.3147187 0.4540869
3 1 at1g01070.1 0.01600346 0.0056768 0.1118311
4 1 at1g01080.2 0.11298386 0.0579859 0.4599944
5 1 at1g01090.1 0.11298386 0.0579859 0.1244523
The resulting p-value adjusted data.frame
can be used to filter for differentially expressed genes. Here, specifying the arguments: comparison
, alpha
, and filter.method
in DiffGenes()
allows users to obtain only significant differentially expressed genes.
# Detection of DEGs using the p-value returned by a Welch t-test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
ttest.DEGs.p_adjust.filtered <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:10 ,1:8],log2),
nrep = 2,
method = "t.test",
p.adjust.method = "BH",
stage.names = c("S1","S2","S3"),
comparison = "above",
alpha = 0.05,
filter.method = "n-set",
n = 1)
# look at the genes fulfilling the filter criteria
ttest.DEGs.p_adjust.filtered
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
3 1 at1g01070.1 0.03200692 0.0113536 0.2192432
In this example, only 1 out of 10 genes fulfills the p-value criteria (alpha = 0.05
) in at least one stage comparison.
Finally, users can rank genes in increasing p-value order for each stage comparison by typing:
ttest.DEGs.p_adjust <- DiffGenes(ExpressionSet = tf(PhyloExpressionSetExample[1:500,1:8],log2),
nrep = 2,
method = "t.test",
p.adjust.method = "BH",
stage.names = c("S1","S2","S3"))
head(ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3])
Phylostratum GeneID S1<->S2
54 1 at1g02400.1 0.151388
119 1 at1g03870.1 0.151388
137 1 at1g04380.1 0.151388
289 1 at1g08110.4 0.151388
383 1 at1g10360.1 0.151388
413 1 at1g11040.1 0.151388
Here the line ttest.DEGs.p_adjust[order(ttest.DEGs.p_adjust[ , "S1<->S2"], decreasing = FALSE) , 1:3]
will sort p-values of stage comparison "S1<->S2"
in increasing order.
The Wilcoxon-Mann-Whitney test is a nonparametric test to quantify the shift in empirical distribution parameters. Nonparametric tests are useful when sample populations do not meet the test assumptions of parametric tests.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
Wilcox.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "wilcox.test",
stage.names = c("S1","S2","S3"))
# look at the results
Wilcox.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.3333333 0.3333333 0.3333333
2 1 at1g01050.1 1.0000000 0.3333333 0.3333333
3 1 at1g01070.1 0.3333333 0.3333333 0.3333333
4 1 at1g01080.2 0.3333333 0.3333333 0.6666667
5 1 at1g01090.1 0.3333333 0.3333333 0.3333333
Again, users can adjust p-values by specifying the p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by a Wilcoxon-Mann-Whitney test
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
Wilcox.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "wilcox.test",
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
Wilcox.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.4166667 0.3333333 0.4166667
2 1 at1g01050.1 1.0000000 0.3333333 0.4166667
3 1 at1g01070.1 0.4166667 0.3333333 0.4166667
4 1 at1g01080.2 0.4166667 0.3333333 0.6666667
5 1 at1g01090.1 0.4166667 0.3333333 0.4166667
Exact Tests for Differences between two groups of negative-binomial counts implemented in DiffGenes()
are based on the edgeR
function exactTest()
. Please consult the edgeR Users Guide for mathematical details.
The detection of DEGs using negative binomial models is based on the powerful implementations provided by the edgeR package. Hence, before using the negative binomial models in DiffGenes()
users need to install the edgeR package.
# install edgeR
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
This method computes two-sided p-values by doubling the smaller tail probability (see ?exactTestByDeviance
for details). To compute p-values for stagewise comparisons based on negative binomial models, the DiffGenes()
argument method = "doubletail"
, the number of replicates per stage nrep
, and lib.size
quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also ?equalizeLibSizes
).
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Double Tail Method
DoubleTail.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "doubletail",
lib.size = 1000,
stage.names = c("S1","S2","S3"))
# look at the results
DoubleTail.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.26026604 0.110233012 0.6304508
2 1 at1g01050.1 0.95314428 0.598102712 0.6398757
3 1 at1g01070.1 0.55461941 0.456018563 0.8774231
4 1 at1g01080.2 0.58130025 0.487028051 0.8860005
5 1 at1g01090.1 0.03615134 0.001773543 0.2645537
Again, users can adjust p-values by specifying the p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Double Tail Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
DoubleTail.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "doubletail",
lib.size = 1000,
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
DoubleTail.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.6506651 0.275582530 0.8860005
2 1 at1g01050.1 0.9531443 0.598102712 0.8860005
3 1 at1g01070.1 0.7266253 0.598102712 0.8860005
4 1 at1g01080.2 0.7266253 0.598102712 0.8860005
5 1 at1g01090.1 0.1807567 0.008867715 0.8860005
This method performs the method of small probabilities as proposed by Robinson and Smyth (2008) (see exactTestBySmallP
for details). To compute p-values for stagewise comparisons based on negative binomial models, the DiffGenes()
argument method = "doubletail"
, the number of replicates per stage nrep
, and lib.size
quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also ?equalizeLibSizes
).
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Small-P Method
SmallP.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "smallp",
lib.size = 1000,
stage.names = c("S1","S2","S3"))
# look at the results
SmallP.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.26026604 0.110233012 0.6304508
2 1 at1g01050.1 0.95314428 0.598102712 0.6398757
3 1 at1g01070.1 0.55461941 0.456018563 0.8774231
4 1 at1g01080.2 0.58130025 0.487028051 0.8860005
5 1 at1g01090.1 0.03615134 0.001773543 0.2645537
Again, users can adjust p-values by specifying the p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Small-P Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
SmallP.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "smallp",
lib.size = 1000,
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
SmallP.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.6506651 0.275582530 0.8860005
2 1 at1g01050.1 0.9531443 0.598102712 0.8860005
3 1 at1g01070.1 0.7266253 0.598102712 0.8860005
4 1 at1g01080.2 0.7266253 0.598102712 0.8860005
5 1 at1g01090.1 0.1807567 0.008867715 0.8860005
This method uses the deviance goodness of fit statistics to define the rejection region, and is therefore equivalent to a conditional likelihood ratio test (see exactTestByDeviance
for details). To compute p-values for stagewise comparisons based on negative binomial models, the DiffGenes()
argument method = "doubletail"
, the number of replicates per stage nrep
, and lib.size
quantifying the library size to equalize sample library sizes by quantile-to-quantile normalization need to be specified (see also ?equalizeLibSizes
).
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Deviance
Deviance.DEGs <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "deviance",
lib.size = 1000,
stage.names = c("S1","S2","S3"))
# look at the results
Deviance.DEGs
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.26026604 0.110233012 0.6304508
2 1 at1g01050.1 0.95314428 0.598102712 0.6398757
3 1 at1g01070.1 0.55461941 0.456018563 0.8774231
4 1 at1g01080.2 0.58130025 0.487028051 0.8860005
5 1 at1g01090.1 0.03615134 0.001773543 0.2645537
Again, users can adjust p-values by specifying the p.adjust.method
argument.
data("PhyloExpressionSetExample")
# Detection of DEGs using the p-value returned by the Deviance Method
# and furthermore, adjust p-values for multiple comparison
# using the Benjamini & Hochberg (1995) method: method = "BH"
# and filter for significantly differentially expressed genes (alpha = 0.05)
Deviance.DEGs.adj <- DiffGenes(ExpressionSet = PhyloExpressionSetExample[1:5,1:8],
nrep = 2,
method = "deviance",
lib.size = 1000,
stage.names = c("S1","S2","S3"),
p.adjust.method = "BH")
# look at the results
Deviance.DEGs.adj
Phylostratum GeneID S1<->S2 S1<->S3 S2<->S3
1 1 at1g01040.2 0.6506651 0.275582530 0.8860005
2 1 at1g01050.1 0.9531443 0.598102712 0.8860005
3 1 at1g01070.1 0.7266253 0.598102712 0.8860005
4 1 at1g01080.2 0.7266253 0.598102712 0.8860005
5 1 at1g01090.1 0.1807567 0.008867715 0.8860005
Users can also perform replicate quality checks to quantify the variability between replicate expression levels fo each stage separately.
The PlotReplicateQuality()
is designed to perform customized replicate variablity checks for any ExpressionSet
object storing replicates.
data(PhyloExpressionSetExample)
# visualize the sd() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
nrep = 2,
legend.pos = "topright",
ylim = c(0,0.2),
lwd = 6)
The resulting plot visualizes the kernel density estimates for the variance (log variance) between replicates. Each curve represents the density function for the replicate variation within one stage or experiment. In this case the variance between replicates of Stage 1
to Stage 3
(each including 2 replicates) seem to deviate from each other allowing the conclusion that each stage has a different expression level variability between replicates.
The FUN
argument implemented in PlotReplicateQuality()
allows users to furthermore, specify customized criteria quantifying replicate varibility. Please notice that the function specified in FUN
will be performed separately on each gene and stage.
In the following example the median bbsolute deviation function mad()
is used to quantify replicate variability.
data(PhyloExpressionSetExample)
# visualize the mad() between replicates
PlotReplicateQuality(ExpressionSet = PhyloExpressionSetExample[ , 1:8],
nrep = 2,
FUN = mad,
legend.pos = "topright",
ylim = c(0,0.015),
lwd = 6)
In general, users are not limited to specific functions implememnted in R. By writing customized functions such as FUN = function(x) return((x - mean(x))^2)
users can define their own criteria to quantify replicate variability and can then apply this criteria to PlotReplicateQuality()
by specifying the FUN
argument.
After performing differential gene expression analyses, replicate expression levels are collapsed to a single stage specific expression level. For this purpose, myTAI
implements the CollapseReplicates()
function, allowing users to combine replicate expression levels stored in a standard PhyloExpressionSet
or DivergenceExpressionSet
object to a stage specific expression level using a specified window function.
library(myTAI)
# load example data
data(PhyloExpressionSetExample)
# genrate an example PhyloExpressionSet with replicates
ExampleReplicateExpressionSet <- PhyloExpressionSetExample[ ,1:8]
# rename stages
names(ExampleReplicateExpressionSet)[3:8] <- c("Stage_1_Repl_1","Stage_1_Repl_2",
"Stage_2_Repl_1","Stage_2_Repl_2",
"Stage_3_Repl_1","Stage_3_Repl_2")
# have a look at the example dataset
head(ExampleReplicateExpressionSet, 5)
Phylostratum GeneID Stage_1_Repl_1 Stage_1_Repl_2 Stage_2_Repl_1 Stage_2_Repl_2 Stage_3_Repl_1 Stage_3_Repl_2
1 1 at1g01040.2 2173.635 1911.2001 1152.555 1291.4224 1000.253 962.9772
2 1 at1g01050.1 1501.014 1817.3086 1665.309 1564.7612 1496.321 1114.6435
3 1 at1g01070.1 1212.793 1233.0023 939.200 929.6195 864.218 877.2060
4 1 at1g01080.2 1016.920 936.3837 1181.338 1329.4734 1392.643 1287.9746
5 1 at1g01090.1 11424.567 16778.1685 34366.649 39775.6405 56231.569 66980.3673
Now, assume that this example PhyloExpressionSet
stores three developmental stages and 2 biological replicates for each developmental stage. Of course, we could now compute and visualize the TAI profile by typing:
# visualize the TAI profile over 3 stages of development
# and 2 replicates per stage
PlotPattern(ExpressionSet = ExampleReplicateExpressionSet,
type = "l",
lwd = 6)
Usually, one would expect that variations in replicate values are smaller than variations between developmental stages. In this example however, we constructed replicate values that vary larger than expression levels between developmental stages. For many applications it might be useful to visualize TAI/TDI values of replicates as well, but normally replicate values are collapsed to one gene and stage specific value after differential gene expression analyses and replicate quality control have been performed.
The following example illustrates how to collapse replicates with CollapseReplicates()
:
# combine the expression levels of the 2 replicates (const) per stage
# using geom.mean as window function and rename new stages to: "S1","S2","S3"
CollapssedPhyloExpressionSet <- CollapseReplicates(
ExpressionSet = ExampleReplicateExpressionSet,
nrep = 2,
FUN = geom.mean,
stage.names = c("S1","S2","S3"))
# have a look at the collpased PhyloExpressionSet
head(CollapssedPhyloExpressionSet)
Phylostratum GeneID S1 S2 S3
1 1 at1g01040.2 2038.1982 1220.0147 981.4381
2 1 at1g01050.1 1651.6070 1614.2524 1291.4582
3 1 at1g01070.1 1222.8557 934.3975 870.6878
4 1 at1g01080.2 975.8215 1253.2189 1339.2866
5 1 at1g01090.1 13844.9740 36972.3612 61371.0937
6 1 at1g01120.1 815.3288 894.8987 905.8272
The nrep
argument specifies either a constant number of replicates per stage or a numeric vector storing variable numbers of replicates for each developmental stage. In our example, each developmental stage had a constant (equal) number of replicates per developmental stage (nrep = 2
). In case a variable stage specific number of replicates is present, one could specify nrep = c(2,3,2)
defining the case that developmental stage 1 stores 2 replicates, stage 2 stores 3 replicates, and stage 3 again, stores 2 replicates.
The argument FUN
specifies the window function to collapse replicate expression levels to a single stage specific value. In this example, we chose the geom.mean()
(geometric mean) function implemented in myTAI
, because our example PhyloExpressionSet
stores absolute expression levels. Notice that the mathematical equivalent of performing arithmetic mean (mean()
) computations on log
expression levels is to perform the geometric mean (geom.mean()
) on absolute expression levels.
The stage.names
argument then specifies the new names of collapsed stages.
After differential gene expression analyses and replicate aggregation have been performed, some studies filter gene expression levels in RNA-Seq count tables or microarray expression matrices for non-expressed or outlier genes. For example, in most studies performing RNA-Seq experiments FPKM/RPKM values < 1 are remove from the processed (final) count table.
For this purpose myTAI
implements the Expressed()
function to filter (remove) expression levels in RNA-Seq count tables or microarray expression matrices which do not pass a defined expression threshold.
The Expressed()
function takes a standard PhyloExpressionSet
or DivergenceExpressionSet
object storing a RNA-Seq count table (CT) or microarray gene expression matrix and removes genes from this count table or gene expression matrix that have an expression level below a defined cut.off
value.
Expressed()
allows users to choose from several gene extraction methods (see ?Expressed
for details):
const
: all genes that have at least one stage that undercuts or exceeds the expression cut.off
will be excluded from the ExpressionSet
. Hence, for a 7 stage ExpressionSet
genes passing the expression level cut.off
in 6 stages will be retained in the ExpressionSet
.
min-set
: genes passing the expression level cut.off
in ceiling(n/2)
stages will be retained in the ExpressionSet
, where n
is the number of stages in the ExpressionSet
.
n-set
: genes passing the expression level cut.off
in n
stages will be retained in the ExpressionSet
. Here, the argument n
is defining the number of stages for which the threshold criteria should be fulfilled.
# check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260
# remove genes that have an expression level below 8000
# in at least one developmental stage
FilterConst <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = 8000,
comparison = "below",
method = "const")
nrow(FilterConst) # check number of retained genes
#> [1] 449
Users will observe that only 449 out of 25260 genes in PhyloExpressionSetExample
have an absolute expression level above 8000
when omitting genes using method = 'const'
. The argument comparison
specifies whether genes having expression levels below, above, or below AND above (both) the cut.off
value should be removed from the dataset.
The following comparison methods can be selected:
comparison = "below"
: define genes as not expressed which undercut the cut-off
threshold.comparison = "above"
: define genes as outliers which exceed the cut-off
threshold.comparison = "both"
: remove genes fulfilling the comparison = "below"
AND comparison = "above"
criteria.# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260
# remove genes that have an expression level above 12000
# in at least one developmental stage (outlier removal)
FilterConst.above <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = 12000,
comparison = "above",
method = "const")
nrow(FilterConst.above) # check number of retained genes
#> [1] 23547
For this example 25260 - 23547 = 1713 have been classified as outliers (expression levels above 12000) and were removed from the dataset.
# again: check number of genes in PhyloExpressionSetExample
nrow(PhyloExpressionSetExample)
#> [1] 25260
# remove genes that have an expression level below 8000 AND above 12000
# in at least one developmental stage (non-expressed genes AND outlier removal)
FilterConst.both <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = c(8000,12000),
comparison = "both",
method = "const")
nrow(FilterConst.both) # check number of retained genes
#> [1] 2
When selecting comparison = 'both'
, the cut.off
argument receives 2 threshold values: the below cut.off
as first element and the above cut.off
as second element. In this case cut.off = c(8000,12000)
. Here, only 2 genes fulfill these criteria.
Analogously, users can specify the number of stages that should fulfill the threshold criteria using the n-set
method.
# remove genes that have an expression level below 8000
# in at least 5 developmental stages (in this case: n = 2 stages fulfilling the criteria)
FilterNSet <- Expressed(ExpressionSet = PhyloExpressionSetExample,
cut.off = 8000,
method = "n-set",
comparison = "below",
n = 2)
nrow(FilterMinSet) # check number of retained genes
#> [1] 20
Here, 20 genes are fulfilling these criteria.
In some cases (high variability of replicates) it might be useful to verify that there is no sequence of replicates (for all possible combination of replicates) that results in a non-significant TAI
or TDI
pattern, when the initial pattern with combined replicates was shown to be significant.
The CombinatorialSignificance()
function implemented in myTAI
allows users to compute the p-values quantifying the statistical significance of the underlying pattern for all combinations of replicates.
Assume a PhyloExpressionSet
stores 3 developmental stages with 3 replicates measured for each stage. The 9 replicates in total are denoted as: \(1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3\). Now the function computes the statistical significance of each pattern derived by the corresponding combination of replicates, e.g.
1.1, 2.1, 3.1 : p-value for combination 1
1.1, 2.2, 3.1 : p-value for combination 2
1.1, 2.3, 3.1 : p-value for combination 3
1.2, 2.1, 3.1 : p-value for combination 4
1.2, 2.1, 3.1 : p-value for combination 5
1.2, 2.1, 3.1 : p-value for combination 6
1.3, 2.1, 3.1 : p-value for combination 7
1.3, 2.2, 3.1 : p-value for combination 8
1.3, 2.3, 3.1 : p-value for combination 9
…
This procedure yields 27 p-values for the \(3^3\) (\(n^m\)) replicate combinations, where \(n\) denotes the number of developmental stages and \(m\) denotes the number of replicates per stage.
Note that in case users have a large amount of stages/experiments and a large amount of replicates the computation time will increase by \(n^m\). For 11 stages and 4 replicates, \(4^{11}\) = 4194304 p-values have to be computed. Each p-value computation itself is based on a permutation test running with \(1,000, 10,000, ...\) or more permutations. Be aware that this might take some time.
The p-value vector returned by this function can then be used to plot the p-values to see whether an critical value \(\alpha\) is exceeded or not (e.g. \(\alpha = 0.05\)).
# load a standard PhyloExpressionSet
data(PhyloExpressionSetExample)
# we assume that the PhyloExpressionSetExample
# consists of 3 developmental stages
# and 2 replicates for stage 1, 3 replicates for stage 2,
# and 2 replicates for stage 3
# FOR REAL ANALYSES PLEASE USE: permutations = 1000 or 10000
# BUT NOTE THAT THIS TAKES MUCH MORE COMPUTATION TIME
p.vector <- CombinatorialSignificance(ExpressionSet = PhyloExpressionSetExample,
replicates = c(2,3,2),
TestStatistic = "FlatLineTest",
permutations = 100,
parallel = FALSE)
[1] 2.436296e-03 2.288593e-02 1.608399e-03 1.185615e-02 1.835306e-06 1.077012e-05
[7] 2.025515e-07 5.148342e-07 1.654885e-07 6.251145e-06 9.265520e-10 1.047479e-06
Users will observe that none of the replicate combinations resulted in p-values > 0.05 and thus we can assume that the phylotranscriptomic pattern computed based on collapsed replicates is not biased by insignificant replicate combinations.
any(p.vector > 0.05)
#> FALSE
CombinatorialSignificance()
can perform all significance tests introduced in the Introduction and Intermediate vignettes.
Furthermore, the parallel
argument allows users to perform significance computations in parallel on a multicore machine. This will speed up p-value computations for a large number of combinations.