Skip to contents

The motivation for transformating of RNA-seq data

As noted in the vignette Statistical testing with myTAI, RNA-seq data transformation (e.g. log2 transformation) can affect the results of the tests and the overall TAI (as well as TDI, TSI and analogous) profile. There has been a lively debate around this topic (see Piasecka et al. (2013) and Liu & Robinson-Rechavi (2018) for more information).

How can we use this to our advantage?

The myTAI package provides a flexible framework to apply different transformations to the expression data in a PhyloExpressionSet object. This allows you to evaluate the robustness of the results, i.e. whether the observed patterns are driven primarily by variation in highly expressed genes (e.g. raw TPM or counts), by variation in lowly expressed genes (e.g. under log2 or vst transformation), or somewhere in between (e.g. sqrt transformation).

Transforming RNA-seq data for myTAI

The myTAI package provides a function myTAI::tf() that allows you to apply any transformation function to the expression data in a PhyloExpressionSet object. This function is flexible and can be used with any transformation function that takes a matrix as input.

Here is a more extreme example of changes to the overall TAI pattern due to RNA-seq transformations.

data("example_phyex_set")
# no transformation
myTAI::plot_signature(example_phyex_set)
## Computing: [====================                    ] 49%  (~1s remaining)       Computing: [======================================= ] 96%  (~0s remaining)       Computing: [========================================] 100% (done)

plot_signature function output no transform

# rank transformation
myTAI::plot_signature(example_phyex_set |> myTAI::tf(FUN = function(x) apply(x, 2, base::rank)))
## Computing: [===================                     ] 46%  (~1s remaining)       Computing: [======================================  ] 93%  (~0s remaining)       Computing: [========================================] 100% (done)

plot_signature function output rank transform

Indeed, myTAIv2 is fitted with a single plotting function (myTAI::plot_signature_transformed()) to evaluate the robustness of the patterns against multiple transformations, i.e.

myTAI::plot_signature_transformed(example_phyex_set)
## Computing: [====================                    ] 49%  (~1s remaining)       Computing: [========================================] 98%  (~0s remaining)       Computing: [========================================] 100% (done)                         
## Computing: [====================                    ] 49%  (~1s remaining)       Computing: [======================================= ] 97%  (~0s remaining)       Computing: [========================================] 100% (done)                         
## Computing: [====================                    ] 49%  (~1s remaining)       Computing: [======================================= ] 96%  (~0s remaining)       Computing: [========================================] 100% (done)                         
## Computing: [=================                       ] 42%  (~1s remaining)       Computing: [==================================      ] 84%  (~0s remaining)       Computing: [========================================] 100% (done)

plot_signature_transformed function output transform all

See how different some of the transformations are! In general, though, the earlier stages are younger compared to the mid-stage. Across most transformations, the late-stage is younger compared to the mid-stage. This is consistent with the molecular hourglass model.

In addition, the p-values for various tests (default = flat line test) can be assessed through the myTAI::tf_stability() function. The test can be changed with the conservation_test argument, e.g. conservation_test = stat_flatline_test, conservation_test = stat_flatline_test and so on.

tf_stability_res <- myTAI::tf_stability(example_phyex_set)
tf_stability_res
##         none         sqrt         log2         rank          vst         rlog 
## 4.882605e-01 7.333446e-06 1.441248e-36 1.799337e-67 4.203084e-34 1.056360e-28

Here, we can see that the p-values for the flat line test are all below 0.05 for all transformations. This is reassuring as to the stability of the TAI profile across transformations.

Indeed, within the myTAI::plot_signature_transformed() function, we can also set the conservation_test parameter (e.g. conservation_test = stat_flatline_test) alongside the display for the p-values (show_p_val = TRUE), to get an all-in-one plot!

If you are curious about the exact nature of the transformations used by myTAI::plot_signature_transformed() and myTAI::tf_stability() you can check the list of functions that were applied via print(myTAI::COUNT_TRANSFORMS). Note that DESeq2-based transforms (‘vst’, ‘rlog’) are only available if DESeq2 is installed.

Removing or adding transformations

To remove transformations, e.g. “rlog”, “rank”, “vst” from the list, e.g.

COUNT_TRANSFORMS_REMOVED <- COUNT_TRANSFORMS[!names(COUNT_TRANSFORMS) %in% c("rlog", "rank", "vst")]

To add transformations, e.g.

COUNT_TRANSFORMS_ADDED <- c(COUNT_TRANSFORMS, list(new_transform = function(x) x + 1))

Then apply to tf_stability or plot_signature functions, e.g.

myTAI::tf_stability(example_phyex_set, transformations = COUNT_TRANSFORMS_REMOVED)
myTAI::plot_signature_transformed(example_phyex_set, transformations = COUNT_TRANSFORMS_REMOVED)

For ScPhyloExpressionSet, it is important to remove certain transformations such as “rlog”, “rank”, “vst” and “log2” as they are not suitable for single cell data.

Other robustness checks

Alternatively (or in conjunction) to the transformations, myTAIv2 allows us to check how stable the TAI profile is after progressively removing the top expressed genes. This can be done using the myTAI::plot_signature_gene_quantiles() function. By default, the quantiles are set to 1, 0.99, 0.95, 0.9 and 0.8 (i.e. from removing no genes quantile = 1 to removing the top 20% of expressed genes quantile = 0.8).

myTAI::plot_signature_gene_quantiles(example_phyex_set)

plot_signature_gene_quantiles function output

From this, we can see that the most of the ‘bulge’ on the Bent Cotyledon stage is driven by a very few highly expressed genes. Otherwise, the TAI profile is very stable.

We could also filter genes based on other categories using:
myTAI::genes_top_variance() → get high variance genes. myTAI::genes_top_mean() and myTAI::genes_top_expr() → get highly expressed genes, like in in myTAI::plot_signature_gene_quantiles()
myTAI::genes_lowly_expressed() → get lowly expressed genes

See filtering workflow
# get top 1% variance genes (default parameters)
genes.top_var <- myTAI::genes_top_variance(example_phyex_set)
# get top 1% expressed genes (default parameters)
genes.top_mean <- myTAI::genes_top_mean(example_phyex_set)
# get lowly expressed genes with mean < 1 (default parameters)
genes.low_expr <- myTAI::genes_lowly_expressed(example_phyex_set)

Once you have a gene set, we can use myTAI::remove_genes() to filter out genes in a given set:

# removing top 1% variance genes
example_phyex_set.rm_top_var <-
  myTAI::remove_genes(example_phyex_set, genes.top_var)
# plot
myTAI::plot_signature(example_phyex_set.rm_top_var)

plot_signature function output removing top variance

We could also use myTAI::select_genes() to keep the genes in a given set:

# select top 1% expressed genes
example_phyex_set.select_top_expr <- 
  myTAI::select_genes(example_phyex_set, genes.top_mean)
# plot
myTAI::plot_signature(example_phyex_set.select_top_expr)
## Computing: [========================================] 100% (done)

plot_signature function output for top expressed

If you have certain genes of interest, we can also select or remove individual genes with myTAI::select_genes() and myTAI::remove_genes(), respectively.

# remove two genes (AT1G02780 & AT1G03880)
example_phyex_set.select_goi <- 
  myTAI::remove_genes(example_phyex_set, c("AT1G02780", "AT1G03880"))
# plot
myTAI::plot_signature(example_phyex_set.select_goi)

plot_signature function output for gene of interest

For a more sophisticated workflow to extract genes that drive the TAI profile, check out this vignette
Break TAI patterns using gaTAI

Transforming gene age information for myTAI

Another aspect of the myTAI package is the ability to transform gene age information itself using the myTAI::tf_PS() function. This workflow is still being tested and will be documented in due time.

Conclusion

The choice of RNA-seq data transformation (including no transformation) is an important and often overlooked aspect of gene expression data analysis. These choices may affect the results of the TAI (and related indices) profiles, as well as the statistical tests and the interpretation of the data.

myTAIv2 provides a powerful framework to apply various RNA-seq data transformations to BulkPhyloExpressionSet and ScPhyloExpressionSet objects. With this, we can use the transformed patterns to our advantage, namely, to evaluate the overall robustness of our patterns. We also explored the stability of the TAI profile after progressively removing the top expressed genes. Taken together, these analyses allow us to better understand the expression levels of genes that drive the TAI profile.

While we can perturb the TAI patterns in silico, nothing beats repeating experiments in the lab, with multiple biological replicates and gold-standard protocols for RNA-seq data generation and processing.