Skip to contents

This vignette will introduce users to the retrieval of taxonomic information with myTAI. The taxonomy() function (formerly) implemented in myTAI relies on the powerful package taxize. More specifically, the taxonomic information retrieval has been customized for the myTAI standard and for organism specific information retrieval.

While the previous taxonomy() function has been deprecated since taxize() was pulled from CRAN, users can nevertheless follow the taxonomy pipeline by installing the taxize package and copy the old taxonomy function.

# install taxize from CRAN
install.packages("taxize")

# if taxize is not available again
install.packages("remotes")
remotes::install_github("ropensci/taxize")

Copy the taxonomy function:

open for the taxonomy function

Click on the copy icon to copy the function.

#' @title Retrieving Taxonomic Information of a Query Organism
#' @description This function takes the scientific name of a query organism
#' and returns selected output formats of taxonomic information for the corresponding organism.
#' @param organism a character string specifying the scientific name of a query organism.
#' @param db a character string specifying the database to query, e.g. \code{db} = \code{"itis"} or \code{"ncbi"}.
#' @param output a character string specifying the taxonomic information that shall be returned. 
#' Implemented are: \code{output} = \code{"classification"}, \code{"taxid"}, or \code{"children"}.
#' @details This function is based on the powerful package \pkg{taxize} and implements
#' the customized retrieval of taxonomic information for a query organism. 
#' 
#' The following data bases can be selected to retrieve taxonomic information:
#' 
#' \itemize{
#' \item \code{db = "itis"} : Integrated Taxonomic Information Service
#' \item \code{db = "ncbi"} : National Center for Biotechnology Information
#' }
#' 
#' 
#' 
#' @author Hajk-Georg Drost
#' @examples
#' \dontrun{
#' # retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
#' # from NCBI Taxonomy
#' taxonomy("Arabidopsis thaliana",db = "ncbi")
#' 
#' # the same can be applied to database : "itis"
#'  taxonomy("Arabidopsis thaliana",db = "itis")
#' 
#' # retrieving the taxonomic hierarchy of "Arabidopsis"
#'  taxonomy("Arabidopsis",db = "ncbi") # analogous : db = "ncbi" or "itis"
#' 
#' # or just "Arabidopsis"
#'  taxonomy("Arabidopsis",db = "ncbi")
#' 
#' # retrieving the taxonomy id of the query organism and in the correspondning database
#' # taxonomy("Arabidopsis thaliana",db = "ncbi", output = "taxid")
#' 
#' # the same can be applied to databases : "ncbi" and "itis"
#'  taxonomy("Arabidopsis thaliana",db = "ncbi", output = "taxid")
#'  taxonomy("Arabidopsis thaliana",db = "itis", output = "taxid")
#' 
#' 
#' # retrieve children taxa of the query organism stored in the correspondning database
#'  taxonomy("Arabidopsis",db = "ncbi", output = "children")
#' 
#' # the same can be applied to databases : "ncbi" and "itis"
#'  taxonomy("Arabidopsis thaliana",db = "ncbi", output = "children")
#'  taxonomy("Arabidopsis thaliana",db = "itis", output = "children")
#'  
#' }
#' @references
#' 
#' Scott Chamberlain and Eduard Szocs (2013). taxize - taxonomic search and retrieval in R. F1000Research,
#' 2:191. URL: http://f1000research.com/articles/2-191/v2.
#' 
#' Scott Chamberlain, Eduard Szocs, Carl Boettiger, Karthik Ram, Ignasi Bartomeus, and John Baumgartner
#' (2014) taxize: Taxonomic information from around the web. R package version 0.3.0.
#' https://github.com/ropensci/taxize
#' @export

taxonomy <- function(organism, db = "ncbi", output = "classification"){
        
        if (!is.element(output,c("classification","taxid","children")))
                stop ("The output '",output,"' is not supported by this function.")
        
        if (!is.element(db,c("ncbi","itis")))
                stop ("Database '",db,"' is not supported by this function.")
        
        name <- id <- NULL

        if (db == "ncbi")
                tax_hierarchy <- as.data.frame(taxize::classification(taxize::get_uid(organism), db = "ncbi")[[1]])
        
        else if (db == "itis")    
                tax_hierarchy <- as.data.frame(taxize::classification(taxize::get_tsn(organism), db = "itis")[[1]])
        
        # tryCatch({colnames(tax_hierarchy) <- c("name","rank","id")},stop("The connection to ",db," did not work properly. Please check your internet connection or maybe the API did change.", call. = FALSE))
        
        if(output == "classification"){
                
                return(tax_hierarchy)
        }
        
        if(output == "taxid"){
                
                        return(dplyr::select(dplyr::filter(tax_hierarchy, name == organism),id))
        }
        
        if(output == "children"){
                
                return(as.data.frame(taxize::children(organism, db = db)[[1]]))
        } 
}

The taxonomy() function implemented in myTAI can be used to classify genomes according to phylogenetic classification into Phylostrata (Phylostratigraphy) or to retrieve species specific taxonomic information when performing Divergence Stratigraphy (see Introduction for details).

For larger taxonomy queries it may be useful to create an NCBI Account and set up an ENTREZ API KEY.

# install.packages(c("taxize", "usethis"))
taxize::use_entrez()
# Create your key from your (brand-new) account's. 
# After generating your key set it as ENTREZ_KEY in .Renviron.
# ENTREZ_KEY='youractualkeynotthisstring'
# For that, use usethis::edit_r_environ()
usethis::edit_r_environ()

Taxonomic Information Retrieval

The taxonomy() function to retrieve taxonomic information.

retrieve taxonomy hierarchy

In the following example we will obtain the taxonomic hierarchy of Arabidopsis thaliana from NCBI Taxonomy.

# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from NCBI Taxonomy
taxonomy( organism = "Arabidopsis thaliana", 
          db       = "ncbi",
          output   = "classification" )
                   name         rank      id
1    cellular organisms      no rank  131567
2             Eukaryota superkingdom    2759
3         Viridiplantae      kingdom   33090
4          Streptophyta       phylum   35493
5        Streptophytina      no rank  131221
6           Embryophyta      no rank    3193
7          Tracheophyta      no rank   58023
8         Euphyllophyta      no rank   78536
9         Spermatophyta      no rank   58024
10        Magnoliophyta      no rank    3398
11      Mesangiospermae      no rank 1437183
12       eudicotyledons      no rank   71240
13           Gunneridae      no rank   91827
14         Pentapetalae      no rank 1437201
15               rosids     subclass   71275
16              malvids      no rank   91836
17          Brassicales        order    3699
18         Brassicaceae       family    3700
19           Camelineae        tribe  980083
20          Arabidopsis        genus    3701
21 Arabidopsis thaliana      species    3702

The organism argument takes the scientific name of a query organism, the db argument specifies that database from which the corresponding taxonomic information shall be retrieved, e.g. ncbi (NCBI Taxonomy) and itis (Integrated Taxonomic Information System) and the output argument specifies the type of taxonomic information that shall be returned for the query organism, e.g. classification, taxid, or children.

The output of classification is a data.frame storing the taxonomic hierarchy of Arabidopsis thaliana starting with cellular organisms up to Arabidopsis thaliana. The first column stores the taxonomic name, the second column the taxonomic rank, and the third column the NCBI Taxonomy id for corresponding taxa.

Analogous classification information can be obtained from different databases.

# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from the Integrated Taxonomic Information System
taxonomy( organism = "Arabidopsis thaliana", 
          db       = "itis",
          output   = "classification" )
              name          rank     id
1          Plantae       Kingdom 202422
2    Viridiplantae    Subkingdom 954898
3     Streptophyta  Infrakingdom 846494
4      Embryophyta Superdivision 954900
5     Tracheophyta      Division 846496
6  Spermatophytina   Subdivision 846504
7    Magnoliopsida         Class  18063
8          Rosanae    Superorder 846548
9      Brassicales         Order 822943
10    Brassicaceae        Family  22669
11     Arabidopsis         Genus  23040

The output argument allows you to directly access taxonomy ids for a query organism or species.

retrieve taxonomy ID from ncbi
# retrieving the taxonomy id of the query organism from NCBI Taxonomy
taxonomy( organism = "Arabidopsis thaliana", 
          db       = "ncbi", 
          output   = "taxid" )
    id
1 3702
retrieve taxonomy ID from itis
# retrieving the taxonomy id of the query organism from Integrated Taxonomic Information Service
taxonomy( organism = "Arabidopsis", 
          db       = "itis", 
          output   = "taxid" )
    id
1 23040

So far, the following data bases can be accesses to retrieve taxonomic information:

  • db = "itis" : Integrated Taxonomic Information Service
  • db = "ncbi" : National Center for Biotechnology Information

Retrieve Children Nodes

Another output supported by taxonomy() is children that returns the immediate children taxa for a query organism. This feature is useful to determine species relationships for quantifying recent evolutionary conservation with Divergence Stratigraphy.

retrieve children nodes from ncbi
# retrieve children taxa of the query organism stored in the correspondning database
taxonomy( organism = "Arabidopsis", 
          db       = "ncbi", 
          output   = "children" )
   childtaxa_id                                                     childtaxa_name childtaxa_rank
1       1547872                                              Arabidopsis umezawana        species
2       1328956 (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica        species
3       1240361                         Arabidopsis thaliana x Arabidopsis arenosa        species
4        869750                          Arabidopsis thaliana x Arabidopsis lyrata        species
5        412662                                            Arabidopsis pedemontana        species
6        378006                         Arabidopsis arenosa x Arabidopsis thaliana        species
7        347883                                              Arabidopsis arenicola        species
8        302551                                              Arabidopsis petrogena        species
9         97980                                               Arabidopsis croatica        species
10        97979                                            Arabidopsis cebennensis        species
11        81970                                                Arabidopsis halleri        species
12        59690                                             Arabidopsis kamchatica        species
13        59689                                                 Arabidopsis lyrata        species
14        45251                                               Arabidopsis neglecta        species
15        45249                                                Arabidopsis suecica        species
16        38785                                                Arabidopsis arenosa        species
17         3702                                               Arabidopsis thaliana        species
retrieve children nodes from itis
# retrieve children taxa of the query organism stored in the correspondning database
taxonomy( organism = "Arabidopsis", 
          db       = "itis", 
          output   = "children" )
   parentname parenttsn rankname             taxonname    tsn
1 Arabidopsis     23040  Species  Arabidopsis thaliana  23041
2 Arabidopsis     23040  Species Arabidopsis arenicola 823113
3 Arabidopsis     23040  Species   Arabidopsis arenosa 823130
4 Arabidopsis     23040  Species    Arabidopsis lyrata 823171

These results allow us to choose subject organisms for Divergence Stratigraphy.