Skip to contents

The taxonomy() function (formerly) implemented in myTAI relies on the powerful package taxize. More specifically, the taxonomic information retrieval has been customized for the myTAI standard and for organism specific information retrieval.

While the previous taxonomy() function has been deprecated since taxize was pulled from CRAN, users can nevertheless follow the taxonomy pipeline by installing the taxize package and copy the old taxonomy function.

# install taxize from CRAN
install.packages("taxize")

# if taxize is not available again
install.packages("remotes")
remotes::install_github("ropensci/taxize")

Copy the taxonomy function:

open for the taxonomy function

Click on the copy icon to copy the function.

#' @title Retrieving Taxonomic Information of a Query Organism
#' @description This function takes the scientific name of a query organism
#' and returns selected output formats of taxonomic information for the corresponding organism.
#' @param organism a character string specifying the scientific name of a query organism.
#' @param db a character string specifying the database to query, e.g. \code{db} = \code{"itis"} or \code{"ncbi"}.
#' @param output a character string specifying the taxonomic information that shall be returned. 
#' Implemented are: \code{output} = \code{"classification"}, \code{"taxid"}, or \code{"children"}.
#' @details This function is based on the powerful package \pkg{taxize} and implements
#' the customized retrieval of taxonomic information for a query organism. 
#' 
#' The following data bases can be selected to retrieve taxonomic information:
#' 
#' \itemize{
#' \item \code{db = "itis"} : Integrated Taxonomic Information Service
#' \item \code{db = "ncbi"} : National Center for Biotechnology Information
#' }
#' 
#' 
#' 
#' @author Hajk-Georg Drost
#' @examples
#' \dontrun{
#' # retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
#' # from NCBI Taxonomy
#' taxonomy("Arabidopsis thaliana",db = "ncbi")
#' 
#' # the same can be applied to database : "itis"
#'  taxonomy("Arabidopsis thaliana",db = "itis")
#' 
#' # retrieving the taxonomic hierarchy of "Arabidopsis"
#'  taxonomy("Arabidopsis",db = "ncbi") # analogous : db = "ncbi" or "itis"
#' 
#' # or just "Arabidopsis"
#'  taxonomy("Arabidopsis",db = "ncbi")
#' 
#' # retrieving the taxonomy id of the query organism and in the correspondning database
#' # taxonomy("Arabidopsis thaliana",db = "ncbi", output = "taxid")
#' 
#' # the same can be applied to databases : "ncbi" and "itis"
#'  taxonomy("Arabidopsis thaliana",db = "ncbi", output = "taxid")
#'  taxonomy("Arabidopsis thaliana",db = "itis", output = "taxid")
#' 
#' 
#' # retrieve children taxa of the query organism stored in the correspondning database
#'  taxonomy("Arabidopsis",db = "ncbi", output = "children")
#' 
#' # the same can be applied to databases : "ncbi" and "itis"
#'  taxonomy("Arabidopsis thaliana",db = "ncbi", output = "children")
#'  taxonomy("Arabidopsis thaliana",db = "itis", output = "children")
#'  
#' }
#' @references
#' 
#' Scott Chamberlain and Eduard Szocs (2013). taxize - taxonomic search and retrieval in R. F1000Research,
#' 2:191. URL: http://f1000research.com/articles/2-191/v2.
#' 
#' Scott Chamberlain, Eduard Szocs, Carl Boettiger, Karthik Ram, Ignasi Bartomeus, and John Baumgartner
#' (2014) taxize: Taxonomic information from around the web. R package version 0.3.0.
#' https://github.com/ropensci/taxize
#' @export

taxonomy <- function(organism, db = "ncbi", output = "classification"){
        
        if (!is.element(output,c("classification","taxid","children")))
                stop ("The output '",output,"' is not supported by this function.")
        
        if (!is.element(db,c("ncbi","itis")))
                stop ("Database '",db,"' is not supported by this function.")
        
        name <- id <- NULL

        if (db == "ncbi")
                tax_hierarchy <- as.data.frame(taxize::classification(taxize::get_uid(organism), db = "ncbi")[[1]])
        
        else if (db == "itis")    
                tax_hierarchy <- as.data.frame(taxize::classification(taxize::get_tsn(organism), db = "itis")[[1]])
        
        # tryCatch({colnames(tax_hierarchy) <- c("name","rank","id")},stop("The connection to ",db," did not work properly. Please check your internet connection or maybe the API did change.", call. = FALSE))
        
        if(output == "classification"){
                
                return(tax_hierarchy)
        }
        
        if(output == "taxid"){
                
                        return(dplyr::select(dplyr::filter(tax_hierarchy, name == organism),id))
        }
        
        if(output == "children"){
                
                return(as.data.frame(taxize::children(organism, db = db)[[1]]))
        } 
}

The taxonomy() function can be used to classify genomes according to phylogenetic classification into Phylostrata (Phylostratigraphy) or to retrieve species specific taxonomic information when performing Divergence Stratigraphy (see Introduction for details).

For larger taxonomy queries it may be useful to create an NCBI Account and set up an ENTREZ API KEY.

# install.packages(c("taxize", "usethis"))
taxize::use_entrez()
# Create your key from your (brand-new) account's. 
# After generating your key set it as ENTREZ_KEY in .Renviron.
# ENTREZ_KEY='youractualkeynotthisstring'
# For that, use usethis::edit_r_environ()
usethis::edit_r_environ()

Taxonomic Information Retrieval

The taxonomy() function to retrieve taxonomic information.

retrieve taxonomy hierarchy

In the following example we will obtain the taxonomic hierarchy of Arabidopsis thaliana from NCBI Taxonomy.

# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from NCBI Taxonomy
taxonomy( organism = "Arabidopsis thaliana", 
          db       = "ncbi",
          output   = "classification" )
Show output
## ══  1 queries  ═══════════════
## ✔  Found:  Arabidopsis+thaliana
## ══  Results  ═════════════════
## 
## • Total: 1 
## • Found: 1 
## • Not Found: 0
##                    name          rank      id
## 1    cellular organisms cellular root  131567
## 2             Eukaryota        domain    2759
## 3         Viridiplantae       kingdom   33090
## 4          Streptophyta        phylum   35493
## 5        Streptophytina     subphylum  131221
## 6           Embryophyta         clade    3193
## 7          Tracheophyta         clade   58023
## 8         Euphyllophyta         clade   78536
## 9         Spermatophyta         clade   58024
## 10        Magnoliopsida         class    3398
## 11      Mesangiospermae         clade 1437183
## 12       eudicotyledons         clade   71240
## 13           Gunneridae         clade   91827
## 14         Pentapetalae         clade 1437201
## 15               rosids         clade   71275
## 16              malvids         clade   91836
## 17          Brassicales         order    3699
## 18         Brassicaceae        family    3700
## 19           Camelineae         tribe  980083
## 20          Arabidopsis         genus    3701
## 21 Arabidopsis thaliana       species    3702

The organism argument takes the scientific name of a query organism, the db argument specifies that database from which the corresponding taxonomic information shall be retrieved, e.g. ncbi (NCBI Taxonomy) and itis (Integrated Taxonomic Information System) and the output argument specifies the type of taxonomic information that shall be returned for the query organism, e.g. classification, taxid, or children.

The output of classification is a data.frame storing the taxonomic hierarchy of Arabidopsis thaliana starting with cellular organisms up to Arabidopsis thaliana. The first column stores the taxonomic name, the second column the taxonomic rank, and the third column the NCBI Taxonomy id for corresponding taxa.

Analogous classification information can be obtained from different databases.

# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from the Integrated Taxonomic Information System
taxonomy( organism = "Arabidopsis thaliana", 
          db       = "itis",
          output   = "classification" )
Show output
## ══  1 queries  ═══════════════
## ✔  Found:  Arabidopsis thaliana
## ══  Results  ═════════════════
## 
## • Total: 1 
## • Found: 1 
## • Not Found: 0
##                    name          rank     id
## 1               Plantae       kingdom 202422
## 2         Viridiplantae    subkingdom 954898
## 3          Streptophyta  infrakingdom 846494
## 4           Embryophyta superdivision 954900
## 5          Tracheophyta      division 846496
## 6       Spermatophytina   subdivision 846504
## 7         Magnoliopsida         class  18063
## 8               Rosanae    superorder 846548
## 9           Brassicales         order 822943
## 10         Brassicaceae        family  22669
## 11          Arabidopsis         genus  23040
## 12 Arabidopsis thaliana       species  23041

The output argument allows you to directly access taxonomy ids for a query organism or species.

retrieve taxonomy ID from ncbi
# retrieving the taxonomy id of the query organism from NCBI Taxonomy
taxonomy( organism = "Arabidopsis thaliana", 
          db       = "ncbi", 
          output   = "taxid" )
Show output
## ══  1 queries  ═══════════════
## ✔  Found:  Arabidopsis+thaliana
## ══  Results  ═════════════════
## 
## • Total: 1 
## • Found: 1 
## • Not Found: 0
##     id
## 1 3702
retrieve taxonomy ID from itis
# retrieving the taxonomy id of the query organism from Integrated Taxonomic Information Service
taxonomy( organism = "Arabidopsis", 
          db       = "itis", 
          output   = "taxid" )
Show output
## ══  1 queries  ═══════════════
## ✔  Found:  Arabidopsis
## ══  Results  ═════════════════
## 
## • Total: 1 
## • Found: 1 
## • Not Found: 0
##      id
## 1 23040

So far, the following data bases can be accesses to retrieve taxonomic information:

  • db = "itis" : Integrated Taxonomic Information Service
  • db = "ncbi" : National Center for Biotechnology Information
How does the taxonomy(db = "ncbi") output differ from GenEra?

The taxonomic classifications should be the same between taxonomy(..., db = "ncbi") and the taxonomic classifications in the GenEra output (since it uses NCBI taxdump as input). But it should be noted that the recent updates to NCBI taxonomy has meant that the highest order ranks (cellular root, domain, kingdom etc.) may differ.

Retrieve Children Nodes

Another output supported by taxonomy() is children that returns the immediate children taxa for a query organism. This feature is useful to determine species relationships for quantifying recent evolutionary conservation with Divergence Stratigraphy.

retrieve children nodes from ncbi
# retrieve children taxa of the query organism stored in the correspondning database
taxonomy( organism = "Arabidopsis", 
          db       = "ncbi", 
          output   = "children" )
Show output
## ══  1 queries  ═══════════════
## ✔  Found:  Arabidopsis
## ══  Results  ═════════════════
## 
## • Total: 1 
## • Found: 1 
## • Not Found: 0
##    childtaxa_id
## 1       2608267
## 2       2486701
## 3       1837063
## 4       1547872
## 5       1328956
## 6       1240361
## 7        869751
## 8        869750
## 9        864766
## 10       412662
## 11       378006
## 12       347883
## 13       302551
## 14        97980
## 15        97979
## 16        81970
## 17        59690
## 18        59689
## 19        45251
## 20        45249
## 21        38785
## 22         3702
##                                                        childtaxa_name
## 1                            Arabidopsis arenosa x Arabidopsis lyrata
## 2                            Arabidopsis lyrata x Arabidopsis halleri
## 3                          Arabidopsis thaliana x Arabidopsis halleri
## 4                                               Arabidopsis umezawana
## 5  (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica
## 6                          Arabidopsis thaliana x Arabidopsis arenosa
## 7         Arabidopsis thaliana x Arabidopsis halleri subsp. gemmifera
## 8                           Arabidopsis thaliana x Arabidopsis lyrata
## 9                                         Arabidopsis septentrionalis
## 10                                            Arabidopsis pedemontana
## 11                         Arabidopsis arenosa x Arabidopsis thaliana
## 12                                              Arabidopsis arenicola
## 13                                              Arabidopsis petrogena
## 14                                               Arabidopsis croatica
## 15                                            Arabidopsis cebennensis
## 16                                                Arabidopsis halleri
## 17                                             Arabidopsis kamchatica
## 18                                                 Arabidopsis lyrata
## 19                                               Arabidopsis neglecta
## 20                                                Arabidopsis suecica
## 21                                                Arabidopsis arenosa
## 22                                               Arabidopsis thaliana
##    childtaxa_rank
## 1         species
## 2         species
## 3         species
## 4         species
## 5         species
## 6         species
## 7         species
## 8         species
## 9         species
## 10        species
## 11        species
## 12        species
## 13        species
## 14        species
## 15        species
## 16        species
## 17        species
## 18        species
## 19        species
## 20        species
## 21        species
## 22        species
retrieve children nodes from itis
# retrieve children taxa of the query organism stored in the correspondning database
taxonomy( organism = "Arabidopsis", 
          db       = "itis", 
          output   = "children" )
Show output
## ══  1 queries  ═══════════════
## ✔  Found:  Arabidopsis
## ══  Results  ═════════════════
## 
## • Total: 1 
## • Found: 1 
## • Not Found: 0
## ══  1 queries  ═══════════════
## ✔  Found:  Arabidopsis
## ══  Results  ═════════════════
## 
## • Total: 1 
## • Found: 1 
## • Not Found: 0
##    parentname parenttsn rankname             taxonname    tsn
## 1 Arabidopsis     23040  Species Arabidopsis arenicola 823113
## 2 Arabidopsis     23040  Species   Arabidopsis arenosa 823130
## 3 Arabidopsis     23040  Species    Arabidopsis lyrata 823171
## 4 Arabidopsis     23040  Species  Arabidopsis thaliana  23041

These results allow us to choose subject organisms for Divergence Stratigraphy.