Genome-wide pairwise inference of synonymous versus non-synonymous substitution rates (dnds)

This function takes the CDS files of two organisms of interest (query and subject) and computes the dNdS estimation values for orthologous gene pairs between these organisms.

dnds(
  query,
  subject,
  format = "fasta",
  ortho_detection = "DIAMOND-RBH",
  delete_corrupt_cds = FALSE,
  store_locally = FALSE,
  sensitivity_mode = "ultra-sensitive",
  out_format = "csv",
  evalue = "1E-5",
  max_target_seqs = 5000,
  cores = 1,
  hard_mask = TRUE,
  diamond_exec_path = NULL,
  add_makedb_options = NULL,
  add_diamond_options = NULL,
  aa_aln_type = "pairwise",
  aa_aln_tool = "NW",
  codon_aln_tool = "pal2nal",
  dnds_estimation = "Li",
  output_path = NULL,
  quiet = TRUE,
  clean_folders = FALSE,
  print_citation = TRUE
)

Arguments

query	a character string specifying the path to the CDS file of interest (query organism).
subject	a character string specifying the path to the CDS file of interest (subject organism).
format	a character string specifying the file format of the sequence file, e.g. `format = "fasta"`, `format = "gbk"`. See `import_proteome` for more details.
ortho_detection	a character string specifying the orthology inference method that shall be performed to detect orthologous genes. Options are: `ortho_detection ="DIAMOND-BH"`: DIAMOND unidirectional best hit. `ortho_detection = "DIAMOND-RBH"`: DIAMOND reciprocal/bidirectional best hit (Default).
delete_corrupt_cds	a logical value indicating whether sequences with corrupt base triplets should be removed from the input `file`. This is the case when the length of coding sequences cannot be divided by 3 and thus the coding sequence contains at least one corrupt base triplet.
store_locally	a logical value indicating whether or not alignment files shall be stored locally rather than in `tempdir()`.
sensitivity_mode	specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed. `sensitivity_mode = "fast"` : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70 `sensitivity_mode = "mid-sensitive"` : fast alignments between the `fast` mode and the sensitive mode in sensitivity. `sensitivity_mode = "sensitive"` : fast alignments, but full sensitivity for hits >40 `sensitivity_mode = "more-sensitive"` : more sensitive than the `sensitive` mode. `sensitivity_mode = "very-sensitive"` : sensitive alignment mode. `sensitivity_mode = "ultra-sensitive"` : most sensitive alignment mode (sensitivity as high as BLASTP).
out_format	a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are: `out_format = "pair"` : Pairwise `out_format = "xml"` : XML `out_format = "csv"` : Comma-separated file
evalue	Expectation value (E) threshold for saving hits (default: `evalue = 0.001`).
max_target_seqs	maximum number of aligned sequences that shall be retained. Please be aware that `max_target_seqs` selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
cores	number of cores for parallel DIAMOND searches.
hard_mask	shall low complexity regions be hard masked with TANTAN? Default is `db_hard_mask = TRUE`.
diamond_exec_path	a path to the DIAMOND executable or `conda/miniconda` folder.
add_makedb_options	a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. `add_make_options = "--taxonnames"` (Default is `add_diamond_options = NULL`).
add_diamond_options	a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. `add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits"` (Default is `add_diamond_options = NULL`).
aa_aln_type	a character string specifying the amino acid alignement type: `aa_aln_type` = "pairwise" (default)
aa_aln_tool	a character string specifying the program that should be used e.g. "NW" (Needleman-Wunsch Global Alignments).
codon_aln_tool	a character string specifying the codon alignment tool that shall be used. Default is `codon_aln_tool = "pal2nal"`. Right now only "pal2nal" can be selected as codon alignment tool.
dnds_estimation	the dNdS estimation method that shall be used. Options are: `dnds_estimation = "Li"` (Default): Li's method
output_path	a path to the location were the DIAMOND best hit output shall be stored. E.g. `output_path` = `getwd()` to store it in the current working directory, or `output_path` = `file.path("put", "your", "path", "here")`.
quiet	a logical value specifying whether the output of the corresponding alignment tool shall be printed out to the console. Default is `quiet` = `FALSE`.
clean_folders	a logical value specifying whether all internal folders storing the output of used programs shall be removed. Default is `clean_folders` = `FALSE`.
print_citation	a logical value indicating whether or not the citation message shall be printed.

Value

A tibble storing the dnds values of orthologous genes.

Details

The dN/dS ratio quantifies the mode and strength of selection acting on a pair of orthologous genes. This selection pressure can be quantified by comparing synonymous substitution rates (dS) that are assumed to be neutral with nonsynonymous substitution rates (dN), which are exposed to selection as they change the amino acid composition of a protein (Mugal et al., 2013 http://mbe.oxfordjournals.org/content/31/1/212).

The homologr package provides the dnds function to perform dNdS estimation on pairs of orthologous genes. This function takes the CDS files of two organisms of interest (query and subject) and computes the dN/dS estimation values for orthologous gene pairs between these organisms.

The following pipeline resembles the dN/dS estimation process:

1) Orthology Inference: e.g. DIAMOND reciprocal best hit (DIAMOND-RBH)
2) Pairwise sequence alignment: e.g. global pairwise alignments with Needleman-Wunsch
3) Codon Alignment: e.g. pal2nal program
4) dNdS estimation: e.g. Li's method

Note: it is assumed that when using dnds() all corresponding multiple sequence alignment programs you want to use are already installed on your machine and are executable via either the default execution PATH or you specifically define the location of the executable file via the aa_aln_path or blast_path argument that can be passed to dnds().

The dnds() function can be used choosing the following options:

ortho_detection :
- "DIAMOND-RBH" (DIAMOND best reciprocal hit)
- "DIAMOND-BH" (DIAMOND best hit)
aa_aln_type :
- "pairwise"
aa_aln_tool :
"NW" (in case aa_aln_type = "pairwise")
codon_aln_tool :
- "pal2nal"
dnds_estimation :
- "Li" : Li's method

References

seqinr: http://seqinr.r-forge.r-project.org/

Li, W.-H., Wu, C.-I., Luo, C.-C. (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol, 2:150-174

Li, W.-H. (1993) Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol., 36:96-99.

Pamilo, P., Bianchi, N.O. (1993) Evolution of the Zfx and Zfy genes: Rates and interdependence between genes. Mol. Biol. Evol, 10:271-281

Author

Hajk-Georg Drost

Examples

if (FALSE) {

# get a dnds table using:
# 1) reciprocal best hit for orthology inference (DIAMOND-RBH)
# 2) Needleman-Wunsch for pairwise amino acid alignments
# 3) pal2nal for codon alignments
# 4) Li for dNdS estimation
# 5) single core processing 'cores = 1'
result <- dnds(quer      = system.file('seqs/ortho_thal_cds.fasta', package = 'homologr'),
     subject    = system.file('seqs/ortho_lyra_cds.fasta', package = 'homologr'),
     ortho_detection = "DIAMOND-RBH",
     aa_aln_type     = "pairwise",
     aa_aln_tool     = "NW",
     codon_aln_tool  = "pal2nal",
     dnds_estimation = "Li",
     cores      = 1 )
# look at output
result

# specify the path to the DIAMOND executable if it is not in the system path
# using the 'diamond_exec_path' argument, e.g. diamond_exec_path = "/opt/miniconda3/bin/"
result <- dnds(query      = system.file('seqs/ortho_thal_cds.fasta', package = 'homologr'),
     subject    = system.file('seqs/ortho_lyra_cds.fasta', package = 'homologr'),
     diamond_exec_path = "/opt/miniconda3/bin/",
     ortho_detection = "DIAMOND-RBH",
     aa_aln_type     = "pairwise",
     aa_aln_tool     = "NW",
     codon_aln_tool  = "pal2nal",
     dnds_estimation = "Li",
     cores      = 1 )
# look at output
result
}