vignettes/dNdS_estimation.Rmd
dNdS_estimation.Rmd
The dN/dS ratio quantifies the mode and strength of selection acting on a pair of orthologous genes. This selection pressure can be quantified by comparing synonymous substitution rates (dS) that are assumed to be neutral with nonsynonymous substitution rates (dN), which are exposed to selection as they change the amino acid composition of a protein (Mugal et al., 2013).
The orthologr
package provides a function named
dNdS()
to perform dNdS estimation on pairs of orthologous
genes. The dNdS()
function takes the CDS files of two
organisms of interest (query_file
and
subject_file
) and computes the dNdS estimation values for
orthologous gene pairs between these organisms.
Note: the following dNdS estimation methods are based on KaKs_Calculator 2:
“NG”: Nei, M. and Gojobori, T. (1986)
“LWL”: Li, W.H., et al. (1985)
“LPB”: Li, W.H. (1993) and Pamilo, P. and Bianchi, N.O. (1993)
“MLWL” (Modified LWL), MLPB (Modified LPB): Tzeng, Y.H., et al. (2004)
“YN”: Yang, Z. and Nielsen, R. (2000)
“MYN” (Modified YN): Zhang, Z., et al. (2006)
“GMYN”: Wang, D.P., et al. Biology Direct. (2009)
“GY”: Goldman, N. and Yang, Z. (1994)
“MS”: (Model Selection): based on a set of candidate models, Posada, D. (2003)
“MA” (Model Averaging): based on a set of candidate models, Posada, D. (2003)
“ALL”: All models toghether
It is assumed that when you choose one of these dNdS estimation
methods you have KaKs_Calculator
2 installed on your machine and it can be executed from the default
execution PATH
.
The following pipeline resembles an example dNdS estimation procedure:
Orthology Inference: e.g. BLAST reciprocal best hit (RBH)
Pairwise sequence alignment: e.g. clustalw for pairwise amino acid sequence alignments
Codon Alignment: e.g. pal2nal program
dNdS estimation: e.g. Yang, Z. and Nielsen, R. (2000) (YN)
Note: it is assumed that when using
dNdS()
all corresponding multiple sequence alignment
programs you want to use are already installed on your machine and are
executable via either the default execution PATH or you specifically
define the location of the executable program via the
aa_aln_path
or blast_path
argument that can be
passed to dNdS()
. See the Sequence
Alignments vignette for details.
The following example shall illustrate a dNdS estimation process.
library(orthologr)
# get a dNdS table using:
# 1) reciprocal best hit for orthology inference (RBH)
# 2) Needleman-Wunsch for pairwise amino acid alignments
# 3) pal2nal for codon alignments
# 4) Comeron for dNdS estimation
# 5) single core processing 'comp_cores = 1'
dNdS(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
ortho_detection = "RBH",
aa_aln_type = "pairwise",
aa_aln_tool = "NW",
codon_aln_tool = "pal2nal",
dnds_est.method = "Comeron",
comp_cores = 1)
A tibble: 20 x 15
query_id subject_id dN dS dNdS perc_identity alig_length
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AT1G010... 333554|PA... 0.106 0.254 0.420 74.0 469
2 AT1G010... 470181|PA... 0.0402 0.104 0.388 91.1 246
3 AT1G010... 470180|PA... 0.0150 0.126 0.118 95.5 359
4 AT1G010... 333551|PA... 0.0135 0.116 0.116 92.0 1970
5 AT1G010... 909874|PA... 0 0.175 0 100 213
6 AT1G010... 470177|PA... 0.0449 0.113 0.397 89.5 648
7 AT1G010... 918864|PA... 0.0183 0.106 0.173 95.1 366
8 AT1G010... 909871|PA... 0.0340 0.106 0.322 90.3 300
9 AT1G010... 470171|PA... 0.00910 0.218 0.0417 96.8 434
10 AT1G011... 333544|PA... 0.0325 0.122 0.266 93.6 528
11 AT1G011... 918858|PA... 0.00307 0.133 0.0232 99.2 529
12 AT1G011... 470161|PA... 0.00567 0.131 0.0432 98.5 453
13 AT1G011... 918855|PA... 0.13 0.203 0.641 72.6 285
14 AT1G011... 918854|PA... 0.105 0.280 0.373 84.9 179
15 AT1G011... 311317|PA... 0 0.306 0 85.6 97
16 AT1G011... 909860|PA... 0.0297 0.176 0.168 92.6 310
17 AT1G011... 311315|PA... 0.0287 0.162 0.177 94.2 533
18 AT1G012... 470156|PA... 0.0190 0.168 0.114 95.8 238
19 AT1G012... 311313|PA... 0.0207 0.154 0.134 95.3 107
20 AT1G012... 470155|PA... 0.0157 0.153 0.102 96.7 1056
... with 8 more variables: mismatches <dbl>, gap_openings <dbl>,
q_start <dbl>, q_end <dbl>, s_start <dbl>, s_end <dbl>, evalue <dbl>,
bit_score <dbl>
Some outputs include NA
values. To filter for
NA
values or a specific dnds.threshold
, you
can use the filter_dNdS()
function. The
filter_dNdS()
function takes the output data.table returned
by dNdS()
and filters the output by the following
criteria:
all dN values having an NA value are omitted
all dS values having an NA value are omitted
all dNdS values >= the specified dnds.threshold
are omitted
library(orthologr)
# get dNdS estimated for orthologous genes between A. thaliana and A. lyrata
Ath_Aly_dnds <- dNdS(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
ortho_detection = "RBH",
aa_aln_type = "pairwise",
aa_aln_tool = "NW",
codon_aln_tool = "pal2nal",
dnds_est.method = "Comeron",
comp_cores = 1)
# filter for:
# 1) all dN values having an NA value are omitted
# 2) all dS values having an NA value are omitted
# 3) all dNdS values >= 2 are omitted
filter_dNdS(Ath_Aly_dnds, dnds.threshold = 2)
Filtering out NA values in dN or dS and all values with dNdS > 2 ...
Initial input contains 20 rows.
Filtering done. New output table contains 20 rows.
A tibble: 20 x 15
query_id subject_id dN dS dNdS perc_identity alig_length
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AT1G010... 333554|PA... 0.106 0.254 0.420 74.0 469
2 AT1G010... 470181|PA... 0.0402 0.104 0.388 91.1 246
3 AT1G010... 470180|PA... 0.0150 0.126 0.118 95.5 359
4 AT1G010... 333551|PA... 0.0135 0.116 0.116 92.0 1970
5 AT1G010... 909874|PA... 0 0.175 0 100 213
6 AT1G010... 470177|PA... 0.0449 0.113 0.397 89.5 648
7 AT1G010... 918864|PA... 0.0183 0.106 0.173 95.1 366
8 AT1G010... 909871|PA... 0.0340 0.106 0.322 90.3 300
9 AT1G010... 470171|PA... 0.00910 0.218 0.0417 96.8 434
10 AT1G011... 333544|PA... 0.0325 0.122 0.266 93.6 528
11 AT1G011... 918858|PA... 0.00307 0.133 0.0232 99.2 529
12 AT1G011... 470161|PA... 0.00567 0.131 0.0432 98.5 453
13 AT1G011... 918855|PA... 0.13 0.203 0.641 72.6 285
14 AT1G011... 918854|PA... 0.105 0.280 0.373 84.9 179
15 AT1G011... 311317|PA... 0 0.306 0 85.6 97
16 AT1G011... 909860|PA... 0.0297 0.176 0.168 92.6 310
17 AT1G011... 311315|PA... 0.0287 0.162 0.177 94.2 533
18 AT1G012... 470156|PA... 0.0190 0.168 0.114 95.8 238
19 AT1G012... 311313|PA... 0.0207 0.154 0.134 95.3 107
20 AT1G012... 470155|PA... 0.0157 0.153 0.102 96.7 1056
... with 8 more variables: mismatches <dbl>, gap_openings <dbl>,
q_start <dbl>, q_end <dbl>, s_start <dbl>, s_end <dbl>, evalue <dbl>,
bit_score <dbl>
The dNdS()
function can be used choosing the following
options:
ortho_detection
: RBH
(BLAST best
reciprocal hit), BH
(BLAST best reciprocal hit),
PO
(ProteinOrtho), and OrthoMCL
(OrthoMCL)aa_aln_type
: multiple
or
pairwise
aa_aln_tool
: clustalw
,
t_coffee
, muscle
, clustalo
,
mafft
, and NW
(in case
aa_aln_type = "pairwise"
)codon_aln_tool
: pal2nal
dnds_est.method
: Li
,
Comeron
, NG
, LWL
,
LPB
, MLWL
, YN
, and
MYN
Please see ?dNdS
for details.
In case your BLAST program, or multiple alignment program can not be
executed from the default execution PATH
you can specify
the aa_aln_path
or blast_path
arguments.
library(orthologr)
# using the `aa_aln_path` or `blast_path` arguments
dNdS( query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
ortho_detection = "RBH",
blast_path = "here/path/to/blastp",
aa_aln_type = "multiple",
aa_aln_tool = "clustalw",
aa_aln_path = "here/path/to/clustalw",
codon_aln_tool = "pal2nal",
dnds_est.method = "Comeron",
comp_cores = 1,
clean_folders = TRUE)
Additional arguments can be passed to dNdS()
. This
allows you to use more advanced options of several interface
programs.
To pass additional parameters to the interface programs, you can use
the blast_params
and aa_aln_params
arguments.
The aa_aln_params
argument assumes that when you chose
e.g. aa_aln_tool = "mafft"
you will pass the corresponding
additional parameters in MAFFT notation.
library(orthologr)
# get dNdS estimated for orthologous genes between A. thaliana and A. lyrata
# using additional parameters:
# get a dNdS table using:
# 1) reciprocal best hit for orthology inference (RBH)
# 2) multiple amino acid alignments using MAFFT
# 3) pal2nal for codon alignments
# 4) Comeron (1995) for dNdS estimation
# 5) single core processing 'comp_cores = 1'
Ath_Aly_dnds <- dNdS( query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
ortho_detection = "RBH",
blast_params = "-matrix BLOSUM80",
aa_aln_type = "multiple",
aa_aln_tool = "mafft",
aa_aln_params = "--maxiterate 1 --clustalout",
dnds_est.method = "Comeron",
comp_cores = 1,
clean_folders = TRUE,
quiet = TRUE )
# filter for:
# 1) all dN values having an NA value are omitted
# 2) all dS values having an NA value are omitted
# 3) all dNdS values >= 0.1 are omitted
filter_dNdS(Ath_Aly_dnds, dnds.threshold = 0.1)
query_id subject_id dN dS dNdS
1 AT1G01050.1 909874|PACid:16 0.000000 0.1750 0.00000
2 AT1G01090.1 470171|PACid:16 0.009843 0.2150 0.04579
3 AT1G01120.1 918858|PACid:16 0.003072 0.1326 0.02317
4 AT1G01140.3 470161|PACid:16 0.005672 0.1312 0.04324
5 AT1G01170.2 311317|PACid:16 0.008750 0.2827 0.03095
6 AT1G01220.1 470155|PACid:16 0.015210 0.1533 0.09919
Here blast_params
and aa_aln_params
take an
character string specifying the parameters that shall be passed to BLAST
and MAFFT. The notation of these parameters must follow the command line
call of the stand alone versions of BLAST and MAFFT:
e.g. blast_params = "blast_params = -matrix BLOSUM80"
and
aa_aln_params = "--maxiterate 1 --clustalout"
.