The orthologr
package provides several interface
functions to perform BLAST searches.
First, users need to make sure that they have BLAST installed on their machine. Please follow these instructions to install BLAST on your manine.
The orthologr
package stores 20 example genes
(orthologs) between Arabidopsis thaliana and Arabidopsis
lyrata. The following example BLAST search shall illustrate a
simple search with standard parameters provided by the
blast()
function.
When running the subsequent functions please make sure you can call
BLAST+ from your console either in the standard PATH
or in
case you have BLAST+ installed in a separate folder, please specify the
path
argument that can be passed to
blast()
.
To check whether BLAST+ can be executed from the default
PATH
(usr/bin/local
on UNIX systems), you can
run:
# check if blast is installed and if yes, what version of blast
system("blastp -version")
This should return something like this:
blastp: 2.8.1+
Package: blast 2.8.1, build Nov 26 2018 12:45:20
If everything works properly, you can get started with your first BLAST+ search.
The orthologr
packages allows users to perform fast and
easy-to-use BLAST searches between fasta
files. These
searches can even be scaled towards genome and proteome comparisons.
The blast()
function provides the easiest way to perform
a BLAST search.
library(dplyr)
# performing a BLAST search using blastp (default)
hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))
# look at results
glimpse(hit_tbl)
Variables:
$ query_id (chr) "AT1G01010.1", "AT1G01020.1", "AT1G01030.1",...
$ subject_id (chr) "333554|PACid:16033839", "470181|PACid:16064...
$ perc_identity (dbl) 73.99, 91.06, 95.54, 91.98, 100.00, 89.51, 9...
$ alig_length (dbl) 469, 246, 359, 1970, 213, 648, 366, 300, 434...
$ mismatches (dbl) 80, 22, 12, 85, 0, 58, 14, 22, 8, 34, 4, 6, ...
$ gap_openings (dbl) 8, 0, 2, 10, 0, 5, 2, 2, 3, 0, 0, 1, 3, 2, 1...
$ q_start (dbl) 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 4, 2,...
$ q_end (dbl) 430, 246, 359, 1910, 213, 646, 366, 294, 429...
$ s_start (dbl) 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 16, 2, 4...
$ s_end (dbl) 466, 246, 355, 1963, 213, 640, 362, 299, 433...
$ evalue (dbl) 0e+00, 7e-166, 0e+00, 0e+00, 2e-160, 0e+00, ...
$ bit_score (dbl) 627, 454, 698, 3704, 437, 1037, 696, 491, 85...
As you can see, the hit table shows the output of the BLAST+ search.
The blast()
function runs blastp
as default
BLAST+ algorithm. Different BLAST+ algorithms can be selected by
specifying the blast_algorithm
argument,
e.g. blast_algorithm = "tblastn"
. See ?blast
for further details. The blast()
function returns the BLAST
arguments: query_id
, subject_id
,
perc_identity
, alig_length
,
mismatches
, gap_openings
,
q_start
, q_end
, s_start
,
s_end
, evalue
, and bit_score
.
Since blast()
stores the hit table returned by BLAST in
a data.table object, you can access each column, using the data.table
notation.
In case you need to specify the PATH
to BLAST+ please
use the path
argument:
# performing a BLAST search using blastp (default)
hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
path = "/path/to/blastp")
hit_tbl
# access columns: query_id, subject_id, evalue, and bit_score
dplyr::select(hit_tbl, query_id, subject_id, evalue, bit_score)
query_id subject_id evalue bit_score
1: AT1G01010.1 333554|PACid:16033839 0e+00 627
2: AT1G01020.1 470181|PACid:16064328 7e-166 454
3: AT1G01030.1 470180|PACid:16054974 0e+00 698
4: AT1G01040.1 333551|PACid:16057793 0e+00 3704
5: AT1G01050.1 909874|PACid:16064489 2e-160 437
6: AT1G01060.3 470177|PACid:16043374 0e+00 1037
7: AT1G01070.1 918864|PACid:16052578 0e+00 696
8: AT1G01080.1 909871|PACid:16053217 1e-178 491
9: AT1G01090.1 470171|PACid:16052860 0e+00 859
10: AT1G01110.2 333544|PACid:16034284 0e+00 972
11: AT1G01120.1 918858|PACid:16049140 0e+00 1092
12: AT1G01140.3 470161|PACid:16036015 0e+00 918
13: AT1G01150.1 918855|PACid:16037307 3e-150 421
14: AT1G01160.1 918854|PACid:16044153 1e-93 268
15: AT1G01170.2 311317|PACid:16052302 3e-54 158
16: AT1G01180.1 909860|PACid:16056125 0e+00 576
17: AT1G01190.1 311315|PACid:16059488 0e+00 1036
18: AT1G01200.1 470156|PACid:16041002 3e-172 470
19: AT1G01210.1 311313|PACid:16057125 7e-76 215
20: AT1G01220.1 470155|PACid:16047984 0e+00 2106
The blast()
function also allows you to pass additional
parameters to the BLAST+ search using the blast_params
argument. In the following example, a remote BLAST+ search is
performed.
hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
blast_params = "-qcov_hsp_perc 0.9")
glimpse(hit_tbl)
A tibble: 78 x 12
query_id subject_id perc_identity alig_length mismatches gap_openings
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AT1G010... 311334|PA... 43.6 165 81 4
2 AT1G010... 333554|PA... 74.0 469 80 8
3 AT1G010... 909883|PA... 34.5 354 162 12
4 AT1G010... 918785|PA... 33.8 160 80 6
5 AT1G010... 470181|PA... 91.1 246 22 0
6 AT1G010... 470180|PA... 95.5 359 12 2
7 AT1G010... 333551|PA... 92.0 1970 85 10
8 AT1G010... 909874|PA... 100 213 0 0
9 AT1G010... 311286|PA... 64.5 62 22 0
10 AT1G010... 470177|PA... 89.5 648 58 5
with 68 more rows, and 6 more variables: q_start <dbl>, q_end <dbl>,
s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>
In all cases the default e-value
BLAST+ searches is
1E-5
and the default blast_algorithm
is
blastp
.
Since BLAST+ searches can be computationally expensive, it is
possible to specify the comp_cores
argument when working
with an multicore machine.
# BLAST computations using the comp_cores parameter: here with 2 cores
blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
comp_cores = 2)
The query_file
and subject_file
arguments
specify the path to the corresponding fasta files storing the
CDS
files, amino acid
files, or
genome
files of the query organism and subject organism of
interest. Make sure that when using CDS
files,
amino acid
files, or genome
files the
corresponding argument seq_type
must be adapted according
to the input data format.
Use :
CDS
files -> seq_type = "cds"
amino acid
files ->
seq_type = "protein"
genome
files -> seq_type = "dna"
The format
argument specifies the input file format,
e.g. “fasta” or “gbk”. The blast_algorithm
argument
specifies the BLAST program (algorithm) that shall be used to perform
BLAST searches, e.g. “blastp”,“blastn”,“tblastn”,etc. Again, the
eval
argument defines the default e-value that shall be
chosen as best hit threshold.
All blast
functions implemented in
orthologr
can easily be processed using the split-apply-combine
strategy to detect for example one-to-one
,
one-to-many
, and many-to-many
gene homology
relationships.
Here a simple example:
library(dplyr)
# perform a blastp search of 20 A. thaliana genes against 1000 A. lyrata genes
hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr'))
# determine 'one-to-many' and 'one-to-one' gene relationships
rel_hit_tbl <- summarize(group_by(hit_tbl, query_id), n_genes = n())
# look at results
rel_hit_tbl
A tibble: 20 x 2
query_id n_genes
<chr> <int>
1 AT1G01010.1 4
2 AT1G01020.1 1
3 AT1G01030.1 1
4 AT1G01040.1 1
5 AT1G01050.1 1
6 AT1G01060.3 2
7 AT1G01070.1 3
8 AT1G01080.1 5
9 AT1G01090.1 1
10 AT1G01110.2 1
11 AT1G01120.1 3
12 AT1G01140.3 36
13 AT1G01150.1 1
14 AT1G01160.1 1
15 AT1G01170.2 1
16 AT1G01180.1 1
17 AT1G01190.1 6
18 AT1G01200.1 7
19 AT1G01210.1 1
20 AT1G01220.1 1
Now you can sort genes into classes: one-to-one
and
one-to-many
.
# classify into 'one-to-one' relationships
one_to_one <- filter(rel_hit_tbl, n_genes == 1)
# classify into 'one-to-many' relationships
one_to_many <- filter(rel_hit_tbl, n_genes > 1)
# look at one_to_one
one_to_one
A tibble: 12 x 2
query_id n_genes
<chr> <int>
1 AT1G01020.1 1
2 AT1G01030.1 1
3 AT1G01040.1 1
4 AT1G01050.1 1
5 AT1G01090.1 1
6 AT1G01110.2 1
7 AT1G01150.1 1
8 AT1G01160.1 1
9 AT1G01170.2 1
10 AT1G01180.1 1
11 AT1G01210.1 1
12 AT1G01220.1 1
# look at one_to_many
one_to_many
A tibble: 8 x 2
query_id n_genes
<chr> <int>
1 AT1G01010.1 4
2 AT1G01060.3 2
3 AT1G01070.1 3
4 AT1G01080.1 5
5 AT1G01120.1 3
6 AT1G01140.3 36
7 AT1G01190.1 6
8 AT1G01200.1 7
Now we can treat classes: one_to_one
and
one_to_many
differently:
We can now retrieve the blast hit entries for the one-to-one hits by
joining the tables one_to_one
and hit_tbl
.
# look at the evalue, perc_identity, and alig_length of one_to_one genes
oo_genes <- dplyr::inner_join(one_to_one, hit_tbl , by = "query_id")
oo_genes
A tibble: 12 x 13
query_id n_genes subject_id perc_identity alig_length mismatches
<chr> <int> <chr> <dbl> <dbl> <dbl>
1 AT1G010... 1 470181|PA... 91.1 246 22
2 AT1G010... 1 470180|PA... 95.5 359 12
3 AT1G010... 1 333551|PA... 92.0 1970 85
4 AT1G010... 1 909874|PA... 100 213 0
5 AT1G010... 1 470171|PA... 96.8 434 8
6 AT1G011... 1 333544|PA... 93.6 528 34
7 AT1G011... 1 918855|PA... 72.6 285 68
8 AT1G011... 1 918854|PA... 84.9 179 19
9 AT1G011... 1 311317|PA... 85.6 97 0
10 AT1G011... 1 909860|PA... 92.6 310 20
11 AT1G012... 1 311313|PA... 95.3 107 5
12 AT1G012... 1 470155|PA... 96.7 1056 35
with 7 more variables: gap_openings <dbl>, q_start <dbl>, q_end <dbl>,
s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>
This way, users will have all blast information of
one_to_one
hits available.
Using the oo_genes
dataset above, users can use this
one_to_one
hits strategy to detect homologous genes between
species.
For example, we could restrict one_to_one
hits to
fulfill certain (stringent) criteria to identify as a homologous hit.
Here we choose the following parameter constellation to achieve this
goal: one_to_one
genes must have a minimum
alignment length of 300
, a
perc_identity of > 80 percent
and an
e-value < 1E-5
.
# filter for 'homologous' hits
true_orthologs <- dplyr::filter(oo_genes, evalue < 1e-5, perc_identity > 80, alig_length > 300)
# look at results
dplyr::select(true_orthologs, query_id, subject_id, evalue, perc_identity, alig_length)
query_id subject_id evalue perc_identity alig_length
1: AT1G01030.1 470180|PACid:16054974 0 95.54 359
2: AT1G01040.1 333551|PACid:16057793 0 91.98 1970
3: AT1G01090.1 470171|PACid:16052860 0 96.77 434
4: AT1G01110.2 333544|PACid:16034284 0 93.56 528
5: AT1G01180.1 909860|PACid:16056125 0 92.58 310
6: AT1G01220.1 470155|PACid:16047984 0 96.69 1056
This way we could filter out a high confidence set of homologous
genes from the one_to_one
class of genes.
In reality most homology inference programs and methods perform way more complicated and sophisticated analyses to distinguish true orthologs from true paralogs (in-paralogs, out-paralogs, etc.). These subsequent analyses can also be performed using the above introduced split-apply-combine strategy.
Note, that you can perform self-BLAST searches
blast(query,query)
and blast(subject,subject)
to distinguish between orthologous and paralogous genes.
Now we continue with the one_to_many
class of genes.
Here we want to address the question how to deal with multiple hits
returned by BLAST+
.
Again we investigate all one_to_many
genes:
# look at results
one_to_many
A tibble: 8 x 2
query_id n_genes
<chr> <int>
1 AT1G01010.1 4
2 AT1G01060.3 2
3 AT1G01070.1 3
4 AT1G01080.1 5
5 AT1G01120.1 3
6 AT1G01140.3 36
7 AT1G01190.1 6
8 AT1G01200.1 7
When looking at gene_id AT1G01200.1
we see that it was
found 7 times in the corresponding subject set of A.
lyrata.
# look at all 7 hits found
dplyr::select(dplyr::filter(hit_tbl, query_id == "AT1G01200.1"), query_id, subject_id, evalue, perc_identity, alig_length)
A tibble: 7 x 5
query_id subject_id evalue perc_identity alig_length
<chr> <chr> <dbl> <dbl> <dbl>
1 AT1G01200.1 470156|PACid:16041002 1.87e-172 95.8 238
2 AT1G01200.1 910431|PACid:16035207 2.63e- 75 53.0 219
3 AT1G01200.1 918732|PACid:16054958 1.07e- 51 44.6 193
4 AT1G01200.1 919287|PACid:16060536 4.42e- 70 58.1 179
5 AT1G01200.1 919355|PACid:16050170 2.40e- 73 53.3 212
6 AT1G01200.1 919721|PACid:16036935 2.63e- 81 59.3 204
7 AT1G01200.1 919852|PACid:16055066 4.02e- 7 24.0 154
Now we have to decide which hit shall be considered as potential homolog.
In this example subject_id
470156|PACid:16041002
has the highest
perc_identity
as well as the lowest e-value
1.87e-172
. So a straightforward approach would be to choose
subject gene 470156|PACid:16041002
as potential ortholog of
query gene AT1G01200.1
.
We can validate this approach by running a reciprocal best hit search
with blast_rec()
and compare the output of gene
AT1G01200.1
with our choice
470156|PACid:16041002
.
A reciprocal best hit blast approach denotes the search strategy in which query sequences are blasted in one direction to detect matches in subject sequences and then inversely subject sequences are blasted in the other direction to detect matches in the query sequences. Only when both blast seaches in both directions result in the same hit pair: BLAST(A,B) = BLAST(B,A) the hit will be considered as true ortholog relationship.
# run blast best reciprocal hit function
rbh_hit_tbl <- blast_rec(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr'))
# look at results
dplyr::select(rbh_hit_tbl, query_id, subject_id, evalue, perc_identity, alig_length)
A tibble: 20 x 5
Groups: query_id [20]
query_id subject_id evalue perc_identity alig_length
<chr> <chr> <dbl> <dbl> <dbl>
1 AT1G01010.1 333554|PACid:16033839 0. 74.0 469
2 AT1G01020.1 470181|PACid:16064328 4.33e-166 91.1 246
3 AT1G01030.1 470180|PACid:16054974 0. 95.5 359
4 AT1G01040.1 333551|PACid:16057793 0. 92.0 1970
5 AT1G01050.1 909874|PACid:16064489 1.59e-160 100 213
6 AT1G01060.3 470177|PACid:16043374 0. 89.5 648
7 AT1G01070.1 918864|PACid:16052578 0. 95.1 366
8 AT1G01080.1 909871|PACid:16053217 5.90e-179 90.3 300
9 AT1G01090.1 470171|PACid:16052860 0. 96.8 434
10 AT1G01110.2 333544|PACid:16034284 0. 93.6 528
11 AT1G01120.1 918858|PACid:16049140 0. 99.2 529
12 AT1G01140.3 470161|PACid:16036015 0. 98.5 453
13 AT1G01150.1 918855|PACid:16037307 2.01e-150 72.6 285
14 AT1G01160.1 918854|PACid:16044153 1.16e- 93 84.9 179
15 AT1G01170.2 311317|PACid:16052302 5.75e- 54 85.6 97
16 AT1G01180.1 909860|PACid:16056125 0. 92.6 310
17 AT1G01190.1 311315|PACid:16059488 0. 94.2 533
18 AT1G01200.1 470156|PACid:16041002 1.87e-172 95.8 238
19 AT1G01210.1 311313|PACid:16057125 8.80e- 76 95.3 107
20 AT1G01220.1 470155|PACid:16047984 0. 96.7 1056
When we now look at gene AT1G01200.1
we find that after
an reciprocal blast approach subject gene
470181|PACid:16064328
rather than the subject gene
470156|PACid:16041002
(= result of unidirectional blast)
has been detected as potential ortholog. The example illustrates the
importance of using bidirectional blast to determine orthology
relationships.
An alternative analysis that can be performed with these three candidate subject genes is the following:
# read CDS sequences of the 20 example query genes of A. thaliana
Ath.cds <- orthologr::read.cds(file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), format = "fasta")
# read CDS sequences of the 1000 example subject genes of A. lyrata
Aly.cds <- orthologr::read.cds(file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr'), format = "fasta")
# show the sequence of gene AT1G01070.1
Ath.cds["AT1G01070.1" , seqs]
[1] "atggctggagatatgcaaggagtgagagtagtagaaaaatattcaccggtcatagtgatggtgatgtcaaatgta
gcgatgggttcggtgaatgcacttgtgaagaaagctcttgatgttggtgtgaaccatatggtcattggtgcttatcgaat
ggctatttccgctttaattttggttccctttgcctatgtcttggaaaggaaaacaagaccacaaataacgtttaggctaa
tggtcgatcatttcgtcagtggccttctcggggcgagtttgatgcagtttttctttttgcttggtctgtcgtacacgtca
gcaactgtttcgtgtgctttggtaagcatgttgcctgcaatcaccttcgctttggcccttattttcaggactgaaaatgt
gaagattctaaagaccaaagcaggaatgttgaaggtgattggaactttgatctgtataagtggagctttgttcttaacat
tttacaaaggcccacaaatatcaaactctcactctcactctcacggtggggcttcccacaacaacaacgatcaagacaag
gccaataattggcttcttggatgtctttatttaaccataggaacagtgttgctatctctatggatgttgtttcaagggac
tttaagtattaagtacccttgcaaatactcgagcacttgtcttatgtcaattttcgcggcatttcaatgtgctctcttga
gcctttacaagagcagagacgttaatgattggatcatagatgatagattcgttatcaccgtcatcatatacgctggagtg
gtaggacaagcaatgacgacggttgcaacaacatgggggattaaaaaattaggagctgtgttcgcatcggcgtttttccc
acttactctcatttcggctactctatttgatttcctcattttacacactcctttataccttggaagtgtgattggatcac
tagtgaccataacgggtctctacatgttcttgtgggggaagaacaaagaaacggaatcatcaactgcattgtcttcagga
atggataacgaagctcaatatactactcctaataaggataacgactctaagtcgcccgtttaa"
Now users can perform a global alignment between the CDS sequences of
AT1G01070.1
and the three subject genes as follows:
library(Biostrings)
# perform 3 global alignments between: AT1G01070.1 and 918864|PACid:16052578,
# 919693|PACid:16048878, 919961|PACid:16062329
sapply(Aly.cds[ unlist(dplyr::select(dplyr::filter(hit_tbl, query_id == "AT1G01070.1"), subject_id)), seqs ], pairwiseAlignment,
pattern = Ath.cds["AT1G01070.1" , seqs], type = "global" )
$...
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] atggctggagatatgcaaggagtgagagta...aaggataacgactctaagtcgcccgtttaa
subject: [1] atgggtgaaggtatgattggagtgagagta...aaggataacgactctaagtcgcccgtttaa
score: 1768.965
$...
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] atggctgga---gatatgcaaggagtgaga...----cgac----tctaagtcgcccgtttaa
subject: [1] atggctaaatcagatatgc------tg---...ggttccacaaggtctatatcgcc---ttaa
score: -2318.726
$...
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] atggctggagatatgcaaggagtgagagta...aaggataacgactctaagtcgcccgtttaa
subject: [1] atgagtgaggatatgggaggagtgaaagta...----------------------------aa
score: 486.462
Note: To obtain the score value, you need to specify
the scoreOnly = TRUE
in the pairwiseAlignment
function.
As you can see, subject gene 918864|PACid:16052578
also
has the highest global alignment score 1768.965
based on
the Needleman-Wunsch algorithm. This strategy might help you to
differentiate between border line cases.
The examples shown above shall demonstrate the use cases that can be
performed using the blast
functions implemented in
orthologr
.
Another useful analysis can be to take the length of the initial
query genes into account using the nchar()
function:
# show the length distribution of all genes
# stored in "Ath.cds"
Ath.cds[ , nchar(seqs)]
[1] 1290 738 1077 5730 639 1938 1098 882 1287 1584 1587 1356 1038 588 252 1437 1608 714 321 3168
Or the length of a specific gene:
Ath.cds["AT1G01070.1" , nchar(seqs)]
[1] 1098
This way you can easily visualize the length distribution of genes stored in your query organism file.
Ath.cds <- read.cds(system.file('seqs/ortho_thal_cds_1000.fasta', package = 'orthologr'),
format = "fasta")
# look at sequence length distributions
hist(Ath.cds[ , nchar(seqs)], breaks = 100)
For some analyses it is sufficient to perform BLAST+ best hit
searches. The blast_best()
function is optimized to perform
BLAST+ best hit searches (only based on the minimum e-value) and returns
the best hit when performing a BLAST+ search of a query organisms (or
set of query genes) against a subject organism (or set of subject
genes).
# performing gene orthology inference using the best hit (BH) method
blast_best(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
clean_folders = TRUE)
A tibble: 20 x 12
Groups: query_id [20]
query_id subject_id perc_identity alig_length mismatches gap_openings
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AT1G010... 333554|PA... 74.0 469 80 8
2 AT1G010... 470181|PA... 91.1 246 22 0
3 AT1G010... 470180|PA... 95.5 359 12 2
4 AT1G010... 333551|PA... 92.0 1970 85 10
5 AT1G010... 909874|PA... 100 213 0 0
6 AT1G010... 470177|PA... 89.5 648 58 5
7 AT1G010... 918864|PA... 95.1 366 14 2
8 AT1G010... 909871|PA... 90.3 300 22 2
9 AT1G010... 470171|PA... 96.8 434 8 3
10 AT1G011... 333544|PA... 93.6 528 34 0
11 AT1G011... 918858|PA... 99.2 529 4 0
12 AT1G011... 470161|PA... 98.5 453 6 1
13 AT1G011... 918855|PA... 72.6 285 68 3
14 AT1G011... 918854|PA... 84.9 179 19 2
15 AT1G011... 311317|PA... 85.6 97 0 1
16 AT1G011... 909860|PA... 92.6 310 20 1
17 AT1G011... 311315|PA... 94.2 533 30 1
18 AT1G012... 470156|PA... 95.8 238 10 0
19 AT1G012... 311313|PA... 95.3 107 5 0
20 AT1G012... 470155|PA... 96.7 1056 35 0
with 6 more variables: q_start <dbl>, q_end <dbl>, s_start <dbl>,
s_end <dbl>, evalue <dbl>, bit_score <dbl>
The blast_best()
function returns:
query_id
, subject_id
, and
eval
.
In case you need more parameters returned by a BLAST+ best hit
search, you can specify the detailed_output
argument
(detailed_output = TRUE
).
# BLAST+ best hit search
best_hit_tbl <- blast_best(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))
# look at results
dplyr::glimpse(best_hit_tbl)
Observations: 20
Variables: 12
Groups: query_id [20]
$ query_id <chr> "AT1G01010.1", "AT1G01020.1", "AT1G01030.1", "AT1G010...
$ subject_id <chr> "333554|PACid:16033839", "470181|PACid:16064328", "47...
$ perc_identity <dbl> 73.987, 91.057, 95.543, 91.980, 100.000, 89.506, 95.0...
$ alig_length <dbl> 469, 246, 359, 1970, 213, 648, 366, 300, 434, 528, 52...
$ mismatches <dbl> 80, 22, 12, 85, 0, 58, 14, 22, 8, 34, 4, 6, 68, 19, 0...
$ gap_openings <dbl> 8, 0, 2, 10, 0, 5, 2, 2, 3, 0, 0, 1, 3, 2, 1, 1, 1, 0...
$ q_start <dbl> 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 4, 2, 1, 1, 1,...
$ q_end <dbl> 430, 246, 359, 1910, 213, 646, 366, 294, 429, 528, 52...
$ s_start <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 16, 2, 4, 1, 6, 1...
$ s_end <dbl> 466, 246, 355, 1963, 213, 640, 362, 299, 433, 528, 52...
$ evalue <dbl> 0.00e+00, 8.91e-168, 0.00e+00, 0.00e+00, 3.27e-162, 0...
$ bit_score <dbl> 627, 454, 698, 3704, 437, 1037, 696, 491, 859, 972, 1...
The blast_rec()
function was implemented to optimize
BLAST+ reciprocal best hit searches (only based on the minimum e-value).
BLAST+ reciprocal best hit searches are used to perform orthology
inference.
Running blast_rec()
using default parameter
settings:
# performing gene orthology inference using the reciprocal best hit (RBH) method
blast_rec(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))
A tibble: 20 x 12
Groups: query_id
query_id subject_id perc_identity alig_length mismatches gap_openings
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AT1G010... 333554|PA... 74.0 469 80 8
2 AT1G010... 470181|PA... 91.1 246 22 0
3 AT1G010... 470180|PA... 95.5 359 12 2
4 AT1G010... 333551|PA... 92.0 1970 85 10
5 AT1G010... 909874|PA... 100 213 0 0
6 AT1G010... 470177|PA... 89.5 648 58 5
7 AT1G010... 918864|PA... 95.1 366 14 2
8 AT1G010... 909871|PA... 90.3 300 22 2
9 AT1G010... 470171|PA... 96.8 434 8 3
10 AT1G011... 333544|PA... 93.6 528 34 0
11 AT1G011... 918858|PA... 99.2 529 4 0
12 AT1G011... 470161|PA... 98.5 453 6 1
13 AT1G011... 918855|PA... 72.6 285 68 3
14 AT1G011... 918854|PA... 84.9 179 19 2
15 AT1G011... 311317|PA... 85.6 97 0 1
16 AT1G011... 909860|PA... 92.6 310 20 1
17 AT1G011... 311315|PA... 94.2 533 30 1
18 AT1G012... 470156|PA... 95.8 238 10 0
19 AT1G012... 311313|PA... 95.3 107 5 0
20 AT1G012... 470155|PA... 96.7 1056 35 0
with 6 more variables: q_start <dbl>, q_end <dbl>, s_start <dbl>,
s_end <dbl>, evalue <dbl>, bit_score <dbl>
The set_blast()
function reads a file storing a specific
sequence type, such as “cds”, “protein”, or “dna” in a standard sequence
file format such as “fasta”, etc. and depending of the makedb parameter
either creates a blast-able database, or returns the corresponding
protein sequences as data.table object for further BLAST searches.
# using set_blast() to generate a blastable sequence database
head(set_blast(file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'))[[1]] , 2)
geneids
1: AT1G01010.1
2: AT1G01020.1
seqs
1: atggaggatcaagttgggtttgggttccgtccgaacgacgaggagctcgttggtcactatctccgtaacaaaatcgaaggaaacact
agccgcgacgttgaagtagccatcagcgaggtcaacatctgtagctacgatccttggaacttgcgcttccagtcaaagtacaaatcgaga
gatgctatgtggtacttcttctctcgtagagaaaacaacaaagggaatcgacagagcaggacaacggtttctggtaaatggaagcttacc
ggagaatctgttgaggtcaaggaccagtggggattttgtagtgagggctttcgtggtaagattggtcataaaagggttttggtgttcctc
gatggaagataccctgacaaaaccaaatctgattgggttatccacgagttccactacgacctcttaccagaacatcagaggacatatgtc
atctgcagacttgagtacaagggtgatgatgcggacattctatctgcttatgcaatagatcccactcccgcttttgtccccaatatgact
agtagtgcaggttctgtggtcaaccaatcacgtcaacgaaattcaggatcttacaacacttactctgagtatgattcagcaaatcatggc
cagcagtttaatgaaaactctaacattatgcagcagcaaccacttcaaggatcattcaaccctctccttgagtatgattttgcaaatcac
ggcggtcagtggctgagtgactatatcgacctgcaacagcaagttccttacttggcaccttatgaaaatgagtcggagatgatttggaag
catgtgattgaagaaaattttgagtttttggtagatgaaaggacatctatgcaacagcattacagtgatcaccggcccaaaaaacctgtg
tctggggttttgcctgatgatagcagtgatactgaaactggatcaatgattttcgaagacacttcgagctccactgatagtgttggtagt
tcagatgaaccgggccatactcgtatagatgatattccatcattgaacattattgagcctttgcacaattataaggcacaagagcaacca
aagcagcagagcaaagaaaaggtgataagttcgcagaaaagcgaatgcgagtggaaaatggctgaagactcgatcaagatacctccatcc
accaacacggtgaagcagagctggattgttttggagaatgcacagtggaactatctcaagaacatgatcattggtgtcttgttgttcatc
tccgtcattagttggatcattcttgttggttaa
2:
atggcggcgagtgaacacagatgcgtgggatgtggttttagggtaaagtcattgttcattcaatactctccgggtaacattcgtctcatg
aaatgcggaaattgcaaggaagtagcagatgagtacatcgagtgtgaacgcatgattattttcatcgatttaatccttcacagaccaaag
gtatatagacacgtcctctacaatgcaattaatccagcaactgtcaatattcagcatctgttgtggaagttggtcttcgcctatcttctt
ctagactgttatagaagcttgctactgagaaaaagtgatgaagaatcgagcttttctgatagccctgttcttctatctataaaggttctg
attggtgtcttatctgcaaacgctgcatttatcatctcttttgccattgcgactaagggtttgctaaatgaagtttccagaagaagagag
attatgttggggatattcatctctagttacttcaagatatttctgcttgcgatgttggtatgggaattcccaatgtcagtgatttttttt
gtcgatatacttctcttaacatcaaactccatggctcttaaagtgatgactgaatcaacaatgaccagatgcatagccgtatgcttaatc
gcgcacttgattagattcttggtgggtcagatttttgagccgacaatatttttgatacaaattggatctctgttgcaatatatgtcttat
tttttcagaatcgtatga
aa
1: MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRENNKGNRQSRTTVSGKWK
LTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPN
MTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMI
WKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQE
QPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG*
2:
MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMIIFIDLILHRPKVYRHVLYNAINPATVNIQHLLWKLVFAYLL
LDCYRSLLLRKSDEESSFSDSPVLLSIKVLIGVLSANAAFIISFAIATKGLLNEVSRRREIMLGIFISSYFKIFLLAMLVWEFPMSVIFF
VDILLLTSNSMALKVMTESTMTRCIAVCLIAHLIRFLVGQIFEPTIFLIQIGSLLQYMSYFFRIV*
For large-scale applications of orthologr
, users may
wish to perform BLAST+ searches between entire genomes. For very
large-scale BLAST+ searches between genomes we recommend the user to use
the metablastr package
which aims to provide an easy-to-use BLAST search framework for massive
genome comparisons.
For pairwise genome comparisons however, users can use the following functions to BLAST one genome against another.