BLAST searches using orthologr

1. Getting started with BLAST
2. Core BLAST+ Functionality
3. Perform BLAST+ searches between genomes
- 3.1 Genome and Proteome Retrieval

Getting Started

The orthologr package provides several interface functions to perform BLAST searches.

First, users need to make sure that they have BLAST installed on their machine. Please follow these instructions to install BLAST on your manine.

Performing BLAST Searches

The orthologr package stores 20 example genes (orthologs) between Arabidopsis thaliana and Arabidopsis lyrata. The following example BLAST search shall illustrate a simple search with standard parameters provided by the blast() function.

When running the subsequent functions please make sure you can call BLAST+ from your console either in the standard PATH or in case you have BLAST+ installed in a separate folder, please specify the path argument that can be passed to blast().

To check whether BLAST+ can be executed from the default PATH (usr/bin/local on UNIX systems), you can run:

# check if blast is installed and if yes, what version of blast
system("blastp -version")

This should return something like this:

blastp: 2.8.1+
 Package: blast 2.8.1, build Nov 26 2018 12:45:20

If everything works properly, you can get started with your first BLAST+ search.

Using orthologr to perform BLAST searches

The orthologr packages allows users to perform fast and easy-to-use BLAST searches between fasta files. These searches can even be scaled towards genome and proteome comparisons.

The blast() function

The blast() function provides the easiest way to perform a BLAST search.

library(dplyr)
# performing a BLAST search using blastp (default)
hit_tbl <- blast(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
                 subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))
# look at results
glimpse(hit_tbl)

Variables:
$ query_id      (chr) "AT1G01010.1", "AT1G01020.1", "AT1G01030.1",...
$ subject_id    (chr) "333554|PACid:16033839", "470181|PACid:16064...
$ perc_identity (dbl) 73.99, 91.06, 95.54, 91.98, 100.00, 89.51, 9...
$ alig_length   (dbl) 469, 246, 359, 1970, 213, 648, 366, 300, 434...
$ mismatches    (dbl) 80, 22, 12, 85, 0, 58, 14, 22, 8, 34, 4, 6, ...
$ gap_openings  (dbl) 8, 0, 2, 10, 0, 5, 2, 2, 3, 0, 0, 1, 3, 2, 1...
$ q_start       (dbl) 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 4, 2,...
$ q_end         (dbl) 430, 246, 359, 1910, 213, 646, 366, 294, 429...
$ s_start       (dbl) 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 16, 2, 4...
$ s_end         (dbl) 466, 246, 355, 1963, 213, 640, 362, 299, 433...
$ evalue        (dbl) 0e+00, 7e-166, 0e+00, 0e+00, 2e-160, 0e+00, ...
$ bit_score     (dbl) 627, 454, 698, 3704, 437, 1037, 696, 491, 85...

As you can see, the hit table shows the output of the BLAST+ search. The blast() function runs blastp as default BLAST+ algorithm. Different BLAST+ algorithms can be selected by specifying the blast_algorithm argument, e.g. blast_algorithm = "tblastn". See ?blast for further details. The blast() function returns the BLAST arguments: query_id, subject_id, perc_identity, alig_length, mismatches, gap_openings, q_start, q_end, s_start, s_end, evalue, and bit_score.

Since blast() stores the hit table returned by BLAST in a data.table object, you can access each column, using the data.table notation.

In case you need to specify the PATH to BLAST+ please use the path argument:

# performing a BLAST search using blastp (default)
hit_tbl <- blast(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
                 subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
                 path         = "/path/to/blastp")


hit_tbl

# access columns: query_id, subject_id, evalue, and bit_score
dplyr::select(hit_tbl, query_id, subject_id, evalue, bit_score)

       query_id            subject_id evalue bit_score
 1: AT1G01010.1 333554|PACid:16033839  0e+00       627
 2: AT1G01020.1 470181|PACid:16064328 7e-166       454
 3: AT1G01030.1 470180|PACid:16054974  0e+00       698
 4: AT1G01040.1 333551|PACid:16057793  0e+00      3704
 5: AT1G01050.1 909874|PACid:16064489 2e-160       437
 6: AT1G01060.3 470177|PACid:16043374  0e+00      1037
 7: AT1G01070.1 918864|PACid:16052578  0e+00       696
 8: AT1G01080.1 909871|PACid:16053217 1e-178       491
 9: AT1G01090.1 470171|PACid:16052860  0e+00       859
10: AT1G01110.2 333544|PACid:16034284  0e+00       972
11: AT1G01120.1 918858|PACid:16049140  0e+00      1092
12: AT1G01140.3 470161|PACid:16036015  0e+00       918
13: AT1G01150.1 918855|PACid:16037307 3e-150       421
14: AT1G01160.1 918854|PACid:16044153  1e-93       268
15: AT1G01170.2 311317|PACid:16052302  3e-54       158
16: AT1G01180.1 909860|PACid:16056125  0e+00       576
17: AT1G01190.1 311315|PACid:16059488  0e+00      1036
18: AT1G01200.1 470156|PACid:16041002 3e-172       470
19: AT1G01210.1 311313|PACid:16057125  7e-76       215
20: AT1G01220.1 470155|PACid:16047984  0e+00      2106

The blast() function also allows you to pass additional parameters to the BLAST+ search using the blast_params argument. In the following example, a remote BLAST+ search is performed.

hit_tbl <- blast(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
                 subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
                 blast_params = "-qcov_hsp_perc 0.9")

glimpse(hit_tbl)

A tibble: 78 x 12
   query_id subject_id perc_identity alig_length mismatches gap_openings
   <chr>    <chr>              <dbl>       <dbl>      <dbl>        <dbl>
 1 AT1G010... 311334|PA...          43.6         165         81            4
 2 AT1G010... 333554|PA...          74.0         469         80            8
 3 AT1G010... 909883|PA...          34.5         354        162           12
 4 AT1G010... 918785|PA...          33.8         160         80            6
 5 AT1G010... 470181|PA...          91.1         246         22            0
 6 AT1G010... 470180|PA...          95.5         359         12            2
 7 AT1G010... 333551|PA...          92.0        1970         85           10
 8 AT1G010... 909874|PA...         100           213          0            0
 9 AT1G010... 311286|PA...          64.5          62         22            0
10 AT1G010... 470177|PA...          89.5         648         58            5
 with 68 more rows, and 6 more variables: q_start <dbl>, q_end <dbl>,
 s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>

In all cases the default e-value BLAST+ searches is 1E-5 and the default blast_algorithm is blastp.

Since BLAST+ searches can be computationally expensive, it is possible to specify the comp_cores argument when working with an multicore machine.

# BLAST computations using the comp_cores parameter: here with 2 cores
blast(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
      subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
      comp_cores   = 2)

The query_file and subject_file arguments specify the path to the corresponding fasta files storing the CDS files, amino acid files, or genome files of the query organism and subject organism of interest. Make sure that when using CDSfiles, amino acid files, or genome files the corresponding argument seq_type must be adapted according to the input data format.

Use :

CDS files -> seq_type = "cds"
amino acid files -> seq_type = "protein"
genome files -> seq_type = "dna"

The format argument specifies the input file format, e.g. “fasta” or “gbk”. The blast_algorithm argument specifies the BLAST program (algorithm) that shall be used to perform BLAST searches, e.g. “blastp”,“blastn”,“tblastn”,etc. Again, the eval argument defines the default e-value that shall be chosen as best hit threshold.

Using the split-apply-combine strategy for a BLAST hit table

All blast functions implemented in orthologr can easily be processed using the split-apply-combine strategy to detect for example one-to-one, one-to-many, and many-to-many gene homology relationships.

Here a simple example:

library(dplyr)
# perform a blastp search of 20 A. thaliana genes against 1000 A. lyrata genes
hit_tbl <- blast(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
                 subject_file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr'))

# determine 'one-to-many' and 'one-to-one' gene relationships
rel_hit_tbl <- summarize(group_by(hit_tbl, query_id), n_genes = n())
# look at results
rel_hit_tbl

A tibble: 20 x 2
   query_id    n_genes
   <chr>         <int>
 1 AT1G01010.1       4
 2 AT1G01020.1       1
 3 AT1G01030.1       1
 4 AT1G01040.1       1
 5 AT1G01050.1       1
 6 AT1G01060.3       2
 7 AT1G01070.1       3
 8 AT1G01080.1       5
 9 AT1G01090.1       1
10 AT1G01110.2       1
11 AT1G01120.1       3
12 AT1G01140.3      36
13 AT1G01150.1       1
14 AT1G01160.1       1
15 AT1G01170.2       1
16 AT1G01180.1       1
17 AT1G01190.1       6
18 AT1G01200.1       7
19 AT1G01210.1       1
20 AT1G01220.1       1

Now you can sort genes into classes: one-to-one and one-to-many.

# classify into 'one-to-one' relationships
one_to_one <- filter(rel_hit_tbl, n_genes == 1)
# classify into 'one-to-many' relationships
one_to_many <- filter(rel_hit_tbl, n_genes > 1)

# look at one_to_one
one_to_one

A tibble: 12 x 2
   query_id    n_genes
   <chr>         <int>
 1 AT1G01020.1       1
 2 AT1G01030.1       1
 3 AT1G01040.1       1
 4 AT1G01050.1       1
 5 AT1G01090.1       1
 6 AT1G01110.2       1
 7 AT1G01150.1       1
 8 AT1G01160.1       1
 9 AT1G01170.2       1
10 AT1G01180.1       1
11 AT1G01210.1       1
12 AT1G01220.1       1

# look at one_to_many
one_to_many

A tibble: 8 x 2
  query_id    n_genes
  <chr>         <int>
1 AT1G01010.1       4
2 AT1G01060.3       2
3 AT1G01070.1       3
4 AT1G01080.1       5
5 AT1G01120.1       3
6 AT1G01140.3      36
7 AT1G01190.1       6
8 AT1G01200.1       7

Now we can treat classes: one_to_one and one_to_many differently:

Working with one-to-one hits

We can now retrieve the blast hit entries for the one-to-one hits by joining the tables one_to_one and hit_tbl.

# look at the evalue, perc_identity, and alig_length of one_to_one genes
oo_genes <- dplyr::inner_join(one_to_one, hit_tbl , by = "query_id")
oo_genes

A tibble: 12 x 13
   query_id n_genes subject_id perc_identity alig_length mismatches
   <chr>      <int> <chr>              <dbl>       <dbl>      <dbl>
 1 AT1G010...       1 470181|PA...          91.1         246         22
 2 AT1G010...       1 470180|PA...          95.5         359         12
 3 AT1G010...       1 333551|PA...          92.0        1970         85
 4 AT1G010...       1 909874|PA...         100           213          0
 5 AT1G010...       1 470171|PA...          96.8         434          8
 6 AT1G011...       1 333544|PA...          93.6         528         34
 7 AT1G011...       1 918855|PA...          72.6         285         68
 8 AT1G011...       1 918854|PA...          84.9         179         19
 9 AT1G011...       1 311317|PA...          85.6          97          0
10 AT1G011...       1 909860|PA...          92.6         310         20
11 AT1G012...       1 311313|PA...          95.3         107          5
12 AT1G012...       1 470155|PA...          96.7        1056         35
 with 7 more variables: gap_openings <dbl>, q_start <dbl>, q_end <dbl>,
   s_start <dbl>, s_end <dbl>, evalue <dbl>, bit_score <dbl>

This way, users will have all blast information of one_to_one hits available.

A real world application: Homology Inference

Using the oo_genes dataset above, users can use this one_to_one hits strategy to detect homologous genes between species.

For example, we could restrict one_to_one hits to fulfill certain (stringent) criteria to identify as a homologous hit. Here we choose the following parameter constellation to achieve this goal: one_to_one genes must have a minimum alignment length of 300, a perc_identity of > 80 percent and an e-value < 1E-5.

# filter for 'homologous' hits
true_orthologs <- dplyr::filter(oo_genes, evalue < 1e-5, perc_identity > 80, alig_length > 300)
# look at results
dplyr::select(true_orthologs, query_id, subject_id, evalue, perc_identity, alig_length)

      query_id            subject_id evalue perc_identity alig_length
1: AT1G01030.1 470180|PACid:16054974      0         95.54         359
2: AT1G01040.1 333551|PACid:16057793      0         91.98        1970
3: AT1G01090.1 470171|PACid:16052860      0         96.77         434
4: AT1G01110.2 333544|PACid:16034284      0         93.56         528
5: AT1G01180.1 909860|PACid:16056125      0         92.58         310
6: AT1G01220.1 470155|PACid:16047984      0         96.69        1056

This way we could filter out a high confidence set of homologous genes from the one_to_one class of genes.

In reality most homology inference programs and methods perform way more complicated and sophisticated analyses to distinguish true orthologs from true paralogs (in-paralogs, out-paralogs, etc.). These subsequent analyses can also be performed using the above introduced split-apply-combine strategy.

Note, that you can perform self-BLAST searches blast(query,query) and blast(subject,subject) to distinguish between orthologous and paralogous genes.

Now we continue with the one_to_many class of genes.

Working with one-to-many hits

Here we want to address the question how to deal with multiple hits returned by BLAST+.

Again we investigate all one_to_many genes:

# look at results 
one_to_many

 A tibble: 8 x 2
  query_id    n_genes
  <chr>         <int>
1 AT1G01010.1       4
2 AT1G01060.3       2
3 AT1G01070.1       3
4 AT1G01080.1       5
5 AT1G01120.1       3
6 AT1G01140.3      36
7 AT1G01190.1       6
8 AT1G01200.1       7

When looking at gene_id AT1G01200.1 we see that it was found 7 times in the corresponding subject set of A. lyrata.

# look at all 7 hits found
dplyr::select(dplyr::filter(hit_tbl, query_id == "AT1G01200.1"), query_id, subject_id, evalue, perc_identity, alig_length)

 A tibble: 7 x 5
  query_id    subject_id               evalue perc_identity alig_length
  <chr>       <chr>                     <dbl>         <dbl>       <dbl>
1 AT1G01200.1 470156|PACid:16041002 1.87e-172          95.8         238
2 AT1G01200.1 910431|PACid:16035207 2.63e- 75          53.0         219
3 AT1G01200.1 918732|PACid:16054958 1.07e- 51          44.6         193
4 AT1G01200.1 919287|PACid:16060536 4.42e- 70          58.1         179
5 AT1G01200.1 919355|PACid:16050170 2.40e- 73          53.3         212
6 AT1G01200.1 919721|PACid:16036935 2.63e- 81          59.3         204
7 AT1G01200.1 919852|PACid:16055066 4.02e-  7          24.0         154

Now we have to decide which hit shall be considered as potential homolog.

In this example subject_id 470156|PACid:16041002 has the highest perc_identity as well as the lowest e-value 1.87e-172. So a straightforward approach would be to choose subject gene 470156|PACid:16041002 as potential ortholog of query gene AT1G01200.1.

We can validate this approach by running a reciprocal best hit search with blast_rec()and compare the output of gene AT1G01200.1 with our choice 470156|PACid:16041002.

A reciprocal best hit blast approach denotes the search strategy in which query sequences are blasted in one direction to detect matches in subject sequences and then inversely subject sequences are blasted in the other direction to detect matches in the query sequences. Only when both blast seaches in both directions result in the same hit pair: BLAST(A,B) = BLAST(B,A) the hit will be considered as true ortholog relationship.

# run blast best reciprocal hit function
rbh_hit_tbl <- blast_rec(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
                         subject_file = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr'))
# look at results
dplyr::select(rbh_hit_tbl, query_id, subject_id, evalue, perc_identity, alig_length)

 A tibble: 20 x 5
 Groups:   query_id [20]
   query_id    subject_id               evalue perc_identity alig_length
   <chr>       <chr>                     <dbl>         <dbl>       <dbl>
 1 AT1G01010.1 333554|PACid:16033839 0.                 74.0         469
 2 AT1G01020.1 470181|PACid:16064328 4.33e-166          91.1         246
 3 AT1G01030.1 470180|PACid:16054974 0.                 95.5         359
 4 AT1G01040.1 333551|PACid:16057793 0.                 92.0        1970
 5 AT1G01050.1 909874|PACid:16064489 1.59e-160         100           213
 6 AT1G01060.3 470177|PACid:16043374 0.                 89.5         648
 7 AT1G01070.1 918864|PACid:16052578 0.                 95.1         366
 8 AT1G01080.1 909871|PACid:16053217 5.90e-179          90.3         300
 9 AT1G01090.1 470171|PACid:16052860 0.                 96.8         434
10 AT1G01110.2 333544|PACid:16034284 0.                 93.6         528
11 AT1G01120.1 918858|PACid:16049140 0.                 99.2         529
12 AT1G01140.3 470161|PACid:16036015 0.                 98.5         453
13 AT1G01150.1 918855|PACid:16037307 2.01e-150          72.6         285
14 AT1G01160.1 918854|PACid:16044153 1.16e- 93          84.9         179
15 AT1G01170.2 311317|PACid:16052302 5.75e- 54          85.6          97
16 AT1G01180.1 909860|PACid:16056125 0.                 92.6         310
17 AT1G01190.1 311315|PACid:16059488 0.                 94.2         533
18 AT1G01200.1 470156|PACid:16041002 1.87e-172          95.8         238
19 AT1G01210.1 311313|PACid:16057125 8.80e- 76          95.3         107
20 AT1G01220.1 470155|PACid:16047984 0.                 96.7        1056

When we now look at gene AT1G01200.1 we find that after an reciprocal blast approach subject gene 470181|PACid:16064328 rather than the subject gene 470156|PACid:16041002 (= result of unidirectional blast) has been detected as potential ortholog. The example illustrates the importance of using bidirectional blast to determine orthology relationships.

An alternative analysis that can be performed with these three candidate subject genes is the following:

# read CDS sequences of the 20 example query genes of A. thaliana
Ath.cds <- orthologr::read.cds(file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), format = "fasta")
# read CDS sequences of the 1000 example subject genes of A. lyrata
Aly.cds <- orthologr::read.cds(file   = system.file('seqs/ortho_lyra_cds_1000.fasta', package = 'orthologr'), format = "fasta")
# show the sequence of gene AT1G01070.1
Ath.cds["AT1G01070.1" , seqs]

[1] "atggctggagatatgcaaggagtgagagtagtagaaaaatattcaccggtcatagtgatggtgatgtcaaatgta
gcgatgggttcggtgaatgcacttgtgaagaaagctcttgatgttggtgtgaaccatatggtcattggtgcttatcgaat
ggctatttccgctttaattttggttccctttgcctatgtcttggaaaggaaaacaagaccacaaataacgtttaggctaa
tggtcgatcatttcgtcagtggccttctcggggcgagtttgatgcagtttttctttttgcttggtctgtcgtacacgtca
gcaactgtttcgtgtgctttggtaagcatgttgcctgcaatcaccttcgctttggcccttattttcaggactgaaaatgt
gaagattctaaagaccaaagcaggaatgttgaaggtgattggaactttgatctgtataagtggagctttgttcttaacat
tttacaaaggcccacaaatatcaaactctcactctcactctcacggtggggcttcccacaacaacaacgatcaagacaag
gccaataattggcttcttggatgtctttatttaaccataggaacagtgttgctatctctatggatgttgtttcaagggac
tttaagtattaagtacccttgcaaatactcgagcacttgtcttatgtcaattttcgcggcatttcaatgtgctctcttga
gcctttacaagagcagagacgttaatgattggatcatagatgatagattcgttatcaccgtcatcatatacgctggagtg
gtaggacaagcaatgacgacggttgcaacaacatgggggattaaaaaattaggagctgtgttcgcatcggcgtttttccc
acttactctcatttcggctactctatttgatttcctcattttacacactcctttataccttggaagtgtgattggatcac
tagtgaccataacgggtctctacatgttcttgtgggggaagaacaaagaaacggaatcatcaactgcattgtcttcagga
atggataacgaagctcaatatactactcctaataaggataacgactctaagtcgcccgtttaa"

Now users can perform a global alignment between the CDS sequences of AT1G01070.1 and the three subject genes as follows:

library(Biostrings)
# perform 3 global alignments between:  AT1G01070.1 and 918864|PACid:16052578, 
# 919693|PACid:16048878, 919961|PACid:16062329
sapply(Aly.cds[ unlist(dplyr::select(dplyr::filter(hit_tbl, query_id == "AT1G01070.1"), subject_id)), seqs ], pairwiseAlignment, 
       pattern = Ath.cds["AT1G01070.1" , seqs], type    = "global" )

$...
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] atggctggagatatgcaaggagtgagagta...aaggataacgactctaagtcgcccgtttaa 
subject: [1] atgggtgaaggtatgattggagtgagagta...aaggataacgactctaagtcgcccgtttaa 
score: 1768.965 

$...
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] atggctgga---gatatgcaaggagtgaga...----cgac----tctaagtcgcccgtttaa 
subject: [1] atggctaaatcagatatgc------tg---...ggttccacaaggtctatatcgcc---ttaa 
score: -2318.726 

$...
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] atggctggagatatgcaaggagtgagagta...aaggataacgactctaagtcgcccgtttaa 
subject: [1] atgagtgaggatatgggaggagtgaaagta...----------------------------aa 
score: 486.462

Note: To obtain the score value, you need to specify the scoreOnly = TRUE in the pairwiseAlignment function.

As you can see, subject gene 918864|PACid:16052578 also has the highest global alignment score 1768.965 based on the Needleman-Wunsch algorithm. This strategy might help you to differentiate between border line cases.

The examples shown above shall demonstrate the use cases that can be performed using the blast functions implemented in orthologr.

Another useful analysis can be to take the length of the initial query genes into account using the nchar() function:

# show the length distribution of all genes
# stored in "Ath.cds"
Ath.cds[ , nchar(seqs)]


[1] 1290  738 1077 5730  639 1938 1098  882 1287 1584 1587 1356 1038  588  252 1437 1608  714  321 3168

Or the length of a specific gene:

Ath.cds["AT1G01070.1" , nchar(seqs)]

[1] 1098

This way you can easily visualize the length distribution of genes stored in your query organism file.

Ath.cds <- read.cds(system.file('seqs/ortho_thal_cds_1000.fasta', package = 'orthologr'),
                    format = "fasta")
                    
# look at sequence length distributions
hist(Ath.cds[ , nchar(seqs)], breaks = 100)

The blast_best() function

For some analyses it is sufficient to perform BLAST+ best hit searches. The blast_best() function is optimized to perform BLAST+ best hit searches (only based on the minimum e-value) and returns the best hit when performing a BLAST+ search of a query organisms (or set of query genes) against a subject organism (or set of subject genes).

# performing gene orthology inference using the best hit (BH) method
blast_best(query_file    = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
           subject_file  = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
           clean_folders = TRUE)

A tibble: 20 x 12
Groups:   query_id [20]
   query_id subject_id perc_identity alig_length mismatches gap_openings
   <chr>    <chr>              <dbl>       <dbl>      <dbl>        <dbl>
 1 AT1G010... 333554|PA...          74.0         469         80            8
 2 AT1G010... 470181|PA...          91.1         246         22            0
 3 AT1G010... 470180|PA...          95.5         359         12            2
 4 AT1G010... 333551|PA...          92.0        1970         85           10
 5 AT1G010... 909874|PA...         100           213          0            0
 6 AT1G010... 470177|PA...          89.5         648         58            5
 7 AT1G010... 918864|PA...          95.1         366         14            2
 8 AT1G010... 909871|PA...          90.3         300         22            2
 9 AT1G010... 470171|PA...          96.8         434          8            3
10 AT1G011... 333544|PA...          93.6         528         34            0
11 AT1G011... 918858|PA...          99.2         529          4            0
12 AT1G011... 470161|PA...          98.5         453          6            1
13 AT1G011... 918855|PA...          72.6         285         68            3
14 AT1G011... 918854|PA...          84.9         179         19            2
15 AT1G011... 311317|PA...          85.6          97          0            1
16 AT1G011... 909860|PA...          92.6         310         20            1
17 AT1G011... 311315|PA...          94.2         533         30            1
18 AT1G012... 470156|PA...          95.8         238         10            0
19 AT1G012... 311313|PA...          95.3         107          5            0
20 AT1G012... 470155|PA...          96.7        1056         35            0
 with 6 more variables: q_start <dbl>, q_end <dbl>, s_start <dbl>,
   s_end <dbl>, evalue <dbl>, bit_score <dbl>

The blast_best() function returns: query_id, subject_id, and eval.

In case you need more parameters returned by a BLAST+ best hit search, you can specify the detailed_output argument (detailed_output = TRUE).

# BLAST+ best hit search
best_hit_tbl <- blast_best(query_file      = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
                           subject_file    = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))
# look at results
dplyr::glimpse(best_hit_tbl)

Observations: 20
Variables: 12
Groups: query_id [20]
$ query_id      <chr> "AT1G01010.1", "AT1G01020.1", "AT1G01030.1", "AT1G010...
$ subject_id    <chr> "333554|PACid:16033839", "470181|PACid:16064328", "47...
$ perc_identity <dbl> 73.987, 91.057, 95.543, 91.980, 100.000, 89.506, 95.0...
$ alig_length   <dbl> 469, 246, 359, 1970, 213, 648, 366, 300, 434, 528, 52...
$ mismatches    <dbl> 80, 22, 12, 85, 0, 58, 14, 22, 8, 34, 4, 6, 68, 19, 0...
$ gap_openings  <dbl> 8, 0, 2, 10, 0, 5, 2, 2, 3, 0, 0, 1, 3, 2, 1, 1, 1, 0...
$ q_start       <dbl> 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 5, 4, 2, 1, 1, 1,...
$ q_end         <dbl> 430, 246, 359, 1910, 213, 646, 366, 294, 429, 528, 52...
$ s_start       <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 16, 2, 4, 1, 6, 1...
$ s_end         <dbl> 466, 246, 355, 1963, 213, 640, 362, 299, 433, 528, 52...
$ evalue        <dbl> 0.00e+00, 8.91e-168, 0.00e+00, 0.00e+00, 3.27e-162, 0...
$ bit_score     <dbl> 627, 454, 698, 3704, 437, 1037, 696, 491, 859, 972, 1...

The blast_rec() function

The blast_rec() function was implemented to optimize BLAST+ reciprocal best hit searches (only based on the minimum e-value). BLAST+ reciprocal best hit searches are used to perform orthology inference.

Running blast_rec() using default parameter settings:

# performing gene orthology inference using the reciprocal best hit (RBH) method
blast_rec(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
          subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))

 A tibble: 20 x 12
 Groups:   query_id 
   query_id subject_id perc_identity alig_length mismatches gap_openings
   <chr>    <chr>              <dbl>       <dbl>      <dbl>        <dbl>
 1 AT1G010... 333554|PA...          74.0         469         80            8
 2 AT1G010... 470181|PA...          91.1         246         22            0
 3 AT1G010... 470180|PA...          95.5         359         12            2
 4 AT1G010... 333551|PA...          92.0        1970         85           10
 5 AT1G010... 909874|PA...         100           213          0            0
 6 AT1G010... 470177|PA...          89.5         648         58            5
 7 AT1G010... 918864|PA...          95.1         366         14            2
 8 AT1G010... 909871|PA...          90.3         300         22            2
 9 AT1G010... 470171|PA...          96.8         434          8            3
10 AT1G011... 333544|PA...          93.6         528         34            0
11 AT1G011... 918858|PA...          99.2         529          4            0
12 AT1G011... 470161|PA...          98.5         453          6            1
13 AT1G011... 918855|PA...          72.6         285         68            3
14 AT1G011... 918854|PA...          84.9         179         19            2
15 AT1G011... 311317|PA...          85.6          97          0            1
16 AT1G011... 909860|PA...          92.6         310         20            1
17 AT1G011... 311315|PA...          94.2         533         30            1
18 AT1G012... 470156|PA...          95.8         238         10            0
19 AT1G012... 311313|PA...          95.3         107          5            0
20 AT1G012... 470155|PA...          96.7        1056         35            0
 with 6 more variables: q_start <dbl>, q_end <dbl>, s_start <dbl>,
   s_end <dbl>, evalue <dbl>, bit_score <dbl>

The set_blast() function

The set_blast()function reads a file storing a specific sequence type, such as “cds”, “protein”, or “dna” in a standard sequence file format such as “fasta”, etc. and depending of the makedb parameter either creates a blast-able database, or returns the corresponding protein sequences as data.table object for further BLAST searches.

# using set_blast() to generate a blastable sequence database
head(set_blast(file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'))[[1]] , 2)


       geneids
1: AT1G01010.1
2: AT1G01020.1

seqs
1: atggaggatcaagttgggtttgggttccgtccgaacgacgaggagctcgttggtcactatctccgtaacaaaatcgaaggaaacact
agccgcgacgttgaagtagccatcagcgaggtcaacatctgtagctacgatccttggaacttgcgcttccagtcaaagtacaaatcgaga
gatgctatgtggtacttcttctctcgtagagaaaacaacaaagggaatcgacagagcaggacaacggtttctggtaaatggaagcttacc
ggagaatctgttgaggtcaaggaccagtggggattttgtagtgagggctttcgtggtaagattggtcataaaagggttttggtgttcctc
gatggaagataccctgacaaaaccaaatctgattgggttatccacgagttccactacgacctcttaccagaacatcagaggacatatgtc
atctgcagacttgagtacaagggtgatgatgcggacattctatctgcttatgcaatagatcccactcccgcttttgtccccaatatgact
agtagtgcaggttctgtggtcaaccaatcacgtcaacgaaattcaggatcttacaacacttactctgagtatgattcagcaaatcatggc
cagcagtttaatgaaaactctaacattatgcagcagcaaccacttcaaggatcattcaaccctctccttgagtatgattttgcaaatcac
ggcggtcagtggctgagtgactatatcgacctgcaacagcaagttccttacttggcaccttatgaaaatgagtcggagatgatttggaag
catgtgattgaagaaaattttgagtttttggtagatgaaaggacatctatgcaacagcattacagtgatcaccggcccaaaaaacctgtg
tctggggttttgcctgatgatagcagtgatactgaaactggatcaatgattttcgaagacacttcgagctccactgatagtgttggtagt
tcagatgaaccgggccatactcgtatagatgatattccatcattgaacattattgagcctttgcacaattataaggcacaagagcaacca
aagcagcagagcaaagaaaaggtgataagttcgcagaaaagcgaatgcgagtggaaaatggctgaagactcgatcaagatacctccatcc
accaacacggtgaagcagagctggattgttttggagaatgcacagtggaactatctcaagaacatgatcattggtgtcttgttgttcatc
tccgtcattagttggatcattcttgttggttaa
2:
atggcggcgagtgaacacagatgcgtgggatgtggttttagggtaaagtcattgttcattcaatactctccgggtaacattcgtctcatg
aaatgcggaaattgcaaggaagtagcagatgagtacatcgagtgtgaacgcatgattattttcatcgatttaatccttcacagaccaaag
gtatatagacacgtcctctacaatgcaattaatccagcaactgtcaatattcagcatctgttgtggaagttggtcttcgcctatcttctt
ctagactgttatagaagcttgctactgagaaaaagtgatgaagaatcgagcttttctgatagccctgttcttctatctataaaggttctg
attggtgtcttatctgcaaacgctgcatttatcatctcttttgccattgcgactaagggtttgctaaatgaagtttccagaagaagagag
attatgttggggatattcatctctagttacttcaagatatttctgcttgcgatgttggtatgggaattcccaatgtcagtgatttttttt
gtcgatatacttctcttaacatcaaactccatggctcttaaagtgatgactgaatcaacaatgaccagatgcatagccgtatgcttaatc
gcgcacttgattagattcttggtgggtcagatttttgagccgacaatatttttgatacaaattggatctctgttgcaatatatgtcttat
tttttcagaatcgtatga

aa
1: MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRENNKGNRQSRTTVSGKWK
LTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPN
MTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMI
WKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQE
QPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG*
2:
MAASEHRCVGCGFRVKSLFIQYSPGNIRLMKCGNCKEVADEYIECERMIIFIDLILHRPKVYRHVLYNAINPATVNIQHLLWKLVFAYLL
LDCYRSLLLRKSDEESSFSDSPVLLSIKVLIGVLSANAAFIISFAIATKGLLNEVSRRREIMLGIFISSYFKIFLLAMLVWEFPMSVIFF
VDILLLTSNSMALKVMTESTMTRCIAVCLIAHLIRFLVGQIFEPTIFLIQIGSLLQYMSYFFRIV*

Perform BLAST Searches between Genomes or Proteomes

For large-scale applications of orthologr, users may wish to perform BLAST+ searches between entire genomes. For very large-scale BLAST+ searches between genomes we recommend the user to use the metablastr package which aims to provide an easy-to-use BLAST search framework for massive genome comparisons.

For pairwise genome comparisons however, users can use the following functions to BLAST one genome against another.

2023-10-10