This function performs a blast+ search (reciprocal best hit) of a given set of protein sequences against a second set of protein sequences and vice versa.

blast_rec(
  query_file,
  subject_file,
  seq_type = "cds",
  format = "fasta",
  blast_algorithm = "blastp",
  delete_corrupt_cds = TRUE,
  eval = "1E-5",
  max.target.seqs = 10000,
  path = NULL,
  comp_cores = 1,
  blast_params = NULL,
  clean_folders = FALSE,
  save.output = NULL
)

Arguments

query_file

a character string specifying the path to the sequence file of interest (query organism).

subject_file

a character string specifying the path to the sequence file of interest (subject organism).

seq_type

a character string specifying the sequence type stored in the input file. Options are are: "cds", "protein", or "dna". In case of "cds", sequence are translated to protein sequences, in case of "dna", cds prediction is performed on the corresponding sequences which subsequently are translated to protein sequences. Default is seq_type = "cds".

format

a character string specifying the file format of the sequence file, e.g. "fasta", "gbk". Default is "fasta".

blast_algorithm

a character string specifying the BLAST algorithm that shall be used, e.g. "blastp","blastn","tblastn",... .

delete_corrupt_cds

a logical value indicating whether sequences with corrupt base triplets should be removed from the input file. This is the case when the length of coding sequences cannot be divided by 3 and thus the coding sequence contains at least one corrupt base triplet.

eval

a numeric value specifying the E-Value cutoff for BLAST hit detection.

max.target.seqs

a numeric value specifying the number of aligned sequences to keep. Please be aware that max.target.seqs selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .

path

a character string specifying the path to the BLAST program (in case you don't use the default path).

comp_cores

a numeric value specifying the number of cores to be used for multicore BLAST computations.

blast_params

a character string listing the input paramters that shall be passed to the executing BLAST program. Default is NULL, implicating that a set of default parameters is used when running BLAST.

clean_folders

a boolean value spefiying whether all internall folders storing the output of used programs shall be removed. Default is clean_folders = FALSE.

save.output

a path to the location were the BLAST output shall be stored. E.g. save.output = getwd() to store it in the current working directory, or save.output = file.path(put,your,path,here).

Value

A data.table as returned by the blast function, storing the geneids of orthologous genes (reciprocal best hit) in the first column and the amino acid sequences in the second column.

Details

Given a set of protein sequences A and a different set of protein sequences B, first a best hit blast search is being performed from A to B: blast(A,B) and afterwards a best hit blast search is being performed from B to A: blast(B,A). Only protein sequences that were found to be best hits in both directions are retained and returned.

This function can be used to perform orthology inference using BLAST+ best reciprocal hit methodology.

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Gish, W. & States, D.J. (1993) "Identification of protein coding regions by database similarity search." Nature Genet. 3:266-272.

Madden, T.L., Tatusov, R.L. & Zhang, J. (1996) "Applications of network BLAST server" Meth. Enzymol. 266:131-141.

Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.

Zhang Z., Schwartz S., Wagner L., & Miller W. (2000), "A greedy algorithm for aligning DNA sequences" J Comput Biol 2000; 7(1-2):203-14.

Zhang, J. & Madden, T.L. (1997) "PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation." Genome Res. 7:649-656.

Morgulis A., Coulouris G., Raytselis Y., Madden T.L., Agarwala R., & Schaeffer A.A. (2008) "Database indexing for production MegaBLAST searches." Bioinformatics 15:1757-1764.

Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., & Madden T.L. (2008) "BLAST+: architecture and applications." BMC Bioinformatics 10:421.

See also

Examples

if (FALSE) { # performing gene orthology inference using the reciprocal best hit (RBH) method blast_rec(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr')) # performing gene orthology inference using the reciprocal best hit (RBH) method # starting with protein sequences blast_rec(query_file = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr'), seq_type = "protein") # save the BLAST output file to the current working directory blast_rec(query_file = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr'), seq_type = "protein", save.output = getwd()) # use multicore processing blast_rec(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'), comp_cores = 2) # performing gene orthology inference using the reciprocal best hit (RBH) method # and external path to blastp blast_rec(query_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'), path = "path/to/blastp") }