This function performs a BLAST+ search (best hit) of a given set of protein sequences against a given database.

blast_best(
  query_file,
  subject_file,
  seq_type = "cds",
  format = "fasta",
  blast_algorithm = "blastp",
  delete_corrupt_cds = TRUE,
  eval = "1E-5",
  max.target.seqs = 10000,
  path = NULL,
  comp_cores = 1,
  blast_params = NULL,
  clean_folders = FALSE,
  save.output = NULL
)

Arguments

query_file

a character string specifying the path to the CDS file of interest (query organism).

subject_file

a character string specifying the path to the CDS file of interest (subject organism).

seq_type

a character string specifying the sequence type stored in the input file. Options are are: "cds", "protein", or "dna". In case of "cds", sequence are translated to protein sequences, in case of "dna", cds prediction is performed on the corresponding sequences which subsequently are translated to protein sequences. Default is seq_type = "cds".

format

a character string specifying the file format of the sequence file, e.g. format = "fasta". Default is format = "fasta".

blast_algorithm

a character string specifying the BLAST algorithm that shall be used, e.g. blast_algorithm = "blastp", blast_algorithm = "blastn", blast_algorithm = "tblastn" .

delete_corrupt_cds

a logical value indicating whether sequences with corrupt base triplets should be removed from the input file. This is the case when the length of coding sequences cannot be divided by 3 and thus the coding sequence contains at least one corrupt base triplet.

eval

a numeric value specifying the E-Value cutoff for BLAST hit detection.

max.target.seqs

a numeric value specifying the number of aligned sequences to keep. Please be aware that max.target.seqs selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .

path

a character string specifying the path to the BLAST program (in case you don't use the default path).

comp_cores

a numeric value specifying the number of cores to be used for multicore BLAST computations.

blast_params

a character string listing the input paramters that shall be passed to the executing BLAST program. Default is NULL, implicating that a set of default parameters is used when running BLAST.

clean_folders

a boolean value spefiying whether all internall folders storing the output of used programs shall be removed. Default is clean_folders = FALSE.

save.output

a path to the location were the BLAST output shall be stored. E.g. save.output = getwd() to store it in the current working directory, or save.output = file.path(put,your,path,here).

Value

A data.table as returned by the blast function, storing the geneids of orthologous genes (best hit) in the first column and the amino acid sequences in the second column.

Details

Given a set of protein sequences (query sequences), a best hit blast search (BH BLAST) is being performed.

Internally to perform best hit searches, the BLAST+ parameter settings:

"-best_hit_score_edge 0.05 -best_hit_overhang 0.25 -max_target_seqs 1"

are used to speed up best hit computations.

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Gish, W. & States, D.J. (1993) "Identification of protein coding regions by database similarity search." Nature Genet. 3:266-272.

Madden, T.L., Tatusov, R.L. & Zhang, J. (1996) "Applications of network BLAST server" Meth. Enzymol. 266:131-141.

Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.

Zhang Z., Schwartz S., Wagner L., & Miller W. (2000), "A greedy algorithm for aligning DNA sequences" J Comput Biol 2000; 7(1-2):203-14.

Zhang, J. & Madden, T.L. (1997) "PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation." Genome Res. 7:649-656.

Morgulis A., Coulouris G., Raytselis Y., Madden T.L., Agarwala R., & Schaeffer A.A. (2008) "Database indexing for production MegaBLAST searches." Bioinformatics 15:1757-1764.

Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., & Madden T.L. (2008) "BLAST+: architecture and applications." BMC Bioinformatics 10:421.

Author

Hajk-Georg Drost

Examples

if (FALSE) {

# performing gene orthology inference using the best hit (BH) method
blast_best(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
           subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'))
           
           
           
# performing gene orthology inference using the best hit (BH) method starting with protein sequences
blast_best(query_file   = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
           subject_file = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr'),
           seq_type     = "protein")



# save the BLAST output file to the current working directory
blast_best(query_file   = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
           subject_file = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr'),
           seq_type     = "protein",
           save.output  = getwd())





# use multicore processing
blast_best(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'), 
           subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
           comp_cores   = 2)



# performing gene orthology inference using the best hit (BH) method and external
# blastp path
blast_best(query_file   = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
           subject_file = system.file('seqs/ortho_lyra_cds.fasta', package = 'orthologr'),
           path         = "path/to/blastp/")
           
           
}