This function performs a DIAMOND search (best reciprocal hit) of a given set of protein sequences against a given database.

diamond_protein_to_protein_best_reciprocal_hits(
  query,
  subject,
  is_subject_db = FALSE,
  format = "fasta",
  sensitivity_mode = "ultra-sensitive",
  out_format = "csv",
  evalue = "1E-5",
  max_target_seqs = 5000,
  cores = 1,
  hard_mask = TRUE,
  diamond_exec_path = NULL,
  add_makedb_options = NULL,
  add_diamond_options = NULL,
  output_path = NULL
)

Arguments

query

a character string specifying the path to the protein sequence file of interest (query organism).

subject

a character string specifying the path to the protein sequence file of interest (subject organism).

is_subject_db

logical specifying whether or not the subject file is a file in fasta format (is_subject_db = FALSE; default) or a fasta file that was previously converted into a blast-able database using diamond makedb (is_subject_db = TRUE).

format

a character string specifying the file format of the sequence file, e.g. format = "fasta". Default is format = "fasta".

sensitivity_mode

specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed.

  • sensitivity_mode = "faster" : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70

  • sensitivity_mode = "default" : Default mode. Designed for finding hits of >70

  • sensitivity_mode = "fast" : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70

  • sensitivity_mode = "mid-sensitive" : fast alignments between the fast mode and the sensitive mode in sensitivity.

  • sensitivity_mode = "sensitive" : fast alignments, but full sensitivity for hits >40

  • sensitivity_mode = "more-sensitive" : more sensitive than the sensitive mode.

  • sensitivity_mode = "very-sensitive" : sensitive alignment mode.

  • sensitivity_mode = "ultra-sensitive" : most sensitive alignment mode (sensitivity as high as BLASTP).

out_format

a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are:

  • out_format = "pair" : Pairwise

  • out_format = "xml" : XML

  • out_format = "csv" : Comma-separated file

evalue

Expectation value (E) threshold for saving hits (default: evalue = 0.001).

max_target_seqs

maximum number of aligned sequences that shall be retained. Please be aware that max_target_seqs selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .

cores

number of cores for parallel DIAMOND searches.

hard_mask

shall low complexity regions be hard masked with TANTAN? Default is db_hard_mask = TRUE.

diamond_exec_path

a path to the DIAMOND executable or conda/miniconda folder.

add_makedb_options

a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. add_make_options = "--taxonnames" (Default is add_diamond_options = NULL).

add_diamond_options

a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits" (Default is add_diamond_options = NULL).

output_path

a path to the location were the DIAMOND best hit output shall be stored. E.g. output_path = getwd() to store it in the current working directory, or output_path = file.path("put", "your", "path", "here").

Value

A tibble as returned by the diamond_protein_to_protein_best_reciprocal_hits function, storing the query_ids

in the first column and the subject_ids (reciprocal best hit homologs) in the second column.

Details

Given a set of protein sequences (query sequences), a best hit diamond search (DRBH) is being performed.

Author

Hajk-Georg Drost

Examples

if (FALSE) {
# performing homology inference using the diamond best reciprocal hit (DRBH) method using protein sequences
best_rec_hits <- diamond_protein_to_protein_best_reciprocal_hits(
query   = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'))
# look at results
best_rec_hits

# store the DIAMOND output file to the current working directory
best_rec_hits <- diamond_protein_to_protein_best_reciprocal_hits(
query  = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
output_path  = getwd())
# look at results
best_rec_hits

# run diamond_protein_to_protein_best_reciprocal_hits() with multiple cores
best_rec_hits <- diamond_protein_to_protein_best_reciprocal_hits(
query   = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
cores   = 2)
# look at results
best_rec_hits

# performing homology inference using the diamond best hit (DRBH) method and
# specifying the path to the DIAMOND executable (here miniconda path)
best_rec_hits <- diamond_protein_to_protein_best_reciprocal_hits(
query  = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
diamond_exec_path = "/opt/miniconda3/bin/")
# look at results
best_rec_hits
}