Perform a DIAMOND best reciprocal hit search

This function performs a DIAMOND search (best reciprocal hit) of a given set of protein sequences against a given database.

diamond_reciprocal_best_hits(
  query,
  subject,
  is_subject_db = FALSE,
  format = "fasta",
  sensitivity_mode = "ultra-sensitive",
  out_format = "csv",
  evalue = "1E-5",
  max_target_seqs = 5000,
  cores = 1,
  hard_mask = TRUE,
  diamond_exec_path = NULL,
  add_makedb_options = NULL,
  add_diamond_options = NULL,
  output_path = NULL
)

Arguments

query	a character string specifying the path to the protein sequence file of interest (query organism).
subject	a character string specifying the path to the protein sequence file of interest (subject organism).
is_subject_db	logical specifying whether or not the `subject` file is a file in `fasta` format (`is_subject_db = FALSE`; default) or a `fasta` file that was previously converted into a blast-able database using `diamond makedb` (`is_subject_db = TRUE`).
format	a character string specifying the file format of the sequence file, e.g. `format` = `"fasta"`. Default is `format` = `"fasta"`.
sensitivity_mode	specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed. `sensitivity_mode = "fast"` : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70 `sensitivity_mode = "mid-sensitive"` : fast alignments between the `fast` mode and the sensitive mode in sensitivity. `sensitivity_mode = "sensitive"` : fast alignments, but full sensitivity for hits >40 `sensitivity_mode = "more-sensitive"` : more sensitive than the `sensitive` mode. `sensitivity_mode = "very-sensitive"` : sensitive alignment mode. `sensitivity_mode = "ultra-sensitive"` : most sensitive alignment mode (sensitivity as high as BLASTP).
out_format	a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are: `out_format = "pair"` : Pairwise `out_format = "xml"` : XML `out_format = "csv"` : Comma-separated file
evalue	Expectation value (E) threshold for saving hits (default: `evalue = 0.001`).
max_target_seqs	maximum number of aligned sequences that shall be retained. Please be aware that `max_target_seqs` selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
cores	number of cores for parallel DIAMOND searches.
hard_mask	shall low complexity regions be hard masked with TANTAN? Default is `db_hard_mask = TRUE`.
diamond_exec_path	a path to the DIAMOND executable or `conda/miniconda` folder.
add_makedb_options	a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. `add_make_options = "--taxonnames"` (Default is `add_diamond_options = NULL`).
add_diamond_options	a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. `add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits"` (Default is `add_diamond_options = NULL`).
output_path	a path to the location were the DIAMOND best hit output shall be stored. E.g. `output_path` = `getwd()` to store it in the current working directory, or `output_path` = `file.path("put", "your", "path", "here")`.

Value

A tibble as returned by the diamond_reciprocal_best_hits function, storing the query_ids in the first column and the subject_ids (reciprocal best hit homologs) in the second column.

Details

Given a set of protein sequences (query sequences), a best hit diamond search (DRBH) is being performed.

Author

Hajk-Georg Drost

Examples

if (FALSE) {
# performing homology inference using the diamond best reciprocal hit (DRBH) method using protein sequences
rec_best_hits <- diamond_reciprocal_best_hits(
query   = system.file('seqs/ortho_thal_aa.fasta', package = 'homologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'homologr'),
seq_type     = "protein")
# look at results
rec_best_hits

# store the DIAMOND output file to the current working directory
rec_best_hits <- diamond_reciprocal_best_hits(
query  = system.file('seqs/ortho_thal_aa.fasta', package = 'homologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'homologr'),
seq_type     = "protein",
output_path  = getwd())
# look at results
rec_best_hits

# run diamond_reciprocal_best_hits() with multiple cores
rec_best_hits <- diamond_reciprocal_best_hits(
query   = system.file('seqs/ortho_thal_aa.fasta', package = 'homologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'homologr'),
cores   = 2)
# look at results
rec_best_hits

# performing homology inference using the diamond best hit (DRBH) method and
# specifying the path to the DIAMOND executable (here miniconda path)
rec_best_hits <- diamond_reciprocal_best_hits(
query  = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr'),
diamond_exec_path = "/opt/miniconda3/bin/")
# look at results
rec_best_hits
}

Arguments

Value

Details

See also

Author

Examples