Run protein to protein DIAMOND2 of reference sequences against a blast-able database or fasta file.

diamond_protein_to_protein(
  query,
  subject,
  output_path = NULL,
  is_subject_db = FALSE,
  task = "blastp",
  sensitivity_mode = "ultra-sensitive",
  use_arrow_duckdb_connection = FALSE,
  evalue = 0.001,
  out_format = "csv",
  cores = 1,
  max_target_seqs = 500,
  hard_mask = TRUE,
  diamond_exec_path = NULL,
  add_makedb_options = NULL,
  add_diamond_options = NULL
)

Arguments

query

path to input file in fasta format.

subject

path to subject file in fasta format or blast-able database.

output_path

path to folder at which DIAMOND2 output table shall be stored. Default is output_path = NULL (hence getwd() is used).

is_subject_db

logical specifying whether or not the subject file is a file in fasta format (is_subject_db = FALSE; default) or a fasta file that was previously converted into a blast-able database using diamond makedb (is_subject_db = TRUE).

task

protein search task option. Options are:

  • task = "blastp" : Standard protein-protein comparisons (default).

sensitivity_mode

specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed.

  • sensitivity_mode = "faster" : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70

  • sensitivity_mode = "default" : Default mode. Designed for finding hits of >70

  • sensitivity_mode = "fast" : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70

  • sensitivity_mode = "mid-sensitive" : fast alignments between the fast mode and the sensitive mode in sensitivity.

  • sensitivity_mode = "sensitive" : fast alignments, but full sensitivity for hits >40

  • sensitivity_mode = "more-sensitive" : more sensitive than the sensitive mode.

  • sensitivity_mode = "very-sensitive" : sensitive alignment mode.

  • sensitivity_mode = "ultra-sensitive" : most sensitive alignment mode (sensitivity as high as BLASTP).

use_arrow_duckdb_connection

shall DIAMOND2 hit output table be transformed to an in-process (big data disk-processing) arrow connection to DuckDB? This is useful when the DIAMOND2 output table to too large to fit into memory. Default is use_arrow_duckdb_connection = FALSE. Please consult the Installation Vignette for details.

evalue

Expectation value (E) threshold for saving hits (default: evalue = 0.001).

out_format

a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are:

  • out_format = "pair" : Pairwise

  • out_format = "xml" : XML

  • out_format = "csv" : Comma-separated file

cores

number of cores for parallel DIAMOND searches.

max_target_seqs

maximum number of aligned sequences that shall be retained. Please be aware that max_target_seqs selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .

hard_mask

shall low complexity regions be hard masked with TANTAN? Default is db_hard_mask = TRUE.

diamond_exec_path

a path to the DIAMOND executable or conda/miniconda folder.

add_makedb_options

a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. add_make_options = "--taxonnames" (Default is add_diamond_options = NULL).

add_diamond_options

a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits" (Default is add_diamond_options = NULL).

Author

Hajk-Georg Drost

Examples

if (FALSE) {
# run diamond assuming that the diamond executable is available
# via the system path ('diamond_exec_path = NULL') and using
# sensitivity_mode = "ultra-sensitive"
diamond_example <- diamond_protein_to_protein(
              query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
              subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
              sensitivity_mode = "ultra-sensitive",
              output_path = tempdir(),
              use_arrow_duckdb_connection  = FALSE)

# look at DIAMOND results
diamond_example

# run diamond assuming that the diamond executable is available
# via the miniconda path ('diamond_exec_path = "/opt/miniconda3/bin/"')
# and using 2 cores as well as sensitivity_mode = "ultra-sensitive"
diamond_example_conda <- diamond_protein_to_protein(
query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
sensitivity_mode = "ultra-sensitive", diamond_exec_path = "/opt/miniconda3/bin/",
output_path = tempdir(),
use_arrow_duckdb_connection  = FALSE, cores = 2)

# look at DIAMOND results
diamond_example_conda

# run diamond assuming that the diamond executable is available
# via the system path ('diamond_exec_path = NULL') and using
# sensitivity_mode = "ultra-sensitive" and adding command line options:
# "--block-size 4.0 --compress 1 --no-self-hits"
diamond_example_ultra_sensitive_add_diamond_options <- diamond_protein_to_protein(
query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
sensitivity_mode = "ultra-sensitive",
max_target_seqs = 500,
output_path = tempdir(),
use_arrow_duckdb_connection  = FALSE,
add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits",
cores = 1
)

# look at DIAMOND results
diamond_example_ultra_sensitive_add_diamond_options

# run diamond assuming that the diamond executable is available
# via the system path ('diamond_exec_path = NULL') and using
# sensitivity_mode = "ultra-sensitive" and adding makedb command line options:
# "--taxonnames"
diamond_example_ultra_sensitive_add_makedb_options <- diamond_protein_to_protein(
query   = system.file('seqs/qry_aa.fa', package = 'rdiamond'),
subject = system.file('seqs/sbj_aa.fa', package = 'rdiamond'),
sensitivity_mode = "ultra-sensitive",
max_target_seqs = 500,
output_path = tempdir(),
use_arrow_duckdb_connection  = FALSE,
add_makedb_options = "--taxonnames",
cores = 1
)

# look at DIAMOND results
diamond_example_ultra_sensitive_add_makedb_options
}