diamond_protein_to_protein_best_hits.RdThis function performs a DIAMOND search (best hit) of a given set of protein sequences against a given database.
diamond_protein_to_protein_best_hits(
query,
subject,
is_subject_db = FALSE,
format = "fasta",
sensitivity_mode = "ultra-sensitive",
out_format = "csv",
evalue = "1E-5",
max_target_seqs = 5000,
cores = 1,
hard_mask = TRUE,
diamond_exec_path = NULL,
add_makedb_options = NULL,
add_diamond_options = NULL,
output_path = NULL
)a character string specifying the path to the protein sequence file of interest (query organism).
a character string specifying the path to the protein sequence file of interest (subject organism).
logical specifying whether or not the subject file is a file in fasta format (is_subject_db = FALSE; default)
or a fasta file that was previously converted into a blast-able database using diamond makedb (is_subject_db = TRUE).
a character string specifying the file format of the sequence file, e.g. format = "fasta".
Default is format = "fasta".
specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed.
sensitivity_mode = "faster" : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70
sensitivity_mode = "default" : Default mode. Designed for finding hits of >70
sensitivity_mode = "fast" : fastest alignment mode, but least sensitive (default). Designed for finding hits of >70
sensitivity_mode = "mid-sensitive" : fast alignments between the fast mode and the sensitive mode in sensitivity.
sensitivity_mode = "sensitive" : fast alignments, but full sensitivity for hits >40
sensitivity_mode = "more-sensitive" : more sensitive than the sensitive mode.
sensitivity_mode = "very-sensitive" : sensitive alignment mode.
sensitivity_mode = "ultra-sensitive" : most sensitive alignment mode (sensitivity as high as BLASTP).
a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are:
out_format = "pair" : Pairwise
out_format = "xml" : XML
out_format = "csv" : Comma-separated file
Expectation value (E) threshold for saving hits (default: evalue = 0.001).
maximum number of aligned sequences that shall be retained. Please be aware that max_target_seqs selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
number of cores for parallel DIAMOND searches.
shall low complexity regions be hard masked with TANTAN? Default is db_hard_mask = TRUE.
a path to the DIAMOND executable or conda/miniconda folder.
a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. add_make_options = "--taxonnames" (Default is add_diamond_options = NULL).
a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits" (Default is add_diamond_options = NULL).
a path to the location were the DIAMOND best hit output shall be stored. E.g. output_path = getwd()
to store it in the current working directory, or output_path = file.path("put", "your", "path", "here").
A tibble as returned by the diamond_best_hits function, storing the query_ids
in the first column and the subject_ids (best hit homologs) in the second column.
Given a set of protein sequences (query sequences), a best hit diamond search (DBH) is being performed.
if (FALSE) {
# performing homology inference using the diamond best hit (DBH) method using protein sequences
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'))
# look at results
best_hits
# store the DIAMOND output file to the current working directory
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
output_path = getwd())
# look at results
best_hits
# run diamond_best_hits() with multiple cores
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
cores = 2)
# look at results
best_hits
# performing homology inference using the diamond best hit (DBH) method and
# specifying the path to the DIAMOND executable (here miniconda path)
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
diamond_exec_path = "/opt/miniconda3/bin/")
# look at results
best_hits
}