diamond_protein_to_protein_best_hits.Rd
This function performs a DIAMOND search (best hit) of a given set of protein sequences against a given database.
diamond_protein_to_protein_best_hits(
query,
subject,
is_subject_db = FALSE,
format = "fasta",
sensitivity_mode = "ultra-sensitive",
out_format = "csv",
evalue = "1E-5",
max_target_seqs = 5000,
cores = 1,
hard_mask = TRUE,
diamond_exec_path = NULL,
add_makedb_options = NULL,
add_diamond_options = NULL,
output_path = NULL
)
a character string specifying the path to the protein sequence file of interest (query organism).
a character string specifying the path to the protein sequence file of interest (subject organism).
logical specifying whether or not the subject
file is a file in fasta
format (is_subject_db = FALSE
; default)
or a fasta
file that was previously converted into a blast-able database using diamond makedb
(is_subject_db = TRUE
).
a character string specifying the file format of the sequence file, e.g. format
= "fasta"
.
Default is format
= "fasta"
.
specify the level of alignment sensitivity. The higher the sensitivity level, the more deep homologs can be found, but at the cost of reduced computational speed.
sensitivity_mode = "faster"
: fastest alignment mode, but least sensitive (default). Designed for finding hits of >70
sensitivity_mode = "default"
: Default mode. Designed for finding hits of >70
sensitivity_mode = "fast"
: fastest alignment mode, but least sensitive (default). Designed for finding hits of >70
sensitivity_mode = "mid-sensitive"
: fast alignments between the fast
mode and the sensitive mode in sensitivity.
sensitivity_mode = "sensitive"
: fast alignments, but full sensitivity for hits >40
sensitivity_mode = "more-sensitive"
: more sensitive than the sensitive
mode.
sensitivity_mode = "very-sensitive"
: sensitive alignment mode.
sensitivity_mode = "ultra-sensitive"
: most sensitive alignment mode (sensitivity as high as BLASTP).
a character string specifying the format of the file in which the DIAMOND results shall be stored. Available options are:
out_format = "pair"
: Pairwise
out_format = "xml"
: XML
out_format = "csv"
: Comma-separated file
Expectation value (E) threshold for saving hits (default: evalue = 0.001
).
maximum number of aligned sequences that shall be retained. Please be aware that max_target_seqs
selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
number of cores for parallel DIAMOND searches.
shall low complexity regions be hard masked with TANTAN? Default is db_hard_mask = TRUE
.
a path to the DIAMOND executable or conda/miniconda
folder.
a character string specifying additional makedb options that shall be passed on to the diamond makedb command line call, e.g. add_make_options = "--taxonnames"
(Default is add_diamond_options = NULL
).
a character string specifying additional diamond options that shall be passed on to the diamond command line call, e.g. add_diamond_options = "--block-size 4.0 --compress 1 --no-self-hits"
(Default is add_diamond_options = NULL
).
a path to the location were the DIAMOND best hit output shall be stored. E.g. output_path
= getwd()
to store it in the current working directory, or output_path
= file.path("put", "your", "path", "here")
.
A tibble as returned by the diamond_best_hits
function, storing the query_ids
in the first column and the subject_ids
(best hit homologs) in the second column.
Given a set of protein sequences (query sequences), a best hit diamond search (DBH) is being performed.
if (FALSE) {
# performing homology inference using the diamond best hit (DBH) method using protein sequences
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'))
# look at results
best_hits
# store the DIAMOND output file to the current working directory
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
output_path = getwd())
# look at results
best_hits
# run diamond_best_hits() with multiple cores
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
cores = 2)
# look at results
best_hits
# performing homology inference using the diamond best hit (DBH) method and
# specifying the path to the DIAMOND executable (here miniconda path)
best_hits <- diamond_protein_to_protein_best_hits(
query = system.file('seqs/ortho_thal_aa.fasta', package = 'rdiamond'),
subject = system.file('seqs/ortho_lyra_aa.fasta', package = 'rdiamond'),
diamond_exec_path = "/opt/miniconda3/bin/")
# look at results
best_hits
}