This function clusters a set of protein (or CDS) sequences using the
DIAMOND2 deepclust algorithm. It prepares a DIAMOND2 database via
set_diamond, runs diamond deepclust, and returns a
two-column tibble mapping each representative sequence to its cluster members.
The result is also written to a TSV file.
Note, we recommend users to run deepclust_annotate() for more information
about the cluster members and their representative sequences.
Usage
deepclust(
input_file,
seq_type = "protein",
format = "fasta",
delete_corrupt_cds = TRUE,
path = NULL,
comp_cores = 1,
mutual_cover = NULL,
approx_id = NULL,
eval = NULL,
deepclust_params = NULL,
save.output = NULL,
quiet = TRUE
)Arguments
- input_file
a character string, or a character vector of file paths, specifying the input sequence file(s). When multiple paths are provided the sequences are merged into a single temporary FASTA before clustering, allowing proteomes from different organisms to be clustered together. Both plain and gzip-compressed (
.gz) files are supported and may be mixed freely. Supported sequence types are controlled byseq_type.- seq_type
a character string specifying the sequence type stored in the input file(s). Options are:
"cds","protein", or"dna". In case of"cds", sequences are translated to protein sequences before clustering. Default isseq_type = "protein".- format
a character string specifying the file format of the sequence file. Default is
format = "fasta".- delete_corrupt_cds
a logical value indicating whether sequences with corrupt base triplets should be removed from the input file. Only relevant when
seq_type = "cds". Default isdelete_corrupt_cds = TRUE.- path
a character string specifying the path to the DIAMOND2 executable (in case it is not on the system
PATH).- comp_cores
a numeric value specifying the number of CPU threads to use. Passed to
--threads. Default iscomp_cores = 1.- mutual_cover
a numeric value (percentage) specifying the minimum mutual coverage of the cluster member and representative sequence. Passed to
--mutual-cover. Default isNULL(DIAMOND2 default applies).- approx_id
a numeric value (percentage) specifying the minimum approximate identity to report an alignment or to cluster sequences. Passed to
--approx-id. Default isNULL(DIAMOND2 default applies).- eval
a numeric value specifying the maximum E-value to report alignments. Passed to
--evalue. Default isNULL(DIAMOND2 default of 0.001 applies).- deepclust_params
a character string of additional DIAMOND2 deepclust parameters to append to the
deepclustcall. Default isNULL.- save.output
a character string specifying the path where the output TSV file should be saved. E.g.
save.output = getwd()to save in the current working directory. Default isNULL(file is only written to a temporary directory).- quiet
a logical value indicating whether DIAMOND2 should run in quiet mode. Default is
quiet = TRUE.
Value
A tibble with two columns:
representative_id— accession of the cluster representative sequence.member_id— accession of the cluster member sequence.
Details
This function wraps diamond deepclust, the graph-based sequence clustering
algorithm available in DIAMOND2 v2.1.0 and later. The output is a two-column
tab-separated file where the first column contains the representative (centroid)
sequence accession and the second column contains the cluster member accession.
A sequence that is its own representative will appear with itself in both columns.
When multiple files are supplied via input_file, they are first merged
into a single temporary FASTA file (using seqinr::read.fasta and
seqinr::write.fasta) before the DIAMOND2 database is built. This allows
cross-species or multi-proteome clustering in a single run. The merged file is
written to the session temporary directory and is not retained after the R
session ends unless save.output is specified.
The function uses set_diamond internally to handle CDS translation
and DIAMOND2 database creation.
References
Buchfink, B., Reuter, K., & Drost, H. G. (2021) "Sensitive protein alignments at tree-of-life scale using DIAMOND." Nature methods, 18(4), 366-368.
https://github.com/bbuchfink/diamond/wiki
Examples
if (FALSE) { # \dontrun{
# Cluster a single proteome (protein sequences, the default)
deepclust(
input_file = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr')
)
# Cluster two proteomes together in a single run
deepclust(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
)
)
# Apply identity and coverage thresholds across multiple proteomes
deepclust(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
),
approx_id = 50,
mutual_cover = 80,
eval = 1e-5,
comp_cores = 4
)
# Save the cluster TSV to the current working directory
deepclust(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
),
save.output = getwd()
)
# Cluster CDS sequences (translated to protein internally)
deepclust(
input_file = system.file('seqs/ortho_thal_cds.fasta', package = 'orthologr'),
seq_type = "cds"
)
} # }
