
Annotate deepclust output with source file metadata
Source: R/deepclust_annotate.R
deepclust_annotate.RdRuns deepclust across one or more sequence files,
then annotates each cluster member with the source file it originated from.
The result is a long-format tibble linking every
(representative_id, member_id) pair to its source file_name,
making it straightforward to compare cluster membership across multiple
proteomes or organisms. Optionally, per-pair alignment statistics from
deepclust_realign can be incorporated without re-running
diamond deepclust.
Usage
deepclust_annotate(
input_file,
seq_type = "protein",
format = "fasta",
delete_corrupt_cds = TRUE,
path = NULL,
comp_cores = 1,
mutual_cover = 80,
approx_id = NULL,
eval = NULL,
deepclust_params = NULL,
quiet = TRUE,
cluster_table = NULL,
realign = NULL
)Arguments
- input_file
a character string, a character vector of file paths, or a named list mapping custom labels to file paths. When a named list is supplied (e.g.
list("thal" = "/path/to/thal.fasta", "lyra" = "/path/to/lyra.fasta")), the list names are used as thefile_namecolumn in the output instead of the base names of the files. This is useful when the file names are long or uninformative and you want cleaner labels in downstream analyses. Files are passed directly todeepclustfor clustering, and are also scanned individually to build a sequence-ID-to-file mapping (the "sequence library"). Ignored whencluster_tableis supplied (the sequence library is still built from these paths in that case).- seq_type
a character string specifying the sequence type stored in the input file(s). Options are:
"cds","protein", or"dna". Default isseq_type = "protein".- format
a character string specifying the file format of the sequence file. Default is
format = "fasta".- delete_corrupt_cds
a logical value indicating whether sequences with corrupt base triplets should be removed from the input file. Only relevant when
seq_type = "cds". Default isdelete_corrupt_cds = TRUE.- path
a character string specifying the path to the DIAMOND2 executable (in case it is not on the system
PATH).- comp_cores
a numeric value specifying the number of CPU threads to use. Passed to
--threads. Default iscomp_cores = 1.- mutual_cover
a numeric value (percentage) specifying the minimum mutual coverage between a cluster member and its representative. Passed to
--mutual-cover. Default ismutual_cover = 80.- approx_id
a numeric value (percentage) specifying the minimum approximate identity for clustering. Passed to
--approx-id. Default isNULL(DIAMOND2 default applies).- eval
a numeric value specifying the maximum E-value. Passed to
--evalue. Default isNULL(DIAMOND2 default of 0.001 applies).- deepclust_params
a character string of additional DIAMOND2 deepclust parameters to append to the
deepclustcall. Default isNULL.- quiet
a logical value indicating whether DIAMOND2 should run in quiet mode. Default is
quiet = TRUE.- cluster_table
an optional
tibble(ordata.frame) with columnsrepresentative_idandmember_id, as returned by a previous call todeepclustordeepclust_annotate. When supplied,diamond deepclustis not re-run; the provided table is used directly as the clustering result. This is particularly useful when combined withrealignto annotate an existing clustering with alignment statistics without incurring the cost of a second deepclust run. Default iscluster_table = NULL.- realign
an optional
tibblereturned bydeepclust_realign. When supplied, the alignment statistics (approx_pident,evalue,bitscore,qcovhsp,scovhsp) are left-joined onto the cluster profile by(representative_id, member_id), enriching each row with per-pair alignment information. Combine withcluster_tableto annotate a previously computed clustering without re-runningdiamond deepclust. Default isrealign = NULL.
Value
A tibble with three columns (plus five
additional alignment columns when realign is supplied):
representative_id— accession of the cluster representative.member_id— accession of the cluster member.file_name— base name of the source file from which the member sequence originated.approx_pident— (only withrealign) approximate percentage of identical matches between representative and member.evalue— (only withrealign) expect value.bitscore— (only withrealign) bit score.qcovhsp— (only withrealign) query (representative) coverage per HSP.scovhsp— (only withrealign) subject (member) coverage per HSP.
Details
The function performs up to four steps:
Build sequence library. Each input file is read with
seqinr::read.fastato extract the sequence IDs (FASTA header names, i.e. themember_idvalues that DIAMOND2 will produce). These are collected into a two-column tibble ofmember_idandfile_name(the base name of the source file).Run deepclust (skipped when
cluster_tableis supplied).deepclustis called with all suppliedinput_filepaths and the specified clustering parameters.Join. The sequence library is left-joined onto the deepclust output on
member_id, annotating every cluster member with its source file.Join realign (only when
realignis supplied). The alignment statistics fromdeepclust_realignare left-joined onto the profile by(representative_id, member_id).
The resulting long-format tibble can be pivoted to a wide presence/absence
matrix (one row per cluster, one column per organism/file) using
tidyr::pivot_wider.
References
Buchfink, B., Reuter, K., & Drost, H. G. (2021) "Sensitive protein alignments at tree-of-life scale using DIAMOND." Nature methods, 18(4), 366-368.
https://github.com/bbuchfink/diamond/wiki
Examples
if (FALSE) { # \dontrun{
# Profile two proteomes with default settings
profile <- deepclust_annotate(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
)
)
# Use a named list to control the file_name column labels
profile <- deepclust_annotate(
input_file = list(
"thal" = system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
"lyra" = system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
)
)
profile
# Pivot to a presence/absence matrix
library(tidyr)
library(dplyr)
profile |>
distinct(representative_id, file_name) |>
mutate(present = 1L) |>
pivot_wider(
names_from = file_name,
values_from = present,
values_fill = 0L
)
# Stricter clustering with custom thresholds
deepclust_annotate(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
),
approx_id = 50,
mutual_cover = 90,
comp_cores = 4
)
# Enrich an existing cluster profile with alignment stats from
# deepclust_realign, without re-running diamond deepclust
input_files <- c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
)
clusters <- deepclust(input_file = input_files)
realign_stats <- deepclust_realign(input_file = input_files,
clusters = clusters)
profile_enriched <- deepclust_annotate(
input_file = input_files,
cluster_table = clusters,
realign = realign_stats
)
} # }