Runs diamond realign on an existing set of
deepclust clusters to obtain per-pair alignment statistics
(approximate identity, e-value, bitscore, and query/subject coverages) for
every (member, representative) pair inside each cluster. The result
can be passed directly to deepclust_annotate via its
realign argument to enrich a cluster profile without re-running
diamond deepclust.
Usage
deepclust_realign(
input_file,
clusters,
seq_type = "protein",
format = "fasta",
delete_corrupt_cds = TRUE,
path = NULL,
comp_cores = 1,
realign_params = NULL,
save.output = NULL,
quiet = TRUE
)Arguments
- input_file
a character string, or a character vector of file paths, specifying the input sequence file(s) that were used for the original
deepclustrun. When multiple paths are provided they are merged into a single temporary FASTA before the DIAMOND2 database is built (same behaviour asdeepclust). Both plain and gzip-compressed (.gz) files are supported.- clusters
either
a
characterstring — path to a tab-separateddeepclustoutput file whose first column isrepresentative_idand second column ismember_id, ora
data.frame/tibblewith columnsrepresentative_idandmember_id(i.e. the direct return value ofdeepclustordeepclust_annotate).
- seq_type
a character string specifying the sequence type stored in the input file(s). Options are:
"cds","protein", or"dna". In case of"cds", sequences are translated to protein sequences before building the database. Default isseq_type = "protein".- format
a character string specifying the file format of the sequence file. Default is
format = "fasta".- delete_corrupt_cds
a logical value indicating whether sequences with corrupt base triplets should be removed from the input file. Only relevant when
seq_type = "cds". Default isdelete_corrupt_cds = TRUE.- path
a character string specifying the path to the DIAMOND2 executable (in case it is not on the system
PATH).- comp_cores
a numeric value specifying the number of CPU threads to use. Passed to
--threads. Default iscomp_cores = 1.- realign_params
a character string of additional
diamond realignparameters to append to the command. Default isNULL.- save.output
a character string specifying the path where the output TSV file should be saved. E.g.
save.output = getwd()to save in the current working directory. Default isNULL(file is only written to a temporary directory).- quiet
a logical value indicating whether DIAMOND2 should run in quiet mode. Default is
quiet = TRUE.
Value
A tibble with seven columns:
representative_id— accession of the cluster representative (query in the realignment).member_id— accession of the cluster member (subject in the realignment).approx_pident— approximate percentage of identical matches.evalue— expect value.bitscore— bit score.qcovhsp— query coverage per HSP.scovhsp— subject (representative) coverage per HSP.
Details
This function wraps diamond realign, which re-aligns every cluster
member against its assigned representative using the cluster mapping produced
by deepclust. The output format is BLAST tabular
(--outfmt 6) with the fields:
qseqid sseqid approx_pident evalue bitscore qcovhsp scovhsp.
The --clusters argument consumed by diamond realign expects a
two-column tab-separated file (representative_id TAB member_id).
When a tibble is supplied to clusters, it is written to a temporary
file automatically.
Columns in the returned tibble are named to match those used by
deepclust (representative_id, member_id) so that
the result joins cleanly onto deepclust_annotate output.
Note that diamond realign uses the representative as the
alignment query (qseqid) and the member as the subject
(sseqid); the returned tibble is named accordingly.
References
Buchfink, B., Reuter, K., & Drost, H. G. (2021) "Sensitive protein alignments at tree-of-life scale using DIAMOND." Nature methods, 18(4), 366-368.
https://github.com/bbuchfink/diamond/wiki
Examples
if (FALSE) { # \dontrun{
# 1. Run deepclust first
clusters <- deepclust(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
)
)
# 2. Realign each representative against its cluster members to get alignment stats
realign_result <- deepclust_realign(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
),
clusters = clusters
)
# 3. Alternatively, supply the path to the deepclust TSV directly
realign_result <- deepclust_realign(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
),
clusters = "/path/to/deepclust_output.tsv"
)
# 4. Incorporate into deepclust_annotate without re-running deepclust
profile <- deepclust_annotate(
input_file = c(
system.file('seqs/ortho_thal_aa.fasta', package = 'orthologr'),
system.file('seqs/ortho_lyra_aa.fasta', package = 'orthologr')
),
realign = realign_result
)
} # }
