This function performs a BLAST search between query and subject sequences and returns only the best hit based on the following criteria.
A best blast hit is defined as:
the hit with the smallest e-value
if e-values are identical then the hit with the longest alignment length is chosen
blast_best_hit(
query,
subject,
search_type = "nucleotide_to_nucleotide",
strand = "both",
output.path = NULL,
is.subject.db = FALSE,
task = "blastn",
db.import = FALSE,
postgres.user = NULL,
evalue = 0.001,
out.format = "csv",
cores = 1,
max.target.seqs = 10000,
db.soft.mask = FALSE,
db.hard.mask = FALSE,
blast.path = NULL
)
path to input file in fasta format.
path to subject file in fasta format or blast-able database.
type of query and subject sequences that will be compared via BLAST search. Options are:
search_type = "nucleotide_to_nucleotide"
search_type = "nucleotide_to_protein"
search_type = "protein_to_nucleotide"
search_type = "protein_to_protein"
Query DNA strand(s) to search against database/subject. Options are:
strand = "both"
(Default): query against both DNA strands.
strand = "minus"
: query against minus DNA strand.
strand = "plus"
: query against plus DNA strand.
path to folder at which BLAST output table shall be stored.
Default is output.path = NULL
(hence getwd()
is used).
logical specifying whether or not the subject
file is a file in fasta format (is.subject.db = FALSE
; default)
or a blast-able database that was formatted with makeblastdb
(is.subject.db = TRUE
).
BLAST search task option (depending on the selected search_type
). Options are:
search_type = "nucleotide_to_nucleotide"
task = "blastn"
: Standard nucleotide-nucleotide comparisons (default) - Traditional BLASTN requiring an exact match of 11.
task = "blastn-short"
: Optimized nucleotide-nucleotide comparisons for query sequences shorter than 50 nucleotides.
task = "dc-megablast"
: Discontiguous megablast used to find somewhat distant sequences.
task = "megablast"
: Traditional megablast used to find very similar (e.g., intraspecies or closely related species) sequences.
task = "rmblastn"
search_type = "nucleotide_to_protein"
task = "blastx"
: Standard nucleotide-protein comparisons (default).
task = "blastx-fast"
: Optimized nucleotide-protein comparisons.
search_type = "protein_to_nucleotide"
task = "tblastn"
: Standard protein-nucleotide comparisons (default).
task = "tblastn-fast"
: Optimized protein-nucleotide comparisons.
search_type = "protein_to_protein"
task = "blastp"
: Standard protein-protein comparisons (default).
task = "blast-fast"
: Improved BLAST searches using longer words for protein seeding.
task = "blastp-short"
: Optimized protein-protein comparisons for query sequences shorter than 30 residues.
shall the BLAST output be stored in a PostgresSQL database and shall a connection be established to this database? Default is db.import = FALSE
.
In case users wish to to only generate a BLAST output file without importing it to the current R session they can specify db.import = NULL
.
when db.import = TRUE
and out.format = "postgres"
is selected, the BLAST output is imported and stored in a
PostgresSQL database. In that case, users need to have PostgresSQL installed and initialized on their system.
Please consult the Installation Vignette for details.
Expectation value (E) threshold for saving hits (default: evalue = 0.001
).
a character string specifying the format of the file in which the BLAST results shall be stored. Available options are:
out.format = "pair"
: Pairwise
out.format = "qa.ident"
: Query-anchored showing identities
out.format = "qa.nonident"
: Query-anchored no identities
out.format = "fq.ident"
: Flat query-anchored showing identities
out.format = "fq.nonident"
: Flat query-anchored no identities
out.format = "xml"
: XML
out.format = "tab"
: Tabular separated file
out.format = "tab.comment"
: Tabular separated file with comment lines
out.format = "ASN.1.text"
: Seqalign (Text ASN.1)
out.format = "ASN.1.binary"
: Seqalign (Binary ASN.1)
out.format = "csv"
: Comma-separated values
out.format = "ASN.1"
: BLAST archive (ASN.1)
out.format = "json.seq.aln"
: Seqalign (JSON)
out.format = "json.blast.multi"
: Multiple-file BLAST JSON
out.format = "xml2.blast.multi"
: Multiple-file BLAST XML2
out.format = "json.blast.single"
: Single-file BLAST JSON
out.format = "xml2.blast.single"
: Single-file BLAST XML2
out.format = "SAM"
: Sequence Alignment/Map (SAM)
out.format = "report"
: Organism Report
number of cores for parallel BLAST searches.
maximum number of aligned sequences that shall be retained. Please be aware that max.target.seqs
selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .
shall low complexity regions be soft masked? Default is db.soft.mask = FALSE
.
shall low complexity regions be hard masked? Default is db.hard.mask = FALSE
.
path to BLAST executables.
if (FALSE) {
blast_best_test <- blast_best_hit(
query = system.file('seqs/qry_nn.fa', package = 'metablastr'),
subject = system.file('seqs/sbj_nn_best_hit.fa', package = 'metablastr'),
search_type = "nucleotide_to_nucleotide",
output.path = tempdir(),
db.import = FALSE)
# look at results
blast_best_test
}