Run protein to protein BLAST of reference sequences against the NCBI non-redundant protein sequence Database.

blast_protein_to_nr_database(
  query,
  nr.database,
  nr.needs.formatting = FALSE,
  output.path = NULL,
  task = "blastp",
  db.import = FALSE,
  postgres.user = NULL,
  evalue = 1e-05,
  out.format = "csv",
  cores = 1,
  max.target.seqs = 65200,
  db.soft.mask = TRUE,
  db.hard.mask = TRUE,
  blast.path = NULL
)

Arguments

query

path to input file in fasta format.

nr.database

path to local NCBI NR database.

nr.needs.formatting

a logical value indicating whether or not the local database specified in nr.database is already pre-formatted (nr.needs.formatting = FALSE; default) or whether the local NCBI NR databasse still requires formatting (nr.needs.formatting = TRUE).

output.path

path to folder at which BLAST output table shall be stored. Default is output.path = NULL (hence getwd() is used).

task

protein search task option. Options are:

  • task = "blastp" : Standard protein-protein comparisons (default).

  • task = "blast-fast" : Improved BLAST searches using longer words for protein seeding.

  • task = "blastp-short" : Optimized protein-protein comparisons for query sequences shorter than 30 residues.

db.import

shall the BLAST output be stored in a PostgresSQL database and shall a connection be established to this database? Default is db.import = FALSE. In case users wish to to only generate a BLAST output file without importing it to the current R session they can specify db.import = NULL.

postgres.user

when db.import = TRUE and out.format = "postgres" is selected, the BLAST output is imported and stored in a PostgresSQL database. In that case, users need to have PostgresSQL installed and initialized on their system. Please consult the Installation Vignette for details.

evalue

Expectation value (E) threshold for saving hits (default: evalue = 0.001).

out.format

a character string specifying the format of the file in which the BLAST results shall be stored. Available options are:

  • out.format = "pair" : Pairwise

  • out.format = "qa.ident" : Query-anchored showing identities

  • out.format = "qa.nonident" : Query-anchored no identities

  • out.format = "fq.ident" : Flat query-anchored showing identities

  • out.format = "fq.nonident" : Flat query-anchored no identities

  • out.format = "xml" : XML

  • out.format = "tab" : Tabular separated file

  • out.format = "tab.comment" : Tabular separated file with comment lines

  • out.format = "ASN.1.text" : Seqalign (Text ASN.1)

  • out.format = "ASN.1.binary" : Seqalign (Binary ASN.1)

  • out.format = "csv" : Comma-separated values

  • out.format = "ASN.1" : BLAST archive (ASN.1)

  • out.format = "json.seq.aln" : Seqalign (JSON)

  • out.format = "json.blast.multi" : Multiple-file BLAST JSON

  • out.format = "xml2.blast.multi" : Multiple-file BLAST XML2

  • out.format = "json.blast.single" : Single-file BLAST JSON

  • out.format = "xml2.blast.single" : Single-file BLAST XML2

  • out.format = "report" : Organism Report

cores

number of cores for parallel BLAST searches.

max.target.seqs

maximum number of aligned sequences that shall be retained. Please be aware that max.target.seqs selects best hits based on the database entry and not by the best e-value. See details here: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166 .

db.soft.mask

shall low complexity regions be soft masked? Default is db.soft.mask = FALSE.

db.hard.mask

shall low complexity regions be hard masked? Default is db.hard.mask = FALSE.

blast.path

path to BLAST executables.

Author

Hajk-Georg Drost

Examples

if (FALSE) {
blast_example <- blast_protein_to_nr_database(
              query   = system.file('seqs/qry_aa.fa', package = 'metablastr'),
              nr.database = "nr",
              output.path = tempdir(),
              db.import  = FALSE)
              
# look at BLAST results
blast_example
}