In some cases, users may wish to extract sequences from randomly sampled loci of a particular length from a set of genomes. This function allows users to specify a number of sequences of a specified length that shall be randomly sampled from the genome. The sampling rule is as follows: For each locus independently sample:

  • 1) choose randomly (equal probability: see sample.int for details) from which of the given chromosomes the locus shall be sampled (replace = TRUE).

  • 2) choose randomly (equal probability: see sample.int for details) from which strand (plus or minus) the locus shall be sampled (replace = TRUE).

  • 3) randomly choose (equal probability: see sample.int the starting position of the locus in the sampled chromosome and strand (replace = TRUE).

extract_random_seqs_from_multiple_genomes(
  sample_size,
  replace = TRUE,
  prob = NULL,
  interval_width,
  subject_genomes,
  file_name = NULL,
  separated_by_genome = FALSE,
  update = TRUE,
  path = NULL
)

Arguments

sample_size

a non-negative integer giving the number of loci that shall be sampled.

replace

logical value indicating whether sampling should be with replacement. Default: replace = TRUE.

prob

a vector of probability weights for obtaining the elements of the vector being sampled. Default is prob = NULL.

interval_width

the length of the locus that shall be sampled.

subject_genomes

a vector containing file paths to the reference genomes that shall be queried (e.g. file paths returned by meta.retrieval).

file_name

name of the fasta file that stores the BLAST hit sequences. This name will only be used when separated_by_genome = FALSE.

separated_by_genome

a logical value indicating whether or not hit sequences from different genomes should be stored in the same output fasta file separated_by_genome = FALSE (default) or in separate fasta files separated_by_genome = TRUE.

update

shall an existing file_name file be overwritten (update = TRUE; Default) or shall blast hit sequences be appended to the existing file (update = FALSE)?

path

a folder path in which corresponding fasta output files shall be stored.

Author

Hajk-Georg Drost