Given a genome assembly file and an corresponding annotation file users can retrieve all upstream promotor sequences of all genes from a genome.

extract_upstream_promotor_seqs(
  organism,
  genome_file,
  annotation_file,
  annotation_format,
  file_name = NULL,
  promotor_width,
  replaceUnstranded = "+"
)

Arguments

organism

a character string specifying the scientific name of the organism.

genome_file

file path to the genome assembly file.

annotation_file

file path to the annotation file of the genome assembly in gtf format.

annotation_format

format of the annotation file. Options are:

  • annotation_format = "gtf"

  • annotation_format = "gff"

  • annotation_format = "gff3"

file_name

file path to the output file storing the promotor sequences.

promotor_width

width of upstream promotors. This is -promotor_width bp from the transcription start site (TSS) of the gene.

replaceUnstranded

logical value indicating whether or not unstranded sequences shall receive a default strand. Default is replaceUnstranded = TRUE.

Details

This function extracts genomic sequences of a specified promotor_width upstream of the transcription start sites of all genes annotated in the corresponding annotation_file file. The promotor sequenes are then

Author

Hajk-Georg Drost

Examples

if (FALSE) {
# download genome assembly of Arabidopsis lyrata
Aly_genome <- biomartr::getGenome(db = "refseq", 
                                 organism = "Arabidopsis lyrata",
                                 path = file.path("refseq", "genome"),
                                 gunzip = TRUE)
# download annotation file of genome assembly of Arabidopsis lyrata
Aly_gff <- biomartr::getGFF(db = "refseq", 
                           organism = "Arabidopsis lyrata",
                           path = file.path("refseq", "annotation"),
                           gunzip = TRUE)
                           
# retrieve upstream promotor sequences of length 1000bp
promotor_seqs <- extract_upstream_promotor_seqs(
                               organism = "Arabidopsis lyrata",
                               genome_file = Aly_genome,
                               annotation_file = Aly_gff,
                               annotation_format = "gff",
                               promotor_width = 1000)

}