R/extract_upstream_promotor_seqs.R
extract_upstream_promotor_seqs.Rd
Given a genome assembly file and an corresponding annotation file users can retrieve all upstream promotor sequences of all genes from a genome.
extract_upstream_promotor_seqs(
organism,
genome_file,
annotation_file,
annotation_format,
file_name = NULL,
promotor_width,
replaceUnstranded = "+"
)
a character string specifying the scientific name of the organism.
file path to the genome assembly file.
file path to the annotation file of the genome assembly
in gtf
format.
format of the annotation file. Options are:
annotation_format = "gtf"
annotation_format = "gff"
annotation_format = "gff3"
file path to the output file storing the promotor sequences.
width of upstream promotors. This is -promotor_width
bp from the
transcription start site (TSS) of the gene.
logical value indicating whether or not unstranded sequences shall receive a default strand. Default is replaceUnstranded = TRUE
.
This function extracts genomic sequences of a specified promotor_width
upstream of the transcription start sites of all genes annotated in the corresponding
annotation_file
file. The promotor sequenes are then
if (FALSE) {
# download genome assembly of Arabidopsis lyrata
Aly_genome <- biomartr::getGenome(db = "refseq",
organism = "Arabidopsis lyrata",
path = file.path("refseq", "genome"),
gunzip = TRUE)
# download annotation file of genome assembly of Arabidopsis lyrata
Aly_gff <- biomartr::getGFF(db = "refseq",
organism = "Arabidopsis lyrata",
path = file.path("refseq", "annotation"),
gunzip = TRUE)
# retrieve upstream promotor sequences of length 1000bp
promotor_seqs <- extract_upstream_promotor_seqs(
organism = "Arabidopsis lyrata",
genome_file = Aly_genome,
annotation_file = Aly_gff,
annotation_format = "gff",
promotor_width = 1000)
}