This section introduces the functions implemented in seqstats
that aim to quantify the statistical significance of observed patterns obtained by biological sequence analyses.
The randomSeqs()
function computes random sequences based on the alphabet and word length of an input sequence based on a multinomial model. These random sequences can furthermore be used for constructing sequence based test statistics.
[1] "AGTCAGTTGAC" "CCAGTTCGCTC" "ATCACTTTGCA" "AGGGTTCCCAA" "CTTACTGGACA"
[6] "AGAAAACCTCG" "CCGCACTTCAA" "TGGCTAGGTTC" "AATTGCGTAAA" "AATGATGGATT"
[1] "APMMMN" "AMPPPN" "AMAAAA" "PPMMPA" "MPPAPA" "NPPPAA" "PNNMPN"
[8] "PPNMAM" "AAANNM" "PANAMP"
Pairwise sequence alignments are a commom method to determine the degree of sequence conservation and therefore potential functional relationships between two sequences. Several pairwise sequence alignment algorithms have been developed, but sometimes it is useful to quantify the statistical significance of a pairwise sequence alignment.
For this purpose, the evalAlignment()
function was implemented which allows to quantify the statistical significance of a given pairwise alignment based on a sampled score distribution returned by randSeqDistr()
.
seq_example <- "MEDQVGFGF"
subject_example <- "AYAIDPTPAF"
p_val_align <- evalAlignment(seq_example, subject_example,sample_size = 10,
Biostrings::pairwiseAlignment, scoreOnly = TRUE,
fit_distr = "norm", comp_cores = 1)
p_val_align
p.value
0.1146395
Here, the p-value of the cooresponding global pairwise alignment between MEDQVGFGF
and AYAIDPTPAF
is statistically not significant. In this simple example, this fact is also reflected by the global alignment score -59.93476
returned by Biostrings::pairwiseAlignment()
.
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] MEDQVGFG-F
subject: [1] AYAIDPTPAF
score: -59.93476
Nevertheless, there are cases in which alignment scores are not sufficient enough to capture the significance of an observed alignment.