Biological Sequence Statistics

This section introduces the functions implemented in seqstats that aim to quantify the statistical significance of observed patterns obtained by biological sequence analyses.

Random Sequence Generator

The randomSeqs() function computes random sequences based on the alphabet and word length of an input sequence based on a multinomial model. These random sequences can furthermore be used for constructing sequence based test statistics.

 [1] "AGTCAGTTGAC" "CCAGTTCGCTC" "ATCACTTTGCA" "AGGGTTCCCAA" "CTTACTGGACA"
 [6] "AGAAAACCTCG" "CCGCACTTCAA" "TGGCTAGGTTC" "AATTGCGTAAA" "AATGATGGATT"
# a protein example
seq_example <- "NPPAAM"

randomSeqs(seq = seq_example, sample_size = 10)
 [1] "APMMMN" "AMPPPN" "AMAAAA" "PPMMPA" "MPPAPA" "NPPPAA" "PNNMPN"
 [8] "PPNMAM" "AAANNM" "PANAMP"

Quantification of Pairwise Sequence Alignments

Pairwise sequence alignments are a commom method to determine the degree of sequence conservation and therefore potential functional relationships between two sequences. Several pairwise sequence alignment algorithms have been developed, but sometimes it is useful to quantify the statistical significance of a pairwise sequence alignment.

For this purpose, the evalAlignment() function was implemented which allows to quantify the statistical significance of a given pairwise alignment based on a sampled score distribution returned by randSeqDistr().

  p.value 
0.1146395

Here, the p-value of the cooresponding global pairwise alignment between MEDQVGFGF and AYAIDPTPAF is statistically not significant. In this simple example, this fact is also reflected by the global alignment score -59.93476 returned by Biostrings::pairwiseAlignment().

Biostrings::pairwiseAlignment(seq_example, subject_example)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] MEDQVGFG-F 
subject: [1] AYAIDPTPAF 
score: -59.93476 

Nevertheless, there are cases in which alignment scores are not sufficient enough to capture the significance of an observed alignment.