vignettes/sequence_alignments.Rmd
sequence_alignments.Rmd
The orthologr
package provides multiple functions to
perform pairwise and multiple sequence alignments. The following
functions are implemented in orthologr
:
multi_aln()
: Perform Multiple Sequence Alignmentspairwise_aln()
: Perform Pairwise Sequence
Alignmentscodon_aln()
: Perform Codon AlignmentsPrior to be able to use all sequence alignment functions implemented
in orthologr
you need to install corresponding alignment
tools of interest. The above mentioned functions provide interfaces to
the following alignment programs:
multi_aln()
function
ClustalW : Advanced multiple alignment tool of nucleic acid and protein sequences
T_Coffee : A collection of tools for processing multiple sequence alignments of nucleic acids and proteins as well as their 3D structure
MUSCLE : Fast and accurate multiple alignment tool of nucleic acid and protein sequences
ClustalO : Fast and scalable multiple alignment tool for nucleic acid and protein sequences that is also capable of performing HMM alignments
MAFFT : A tool for multiple sequence alignment and phylogeny
The easiest way to use the multi_aln()
function is to
store the corresponding multiple sequence alignment tools in the default
execution PATH
of you system
(e.g. /usr/local/bin
on UNIX machines).
You can test whether the corresponding multiple sequence alignment
tool can be executed from the default PATH
by running:
MacOS: system("clustalw2 -help")
Linux: system("clustalw -help")
Windows: system("clustalw2.exe -help")
In case everything is installed appropriately, you should see:
CLUSTAL 2.1 Multiple Sequence Alignments
DATA (sequences)
-INFILE=file.ext :input sequences.
-PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment).
VERBS (do things)
-OPTIONS :list the command line parameters
-HELP or -CHECK :outline the command line params.
-FULLHELP :output full help content.
-ALIGN :do full multiple alignment.
The multi_aln()
function takes a fasta file storing the
genes (proteins) that shall be aligned. The tool
argument
specifies the alignment tool that shall be used to perform a multiple
sequence alignment (in this case tool = clustalw
). The
get_aln
argument specifies whether or not the alignment
shall be printed out to the console. In case
get_aln = FALSE
, the corresponding alignment file is stored
in the file.path(tempdir(),_alignment,multi_aln)
directory.
# in case the default execution path of clustalw runs properly on your system
orthologr::multi_aln(file = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
tool = "clustalw",
get_aln = TRUE)
CLUSTAL 2.1 Multiple Sequence Alignments
[1] "Multiple Alignment successfully written in /var/folders/tb/
v6lh8_g505bfjytt3lvdmvw00000gn/T//RtmpOXdtWL/_alignment/multi_aln/clustalw.aln."
AAMultipleAlignment with 2 rows and 429 columns
aln names
[1] MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYD...IKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG AT1G01010.1
[2] --------MAASEHRCVGCGFR---------------VKSLFIQYS...IFEP-------TIFLTQFGSLMQYLSYLFRTV-------------- 333554|PACid_1603...
It is also possible to pass additional parameters to the ClustalW call:
# running clustalw using additional parameters
# details: system("clustalw2 -help")
orthologr::multi_aln(file = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
tool = "clustalw",
get_aln = TRUE,
params = "-PWMATRIX=BLOSUM -TYPE=PROTEIN")
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: AT1G01010.1 429 aa
Sequence 2: 333554|PACid_16033839 245 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 2
Guide tree file created: [/Library/Frameworks/R.framework/Versions/3.1/Resources/library/orthologr/seqs/aa_seqs.dnd]
There are 1 groups
Start of Multiple Alignment
Aligning...
Group 1: Delayed
Alignment Score 29
CLUSTAL-Alignment file created [/var/folders/tb/v6lh8_g505bfjytt3lvdmvw00000gn/T//RtmpOXdtWL/_alignment/multi_aln/clustalw.aln]
[1] "Multiple Alignment successfully written in /var/folders/tb/v6lh8_g505bfjytt3lvdmvw00000gn/T//RtmpOXdtWL/_alignment/multi_aln/clustalw.aln."
AAMultipleAlignment with 2 rows and 429 columns
aln names
[1] MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYD...IKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG AT1G01010.1
[2] --------MAASEHRCVGCGFR---------------VKSLFIQYS...IFEP-------TIFLTQFGSLMQYLSYLFRTV-------------- 333554|PACid_1603...
system("t_coffee -version")
In case everything is installed appropriately, you should see:
PROGRAM: T-COFFEE Version_11.00.8cbe486 (2014-08-12 21:55:14 - Revision 8cbe486 - Build 470)
orthologr::multi_aln(file = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
tool = "t_coffee",
get_aln = TRUE)
system("muscle -help")
In case everything is installed appropriately, you should see:
MUSCLE v3.8.31 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
Basic usage
muscle -in <inputfile> -out <outputfile>
Common options (for a complete list please see the User Guide):
-in <inputfile> Input file in FASTA format (default stdin)
-out <outputfile> Output alignment in FASTA format (default stdout)
-diags Find diagonals (faster for similar sequences)
-maxiters <n> Maximum number of iterations (integer, default 16)
-maxhours <h> Maximum time to iterate in hours (default no limit)
-html Write output in HTML format (default FASTA)
-msf Write output in GCG MSF format (default FASTA)
-clw Write output in CLUSTALW format (default FASTA)
-clwstrict As -clw, with 'CLUSTAL W (1.81)' header
-log[a] <logfile> Log to file (append if -loga, overwrite if -log)
-quiet Do not write progress messages to stderr
-version Display version information and exit
Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2
Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3
Fastest possible (nucleotides): -maxiters 1 -diags
# in case the default execution path of muscle runs properly on your system
multi_aln(file = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
tool = "muscle",
get_aln = TRUE)
system("clustalo --help")
# in case the default execution path of muscle runs properly on your system
multi_aln(file = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
tool = "clustalo",
get_aln = TRUE)
system("mafft -help")
In case everything is installed appropriately, you should see:
------------------------------------------------------------------------------
MAFFT v7.187 (2014/10/02)
http://mafft.cbrc.jp/alignment/software/
MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
% mafft in > out
% mafft --retree 1 in > out (fast)
High accuracy (for <~200 sequences x <~2,000 aa/nt):
% mafft --maxiterate 1000 --localpair in > out (% linsi in > out is also ok)
% mafft --maxiterate 1000 --genafpair in > out (% einsi in > out)
% mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)
If unsure which option to use:
% mafft --auto in > out
--op # : Gap opening penalty, default: 1.53
--ep # : Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout : Output: clustal format, default: fasta
--reorder : Outorder: aligned, default: input order
--quiet : Do not report progress
--thread # : Number of threads (if unsure, --thread -1)
multi_aln()
function
The multi_aln()
function is an interface function
between R and common multiple sequence alignment tools. When working
with this function a new folder named _alignment
is being
created and stores the multiple alignment returned by the corresponding
alignment tool. The argument get_aln = TRUE
allows to work
with the multiple alignment generated by the corresponding alignment
tool within the current R session.
This small pairwise alignment example shall illustrate how the
multi_aln()
output can be used:
multi_aln(system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
tool = "clustalw", get_aln = TRUE)
$nb
[1] 2
$nam
[1] "AT1G01010.1" "333554|PACid_16033839"
$seq
$seq[[1]]
[1] "medqvgfgfrpndeelvghylrnkiegntsrdvevaisevnicsydpwnlrfqskyksrdamwyffsrrennk
gnrqsrttvsgkwkltgesvevkdqwgfcsegfrgkighkrvlvfldgrypdktksdwvihefhydllpehqrtyvic
rleykgddadilsayaidptpafvpnmtssagsvvnqsrqrnsgsyntyseydsanhgqqfnensnimqqqplqgsfn
plleydfanhggqwlsdyidlqqqvpylapyenesemiwkhvieenfeflvdertsmqqhysdhrpkkpvsgvlpdds
sdtetgsmifedtssstdsvgssdepghtriddipslniieplhnykaqeqpkqqskekvissqksecewkmaedsik
ippstntvkqswivlenaqwnylknmiigvllfisviswiilvg"
$seq[[2]]
[1] "--------maasehrcvgcgfr---------------vkslfiqyspgnirlmk-------------------
---------------cgnckevadey----------iecermiifid---------------------lilhrpkvyr
hvlynainpetvniqhllwklvfvyllldsyrslllrrtdeess----------------fshssvlisikvligvls
anaafifs----------------------------------------faiaakgllnevs---rgreimlgicissy
fkifllamlvwefp----------------msvifivdilvltsnsmalkvmtestmtrciavcliahlvrfsvgqif
ep-------tifltqfgslmqylsylfrtv--------------"
$com
[1] NA
attr(,"class")
[1] "alignment"
Furthermore, multiple alignments are returned as follows:
multi_aln(system.file('seqs/multi_aln_example.fasta', package = 'orthologr'),
tool = "clustalw", get_aln = TRUE)
CLUSTAL 2.1 Multiple Sequence Alignments
$nb
[1] 4
$nam
[1] "AT1G01010.1|PACid_19656964" "Thhalv10006531m|PACid_20187082"
[3] "Bra032623|PACid_22715924" "311315|PACid_16059488"
$seq
$seq[[1]]
[1] "-----------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
-----------------------------------------medqvgfg------------------frp
ndeelvghylrnkiegntsr----------dvevaisevnicsydp---------wnlrfqskyksrd--
-------amwyffsrre------------nnkgnrqsrttvsgkwkltgesvevkdqwgfcseg------
--frgkig-------------------------------------------hkrvlvfldgrypdktksd
wvihefhydllpehqr-----------------tyvicr--leykgddadilsayaidptpafvpnmtss
agsvvnqsrqrnsgsyntyseydsanhgqqfnensn-imqqqplqgsfnplleydfanhg-------gqw
l----------------------------sdyidlqqqvpylapye------------------------
-----------------------------------nesemiwkhvieenfeflvdertsmqqhysdhrpk
kpvsgvlp----------------ddssdtetgsmifedtssstds------------------------
---------------------------------vgssdepghtriddipslniieplhnykaqeqpkqqs
kekvissqksecewkmaeds---------ikippstntvkqswivlenaqwnylknmiigvllfisvisw
iilvg-------------------------------"
$seq[[2]]
[1] "mvmederrgdikppsywldacediscdliddlvsdfdpssvavaesvdengvnndffggidhild
siknggglpnrahingvsetnsqringnsevseaaqliagettvsvkgnvlqkcggkrdevskeegeknr
krarvcsyqrersnlsgrgqansregdrfmnrkrtrnwdeaghnkrrdgynyrrdgrdreargywerdkv
gsnelvyrsgtweadherdlkkesgrnresdekaeenkskpeehkekvveeqarryqldvleqakaknti
afletgagktliailliksihkdltsqnrkmlsvflvpkvplvyqqaevirnqtcfqvghycgemgqdfw
darrwqrefeskqvlvmtaqillnilrhsiirmeainllildechhavkkhpyslvmsefyhttpkdkrp
aifgmtaspvnlkgvssqvdcaikirnletkldstvctikdrkelekhvpmpseivveydkaatmwslhe
kikqmiaaveeaaqassrkskwqfmgardagakdelrqvygvsertesdgaanlihklrainytlaelgq
wcaykvaqsfltalqsdervnfqvdvkfqesylsevvsllqcellegaaaekavaelskpengnandeie
egelpddhvvsggehvdkvigaavadgkvtpkvqsliklllkyqhtadfraivfvervvaalvlpkvfae
lpslgfircasmighnnsqemkssqmqdtiskfrdgqvtllvatsvaeegldirqcnvvmrfdlaktvla
yiqsrgrarkpgsdyilmverenvshaaflrnarnseetlrkeaiertdlshlkdssrlisidavpgtvy
kveatgamvslnsavglihfycsqlpgdryailrpefsmvkhekpgghteyscrlqlpcnapfeilegpv
cssmrlaqqavclagckklhemgaftdmllpdkgsgqdaekadqddegepipgtarhrefypegvadvlk
gewilsgkeicessklfhlymysvrcvdsgvskdpfltevsefavlfgneldaevlsmsmdlyvaramit
kaslafrgslditesqlssikkfhvrlmsivldvdvepsttpwdpakaylfvpvadnssaepikginwel
vekitkttvwdnplqrarpdvylgtnertlggdrreygfgklrhnigfgqkshptygirgavasfdvvra
sgllpvrdalekevegdlsqgklmmadgcmvaenllgkivtaahsgkrfyvdsicydmsaetsfprkegy
lgpleyntyadyykqkygvdlsckqqplikgrgvsycknllsprfeqsgesetildktyyvflppelcvv
hplsgslvrgaqrlpsimrrvesmllavqlknlisypiptskilealtaascqetfcyeraellgdaylk
wivsrflflkypqkhegqltrmrqqmvsnlvlyqyalvkglqsyiqadrfapsrwsapgvppvydedtkd
ggssffdeeekpegnkdvfedgemedgelegdlssyrvlssktladvvealigvyyveggktaanhlmkw
igihveddpeetegsvkpvynvpesvlksidfvgleralkyeftekgllveaithasrpssgvscyqrle
fvgdavldhlitrhlfftytslppgrltdlraaavnnenfarvavkhklhlylrhgssalekqirdfvke
vltesskpgfnsfglgdckapkvlgdivesiagaifldsgkdttaawkvfqpllqpmvtpetlpmhpvre
lqercqqqaegleykasrsgntatvevfidgvqvgaaqnpqkkmaqklaarnalaalkekeaeeskkkqa
ngnaagenqddnengnkkngnqtftrqtlndiclrknwpmpsyrcvkeggpahakrftfgvrvntsdrgw
tdecigepmpsvkkakdsaailllellnktys----"
$seq[[3]]
[1] "-----------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
-----------------------------------------------mqqppqmfpmapsmpptnitteq
iqkyleenkklimaimenqnlgklaecaqyqallqknlmylaaiadaqpppstagatppp----------
---------------------------------amasqmgaphpg-------------------------
---------------------------------------------mqppsyfmqhp------qasgmaqq
appagifp----------------------prgplqfgsphqlqdp------------------------
-----------------------------------qqqhmhqqamqghmgmrpmginnnngmqhqmqqqp
etslggsaanvgirggkqdg-------------------------adgqgkddgk---------------
------------------------------------"
$seq[[4]]
[1] "-----------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
--------------------------------------------------------mlslnmrteienlw
vfalaskfnifmqehfaslllaiaitwctltivfwstpggpawg--------------------------
--------------------------kyfftrrfsslghnrksknlipgprgfplvgsmslrsshvahqr
iasvaemsnakrlmafslgdtkvvvtchpdvakeilnssvfadrpvdetayg---lmfnramgfapngty
wrtlrrlssnhlfnpkqikrseeqrrviatqmvnafarnaksafavrdllktaslcnmmglvfg------
-----------------------------------------------------reyelesnnnveseclk
glveegydllg-----------------------tlnwtdhlpwlagldfqqirfrcsqlvpkvnlllsr
iihehyatgnfldvllslqrseklsdsdivavlwemifrgtdtvavliewvlarialhpk----------
----------------------------------vqstvhdeldr-------------------------
---------------------------------------------vvgrsrtvdesdlpsltyltamike
vlr--lhp-----------------------pgpllswarlsitdt------------------------
------------------------------------tvdgyhvpagttamvnmwaiardphvwedplefk
perfvakdgeaefsvfgsdlr---------------lapfgsgkrvcpgknlglttvsfwvatllhefew
lpsveanppdlsevlrlscemacplivnvsprrksv"
$com
[1] NA
attr(,"class")
[1] "alignment"