The orthologr package provides multiple functions to perform pairwise and multiple sequence alignments. The following functions are implemented in orthologr:

Getting Started

Prior to be able to use all sequence alignment functions implemented in orthologr you need to install corresponding alignment tools of interest. The above mentioned functions provide interfaces to the following alignment programs:

The multi_aln() function

  • ClustalW : Advanced multiple alignment tool of nucleic acid and protein sequences

  • T_Coffee : A collection of tools for processing multiple sequence alignments of nucleic acids and proteins as well as their 3D structure

  • MUSCLE : Fast and accurate multiple alignment tool of nucleic acid and protein sequences

  • ClustalO : Fast and scalable multiple alignment tool for nucleic acid and protein sequences that is also capable of performing HMM alignments

  • MAFFT : A tool for multiple sequence alignment and phylogeny

The easiest way to use the multi_aln() function is to store the corresponding multiple sequence alignment tools in the default execution PATH of you system (e.g. /usr/local/bin on UNIX machines).

You can test whether the corresponding multiple sequence alignment tool can be executed from the default PATH by running:

ClustalW

  • MacOS: system("clustalw2 -help")

  • Linux: system("clustalw -help")

  • Windows: system("clustalw2.exe -help")

In case everything is installed appropriately, you should see:

CLUSTAL 2.1 Multiple Sequence Alignments


                DATA (sequences)

-INFILE=file.ext                             :input sequences.
-PROFILE1=file.ext  and  -PROFILE2=file.ext  :profiles (old alignment).


                VERBS (do things)

-OPTIONS            :list the command line parameters
-HELP  or -CHECK    :outline the command line params.
-FULLHELP           :output full help content.
-ALIGN              :do full multiple alignment.

Perform A Multiple Alignment Using ClustalW

The multi_aln() function takes a fasta file storing the genes (proteins) that shall be aligned. The tool argument specifies the alignment tool that shall be used to perform a multiple sequence alignment (in this case tool = clustalw). The get_aln argument specifies whether or not the alignment shall be printed out to the console. In case get_aln = FALSE, the corresponding alignment file is stored in the file.path(tempdir(),_alignment,multi_aln) directory.

# in case the default execution path of clustalw runs properly on your system
orthologr::multi_aln(file    = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
          tool    = "clustalw",
          get_aln = TRUE)
 CLUSTAL 2.1 Multiple Sequence Alignments



[1] "Multiple Alignment successfully written in /var/folders/tb/
v6lh8_g505bfjytt3lvdmvw00000gn/T//RtmpOXdtWL/_alignment/multi_aln/clustalw.aln."

AAMultipleAlignment with 2 rows and 429 columns
     aln                                                                                           names               
[1] MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYD...IKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG AT1G01010.1
[2] --------MAASEHRCVGCGFR---------------VKSLFIQYS...IFEP-------TIFLTQFGSLMQYLSYLFRTV-------------- 333554|PACid_1603...

It is also possible to pass additional parameters to the ClustalW call:

# running clustalw using additional parameters
# details: system("clustalw2 -help")
orthologr::multi_aln(file    = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
          tool    = "clustalw",
          get_aln = TRUE,
          params  = "-PWMATRIX=BLOSUM -TYPE=PROTEIN")
CLUSTAL 2.1 Multiple Sequence Alignments


Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: AT1G01010.1             429 aa
Sequence 2: 333554|PACid_16033839   245 aa
Start of Pairwise alignments
Aligning...

Sequences (1:2) Aligned. Score:  2
Guide tree file created:   [/Library/Frameworks/R.framework/Versions/3.1/Resources/library/orthologr/seqs/aa_seqs.dnd]

There are 1 groups
Start of Multiple Alignment

Aligning...
Group 1:                     Delayed
Alignment Score 29

CLUSTAL-Alignment file created  [/var/folders/tb/v6lh8_g505bfjytt3lvdmvw00000gn/T//RtmpOXdtWL/_alignment/multi_aln/clustalw.aln]

[1] "Multiple Alignment successfully written in /var/folders/tb/v6lh8_g505bfjytt3lvdmvw00000gn/T//RtmpOXdtWL/_alignment/multi_aln/clustalw.aln."
AAMultipleAlignment with 2 rows and 429 columns
     aln                                                                                             names               
[1] MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYD...IKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG AT1G01010.1
[2] --------MAASEHRCVGCGFR---------------VKSLFIQYS...IFEP-------TIFLTQFGSLMQYLSYLFRTV-------------- 333554|PACid_1603...

T_COFFEE

system("t_coffee -version")

In case everything is installed appropriately, you should see:

PROGRAM: T-COFFEE Version_11.00.8cbe486 (2014-08-12 21:55:14 - Revision 8cbe486 - Build 470)
orthologr::multi_aln(file    = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
          tool    = "t_coffee",
          get_aln = TRUE)

MUSCLE

system("muscle -help")

In case everything is installed appropriately, you should see:

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.


Basic usage

    muscle -in <inputfile> -out <outputfile>

Common options (for a complete list please see the User Guide):

    -in <inputfile>    Input file in FASTA format (default stdin)
    -out <outputfile>  Output alignment in FASTA format (default stdout)
    -diags             Find diagonals (faster for similar sequences)
    -maxiters <n>      Maximum number of iterations (integer, default 16)
    -maxhours <h>      Maximum time to iterate in hours (default no limit)
    -html              Write output in HTML format (default FASTA)
    -msf               Write output in GCG MSF format (default FASTA)
    -clw               Write output in CLUSTALW format (default FASTA)
    -clwstrict         As -clw, with 'CLUSTAL W (1.81)' header
    -log[a] <logfile>  Log to file (append if -loga, overwrite if -log)
    -quiet             Do not write progress messages to stderr
    -version           Display version information and exit

Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2
Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3
Fastest possible (nucleotides): -maxiters 1 -diags
# in case the default execution path of muscle runs properly on your system
multi_aln(file    = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
          tool    = "muscle",
          get_aln = TRUE)

ClustalO

system("clustalo --help")
# in case the default execution path of muscle runs properly on your system
multi_aln(file    = system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
          tool    = "clustalo",
          get_aln = TRUE)

MAFFT

system("mafft -help")

In case everything is installed appropriately, you should see:

------------------------------------------------------------------------------
  MAFFT v7.187 (2014/10/02)
  http://mafft.cbrc.jp/alignment/software/
  MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
  % mafft in > out
  % mafft --retree 1 in > out (fast)

High accuracy (for <~200 sequences x <~2,000 aa/nt):
  % mafft --maxiterate 1000 --localpair  in > out (% linsi in > out is also ok)
  % mafft --maxiterate 1000 --genafpair  in > out (% einsi in > out)
  % mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)

If unsure which option to use:
  % mafft --auto in > out

--op # :         Gap opening penalty, default: 1.53
--ep # :         Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout :   Output: clustal format, default: fasta
--reorder :      Outorder: aligned, default: input order
--quiet :        Do not report progress
--thread # :     Number of threads (if unsure, --thread -1)

The multi_aln() function

The multi_aln() function is an interface function between R and common multiple sequence alignment tools. When working with this function a new folder named _alignment is being created and stores the multiple alignment returned by the corresponding alignment tool. The argument get_aln = TRUE allows to work with the multiple alignment generated by the corresponding alignment tool within the current R session.

This small pairwise alignment example shall illustrate how the multi_aln() output can be used:

multi_aln(system.file('seqs/aa_seqs.fasta', package = 'orthologr'),
          tool = "clustalw", get_aln = TRUE)
$nb
[1] 2

$nam
[1] "AT1G01010.1"           "333554|PACid_16033839"

$seq
$seq[[1]]
[1] "medqvgfgfrpndeelvghylrnkiegntsrdvevaisevnicsydpwnlrfqskyksrdamwyffsrrennk
gnrqsrttvsgkwkltgesvevkdqwgfcsegfrgkighkrvlvfldgrypdktksdwvihefhydllpehqrtyvic
rleykgddadilsayaidptpafvpnmtssagsvvnqsrqrnsgsyntyseydsanhgqqfnensnimqqqplqgsfn
plleydfanhggqwlsdyidlqqqvpylapyenesemiwkhvieenfeflvdertsmqqhysdhrpkkpvsgvlpdds
sdtetgsmifedtssstdsvgssdepghtriddipslniieplhnykaqeqpkqqskekvissqksecewkmaedsik
ippstntvkqswivlenaqwnylknmiigvllfisviswiilvg"

$seq[[2]]
[1] "--------maasehrcvgcgfr---------------vkslfiqyspgnirlmk-------------------
---------------cgnckevadey----------iecermiifid---------------------lilhrpkvyr
hvlynainpetvniqhllwklvfvyllldsyrslllrrtdeess----------------fshssvlisikvligvls
anaafifs----------------------------------------faiaakgllnevs---rgreimlgicissy
fkifllamlvwefp----------------msvifivdilvltsnsmalkvmtestmtrciavcliahlvrfsvgqif
ep-------tifltqfgslmqylsylfrtv--------------"


$com
[1] NA

attr(,"class")
[1] "alignment"

Furthermore, multiple alignments are returned as follows:

multi_aln(system.file('seqs/multi_aln_example.fasta', package = 'orthologr'),
          tool = "clustalw", get_aln = TRUE)


 CLUSTAL 2.1 Multiple Sequence Alignments



$nb
[1] 4

$nam
[1] "AT1G01010.1|PACid_19656964"     "Thhalv10006531m|PACid_20187082"
[3] "Bra032623|PACid_22715924"       "311315|PACid_16059488"         

$seq
$seq[[1]]
[1] "-----------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
-----------------------------------------medqvgfg------------------frp
ndeelvghylrnkiegntsr----------dvevaisevnicsydp---------wnlrfqskyksrd--
-------amwyffsrre------------nnkgnrqsrttvsgkwkltgesvevkdqwgfcseg------
--frgkig-------------------------------------------hkrvlvfldgrypdktksd
wvihefhydllpehqr-----------------tyvicr--leykgddadilsayaidptpafvpnmtss
agsvvnqsrqrnsgsyntyseydsanhgqqfnensn-imqqqplqgsfnplleydfanhg-------gqw
l----------------------------sdyidlqqqvpylapye------------------------
-----------------------------------nesemiwkhvieenfeflvdertsmqqhysdhrpk
kpvsgvlp----------------ddssdtetgsmifedtssstds------------------------
---------------------------------vgssdepghtriddipslniieplhnykaqeqpkqqs
kekvissqksecewkmaeds---------ikippstntvkqswivlenaqwnylknmiigvllfisvisw
iilvg-------------------------------"

$seq[[2]]
[1] "mvmederrgdikppsywldacediscdliddlvsdfdpssvavaesvdengvnndffggidhild
siknggglpnrahingvsetnsqringnsevseaaqliagettvsvkgnvlqkcggkrdevskeegeknr
krarvcsyqrersnlsgrgqansregdrfmnrkrtrnwdeaghnkrrdgynyrrdgrdreargywerdkv
gsnelvyrsgtweadherdlkkesgrnresdekaeenkskpeehkekvveeqarryqldvleqakaknti
afletgagktliailliksihkdltsqnrkmlsvflvpkvplvyqqaevirnqtcfqvghycgemgqdfw
darrwqrefeskqvlvmtaqillnilrhsiirmeainllildechhavkkhpyslvmsefyhttpkdkrp
aifgmtaspvnlkgvssqvdcaikirnletkldstvctikdrkelekhvpmpseivveydkaatmwslhe
kikqmiaaveeaaqassrkskwqfmgardagakdelrqvygvsertesdgaanlihklrainytlaelgq
wcaykvaqsfltalqsdervnfqvdvkfqesylsevvsllqcellegaaaekavaelskpengnandeie
egelpddhvvsggehvdkvigaavadgkvtpkvqsliklllkyqhtadfraivfvervvaalvlpkvfae
lpslgfircasmighnnsqemkssqmqdtiskfrdgqvtllvatsvaeegldirqcnvvmrfdlaktvla
yiqsrgrarkpgsdyilmverenvshaaflrnarnseetlrkeaiertdlshlkdssrlisidavpgtvy
kveatgamvslnsavglihfycsqlpgdryailrpefsmvkhekpgghteyscrlqlpcnapfeilegpv
cssmrlaqqavclagckklhemgaftdmllpdkgsgqdaekadqddegepipgtarhrefypegvadvlk
gewilsgkeicessklfhlymysvrcvdsgvskdpfltevsefavlfgneldaevlsmsmdlyvaramit
kaslafrgslditesqlssikkfhvrlmsivldvdvepsttpwdpakaylfvpvadnssaepikginwel
vekitkttvwdnplqrarpdvylgtnertlggdrreygfgklrhnigfgqkshptygirgavasfdvvra
sgllpvrdalekevegdlsqgklmmadgcmvaenllgkivtaahsgkrfyvdsicydmsaetsfprkegy
lgpleyntyadyykqkygvdlsckqqplikgrgvsycknllsprfeqsgesetildktyyvflppelcvv
hplsgslvrgaqrlpsimrrvesmllavqlknlisypiptskilealtaascqetfcyeraellgdaylk
wivsrflflkypqkhegqltrmrqqmvsnlvlyqyalvkglqsyiqadrfapsrwsapgvppvydedtkd
ggssffdeeekpegnkdvfedgemedgelegdlssyrvlssktladvvealigvyyveggktaanhlmkw
igihveddpeetegsvkpvynvpesvlksidfvgleralkyeftekgllveaithasrpssgvscyqrle
fvgdavldhlitrhlfftytslppgrltdlraaavnnenfarvavkhklhlylrhgssalekqirdfvke
vltesskpgfnsfglgdckapkvlgdivesiagaifldsgkdttaawkvfqpllqpmvtpetlpmhpvre
lqercqqqaegleykasrsgntatvevfidgvqvgaaqnpqkkmaqklaarnalaalkekeaeeskkkqa
ngnaagenqddnengnkkngnqtftrqtlndiclrknwpmpsyrcvkeggpahakrftfgvrvntsdrgw
tdecigepmpsvkkakdsaailllellnktys----"

$seq[[3]]
[1] "-----------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
-----------------------------------------------mqqppqmfpmapsmpptnitteq
iqkyleenkklimaimenqnlgklaecaqyqallqknlmylaaiadaqpppstagatppp----------
---------------------------------amasqmgaphpg-------------------------
---------------------------------------------mqppsyfmqhp------qasgmaqq
appagifp----------------------prgplqfgsphqlqdp------------------------
-----------------------------------qqqhmhqqamqghmgmrpmginnnngmqhqmqqqp
etslggsaanvgirggkqdg-------------------------adgqgkddgk---------------
------------------------------------"

$seq[[4]]
[1] "-----------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
----------------------------------------------------------------------
--------------------------------------------------------mlslnmrteienlw
vfalaskfnifmqehfaslllaiaitwctltivfwstpggpawg--------------------------
--------------------------kyfftrrfsslghnrksknlipgprgfplvgsmslrsshvahqr
iasvaemsnakrlmafslgdtkvvvtchpdvakeilnssvfadrpvdetayg---lmfnramgfapngty
wrtlrrlssnhlfnpkqikrseeqrrviatqmvnafarnaksafavrdllktaslcnmmglvfg------
-----------------------------------------------------reyelesnnnveseclk
glveegydllg-----------------------tlnwtdhlpwlagldfqqirfrcsqlvpkvnlllsr
iihehyatgnfldvllslqrseklsdsdivavlwemifrgtdtvavliewvlarialhpk----------
----------------------------------vqstvhdeldr-------------------------
---------------------------------------------vvgrsrtvdesdlpsltyltamike
vlr--lhp-----------------------pgpllswarlsitdt------------------------
------------------------------------tvdgyhvpagttamvnmwaiardphvwedplefk
perfvakdgeaefsvfgsdlr---------------lapfgsgkrvcpgknlglttvsfwvatllhefew
lpsveanppdlsevlrlscemacplivnvsprrksv"


$com
[1] NA

attr(,"class")
[1] "alignment"