Marine Genomics 2013 - Phylogenomics exercise

Introduction

In this exercise you will practice your newly acquired skills in phylogenetic inference and "tree thinking", and analyse the evolutionary history of one of the gene families in the list below. You will use a reference sequence from the species Arabidopsis thaliana, a well studied model organism where the function of these genes are more or less well known. In the first part, you will use the techniques shown in the Tuesday BLAST exercise, to find homologous protein sequences from different species. The second part involves running the actual phylogenetic analysis, and the final part will be interpretation of the results. Emphasis of the exercise will be on the latter.

Two programs has to be downloaded and installed on your local computer before the exercise. These are the alignment editing program SeaView and the program Figtree for viewing and manipulating trees.

Suggestions for gene families to analyse in this exercise

  • Toc75
  • Toc159
  • Toc132
  • Toc120
  • Toc34
  • Tic20
  • Tic22
  • POR A
  • SAM50
  • Alb3
  • ... or your own favorite gene family.

Instructions

Part I

1). Gene family: Select one of the gene families from the list above for your analysis.

2). Reference sequence: Download a protein reference sequence from the species Arabidopsis thaliana at the NCBI site and save it in fasta format in a text file using your favorit text editor. From the Tuesday exercise you probably remember how to do a species specific search using a gene name (Hint: e.g use the search string "Arabidopsis thaliana[orgn] Tic22").

3). Species tree: Navigate your web browser to www.phytozome.net/ and have a look at the species tree presented there. At the bottom of the figure you see a clade of green algae species (of which only two are marine!). The sister clade only contains land plants. In addition to Arabidopsis thaliana, select three more species from the land plant clade and three from the green algal clade. Make the selection in such a way that you include species from different parts of the tree. Then draw a species tree of the relationship between your selected species on a piece of paper and save that for later.

4). Gathering data: Once again return to the NCBI site to do a BLAST search for putative homologous protein sequences. On this page it is possible to do species specific BLAST searches, by indicating an organism name in the field named "Organism". Use the reference sequence obtained in step 2, and search for similar sequences in all of the species in the tree you just draw, one species at the time. Analyse the BLAST results and save the sequences you want to include in the analysis in the same file as you saved the reference sequence [Hint: at this stage it is better to save too many, rather than too few sequences].

5.) Multiple sequence alignment: Load your fasta file of sequences into SeaView, and under the dropdown menu "Align" select "Align all". A new window will open and show the progress of the alignment analysis. Once finished, click the "OK" button and examine the result. Look for sequences in the alignment that are poorly aligned to the rest, and exclude them if you suspect that they are not homologous to your reference sequence. Keep aligning/analysing/excluding sequences until you are happy with the alignment. Save the resulting sequences to your computer.

6). Phylogenetic analysis: Once again redirect your web browser to a new web site. This time to www.phylogeny.fr/. Select their "One Click" function and upload your data and run the analysis using the default settings.

7). Analysing the result: After the analysis has finished you'll be presented with a phylogenetic tree. Download this tree by clicking "Newick" below the figure, and then copying the text to a new text file (Hint: Use the file extension "tree" for this file). Then open your newly created tree file in Figtree and view the result. Play around with the program for a while to familiarise yourself with all the setting. Click on branches and then the "Reroot" or "Rotate" buttons to ease the comparison to the species tree you draw earlier. Change the font size under "Tip Labels" or add colors to branches or clades etc. Also play around with the other options to make the tree look its best. Try explaining the result in relationship to your species tree by identifying gene duplications and speciation events. Does it look like you have a good sample from the gene family, or do you need to do further BLAST searches to find sequences that seem to be "missing".



Part II

In the previous part of the exercise you used the "One click" function on www.phylogeny.fr and had the web server select the method for the phylogenetic inference. In this part of the exercise we will practice sending your sequence file to Albiorix, and run a phylogenetic analysis using the program MrBayes.

1.) Multiple sequence alignment: Open your file containing the multiple sequence alignment in Seaview and examine it again. The ends of the sequences are generally more difficult to align than the often more conserved middle parts. Including such areas in the analysis may violate one of the fundamental assumptions of the phylogenetic analysis, namely that the inference is done using homologous characters. Conflicting signals may have been introduced in the dataset if non-homologous sites are found in the alignment. At this stage you therefore have the opportunity to exclude parts of the alignment that you don't "trust". To do this, first select "Allow seq. editing" under "props" and then use "Select All" under the dropdown menue "Edit" to highlight all sequences. Then use the mouse to select a position to the right of the region you want to exclude, and then use the backspace key to remove parts of the aligned matrix (Hint: SeavieW does not have an "Undo" function, so be careful not to remove too much). Save the file in NEXUS format and give it the file extension "nex" when you are done. Open your newly created NEXUS file in your favorite text editor and make sure the sequence names does not contain the character | . Furthermore, at the end of the file include the following four lines:


BEGIN MRBAYES;
prset aamodelpr = mixed;
mcmc ngen=1000000 printfreq=1000 samplefreq=1000;
END;

The command "prset aamodelpr = mixed;" indicates that we want to sample across all fixed amino acid rate matrices (models for amino acid evolution) implemented in the program. The next line is the command for running the analysis for 1'000'000 generations and print to screen and sample trees every 1'000 generations.

2). Sending sequences to Albiorix: Use the command scp (the command "man scp" will tell you how to do this) to send the NEXUS file to the Albiorix computer cluster. Ask for help if needed

3). Running MrBayes: Start the analysis by running the command "mb2-3.2.2.sh" followed by the name of your input file like this (in this example the input file is called "in.nex"):


mb2-3.2.2.sh -i in.nex

The output to your screen will look something like this:


0 -- [-12514.137] (-12346.638) (-13103.600) (-12097.902) [...4 remote chains...]
1000 -- (-6906.692) (-6859.085) [-6796.584] (-6831.543) [...4 remote chains...] -- 1:56:33
2000 -- (-6727.185) [-6728.602] (-6729.919) (-6722.131) [...4 remote chains...] -- 1:48:07
3000 -- (-6732.595) (-6715.177) (-6722.709) [-6721.656] [...4 remote chains...] -- 1:45:14
4000 -- (-6733.782) (-6730.317) (-6731.606) [-6710.378] [...4 remote chains...] -- 1:43:45
5000 -- (-6729.122) (-6717.373) (-6728.679) [-6716.584] [...4 remote chains...] -- 1:42:49

Average standard deviation of split frequencies: 0.102305

6000 -- (-6713.920) (-6710.291) (-6726.924) [-6707.084] [...4 remote chains...] -- 1:42:09
7000 -- (-6715.550) (-6727.507) (-6730.474) [-6715.136] [...4 remote chains...] -- 1:41:39
8000 -- (-6719.724) (-6727.341) (-6735.373) [-6722.418] [...4 remote chains...] -- 1:43:20
9000 -- [-6721.249] (-6722.367) (-6728.845) (-6715.845) [...4 remote chains...] -- 1:42:46
10000 -- (-6722.432) [-6719.681] (-6722.254) (-6722.804) [...4 remote chains...] -- 1:42:18

The first column indicates the number of generations the analysis has run, and the last column shows the estimated time it will take for the analysis to finish. After the stipulated 1'000'000 the program will stop and prompt you with the following:


Continue with analysis? (yes/no):

Answer "no" if the reported "Average standard deviation of split frequencies:" is below 0.01. At the next prompt type the commands "sump" and "sumt" ("..." below indicates that the output from the program has been excluded from this example):


MrBayes > sump
...
MrBayes > sumt
...

You can now quit the program by typing "quit"

4.) Examine the result: Download the resulting tree file that has the file extension "con.tre" to your local computer and analyse it in Figtree like you did in part I of the exercise.