Gene family evolution - Exercise #2
In this exercise you will redo the analysis from Exercise #1, but this time analyse the fungal genes you analysed in other parts of this course. Another difference from the last exercise is that you will do alignment, alignment editing using and tree manipulation on your local computer. For the latter you have to download and install the program Figtree.
1). Reference sequence: Use the protein sequence from the fungal species Candida albicans from the assembly exercise, as reference sequence.
2). Species tree: Navigate your web browser to this site and have a look at the species tree presented there. Select four of the species from the "red group" as your ingroup, and two from the "blue group" as your outgroup. Make the selection in such a way that you include species from different parts of the red/blue clades of the tree. Then draw a species tree of the relationship between your selected species on a piece of paper and save that for later.
3). Gathering data: Once again return to the NCBI site to do a BLAST search for putative homologous protein sequences. On this page it is possible to do species specific BLAST searches, by indicating an organism name in the field named "Organism". Use the reference sequence obtained in step 1, and search for similar sequences in all of the species in the tree you just draw, one species at the time. Analyse the BLAST results and save the sequences you want to include in the analysis in the same file as you saved the reference sequence [Hint: at this stage it is better to save too many, rather than too few sequences].
5.) Multiple sequence alignment: Load your fasta file of sequences into SeaView, and under the dropdown menu "Align" select "Align all". A new window will open and show the progress of the alignment process. Once finished, click the "OK" button and examine the result. Look for sequences in the alignment that are poorly aligned to the rest, and exclude them if you suspect that they are not homologous to your reference sequence. Keep aligning/analysing/excluding sequences until you are happy with the alignment.
6). Phylogenetic analysis: Once again redirect your web browser to a new web site. This time to www.phylogeny.fr/. Select their "One Click" function and upload your data and run the analysis using the default settings.
7). Analysing the result: After the analysis has finished you'll be presented with a phylogenetic tree. Download this tree by clicking "Newick" below the figure, and then copying the text to a new text file (Hint: Use the file extension "tree" for this file). Then open your newly created tree file in Figtree and view the result. Play around with the program for a while to familiarise yourself with all the setting. Click on branches and then the "Reroot" or "Rotate" buttons to ease the comparison to the species tree you draw earlier. Change the font size under "Tip Labels" or add colors to branches or clades etc. Also play around with the other options to make the tree look its best. Try explaining the result in relationship to your species tree by identifying gene duplications and speciation events. Does it look like you have a good sample from the gene family, or do you need to do further BLAST searches to find sequences that seem to be "missing". If so, return to the NCBI site and extend your seach.
In the previous part of the exercise you used the "One click" function on www.phylogeny.fr and had the web server select the method for the phylogenetic inference. In this part of the exercise we will practice sending your sequence file to Albiorix, and run a phylogenetic analysis using the program MrBayes.
1.) Multiple sequence alignment: Open your file containing the multiple sequence alignment in Seaview and examine it again. The ends of the sequences are generally more difficult to align than the often more conserved middle parts. Including such areas in the analysis may violate one of the fundamental assumptions of the phylogenetic analysis, namely that the inference is done using homologous characters. Conflicting signals may have been introduced in the dataset if non-homologous sites are found in the alignment. At this stage you therefore have the opportunity to exclude parts of the alignment that you don't "trust". To do this, first select "Allow seq. editing" under "props" and then use "Select All" under the dropdown menue "Edit" to highlight all sequences. Then use the mouse to select a position to the right of the region you want to exclude, and then use the backspace key to remove parts of the aligned matrix (Hint: SeavieW does not have an "Undo" function, so be careful not to remove too much). Save the file in NEXUS format and give it the file extension "nex" when you are done. Open your newly created NEXUS file in your favorite text editor and make sure the sequence names does not contains the character | . Furthermore, at the end of the file include the following six lines:
prset aamodelpr = mixed;
mcmc ngen=1000000 printfreq=1000 samplefreq=1000;
The command "prset aamodelpr = mixed;" indicates that we want to sample across all fixed amino acid rate matrices (models for amino acid evolution) implemented in the program. The next line is the command for running the analysis for 1'000'000 generations and print to screen and sample trees every 1'000 generations. sump and sumt will summarise your tree-sample in a consensus tree
2). Sending sequences to Albiorix: Use WinSCP to create a directory with your name on Albiorix, and then put the NEXUS file in that folder. Ask for help if needed
3). Log on to Albiorix: Start the program putty and log on to Albiorix using ssh and the username, password and address written on the whiteboard. Once logged on, move into the directory you created earlier. Ask for help if you run into problems.
4). Running MrBayes: Start the analysis by running the command "mb2-3.2.2.sh" followed by the name of your input file like this (in this example the input file is called "in.nex"):
The output to your screen will look something like this:
0 -- [-12514.137] (-12346.638) (-13103.600) (-12097.902) [...4 remote chains...]
1000 -- (-6906.692) (-6859.085) [-6796.584] (-6831.543) [...4 remote chains...] -- 1:56:33
2000 -- (-6727.185) [-6728.602] (-6729.919) (-6722.131) [...4 remote chains...] -- 1:48:07
3000 -- (-6732.595) (-6715.177) (-6722.709) [-6721.656] [...4 remote chains...] -- 1:45:14
4000 -- (-6733.782) (-6730.317) (-6731.606) [-6710.378] [...4 remote chains...] -- 1:43:45
5000 -- (-6729.122) (-6717.373) (-6728.679) [-6716.584] [...4 remote chains...] -- 1:42:49
Average standard deviation of split frequencies: 0.102305
6000 -- (-6713.920) (-6710.291) (-6726.924) [-6707.084] [...4 remote chains...] -- 1:42:09
7000 -- (-6715.550) (-6727.507) (-6730.474) [-6715.136] [...4 remote chains...] -- 1:41:39
8000 -- (-6719.724) (-6727.341) (-6735.373) [-6722.418] [...4 remote chains...] -- 1:43:20
9000 -- [-6721.249] (-6722.367) (-6728.845) (-6715.845) [...4 remote chains...] -- 1:42:46
10000 -- (-6722.432) [-6719.681] (-6722.254) (-6722.804) [...4 remote chains...] -- 1:42:18
The first column indicates the number of generations the analysis has run, and the last column shows the estimated time it will take for the analysis to finish. After the stipulated 1'000'000 generations the program will stop and prompt you with the following:
Continue with analysis? (yes/no):
Answer "no" if the reported "Average standard deviation of split frequencies:" is below 0.01. At the next prompt type the commands "sump" and "sumt" ("..." below indicates that the output from the program has been excluded from this example):
MrBayes > sump
MrBayes > sumt
You can now quit the program by typing "quit"
5.) Examine the result: Download the resulting tree file that has the file extension "con.tre" to your local computer and analyse it in Figtree like you did in part I of the exercise.