Lab Notebook - Toc75

[Tue Aug 16 13:59:43 CEST 2011]
Query sequences atToc75-III, atToc75-IV and atOEP80/atToc75-V (Inoue & Potter 2004) where used to search the phytozome v7.0 repository.

NOTES

  • The atToc75-IV and -V sequences are updated version of the sequence used by Inoue & Potter (2004).
  • My local copy of phytozome v6.0 was first updated to v7.0 by downloading the new or updated genomes described here.
  • Checked the release notes and the "Sequence use restrictions" for each genome project in phytozome v7.0, to see if I could use them for this analysis

"Sequence use restrictions"

The genomes of Aquilegia coerulea and Setaria italica are released with a "Sequence use restriction" clause including this sentence: "Scientific users are free to publish papers dealing with specific genes or small sets of genes using the sequence data", and can hence be included in this analysis. The release notes for Zea mays are more fuzzy. I'll have to check that again. The rest of the genomes seems to be free to use.

Preparation of blast database

cat *.fa ~/db/all > ~/db/all/phytozome_7_cds.fst
makeblastdb -in phytozome_7_cds.fst -out Phytozome_7 -dbtype nucl

Wed Aug 17 10:51:18 CEST 2011

BLAST analyses - DNA

  • Updated BLAST+ to version 2.2.25
  • blastn -db /Users/mats/db/all/Phytozome_7 -query atToc75-III.fst -outfmt 11 -out atToc75-III_Phytozome_7.out -task dc-megablast
  • blast_formatter -archive atToc75-III_Phytozome_7.out > atToc75-III_Phytozome_7.formated.txt

Found 52 sequences in 24 species. No matches in Chlamydomonas reinhardtii or Volvox carteri. Some sequences have less useful names, that don't explicitly declare species affinity. Names in *.fa files have been changed, and a new BLAST data base generated.

  • %s/AT/>Arabidopsis_thaliana_/g
  • %s/>Cre/>Chlamydomonas_reinhardtii_/g
  • %s/>orange1./>Citrus_sinensis_/g
  • %s/>Si/>Setaria_italica_/g
  • %s/>clementine0.9_/>Citrus_clementina_0.9_/g
  • %s/>Egrandis/>Eucalyptus_grandis/g
  • %s/>/>Zea_mays_/g

### makeblastdb -in phytozome_7_cds.fst -out Phytozome_7 -dbtype nucl ###
Building a new DB, current time: 08/17/2011 11:17:37
New DB name: Phytozome_7
New DB title: phytozome_7_cds.fst
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 947845 sequences in 88.8843 seconds.
#############################################################

Giving "yaf_current" a try. Some of the sequences are consecutive in the data base, so some manual parsing is required to get all the results.

NOTE: The "yaf_current" is not working very well. Will use Geneious instead!
NOTE: When using the web interface to Phytozome, I only get 16 hits (instead of 52) with 'blastp' or 'blastx'.

BLAST analyses - AA

Added species names to sequences in the database

  • %s/>AcoGoldSmith/>Aquilegia_coerulea/g
  • %s/>/>Arabidopsis_lyrata_/g
  • %s/>/>Arabidopsis_thaliana_/g
  • %s/>/>Brachypodium_distachyon_/g
  • %s/>clementine/>Citrus_clementina_/g
  • %s/>/>Carica_papaya_/g
  • %s/>/>Chlamydomonas_reinhardtii_/g
  • %s/>/>Cucumis_sativus_/g
  • %s/>/>Citrus_sinensis_/g
  • %s/>/>Eucalyptus_grandis_/g
  • %s/>/>Glycine_max_/g
  • %s/>/>Manihot_esculenta_/g
  • %s/>/>Mimulus_guttatus_/g
  • %s/>/>Medicago_truncatula_/g
  • %s/>/>Oryza_sativa_/g
  • %s/>/>Physcomitrella_patens_/g
  • %s/>/>Prunus_persica_/g
  • %s/>/>Populus_trichocarpa_/g
  • %s/>/>Ricinus_communis_/g
  • %s/>/>Sorghum_bicolor_/g
  • %s/>/>Setaria_italica_/g
  • %s/>/>Selaginella_moellendorffii_/g
  • %s/>/>Volvox_carteri_/g
  • %s/>/>Vitis_vinifera_/g
  • %s/>/>Zea_mays_/g

cat *.fa > Phytozome_7_peptide.fst
makeblastdb -in Phytozome_7_peptide.fst -out Phytozome_7_aa -dbtype prot

Building a new DB, current time: 08/17/2011 14:56:06
New DB name: Phytozome_7_aa
New DB title: Phytozome_7_peptide.fst
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 947845 sequences in 59.5057 seconds.

atToc75-III

blastp -db /Users/mats/db/all/aa/Phytozome_7_aa -query atToc75-III_aa.fst -outfmt 5 -out atToc75-III_aa_Phytozome_7.xml -task blastp

Output imported to Geneious and 107 sequences exported in fasta format (atToc75-III_aa_Phytozome_7.fasta) after short (less than 100 aa's long) and similar/identical sequences had been removed.
linsi --reorder atToc75-III_aa_Phytozome_7.fasta > atToc75-III_aa_linsi.fst

NOTE: Geneious will add a number to the fasta header!

atToc75-III_aa_linsi.fst imported to SplitsTree.

NOTE: Geneious will add a gap in the fasta header. SplitsTree will therefore only display the initial digits that was automatically added by Geneious, instead of the full fasta header that includes species names. This will fix that:

  • %s/ /_/g
  • %s/|/_/g
  • %s/:/_/g

atToc75-V

blastp -db /Users/mats/db/all/aa/Phytozome_7_aa -query atToc75-V_aa.fst -outfmt 5 -out atToc75-V_aa_Phytozome_7.xml -task blastp

101 sequences remained after the results where examinated in Geneious.

atToc75-IV

blastp -db /Users/mats/db/all/aa/Phytozome_7_aa -query atToc75-IV_aa.fst -outfmt 5 -out atToc75-IV_aa_Phytozome_7.xml -task blastp

101 sequences remained after the results where examinated in Geneious.

Combined dataset

Combining the three datasets resulted in 128 "unique" sequences. Exported "atToc75_combi_aa_Phytozome_7.fasta" from Geneious.

linsi --reorder atToc75_combi_aa_Phytozome_7.fasta > atToc75_combi_aa_linsi.fasta

[Thu Aug 18 16:07:04 CEST 2011]
A manually curated selection of sequences are found in atToc75_linsi.fasta

Cyanidioschyzon merolae

Blasted "atToc75-III-IV-V_aa.fst" against the Cyanidioschyzon merolae (taxid:45157) genome at NCBI, which only resulted in short fragments (less than 15% coverage). Blasting the same three sequences against the Cm genome at http://merolae.biol.s.u-tokyo.ac.jp/blast/blast_cs.cgi resulted in ***** No hits found ******!

blastp -db /Users/mats/db/_Cyanidioschyzon_merolae/cds -query atToc75-III-IV-V_aa.fst -outfmt 5 -out atToc75-III-IV-V_aa_Cyanidioschyzon_merolae.xml -task blastp

[Wed Aug 31 13:05:53 CEST 2011]

This resulted in 11 hits, of which only three are longer than 113 aa's (exported to /Users/mats/project/toc75/aa/Cyanidioschyzon_merolae/Cyanidioschyzon_merolae_blast_result.fst).

In Cyanidioschyzon_merolae_blast_result.fst:

  • :%s/>/>Cyanidioschyzon_merolae/g
  • :%s/ /_/g
  • :%s/|/_/g

cp ../atToc75_combi_aa_Phytozome_7.fasta atToc75_combi_aa_Phytozome_7_Cmerolae.fasta
cat Cyanidioschyzon_merolae_blast_result.fst >> atToc75_combi_aa_Phytozome_7_Cmerolae.fasta
linsi --reorder atToc75_combi_aa_Phytozome_7_Cmerolae.fasta > atToc75_combi_aa_Phytozome_7_Cmerolae_linsi.fasta

Of the 11 sequences retrieved form the BLAST database, only three (called "Cyanidioschyzon_merolae1577_gnl_CMER_CMJ202C_similar_to_chloroplast_import-associated_channel_Toc75", "Cyanidioschyzon_merolae3745_gnl_CMER_CMR288C_probable_molybdopterin_synthase_Cnx2" and "Cyanidioschyzon_merolae1108_gnl_CMER_CMH185C_hypothetical_protein") seems to be possible to align to the sequences found in Phytozome. These three full-lengt sequences where then manually retrieved from "/Users/mats/db/_Cyanidioschyzon_merolae/cds.fasta" and stored in "/Users/mats/project/toc75/aa/Cyanidioschyzon_merolae/CM_blast_result_full-length.fst". When aligning the ful-length sequences to "atToc75_combi_aa_Phytozome_7_Cmerolae.fasta" it looks like only "CMJ202C" is a Toc75 orthologue. This is the same result as Kalanon & McFadden (2008) reported. Will use this sequence in the following analyses.

Cyanobacteria

BLAST'ed "CMJ202C" against the "nr" database and cyanobacteria (taxid:1117) on the NCBI website (using "blastp" algorithm), and downloaded the 100 best matches to /Users/mats/project/toc75/aa/cyanobacteria/sequence_CM.fasta. Aligned the retrieved sequences to atToc75-III, -IV and -V, as well as "CMJ202C" (linsi --reorder asdf > atToc75-III-IV-V_aa_Cyanobacteria.fst). Interestingly, from looking at the alignment, it looks like the cyanobacteria sequences can be divided in two sets, one large (containing 81 sequences) and one smaller (20 sequences). The Arabidopsis sequences align "best" with the larger set, and the C. merolae sequence with the smaller set. What does that mean?

BLAST'ing each of the four A. thaliana sequences the same way as for the C. merolae sequence. Results are stored in "sequence_toc75_I.fasta", "sequence_toc75_III.fasta", "sequence_toc75_IV.fasta" and "sequence_toc75_V.fasta". All files contains 100 sequences except the "-III" file that hold 93 sequences. The 493 sequences where manually investigated and duplicates removed, resulting in 144 unique sequences ranging in length from 93-2941 bp, and exported to "cyanobacteria_Toc75.fst". Aligned the 144 cyanobacteria proteins together with the four A. thaliana and C. merolae sequences. A few very long cyanobacteria sequences were then removed after inspection in seaview, and saved in "cyanobacteria_Toc75_AT_CM.fst".

Combined the 104 sequences in "atToc75_linsi.fasta" withe the 144 sequences in "cyanobacteria_Toc75_AT_CM.fst" in the file "toc75_phy_CM_cyan.fst".
mafft --reorder toc75_phy_CM_cyan.fst > toc75_phy_CM_cyan_mafft.fst

  • :%s/|/_/g
  • :%s/ /_/g
  • :%s/(/_/g
  • :%s/)/_/g
  • :let i=1 | g/>/s//\=">".i/ | let i=i+1
  • :%s/,/_/g

The following was done in order to import the file into SplitsTree. DID NOT HELP!!! Instead I saved the file in nexus format and imported it to PAUP* and run the analysis for ~45 min. It was not possible to reroot the resulting trees in figtree.

Intron/exon structure

To incorporate information on the intron and exon structure fron the four A. thaliana sequences in the alignment, I downloaded the four sequences (AT3G46740, from signal.salk.edu/atg1001/3.0/gebrowser.php (ecotype Col-0.MPI). Removed the intron characters (that are displayed in small caps letters) using VIM and:

  • :%s/[a-z]/o/gc

The AT3G46740 sequence is "reversed" on compared to hoe the chromosome three sequences is displayed. Therefore I created the python script "reverse.pl" to be able to flip the sequence around, and align it to the rest.
The alignment of the four A. thaliana sequence shows that atToc75-1 (8 exons), atToc75-III (7 exons) and atToc75-IV (6 exons) are possible to align along their full length. atToc75-V, on the other hand, has 16 exons, but only exon 13-16 is "possible to align" to the other three sequences.

[Fri Sep 2 09:48:15 CEST 2011]

Copied "toc75_phy_CM_cyan_mafft.fst" (OBS will use this alignment later for an analysis of all taxa) to "toc75_phy_CM_cyan_mafft_exon.fst" (containing 205 sequences) and started removing duplicate and short sequences (i.e. sequences that dont have a region that is homologous (?) to exons 13-16 in atToc75-V). The fasta headers now starts with a number which makes it hard to see the taxa name when viewing the alignment in SeaView. To fix this I used VIM and this:

  • :%s/>[0-9]*_/>/g
  • :%s/>[0-9]*gi/>gi/gc

Several Selaginella moellendorffii sequences are missing the part corresponding to exons 13-16 in atToc75-V. Samething with one sequence from Zea mays, Medicago truncatula. A number of taxa have sequences which only have short sequences in the same section. This may cause problems in the phylogenetic analysis!
"Randomly" selected a number of cyanobaterial sequences and ended up with 111 sequences. Removed the section of the alignment preceding exon 13-16 in atToc75-V.

  • linsi --reorder toc75_phy_CM_cyan_mafft_exon.fst > toc75_phy_CM_cyan_linsi_exon.fst

Gloeobacter violaceus

Went back to "toc75_phy_CM_cyan_mafft.fst" and extracted the 12 cyanobacterial sequences with "OMP85" in the fasta header, and added them to "toc75_phy_CM_cyan_linsi_exon.fst", removed duplicates, realigned and removed positions outside exons 13-16 again. Also blasted these sequences on the NCBI site, against taxaid Gloeobacter violaceus (taxid:33072). Fasta headers for query sequences in "list_of_cyano.txt". Best BLAST match from all 12 searchers stored in "gloeobacter_violaceus.fst". All sequences had the best BLAST match to "NP_924809.1". Added this sequence to "toc75_phy_CM_cyan_linsi_exon.fst", that now contains 122 sequences.

Phylogenetic analysis

Analysed the 122 sequences alignment with MrBayes 3.2 [mb8-3.2.sh toc75_phy_CM_cyan_linsi_exon.nex]:

  • charset protein = 1 - 347;
  • prset aamodelpr = mixed;
  • mcmc ngen=1000000 printfreq=1000 samplefreq=1000;

The analysis craches with the following error messages:

1 -- [-42701.218] [...7 remote chains...]
[compute-0-0:26588] *** Process received signal ***
[compute-0-0:26588] Signal: Segmentation fault (11)
[compute-0-0:26588] Signal code: (128)
[compute-0-0:26588] Failing at address: (nil)
[compute-0-0:26588] [ 0] /lib64/libpthread.so.0 [0x366860e7c0]
[compute-0-0:26588] [ 1] /lib64/libc.so.6(vsnprintf+0x84) [0x3667e69874]
[compute-0-0:26588] [ 2] /usr/local/bin/mb-3.2(SafeSprintf+0xba) [0x43dc7a]
[compute-0-0:26588] [ 3] /usr/local/bin/mb-3.2(PrintTree+0x430) [0x443700]
[compute-0-0:26588] [ 4] /usr/local/bin/mb-3.2(PrintStatesToFiles+0x720) [0x47bf30]
[compute-0-0:26588] [ 5] /usr/local/bin/mb-3.2(RunChain+0x1382) [0x482622]
[compute-0-0:26588] [ 6] /usr/local/bin/mb-3.2(DoMcmc+0x731) [0x484661]
[compute-0-0:26588] [ 7] /usr/local/bin/mb-3.2(ParseCommand+0x285) [0x425d05]
[compute-0-0:26588] [ 8] /usr/local/bin/mb-3.2(DoExecute+0x620) [0x426bd0]
[compute-0-0:26588] [ 9] /usr/local/bin/mb-3.2(ParseCommand+0x285) [0x425d05]
[compute-0-0:26588] [10] /usr/local/bin/mb-3.2(CommandLine+0x161) [0x410621]
[compute-0-0:26588] [11] /usr/local/bin/mb-3.2(main+0x7c) [0x4108ec]
[compute-0-0:26588] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3667e1d994]
[compute-0-0:26588] [13] /usr/local/bin/mb-3.2 [0x40e6b9]
[compute-0-0:26588] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 26588 on node compute-0-0.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Also tried mb8-3.1.2.sh, with the same result! After shortening the taxa names from 126 characters (the longest) to be no more than 76 characters. That worked!!!
1000000 generations will take ~3h 10 min. to run.

[Mon Sep 5 10:16:38 CEST 2011]
Extended the number of generations for the analysis to 10'000'000. The resulting tree is found in "toc75_phy_CM_cyan_linsi_exon.nex.con". Resolution in parts or the tree is low, and the number of taxa in the analysis makes it difficult to display the tree in a good way. I will remove some taxa, first from the landplant part of the dataset (resulted in 81 sequences), and then from cyanobacteria.

The 81 sequence matris was saved to "toc75_81_seq_exon.nex" and analysed with MrBayes for 3'000'000 generations (ETA 6h).

[Tue Sep 6 12:28:07 CEST 2011]

Extended the analysis to 10'000'000 generations. STDV was ~0.06 at the en of the analysis. ESS values low in the combined dataset.
There are to many sequences in the phylogeny. Excluded Minulus gutatus, Carica papaya, Setaria italica and all cyanobacteria sequences that doesn't have OMP or OMP_85 in the header. Ended up with 52 sequences, that was analyses with MrBayes like the previous datasets.

[Thu Sep 8 14:36:34 CEST 2011]
A few branches in the 52 taxa phylogeny has low posterior probability support. This could be due to problems with the alignments. Also notised that I have an extra copy of one atToc75 protein in the dataset. Removed "Arabidopsis_thaliana_AT3G48620.1_PACid_19663143".
Removed a part of the matrix (corresponding to position 41-97 in "toc75_51_taxa_exon.nex"), and re-analyse the dataset again. The new matrix is called "toc75_51_taxa_exon-2.nex".

Result: When looking att the unrooted tree, it looks like there are two eukaryote clades, two cyanobacteria clades and one clade with the C. merolae sequence together with two cyanobacteria sequences. Midpoint rooting places the root btween the C. merolae + cyanobact. clade and the rest.

Blasted "atToc75-I.fst"against "Gloeobacter violaceus (taxid:33072)" at NCBI. Best hit was "gi|37521432|ref|NP_924809.1", same as before!

Removed seven cyanobacteria sequences ("toc75_44_taxa_exon.nex").

[Fri Sep 9 11:48:37 CEST 2011]

Removed two more cyanobacteria sequences ("toc75_42_taxa_exon.nex").

New approach

Could it be that the cyanobacteria sequences (that don't form a monophyletic group in the trees) have a MRCA with some other eucaryote protein, rather that the Toc75's I have found. Will BLAST them at NCBI against eukaryotes (taxid:2759).

The following proteins were used as query sequences. Best hit against any eukaryote organism, as well as best hit against A. thaliana, was saved. This resulted in six unique sequences ("/Users/mats/project/toc75/aa/cyanobacteria/reciprocal_blast/eukaryotes.mafft.fst"). Aligned these new sequences to the 51-taxa dataset that still have the part of the matrix that was removed earlier (corresponding to position 41-97 in "toc75_51_taxa_exon.nex"). The 57 sequences where then aligned and saved in "toc75_57_taxa_exon.linsi.nex".

Result: STDV was 0.008 after ~ 833000 generations, and the analysis was terminated after 879000 generations. Adding more sequences dramatically improved pp support for some branches! Rerunning the analysis again to see if I can reproduce the results using a different random seed.

OMP85

[Mon Sep 12 20:20:00 CEST 2011]

BLAST'ed AEE78438 against Phytozome_7 (blastp -db /Users/mats/db/all/aa/Phytozome_7_aa -query AEE78438.fst -outfmt 5 -out AEE78438_aa_Phytozome_7.xml -task blastp). Manually removed duplicates, and then aligne the resulting 40 sequences to "toc75_57_taxa_exon.linsi.fst". Result saved in "toc75_OEP85_exon.fst".
Found out why one of the A. thaliana sequences where missing from the previous analyses.
"gi|332644917|gb|AEE78438.1| Outer membrane OMP85 family protein [Arabidopsis thaliana]" (missing from previous analyses, is quite similar to ">gi|60543353|gb|AAX22274.1| At3g44160 [Arabidopsis thaliana]" and must have been removed due to that. Hence, there is not a whole eukaryote clade midding from the previous analyses, just this one sequence. The "lost" A. thaliana sequence has been included in an alignment called "toc75_46_taxa_exon.linsi.fst".

Lateral gene transfere to Cyanidioschyzon merolae

[Fri Sep 16 14:23:36 CEST 2011]

Rooting the tree with Gloeobacter violaceus makes sense for many reasons. First, it has been identified as belonging to the sister group of Cyanobacteria (see Larsson et al. 2010 and references therein). Moreover, in the sequence alignment it is "similar" to the eukaryote sequences and some of the cyanobacteria sequences. A "problem" with rooting the tree like this is that the C. merolae sequence will have a MRCA with cyanobacteria, instead of with other eukaryotes, as expected! The C. merolae sequences, and the cyanobacteria sequences it forms a clade with, are very different from the rest of the analysed sequences. I think that this strange pattern could be explained by lateral gene transfer (LGT) (like a second endosymbiosis event) to C. merolae. Will look into this some more over the weekend.

[Tue Sep 20 13:36:42 CEST 2011]
BLAST'ed the Cyanidioschyzon merolae sequence "CMJ202C" against the NCBI "nr" database using the blastp algorithm. Saved the result in "CM_NCBI_BLAST.fst". Aligned the sequences using mafft and truncated the matrix by removing the N-terminal parts of the matrix preceding position 597 in the "CMJ202C" protein. The original file contains 108 sequences. The C. merolae sequence is still very different from the rest of the sequences.

BLAST'ed the four Synechococcus sp. PCC 7335 sequences agains Gloeobacteria (taxid:307596). Best hit for all four sequences was Gloeobacter_violaceus_gi_37521432_ref_NP_924809.1.

BLAST'ed the full length C. merolae sequence agains the environmental sample database (env_nr) at NCBI, which resulted in many hits on hypothetical proteins in "marine metagenome". Most matches where for the C-terminal part of the sequence that has been used in the phylogenetic analyses.

I wan't to se what the effect of the C. merolae sequence is on the phylogenetic analysis and have therefore removed it from the the file called "toc75_44-2_taxa_exon.linsi.nex". Will run the analysis like before, but WITHOUT changing the "temp" settings.

Result: The possition of the cyanobacteria sequences is still not certain. Instead they form a polytomy with the eukaryotes. The latter group in one clade still.

Galdieria sulphuraria

Found another red algae genome project that has made sequence data from the species Galdieria sulphuraria available. Downloaded the two fasta files called "Galdieria sulphuraria ESTs (FASTA)" and "Galdieria sulphuraria StackPack consensus sequences (FASTA)" from
here, and formated them using:

  • makeblastdb -in est.fst -out Galdieria_sulphuraria_EST -dbtype nucl
  • makeblastdb -in StackPack_consensus_sequences.fst -out Galdieria_sulphuraria_consensus -dbtype nucl

Then, BLAST'ed the C. merolare nucleotide sequence against both databases:

  • blastn -db /Users/mats/db/Galdieria_sulphuraria/Galdieria_sulphuraria_EST -query c.merolae_nt.fst -outfmt 5 -out CM_GS.xml -task dc-megablast
  • Resulted in no hits, and ...

  • blastn -db /Users/mats/db/Galdieria_sulphuraria/Galdieria_sulphuraria_EST -query c.merolae_nt.fst -outfmt 5 -out CM_GS.xml -task blastn
  • ...resulted in 25 hits, and...

  • blastn -db /Users/mats/db/Galdieria_sulphuraria/Galdieria_sulphuraria_consensus -query c.merolae_nt.fst -outfmt 5 -out CM_GS_cons.xml -task blastn
  • ...resulted in 24 hits.

All matches are really short (~20 nucleotides).
When BLAST'ing the C. merolae amino acid sequence against the "Peptide prediction from build 3.0 (Aug 2007)" on their BLAST server, I get a much better result. Best match is against "stig_59;Gs56400.1", but only the BLAST alignement is available for download, and not the full-length sequence! and the result can be downloaded in several formats. Unfortunately Fasta is not one of them. Hence, downloaded the pairwise alignment of the best hit to the query sequence, and manually converted it to a Fasta file. Aligned this sequence to the 45_taxa dataset ans saved in "toc75_GS.fst".

Reciprocal BLAST

[Fri Sep 23 10:05:17 CEST 2011]
Will do reciprocal BLAST against "cyanobacteria (taxid:1117)" in order to get a sample of cyanobacteria sequences for the analysis.

[Query sequence] : [Best match]

Arabidopsis Thaliana

  • [atToc75-I, GI:332193711] : [surface antigen (D15) [Nostoc punctiforme PCC 73102], GI:186681903]
  • [surface antigen (D15) [Nostoc punctiforme PCC 73102], GI:186681903] : [outer envelope protein [Arabidopsis thaliana], GI:18419973]
  • [atToc75_III, GI:15232625] : [outer membrane protein, OMP85 family, putative [Microcoleus chthonoplastes PCC 7420], GI:254414950]
  • [outer membrane protein, OMP85 family, putative [Microcoleus chthonoplastes PCC 7420], GI:254414950] : [outer envelope protein [Arabidopsis thaliana], GI:18419973]
  • [atToc75-IV, GI:79466902] : [OMP85 family outer membrane protein [Synechococcus sp. PCC 7002], GI:170076946]
  • [OMP85 family outer membrane protein [Synechococcus sp. PCC 7002], GI:170076946] : [outer envelope protein [Arabidopsis thaliana], GI:18419973]
  • [atOEP80/atToc75-V, GI:18419973] : [surface antigen (D15) [Microcoleus vaginatus FGP-2], GI:334117378]
  • [surface antigen (D15) [Microcoleus vaginatus FGP-2], GI:334117378] : [outer envelope protein [Arabidopsis thaliana], GI:18419973]

Volvox carteri

  • [hypothetical protein VOLCADRAFT_103183 [Volvox carteri f. nagariensis], GI:302830744] : [OMP85 family membrane protein [Synechococcus sp. JA-3-3Ab], GI:86605270]
  • [OMP85 family membrane protein [Synechococcus sp. JA-3-3Ab], GI:86605270] : [hypothetical protein VOLCADRAFT_88723 [Volvox carteri f. nagariensis], GI:302833940
  • [hypothetical protein VOLCADRAFT_88723 [Volvox carteri f. nagariensis], GI:302833940] : [surface antigen variable number [Oscillatoria sp. PCC 6506], GI:300867385]
  • [surface antigen variable number [Oscillatoria sp. PCC 6506], GI:300867385] : hypothetical protein VOLCADRAFT_88723 [Volvox carteri f. nagariensis]

Cyanidioschyzon merolae

  • [Cyanidioschyzon merolae, CMJ202C] : [surface antigen D15-like protein [Cyanothece sp. ATCC 51142], GI:172037293]
  • [surface antigen D15-like protein [Cyanothece sp. ATCC 51142], GI:172037293] : [Cyanidioschyzon merolae, CMJ202C]

Galdieria sulphuraria

  • [Galdieria sulphuraria, stig_59_Gs56400.1] : [chloroplastic outer envelope membrane protein [Thermosynechococcus elongatus BP-1], GI:22299332]
  • [chloroplastic outer envelope membrane protein [Thermosynechococcus elongatus BP-1], GI:22299332] : [Galdieria sulphuraria, stig_59_Gs56400.1]

Exchanged the existing cyanobacteria sequences from "toc75_GS.fst" (Note: containing the two red alga sequences) with [surface antigen variable number [Oscillatoria sp. PCC 6506], GI:300867385], [surface antigen (D15) [Microcoleus vaginatus FGP-2], GI:334117378], [surface antigen D15-like protein [Cyanothece sp. ATCC 51142], GI:172037293] and [chloroplastic outer envelope membrane protein [Thermosynechococcus elongatus BP-1], GI:22299332] (saved in "toc75_GS_res.fst"), and analysed in MrBayes.

[Wed Oct 12 10:11:51 CEST 2011]

I finally got around to fix a problem I discovered some time ago. It turns out that the sequence "gi|332644917|gb|AEE78438.1| Outer membrane OMP85 family protein [Arabidopsis thaliana]" isn't a duplicate of "At3g44160" (the two genes are discussed in Moslavac et al. 2005). Hence, I have aligned the sequence to the "toc75_GS_res.fst" matrix, and called it "Toc75_final.fst/nex". Will analyse this dataset with the following MrBayes block:


BEGIN MRBAYES;

charset protein = 1 - 310;
prset aamodelpr = mixed;

mcmc ngen=1000000 printfreq=1000 samplefreq=1000;

END;

Wrapping things up

[Fri Oct 21 11:08:12 CEST 2011]

One important sequence, namely psToc75 (gi|576507|gb|AAA53275.1| outer membrane protein) from Pisum sativum, is still missing from the analysis. Added it to the Toc75_final.fst file, realigned using "linsi" and saved the result in Toc75_final_2.fst. Analysed this dataset using the same settings as above.

TODO

  • Check the release notes for the Zea mays genome again
  • fix "yaf_current" (or maybe not! Geneious seems to be quite useful)
  • Search the Cyanidioschyzon merolae genome using nucleotide query sequences, or HMM
  • Get information about the intron/exon structure of the A. thaliana sequences from here, and use that info to select how much of the N-terminal part of the aligned sequences that should be removed before the phylogenetic analysis.
  • Check how many "real" copies that esists in Populus trichocarpa

References

Inoue K., Potter D. (2004) The chloroplastic protein translocation channel Toc75 and its paralog OEP80 represent two distinct protein families and are targeted to the chloroplastic outer envelope by different mechanisms. The Plant Journal. 39: 354-365.
Kalanon, M. and McFadden, G.I. (2008) The chloroplast protein translocation complexes of Chlamydomonas reinhardtii: a bioinformatic comparison of Toc and Tic components in plants, green algae and red algae. Genetics 179, 95-112.
Larsson, J. Nylander, JAA. Bergman, B.(2010) Genome fluctuations in cyanobacteria reflect evolutionary, developmental and adaptive traits. BMC Evolutionary Biology 2011, 11:187
Moslavac, S. et al., 2005. Conserved pore-forming regions in polypeptide- transporting proteins. FEBS Journal, 272, p.1367-1378.