Ortologous sequences from WGS data

Introduction

Lately I have been working on a method for extracting gene family sequences from whole genome sequences datasets. This type of data is in a way already available from repositories like Phytozome, where you can download gene family datasets containing both orthologous sequences (separated by a speciation event) and paralogous sequences (separated by a gene duplication). However, for the downstream analyses I'm planing, I'm only interested in using one sequence per species.

The method is implemented in a python script called "reciprocal_blast.py", and uses BLAST+ to find potentially orthologous sequences and then stores the result in a MySQL database. The code for this can be downloaded from github.

Method

I use this method to find reciprocal blast matches in a reference genome (Arabidopsis thaliana in my case) and a number of plant and bacterial genomes.

Reciprocal BLAST is done by taking a sequence from one dataset (the Arabidopsis thaliana genome in this case; here called the primary dataset) and BLASTing it to a different datase (e.g. a bacterial whole genome sequences; here called the secondary dataset). The best matching sequence from the bacterial genome is then BLASTed to the Arabidopsis thaliana genome. If the second BLAST analysis returns the query sequence used in the first analysis, that is called a reciprocal (best) match and the two sequences are considered to be putative orthologues. For my analysis I use amino acid sequences, but nucleotide sequences should work as well.

The next step is to blast all the sequences in the primary dataset against a different secondary dataset (then against a third dataset and so on). The results from all these analyses are stored in a MySQL database. You can think of this database as a big table, where each row is identified by the name of a sequence in the primary dataset. Information about the different reciprocal BLAST analyses this sequence been used in is also stored in the table. The database can then be queried to adress questions like "Which datasets have been searched using this sequence?", "Which sequence was the best match?" and "Was this match a reciprocal best match?" and so on. It is also easy to extract information about "Which sequences in the primary dataset had only reciprocal best BLAST matches against all the other genomes?". Sequences fulfilling this criteria can then be extracted from the database to be prepared for further analyses. It is these datasets I'm looking for in this project.

Preliminary results: Using the Arabidopsis thaliana gnome (~38 000 gene models) as primary dataset and ~35 plant genomes and ~65 cyanobacterial genomes results in ~50 (putative gene family) datasets fulfilling this strict criteria.

Only using WGS datasets with many sequences will presumably increase this figure. Also, by excluding datasets from the analysis (i.e. require reciprocal best BLAST matches against fewer genomes) will also increase the resulting number of putative gene family datasets found.

Installing reciprocal_blast.py

NOTE: Have a look at this page before you formate the BLAST databases you will use.

For the impatient

  • Download the code from github
  • Create a database table layout using "prepare_db.py"
  • Create a database, user etc. and import the database layout file
  • Edit the "reciprocal_blast.cfg" file, and place it in the directory where you will run your analysis
  • In the same directory, split the multi-sequence fasta file you want to analyse into separate files using "one_fasta_to_many.py"
  • Run the analysis with the command "reciprocal_blast.py -f *.fst" to analyse all files in the directory

Long version

A MySQL database first has to be created, and then populated with a table for storing the BLAST results. Open the file prepare_db.py in a text editor and change the value of the variables "table_name" and "blast_databases". The first variable holds the name of the MySQL table you want to store the results in, and the second one lists the names of the blast databases you want to use for the analysis.


#!/usr/bin/env python

table_name = 'cyanobacteria'
blast_databases = ('Synechocystis_sp_PCC_6803', 'Raphidiopsis_brookii_D9', 'Lyngbya_sp_CCY_8106')

output_file = open('%s.sql' % table_name, 'w')
output_file.write('DROP TABLE %s;\n' % table_name)
output_file.write('CREATE TABLE IF NOT EXISTS `%s` (\n' % table_name)
output_file.write('     `query_seq` varchar(12) NOT NULL PRIMARY KEY\n')

for name in blast_databases:
   output_file.write('     ,`%s_id` varchar(100)\n' % name)
   output_file.write('     ,`%s_def` varchar(100)\n' % name)
#   output_file.write('     ,`%s_eval` DOUBLE(34,30) NOT NULL\n' % name)
   output_file.write('     ,`%s_seq` text\n' % name)
    output_file.write('     ,`%s` varchar(5)\n' % name)

output_file.write('     );')
output_file.close()

After that, you just run the program to generate a file that describes the layout of the MySQL table.


$ ./prepare_db.py

The output will be a file with the name of the database table and the file extension ".sql". Next step is to login to mysql, and create the database, a table and a user with privileges to use the database.


$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE orthology;
mysql> USE orthology_db;
mysql> SOURCE /home/mats/dev/cyanobacteria.sql;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE USER 'me'@'localhost' IDENTIFIED BY 'secret_word';
mysql> GRANT ALL ON orthology_db.* TO 'me'@'localhost' IDENTIFIED BY 'secret_word';
mysql> quit;

Obviously, you should change 'me' and 'secret_word' to something more appropriate. Next you have to modify the configuration file "reciprocal_blast.cfg" and provide some information about where you store your blast databases, which databases to use etc. Open the file in a text editor and change the values of the variables to something appropriate for your system:


$ vim reciprocal_blast.cfg

[Misc]
primary_taxon = 'Athaliana_167_peptide.fa'
secondary_taxon = ''

host = 'localhost'
user_name = 'me'
password = 'secret_word'
database_name = 'orthology'
db_table_name = 'bacteria'
db_dir = '/home/mats/project/chloroplast/orthology_project/db/'

...

  • primary_taxon - Name of the BLAST database with your reference genome
  • secondary_taxon - Name of the BLAST database with the genome you want to query for reciprocal blast matches
  • host - Computer where your MySQL database is found (usually 'localhost')
  • user_name - MySQL database user name
  • password - MySQL database password
  • database_name - Name of the database to use
  • db_table_name - Database table to store the results in
  • db_dir - Directory where you store your BLAST formated databases (don't forget the trailing "/")

Now it is almost time to get started on the analysis. But before that you have to split the fasta file, containing the sequences you want to analyse, into several files:

  • Create a directory for the analysis and copy the multi-sequence fasta file and your modified configuration file to that directory.
  • Split the multi-sequence file into several single sequence fasta files using the script "one_fasta_to_many.py" like this:


$ ./one_fasta_to_many.py 

This will create a number of new fasta files (one for each sequence in the multi-sequence fasta file) in your current directory. The name of a file will be the first part of the fasta header. You may therefore want to modify the fasta headers before you run "one_fasta_to_many.py", in order to generate files with sensible names.

Finally, it's time to run the actual analysis.


$ ./reciprocal_blast_CURRENT.py -f *.fst

The output from the program looks like this:


[--] Running initial blast analyses.
[--] Using ATMG01360.1.fst to search genome 'Athaliana_167_peptide.fa'
[--] Using ATMG01360.1.fst to search genome 'Mdomestica_196_peptide'
[--] Reciprocal BLAST using Mdomestica_196_peptide.fasta to search genome 'Athaliana_167_peptide.fa'
[--] Reciprocal BLAST using Mdomestica_196_peptide.fasta to search genome 'Mdomestica_196_peptide'
[--] Reciprocal BLAST using Athaliana_167_peptide.fa.fasta to search genome 'Athaliana_167_peptide.fa'
[--] Reciprocal BLAST using Athaliana_167_peptide.fa.fasta to search genome 'Mdomestica_196_peptide'

###    MATCH    ###

Number of rows updated: 1
[--] Extracting data from Mdomestica_196_peptide.xml
Number of rows updated: 1
Number of rows updated: 1
Number of rows updated: 1

...

Note: In the current version of the code, only one genome dataset can be searched for reciprocal blast matches (contrary to some of the comments in the code claiming something else). Analysin several data set in one sweep was possible in a previous reincarnation of the code, and will possibly be implemented in later versions. This is also the reason why the program (in the current version) does two extra/redundant blast analyses.