Reciprocal BLAST S. schafta - S. uralensis - S. vulgaris

Introduction

In order to compare the two transcriptome datasets from Silene schafta (78895 sequences) and S. uralensis (80151 sequences) to each other, we are going to do a reciprocal BLAST analyses. This analysis will also include a dataset from S. vulgaris (37874 sequences) (Yann please send me details about the reference).

The analysis will be preformed like this: I'll first BLAST all sequences in the Silene vulgaris (primary-) dataset agains the S. uralensis (secondary-) dataset. Since this is a reciprocal BLAST analysis, each best match in the secondary dataset will then be BLAST'ed against the primary dataset. Information of a reciprocal BLAST match is then recorded in a database if this second BLAST analysis finds the original query sequens. That way, all possible reciprocal BLAST matches between S. vulgaris and S. uralensis will be identified. In the second analysis I will again use the S. vulgaris dataset as primary dataset but instead use S. schafta as secondary dataset. The third analysis will be S. schafta vs. S. uralensis.

As you can see, the Silene vulgaris datasets will be BLAST'ed agains the two others (i.e be the primary dataset) in two analyses. The S. schafta datasets will be the primary dataset once, and the S. uralensis dataset will participate as secondary dataset twice. All reciprocal BLAST matches between the three datasets will still be found as it does not matter which of the two datasets that acts as primary- or secondary dataset in the respective analysis.

Format the BLAST databases

The names of the datasets are "S_schafta_isotigs_singletons_12-08-23", "S_uralensis_isotigs_singletons_12-08-23" and "S_vulgaris_clean". The sufix "_12-08-23" will be removed from the names of the two first datasets to prevent problems with the MySQL query strings.

mt258@host-155-225:~/db/silene$ makeblastdb -in S_schafta_isotigs_singletons -dbtype nucl
Building a new DB, current time: 09/06/2012 22:22:18
New DB name:   S_schafta_isotigs_singletons
New DB title:  S_schafta_isotigs_singletons
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 78895 sequences in 2.51826 seconds.


mt258@host-155-225:~/db/silene$ makeblastdb -in S_uralensis_isotigs_singletons -dbtype nucl
Building a new DB, current time: 09/06/2012 22:22:37
New DB name:   S_uralensis_isotigs_singletons
New DB title:  S_uralensis_isotigs_singletons
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 80151 sequences in 2.53736 seconds.


mt258@host-155-225:~/db/silene$ makeblastdb -in S_vulgaris_clean -dbtype nucl
Building a new DB, current time: 09/06/2012 22:22:46
New DB name:   S_vulgaris_clean
New DB title:  S_vulgaris_clean
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 37874 sequences in 1.819 seconds.

Prepare a MySQL database table for the first two analyses

The MySQL database table that will store the results for the first two analyses (with S. vulgaris as primary dataset) will be setup like this:

#!/usr/bin/env python

table_name = 'silene_vulgaris_vs_uralensis_schafta'
blast_databases = ('S_uralensis_isotigs_singletons', 'S_schafta_isotigs_singletons')

output_file = open('%s.sql' % table_name, 'w')

output_file.write('DROP TABLE IF EXISTS `%s`;\n' % table_name)
output_file.write('CREATE TABLE `%s` (\n' % table_name)
output_file.write('     `query_seq` varchar(12) PRIMARY KEY\n')

for name in blast_databases:
    output_file.write('     ,`%s_id` varchar(100)\n' % name)
    output_file.write('     ,`%s_def` varchar(500)\n' % name)
    output_file.write('     ,`%s_eval` DOUBLE(34,30)\n' % name)
    output_file.write('     ,`%s_seq` text\n' % name)
    output_file.write('     ,`%s` varchar(5)\n' % name)

output_file.write('     );\n')
output_file.close()

Note: The current version of reciprocal_blast.py will only work with amino acid sequences. I have therefore modified the original code so that "reciprocal_blast-nt.py" can be used for the nucleotide sequences in this analysis. Also, I've implemented a solution to the problem caused by the 32000 subdirectory limit on a ext3 filesystem. "reciprocal_blast-nt.py" was pushed to github. The two version should be merged in the next commit.

Prepare a MySQL database table for the third analysis

#!/usr/bin/env python

table_name = 'silene_schafta_vs_uralensis'
blast_databases = ['S_uralensis_isotigs_singletons']

output_file = open('%s.sql' % table_name, 'w')

output_file.write('DROP TABLE IF EXISTS `%s`;\n' % table_name)
output_file.write('CREATE TABLE `%s` (\n' % table_name)
output_file.write('		`query_seq` varchar(12) PRIMARY KEY\n')

for name in blast_databases:
	output_file.write('		,`%s_id` varchar(100)\n' % name)
	output_file.write('		,`%s_def` varchar(500)\n' % name)
	output_file.write('		,`%s_eval` DOUBLE(34,30)\n' % name)
	output_file.write('		,`%s_seq` text\n' % name)
	output_file.write('     ,`%s` varchar(5)\n' % name)

output_file.write('		);\n')
output_file.close()

Note: "[ ]" around the name of the single BLAST database to search.

Bug in the code?

All three analyses are now running, but I have discovered that very few records are added to the MySQL database. After 786 analyses have run (shafta_vs_uralensis), only 36 new rows have been created in the database table. The expected outcome would be one new row created per reciprocal BLAST analysis. It looks like the vulgaris_vs_uralensis analysis is doing that, but the other two analyses are not.
This bug has to do with one row in the "prepare_db.py" script. The length of the identifier for the query sequence is only 12 characters long. This works fine for sequences from the Arabidopsis thaliana genome. However, in this analysis, the identifiers are ~14 characters long. Hence, the reciprocal BLAST analysis that is identified by the 14 character long filename (with ".fst" removed) will not match the row i the database identified by a 12 character long string. I have fixed this bug in the database tables, and restarted the tree analyses.

Results from the first analysis

The analysis of the S. vulgaris and S. uralensis datasets found 14298 reciprocal BLAST matches. The analysis of S. vulgaris and S. schafta datasets resulted in 14399 matches, and the S. schafta and S. uralensis analysis resulted in 26628 matches.

In other words, ~38% of the 37874 S. vulgaris sequences have a reciprocal BLAST match to ~18% of the 80151 S. uralensis sequences, and ~18% of the 78895 S. schafta sequences. ~34% of the S. schafta sequences have a reciprocal BLAST match to ~33% of the S. uralensis sequences.

1167 sequences have a reciprocal BLAST match in all datasets.

Note:


mysql> select count(query_seq) from silene_vulgaris_vs_uralensis_schafta where S_uralensis_isotigs_singletons='TRUE' and S_schafta_isotigs_singletons='TRUE' in (select query_seq from silene_schafta_vs_uralensis where S_uralensis_isotigs_singletons='TRUE'); 
+------------------+
| count(query_seq) |
+------------------+
|             1167 |
+------------------+
1 row in set (0.05 sec)

Reciprocal BLAST of singleton sequences

We will try to deal with the problem of having splice variants in the datasets by splitting all sequences into singletons instead (Note: Ask Yann how this was done).

The datasets used are called "S_schafta_contigs_singletons", "S_uralensis_contigs_singletons", "S_vulgaris_contigs_singletons" and contains the following number of sequences:

  • S. schafta - 79810 sequences
  • S. uralensis - 50955 sequences
  • S. vulgaris - 69306 sequences

The S. uralensis contains the fewest sequences and will therefore be the primary dataset in two of the reciprocal BLAST analyses. The S. vulgaris dataset will then be BLAST'ed to the S. schafta dataset. Results are saved to tables "silene_uralensis_vs_schafta_vulgaris_singletons" and "silene_vulgaris_vs_schafta_singletons" in database "reciprocal_blast"

Result: The number of matches between the S_uralensis and S_schafta datasets are 20214.


mysql> select count(query_seq) from silene_uralensis_vs_schafta_vulgaris_singletons where S_schafta_contigs_singletons='TRUE';
+------------------+
| count(query_seq) |
+------------------+
|            20214 |
+------------------+
1 row in set (0.00 sec)

The number of matches between the S_uralensis and S_vulgaris datasets are 19155.


mysql> select count(query_seq) from silene_uralensis_vs_schafta_vulgaris_singletons where S_vulgaris_contigs_singletons='TRUE';
+------------------+
| count(query_seq) |
+------------------+
|            19155 |
+------------------+
1 row in set (0.00 sec)

The number of matches between the S_schafta and S_vulgaris datasets are 22607


mysql> select count(query_seq) from silene_vulgaris_vs_schafta_singletons where S_schafta_contigs_singletons='TRUE';
+------------------+
| count(query_seq) |
+------------------+
|            22607 |
+------------------+
1 row in set (0.00 sec)

The number of matches between all three datasets are 3294

select count(query_seq) from silene_uralensis_vs_schafta_vulgaris_singletons where S_schafta_contigs_singletons='TRUE' and S_vulgaris_contigs_singletons='TRUE' and query_seq in (select query_seq from silene_vulgaris_vs_schafta_singletons where S_schafta_contigs_singletons='TRUE');
+------------------+
| count(query_seq) |
+------------------+
|             3294 |
+------------------+
1 row in set (0.13 sec)

Reciprocal BLAST of clustered sequences

Yann has prepared a third round of datasets for the reciprocal BLAST analysis. This time the sequences have been clustered (see the GoogleDoc for more details).

These three datasets will be analysed:

  • S_schafta_contigs_singletons_usearch.fasta - 58432 sequences
  • S_uralensis_contigs_singletons_usearch.fasta - 42139 sequences
  • S_vulgaris_contigs_singletons_usearch.fasta - 60645 sequences

The names of the original files have been changed to "S_schafta_usearch", "S_uralensis_usearch" and "S_vulgaris_usearch". Will BLAST the uralensis dataset to the two others in the first two analyses, and then BLAST the sequences in the schafta dataset to the vulgaris dataset.

Note: Some fasta headers looks like this "centroid=F67U7BG01BMMHQ;seqs=1;". This causes problem in the blast step, and the whole thing chrashes. I have therefore replaced the ";" characters with "_" in all fasta headers, recreated the blast databases, and restarted the analyses.

Results:

mysql> select count(*) from silene_uralensis_vs_schafta_vulgaris_usearch where S_schafta_usearch='True';
+----------+
| count(*) |
+----------+
|    19251 |
+----------+
1 row in set (0.00 sec)

The S. uralensis dataset has 19251 reciprocal BLAST matches to the S. schafta dataset.

mysql> select count(*) from silene_uralensis_vs_schafta_vulgaris_usearch where S_vulgaris_usearch='True';
+----------+
| count(*) |
+----------+
|    18733 |
+----------+
1 row in set (0.05 sec)

The S. uralensis dataset has 18733 reciprocal BLAST matches to the S. vulgaris dataset.

mysql> select count(*) from silene_schafta_vs_vulgaris_usearch where S_vulgaris_usearch='True';
+----------+
| count(*) |
+----------+
|    21871 |
+----------+
1 row in set (0.02 sec)

The S. schafta dataset has 21871 reciprocal BLAST matches to the S. vulgaris dataset.

mysql> select count(query_seq) from silene_uralensis_vs_schafta_vulgaris_usearch where S_schafta_usearch='True' and query_seq in (select query_seq from silene_schafta_vs_vulgaris_usearch where S_vulgaris_usearch='True') and S_vulgaris_usearch='True';
+------------------+
| count(query_seq) |
+------------------+
|             4271 |
+------------------+
1 row in set (0.12 sec)

4271 sequences in each dataset have a reciprocal blast match to a sequence in all other datasets!

The last SQL string may need some explanation. It can be dividen into three parts:

select count(query_seq) from silene_uralensis_vs_schafta_vulgaris_usearch where S_schafta_usearch='True' and query_seq in

(select query_seq from silene_schafta_vs_vulgaris_usearch where S_vulgaris_usearch='True') and S_vulgaris_usearch='True';

The translation is as follows:

Count the number of query sequences from S. uralensis with a reciprocal BLAST match in S. schafta that are also found in the next query of matches between S. schafta and S. vulgaris and that have a reciprocal match in S. vulgaris.