Identify single copy loci for sequence capture in the Orchidaceae family

Introduction

We (Filipe, Alex, me ...) have set up four criteria for the EST sequences that will be used to design the sequencing probes:

  • The Phalaenopsis to Oncidium EST's BLAT match should have an alignment length >120, and a similarity score of at lest 80%
  • Selected sequences should have a best BLAST match to a Oryza sativa low copy gene sequence with > 3 exons, that sould be longer that 120 bp each.

1. Identify low copy genes in a reference genome.

In this case, the reference genome is Oryza sativa that was downloaded from phytozome.net. Downloaded protein sequences rather than nucleotide sequences, to facilitate the comparison and more easily identify sequences from multi-copy gene families.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Osativa/annotation/Osat...
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ gunzip Osativa_204_protein_primaryTranscriptOnly.fa.gz
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ grep -c ">" Osativa_204_protein_primaryTranscriptOnly.fa
39049
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ makeblastdb -in Osativa_204_protein_primaryTranscriptOnly.fa -dbtype prot


Building a new DB, current time: 03/13/2014 14:11:33
New DB name: Osativa_204_protein_primaryTranscriptOnly.fa
New DB title: Osativa_204_protein_primaryTranscriptOnly.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 39049 sequences in 1.58495 seconds.

Blasted this database to itself, and saved the two best matches. First match is expected to be the query sequence. Multi-copy genes are expected to have a second best match with low e-value, long alignment length and high identity to the query sequence.


time blastp -query ../data/Osativa_204_protein_primaryTranscriptOnly.fa -task blastp -max_target_seqs 2 -db ../data/Osativa_204_protein_primaryTranscriptOnly.fa -out Os_to_Os.BLAST.table -outfmt '7 std qlen slen' -num_threads 3
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ grep -c "# Query:" Os_to_Os.BLAST.table
39049

39049 BLAST searches where run as expected. Next, I used the scrip get_low_copy.py and tested different cutoff values for the e-value. A relatively high value (e.g. 0.1) is safer to use, as this gives a better guarantee that the sequences from the reference genome actually are low copy genes. However, a lower cutoff will yield more sequence.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ ~/git/sequence_capture/get_low_copy.py Os_to_Os.BLAST.table 0.1 | wc -l
6183
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ ~/git/sequence_capture/get_low_copy.py Os_to_Os.BLAST.table 0.01 | wc -l
7330
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ ~/git/sequence_capture/get_low_copy.py Os_to_Os.BLAST.table 0.001 | wc -l
7974
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ ~/git/sequence_capture/get_low_copy.py Os_to_Os.BLAST.table 0.0001 | wc -l
8468
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ ~/git/sequence_capture/get_low_copy.py Os_to_Os.BLAST.table 0.00001 | wc -l
8967

The file "Os_to_Os.BLAST.table" is the output from the BLAST analysis of the O. sativa amino acid sequences to itself, and "0.1, 0,01 ..." is the minimum e-value for a BLAST match. Matches with these high e-values are expected to be non-homologous, and there query sequence is therefore expected to be a low copy sequence."

Test-selection O. sativa

Extracted the gene id's using e-value cutoff value 0.1, sorted the output and stored in file Os_to_Os-e0.1.txt

topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ ~/git/sequence_capture/get_low_copy.py Os_to_Os.BLAST.table 0.1 | sort > Os_to_Os-e0.1.txt

(Sequence ids where then extracted using vim and ":%s/|.*//g")

Downloaded the nucleotide sequences, and converted them into a blast database.

topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ wget ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Osativa/annotation/Osat...
--2014-03-25 15:28:08-- ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Osativa/annotation/Osat...
=> `Osativa_204_cds_primaryTranscriptOnly.fa.gz'
Resolving ftp.jgi-psf.org... 128.3.96.38
Connecting to ftp.jgi-psf.org|128.3.96.38|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/compgen/phytozome/v9.0/Osativa/annotation ... done.
==> SIZE Osativa_204_cds_primaryTranscriptOnly.fa.gz ... 13154870
==> PASV ... done. ==> RETR Osativa_204_cds_primaryTranscriptOnly.fa.gz ... done.
Length: 13154870 (13M) (unauthoritative)

100%[=================================================================================================================================>] 13,154,870 2.20M/s in 8.8s

2014-03-25 15:28:21 (1.42 MB/s) - `Osativa_204_cds_primaryTranscriptOnly.fa.gz' saved [13154870]

topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ makeblastdb -in Osativa_204_cds_primaryTranscriptOnly.fa -dbtype nucl

Building a new DB, current time: 04/09/2014 13:56:52
New DB name: Osativa_204_cds_primaryTranscriptOnly.fa
New DB title: Osativa_204_cds_primaryTranscriptOnly.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 39049 sequences in 1.733 seconds.

Identifying genes with >3 CDS's of minimum length 120bp

I started by counting the number of O. sativa genes with >3 CDS's and where each CDS is >120bp long using the script parsGff3.py. Then sorted and redirected the output to the file "Os_3-cds_length-120.txt".


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ ~/git/sequence_capture/parsGff3.py --gff3 Osativa_204_gene_exons.gff3 --min_cds_length 120 --cds 3 | wc -l
3965
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ ~/git/sequence_capture/parsGff3.py --gff3 Osativa_204_gene_exons.gff3 --min_cds_length 120 --cds 3 | sort > Os_3-cds_length-120.txt

Note: The file contains a number of gene identifiers that only differ by their suffix (e.g. *.1, *.2 and *.3). Hence, the meaning of "primary transcript only" is not what I expected.

Final selection of O. sativa reference genes

I then extracted and counted the number of O. sativa sequence id's for genes fulfilling all the above criteria:


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ comm -12 Os_3-cds_length-120.txt ../blast/Os_to_Os-e0.1.txt | wc -l
172

Result: Only 172 genes in the reference geome fullfils all criteria. This is an unexpectedly low result!

I will try to improve this number by instead using a higher e-value when selecting low copy genes after the first BLAST analysis.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ ~/git/sequence_capture/get_low_copy.py Os_to_Os.BLAST.table 0.00001 | sort > Os_to_Os-e0.00001.txt
comm -12 Os_3-cds_length-120.txt ../blast/Os_to_Os-e0.00001.txt | wc -l
272

Result: A little better then the last attempt.

Next, I'll try to improve the figure by lowering the cutoff value for the CDS length.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ ~/git/sequence_capture/parsGff3.py --gff3 Osativa_204_gene_exons.gff3 --min_cds_length 100 --cds 3 | wc -l
5477
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ ~/git/sequence_capture/parsGff3.py --gff3 Osativa_204_gene_exons.gff3 --min_cds_length 100 --cds 3 | sort > Os_3-cds_length-100.txt
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ comm -12 Os_3-cds_length-100.txt ../blast/Os_to_Os-e0.00001.txt | wc -l
462
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ comm -12 Os_3-cds_length-100.txt ../blast/Os_to_Os-e0.00001.txt > Os_3-cds_length-100_e0.00001.txt
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ wc -l Os_3-cds_length-100_e0.00001.txt
462 Os_3-cds_length-100_e0.00001.txt

Result: Better then the last attempt. Let's try with this selection of genes.

Creating a BLAST database of the selected O. sativa sequences.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ cp ../blast/Os_to_Os-e0.00001.txt .
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ fasta2tab Osativa_204_cds_primaryTranscriptOnly.fa | grep -f Os_to_Os-e0.00001.txt | tab2fasta > Os_to_Os-e0.00001.fst
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ makeblastdb -in Os_3-cds_length-100_e0.00001.fst -dbtype nucl

Building a new DB, current time: 04/09/2014 16:00:43
New DB name: Os_3-cds_length-100_e0.00001.fst
New DB title: Os_3-cds_length-100_e0.00001.fst
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 462 sequences in 0.037442 seconds.

2. Identify homologous sequences in the Oncidium and Phalaenopsis EST data

All available EST data on NCBI for "Oncidium" and "Phalaenopsis" was downloaded.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data/ncbi$ makeblastdb -in Oncidium_hybrid_cultivar_est.fasta -dbtype nucl -parse_seqids

Building a new DB, current time: 03/25/2014 14:49:09
New DB name: Oncidium_hybrid_cultivar_est.fasta
New DB title: Oncidium_hybrid_cultivar_est.fasta
Sequence type: Nucleotide
Deleted existing BLAST database with identical name.
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3183 sequences in 0.180591 seconds.
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data/ncbi$ makeblastdb -in Phalaenopsis_equestris_est.fasta -dbtype nucl -parse_seqids

Building a new DB, current time: 03/25/2014 14:58:08
New DB name: Phalaenopsis_equestris_est.fasta
New DB title: Phalaenopsis_equestris_est.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 48% ambiguous nucleotides (shouldn't be over 40%)
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 47% ambiguous nucleotides (shouldn't be over 40%)
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 50% ambiguous nucleotides (shouldn't be over 40%)
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: First data line in seq is about 48% ambiguous nucleotides (shouldn't be over 40%)
Adding sequences from FASTA; added 5604 sequences in 0.348965 seconds.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data/ncbi$ grep -c ">" Oncidium_hybrid_cultivar_est.fasta Phalaenopsis_equestris_est.fasta
Oncidium_hybrid_cultivar_est.fasta:3183
Phalaenopsis_equestris_est.fasta:5604

BLAST'ing the Phalaenopsis sequences to the Oncidium database


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ time blastn -db ../data/ncbi/Oncidium_hybrid_cultivar_est.fasta -query ../data/ncbi/Phalaenopsis_equestris_est.fasta -out Phalaenopsis_2_Oncidium.nt.BLAST.table -outfmt '7 std qlen slen' -num_threads 3

real 0m2.468s
user 0m2.322s
sys 0m0.209s
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Phalaenopsis_2_Oncidium.nt.BLAST.table -a 120 -% 80 | wc -l
450
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Phalaenopsis_2_Oncidium.nt.BLAST.table -a 120 -% 80 > P_2_O.a120_id80.txt

Extracted the sequences id's using vim (:%s/\s.*//g), and then extracted the actual sequences from the BLAST database.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ blastdbcmd -db ../data/ncbi/Phalaenopsis_equestris_est.fasta -entry_batch P_2_O.a120_id80.txt | grep -c ">"
450
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ blastdbcmd -db ../data/ncbi/Phalaenopsis_equestris_est.fasta -entry_batch P_2_O.a120_id80.txt > P_2_O.a120_id80.fst

BLAST'ed the selection of Phalaenopsis EST's to the database of selected O. sativa reference sequences.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ blastn -query P_2_O.a120_id80.fst -db ../data/Os_3-cds_length-100_e0.00001.fst -out P_2_O.a120_id80-2-Os_3-cds_length-100_e0.00001.BLAST.table -outfmt '7 std qlen slen' -num_threads 3
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i P_2_O.a120_id80-2-Os_3-cds_length-100_e0.00001.BLAST.table | wc -l
0

Result:Not a single sequence found a match! I suspect that the stringent selection of O. sativa sequences is the main problem here. Will therefore do a search of the database of low copy sequences only (not considering number and length of CDS's). Note that I'll be using the "Os_to_Os-e0.1.txt" dataset here.


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ fasta2tab Osativa_204_cds_primaryTranscriptOnly.fa | grep -f Os_to_Os-e0.1.txt | tab2fasta > Os_to_Os-e0.1.fst
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ makeblastdb -in Os_to_Os-e0.1.fst -dbtype nucl

Building a new DB, current time: 04/09/2014 17:33:49
New DB name: Os_to_Os-e0.1.fst
New DB title: Os_to_Os-e0.1.fst
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6186 sequences in 0.273256 seconds.

New aproach


topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/data$ makeblastdb -in Cymbidium_mosaic_virus.fst -dbtype nucl -parse_seqids

Building a new DB, current time: 04/09/2014 20:56:39
New DB name: Cymbidium_mosaic_virus.fst
New DB title: Cymbidium_mosaic_virus.fst
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 254 sequences in 0.0307701 seconds.

topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ blastn -query Pha_2_Onc_a240_id86.fst -db ../data/Cymbidium_mosaic_virus.fst -out Pha_2_Onc_a240_id86_2_Cymbidium_mosaic_virus.BLAST.table -outfmt '7 std qlen slen' -num_threads 3

topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Pha_2_Onc_a240_id86_2_Cymbidium_mosaic_virus.BLAST.table -% 100 | wc -l
1
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Pha_2_Onc_a240_id86_2_Cymbidium_mosaic_virus.BLAST.table -% 99 | wc -l
10
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Pha_2_Onc_a240_id86_2_Cymbidium_mosaic_virus.BLAST.table -% 95 | wc -l
50
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Pha_2_Onc_a240_id86_2_Cymbidium_mosaic_virus.BLAST.table -% 90 | wc -l
69
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Pha_2_Onc_a240_id86_2_Cymbidium_mosaic_virus.BLAST.table -% 80 | wc -l
72

topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ parseBLASTtable.py -i Pha_2_Onc_a240_id86_2_Cymbidium_mosaic_virus.BLAST.table -% 90 > viruses.txt
topel@Slartibartfasts:~/project/BILS/orchids_seq_capture/blast$ grep -c ">" Pha_2_Onc_a240_id86.fst
175