Identifying bacterial sequences in diatom WGS data

Introduction

Our genome assemblies of the Skeletonema marinoi and Surirella brebissonii datasets contains a lot of bacterial contigs, and these have to be identified and removed before further analyses can be done. This section describes the method we will use for this.

Material

Diatom databases

Thalassiosira pseudonana CCMP1335

11673 sequences. Database contains all NCBI RefSeq sequences for "Thalassiosira pseudonana CCMP1335[orgn]" available 2013-05-08.

Thalassiosira oceanica

34808 sequences. Database contains all NCBI protein sequences for "Thalassiosira oceanica[orgn]" available at 2013-05-08.

Fragilariopsis cylindrus

27137 + 18077 sequences. Filtered ("Best") protein models where downloade from JGI using "wget http://genome.jgi-psf.org/Fracy1/download/portalData/Fracy1_GeneModels_F..." and "wget http://genome.jgi-psf.org/Fracy1/download/portalData/Fracy1_GeneModels_F...".

Pseudo-nitzschia multiseries CLN-47

264136 sequences. Protein database was downloaded from the JGI site using "wget http://genome.jgi-psf.org/Psemu1/download/Psemu1_all_proteins_20111011.a...".

Phaeodactylum tricornutum

10573 sequences. Database contains all NCBI RefSeq sequences for "Phaeodactylum tricornutum[orgn]" available 2013-05-08.

BLAST data bases where formated using "formatdb -i FASTA_FILE -o T -p T".