The Surirella brebissonii genome project

Introduction

At CMB, University of Gothenburg, we are currently working on the de novo genome assembly of the diatom Surirella brebissonii. PI for the project is Anders Blomberg and main responsibility for the assembly work lies on myself and Magnus Alm Rosenblad.

This page will describe the work on assembling the S. brebissonii data. I will not go into detail about the pre-sequencing work as I had no part in that.

Material and methods

We will evaluate the different assemblies using the following criteria:

  • N50 score
  • Distribution of contig lengths (average, SD, Plot [Frequency of lengt classes])
  • GC by counts plot and Length by counts plot

Data

  • 1_110324_A81BF6ABXX_Alg150bp_1.fastq - 125'681'790 sequences (Data_1-1.fastq)
  • 1_110324_A81BF6ABXX_Alg150bp_2.fastq - 125'681'790 sequences (Data_1-2.fastq)
  • 1_120821_BC118PACXX_Alg1_index2_1.fastq - 215'093'134 sequences (Data_2-1.fastq)
  • 1_120821_BC118PACXX_Alg1_index2_2.fastq - 215'093'134 sequences (Data_2-2.fastq)

Code

The code I've written for this project is available at GitHub.

Before I arrived in the lab, a lot of work had been put into finding the right tools and settings for doing the assembly of the available datasets. I have wrapped these methods in the python program assemblyPipeline.py. This wrapper uses fastx_trimmer, cutadapt and fastq_quality_filter to prepare the data and the CLC Assembly Cell for the assembly.

The pipeline includes a step where the trimmed and filtered sequences have to be sorted in the two groups (1.) forward and reverse pairs and (2.) singlets comprised of either forward or reverse sequences where the mate in the pair has been removed in the filtering step. pairSeq.py was produced to do this in a memory efficient way. For example, sorting two files, each containing 53GB of sequence data (in totale 106GB of sequences to sort), only requires 26GB of RAM (Disclaimer: Memory usage was measured using the program "top" that is not very reliable for this type of measurements).

Assembly and filtering analyses

20130118

This filtering and assembly analysis was done using the older scripts called "Complete_Filtering_QXX.sh" and "CLC_Surirella.sh" and not assemblyPipeline.py.

Result:

I ran a scaffolding analysis using "SSPACE_v1-1.pl -l lib.txt -s a12s-20130118_novo.fna -x 0 -m 32 -o 20 -t 0 -k 3 -n 15 -p 1 -v 0 -b a12s_20130118_k3".


20130208

Started a new filtering analysis in /data4/surirella/surirella_mats/20130208 on sparc1 using assemblyPipeline.py. Will compare the output from this analysis to the output from the filtering steps in analysis 20130118 in order to verify that the new python code (assemblyPipeline.py) gives the same result as previous scripts did.

Data:

  • Data_1-1.fastq -> /data4/surirella/surirella_1/1_110324_A81BF6ABXX_Alg150bp_1.fastq
  • Data_1-2.fastq -> /data4/surirella/surirella_1/1_110324_A81BF6ABXX_Alg150bp_2.fastq
  • Data_2-1.fastq -> /data4/surirella/surirella_2/1_120821_BC118PACXX_Alg1_index2_1.fastq
  • Data_2-2.fastq -> /data4/surirella/surirella_2/1_120821_BC118PACXX_Alg1_index2_2.fastq

Result:

[2013-02-18] Cutadapt ran on the first infile, but was terminated prematurely (IOError: [Errno 28] No space left on device) when analysing the second file. Also the fastxtrimmer analysis of the second file was terminated. I have restart the analysis.

[2013-02-20] The pairSeq.py analysis of the second input file was terminated. Restarted the analysis of all files again.

[2013-03-01] Ran out of disk space again. Restarted the paitSeq.py analysis.

[2013-03-05] Started a "fastqc" analysis of the original four data files.

[2013-03-05] The resulting files of the "fastx_trimmer" analysis was removed previously in order to save disk space. However, these files are needed for the analyses named "20130304-1" and "20130304-2". I therefore rerun the "fastx_trimmer" analysis with the original settings.

[2013-03-16] I notis that the file " Data_2-2.FXT.CA.FQF.fastq.gz" is only 46 kb in size. Will therefore restart the "fastq_quality_filter" analysis of "Data_2-2.FXT.CA.fastq" after it has been decompressed.

Note: The "-a" flag is used with all thirteen adaptors for the cutadapt analysis.

Note: will also include "/data4/surirella/surirella_dna_3kb/qual_filt/output_fw_anyBothO10.fastq.t20.l25.trim.fastq.select.sorted.fastq" and "/data4/surirella/surirella_dna_3kb/qual_filt/output_rev_anyBothO10.fastq.t20.l25.trim.fastq.select.sorted.fastq" for scaffolding later on.


20130218

Started a new filtering analysis in /data4/surirella/surirella_mats/20130218 on sparc1 using assemblyPipeline.py. In this analysis, I'm testing different cutadapt options. Previous analyses have used the "-a" flag (remove adaptor sequences ligated to the 3' end of the sequence) with the 'TruSeq Universal Adapter', as well as with the twelve 'Index' adaptors. This time I'm using the "-g" flag (search for adaptors ligated to the 5' end of the sequence) with the Universal adaptor (to indicate that we are alos looking for adaptors in the 5' end) and then the "-b" flag (remove adapters that was ligated to the 5' or the 3' end of the sequence) with the universal adaptor and all twelve index adaptors. We have indications from SciLifeLab that it is a good approach to search for the 3' adaptors in both ends of our sequences.

Note: The "-g" flag did not work for some reason. Will therefore only use the "-b" flag.

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

Data:

  • *FXT.fastq files from the 20130208 analysis.

Result:

[2013-02-20] Ran out of disk space. Moved the directory /data4/surirella/surirella_mats/20130218 to /data7/surirella/surirella_mats/20130218, and restarted cutadapt analysis of the third input file.

[2013-03-01] The filtering analyses resulted in 125'681'790 paired sequences from the first dataset, and 215'093'134 from the second. In total 340'774'924 paired sequences. No non-paired sequences where produced!

[2013-03-11] Reran the "pairSeq.py" step with the new code that uses the correct input files and also produces "singlets" sequences.

[2013-03-12] Stated a "fastqc" analysis of "*.FXT.CA.FQF.fastq" files.

[2013-03-13] Started "fastqc" analyses of "*.FXT.CA.FQF.fastq", "*.FXT.CA.FQF.Pair.fastq" and "*.FXT.CA.FQF.Singles.fastq" files.

[2013-03-16] Transfered filtered files to Albiorix and started an assembly analysis.

[2013-03-17] The assembly analysis finished with an error message:

[mtop@compute-0-0 20130218]$ assemblyPipeline.py
[--] Running clc_assembler: ['clc_assembler', '--cpus', '8', '-o', '/state/partition4/mats/surirella/20130218/Surirella_20130218_novo.fa', '-p', 'fb', 'ss', '100', '450', '-q', '-i', 'Data_1-1.FXT.CA.FQF.Pair.fastq', 'Data_1-2.FXT.CA.FQF.Pair.fastq', 'Data_2-1.FXT.CA.FQF.Pair.fastq', 'Data_2-2.FXT.CA.FQF.Pair.fastq', '-p', 'no', '-q', 'Data_1-1.FXT.CA.FQF.Singles.fastq', 'Data_1-2.FXT.CA.FQF.Singles.fastq', 'Data_2-1.FXT.CA.FQF.Singles.fastq', 'Data_2-2.FXT.CA.FQF.Singles.fastq']
Error: odd number of sequences in paired file
[--] Running clc_mapper: ['clc_mapper', '--cpus', '8', '-o', u'/state/partition4/mats/surirella/20130218/Surirella_20130218_ref.cas', '-p', 'fb', 'ss', '100', '450', '-q', '-i', 'Data_1-1.FXT.CA.FQF.Pair.fastq', 'Data_1-2.FXT.CA.FQF.Pair.fastq', 'Data_2-1.FXT.CA.FQF.Pair.fastq', 'Data_2-2.FXT.CA.FQF.Pair.fastq', '-p', 'no', '-q', 'Data_1-1.FXT.CA.FQF.Singles.fastq', 'Data_1-2.FXT.CA.FQF.Singles.fastq', 'Data_2-1.FXT.CA.FQF.Singles.fastq', 'Data_2-2.FXT.CA.FQF.Singles.fastq', '-d', 'Surirella_20130218_novo.fa']
Problem opening database file: Surirella_20130218_novo.fa
[--] Running clc_mapping_info: ['clc_mapping_info', '-c', '-n', 'Surirella_20130218_ref.cas']
Error opening assembly file for reading
[mtop@compute-0-0 20130218]$

[2013-03-17] This is the same error message we have seen in the Littorina saxatilis analyses. There we suspected the problem could be due to name collisions between files from different sequencing analyses. I will test if there are any name collisions in the current Surirella brebissonii datasets. However, while I'm counting the unique names in the files, I'll also restart the analysis just to see if disk-space on the TMPDIR partition (580G free at the moment) could be the problem. The previous analysis was running at the same time as two other analyses ("20130304-1" and "-2") which could have filled up the tmp. directory.

[2013-03-17] The assembly analysis crashed again. The size of the tmp.-dir does not seem to be the problem.

[2013-03-19] Started an assembly analysis of the unfiltered files in "/state/partition4/mats/surirella/20130218/unfiltered_files". If this analysis works as expected, that would indicate that our filtering steps has introduced something in the files that causes CLC to crash.

[2013-03-19] Started an assembly of the filtered files, but excluded the singlets sequences, and ran the CLC command without the "assemblyPipeline.py" code. That did not work either.

[mtop@compute-0-0 20130218]$ clc_assembler --cpus 8 -o /state/partition4/mats/surirella/20130218/Surirella_20130218_novo.fa -p fb ss 100 450 -q -i Data_1-1.FXT.CA.FQF.Pair.fastq Data_1-2.FXT.CA.FQF.Pair.fastq Data_2-1.FXT.CA.FQF.Pair.fastq Data_2-2.FXT.CA.FQF.Pair.fastq                                                                       
Error: odd number of sequences in paired file
[mtop@compute-0-0 20130218]$

[2013-03-19] Analysed each dataset separately (Data_1-* files in one analysis and Data_2* files in another) to se if the problem lies in either of the two datasets. These analyses worked as expected. Hence, the combination of datasets seem to be part of the problem.

[2013-03-26] Stated a new assembly analysis after discussing the matter with Per Sikora who suggested I add the "-i" flag before EACH pair in the clc command.

clc_assembler --cpus 8 -o /state/partition4/mats/surirella/20130218/Surirella_20130218_novo.fa -p fb ss 100 450 -q -i Data_1-1.FXT.CA.FQF.Pair.fastq Data_1-2.FXT.CA.FQF.Pair.fastq -i Data_2-1.FXT.CA.FQF.Pair.fastq Data_2-2.FXT.CA.FQF.Pair.fastq -p no -q Data_1-1.FXT.CA.FQF.Singles.fastq Data_1-2.FXT.CA.FQF.Singles.fastq Data_2-1.FXT.CA.FQF.Singles.fastq Data_2-2.FXT.CA.FQF.Singles.fastq

[2013-03-27] Problem solved! The "-i" flag suggested by Per solved the problem. However, some analyses that worked earlier will have to be rerun, as there is a differens between the results of analyses using the "-i" flag for each pair, and those that don't.


20130304-1

Testing a new threshold value for the 3' trimming by cutadapt. We have changed the variable "q" from 15 to 20 in this analysis.

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 20
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

Result:

[2013-03-06] Input files for the analysis was mistakenly removed earlier (see note from 2013-03-05 on analysis "20130208"). Restarted the analysis.

[2013-03-11] The "pairSeq.py" step crashed du to lack of disk space. Compressing files and have restarted the "pairSeq.py" step.

[2013-03-16] Transfered filtered files to Albiorix and started an assembly analysis.

[2013-03-17] The assembly analysis finished after approximately 13 hours (running on 8 cores). Removed the input files to save disk-space.


20130304-2

Testing the new threshold value for the 3' trimming (by cutadapt) together with a new value for the "p" variable used by "fastq_quality_filter". The "q" variable was changed from 15 to 20, and the "p" variable from 95 to 80. This variable indicates the percentage of the read (as an integer between 1 and 100) that must have a quality of at least k (20 in this analysis) in order for the sequence to be kept.

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 20
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 80
k: 20
[clc]
min_dist: 100
max_dist: 450

Result:

[2013-03-06] Input files for the analysis was mistakenly removed earlier (see note from 2013-03-05 on analysis "20130208"). Restarted the analysis.

[2013-03-12] Started the "pairSeq.py" step. Also ran "fastqc" on "*.FXT.CA.FQF.fastq" files. The "*.FXT.CA.fastq" files have already been gzipped due to disk space problems earlier.

[2013-03-13] Started "fastqc" analyses of "*.FXT.CA.FQF.Pair.fastq" and "*.FXT.CA.FQF.Singles.fastq" files.

[2013-03-16] Transfered filtered files to Albiorix and started an assembly analysis.

[2013-03-17] Assembly analysis finished after approximately 18 hours (running on 8 cores).


20130309

Started a new filtering/assembly analysis. This time I'm analysing the 300 library rna data in the files "1_110126_A816HRABXX_1.fastq" and "1_110126_A816HRABXX_2.fastq". Also started a "fastqc" analysis of the original files.

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

[2013-03-11] Started "fastqc" analyses of "*.FXT.fastq" and "*FXT.CA.FQF.fastq" files.

[2013-04-04] Ran the assembly analysis. Input files only contains 4'568'769 pairs, so the analysis finished in 30 min. when running on 32 cores.


20130315-1

'fastqc' analysis of the files "1_110126_A816HRABXX_1.fastq" and "1_110126_A816HRABXX_2.fastq" shows a strong sequence bias in the 5' ends of the sequences. In this analysis we are not doing any trimming of the 5' ends by not running the "fastx_trimmer" step.

Settings:

[fastx_trimmer]
f: 0
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

[2013-04-04] Transfered the files to Albiorix and started an assembly analysis.


20130315-2

'fastqc' analysis of the files "1_110126_A816HRABXX_1.fastq" and "1_110126_A816HRABXX_2.fastq" shows a strong sequence bias in the 5' ends of the sequences. In this analysis we are therefore removing 13 nucleotides in the 5' in the "fastx_trimmer" step.

Settings:

[fastx_trimmer]
f: 13
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

[2013-04-04] Transfered the files to Albiorix and started an assembly analysis.


20130402

Previous datasets have contained a lot of bacterial sequences. The latest batch is produced from cells grown with antibiotica in order to make the culture axenic.

Data:

  • /data4/surirella/surirella_3_ax/7_130308_AD1TAPACXX_P385_101F_dual10_1.fastq
  • /data4/surirella/surirella_3_ax/7_130308_AD1TAPACXX_P385_101F_dual10_2.fastq

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

[2013-04-07] Assembly analysis finished.