Littorina saxatilis de novo genome project

Introduction

Coming soon.

Data

  • 2_120125_BC0B5UACXX_Littorins-p136_7_1-index_4_1.fastq - 103'209'945 sequences (Data_120125-1)
  • 2_120125_BC0B5UACXX_Littorins-p136_7_1-index_4_2.fastq - 103'209'945 sequences (Data_120125-2)
  • 2_120313_AC0GTUACXX_Littorins-p136_7_1_index4_1.fastq - 70'341'110 sequences (Data_120313-1)
  • 2_120313_AC0GTUACXX_Littorins-p136_7_1_index4_2.fastq - 70'341'110 sequences (Data_120313-2)
  • 7_120404_BD0MGRACXX_Littorins-p136_7_1_index4_1.fastq - 150'691'477 sequences (Data_120404-1)
  • 7_120404_BD0MGRACXX_Littorins-p136_7_1_index4_2.fastq - 150'691'477 sequences (Data_120404-2)
  • 1_120426_BD0VGKACXX_Littorins-p136_7_1_index4_1.fastq - 47'419'311 sequences (Data_120426-1)
  • 1_120426_BD0VGKACXX_Littorins-p136_7_1_index4_2.fastq - 47'419'311 sequences (Data_120426-2)
  • 1_120426_BD0VGKACXX_Littorins-p136_7_1_noindex_1.fastq - 140'580'810 sequences (Data_120426-3)
  • 1_120426_BD0VGKACXX_Littorins-p136_7_1_noindex_2.fastq - 140'580'810 sequences (Data_120426-4)

Code

The code written for this project is available at GitHub.

Assembly and filtering analyses

20130218
Data included in this analysis:

  • Data_120125-1.fastq
  • Data_120125-2.fastq
  • Data_120313-1.fastq
  • Data_120313-2.fastq
  • Data_120404-1.fastq
  • Data_120404-2.fastq
  • Data_120426-1.fastq
  • Data_120426-2.fastq
  • Data_120426-3.fastq
  • Data_120426-4.fastq

Result:
[2013-03-01] The pairSeq.py analysis terminated prematurely as the was no space left on the disk. Restarted the pairSeq.py analysis of all sequences after removing *FXT.fastq files and gzipping *.FXT.CA.fastq files.
[2013-03-05] The filtering analyses did not break up any pairs. Hence, we only have mate pair sequences and no singlets for the assembly.

  • Data_120125-1.Pair.fastq - 103'209'945 sequences
  • Data_120125-2.Pair.fastq - 103'209'945 sequences
  • Data_120313-1.Pair.fastq - 70'341'110 sequences
  • Data_120313-2.Pair.fastq - 70'341'110 sequences
  • Data_120404-1.Pair.fastq - 150'691'477 sequences
  • Data_120404-2.Pair.fastq - 150'691'477 sequences
  • Data_120426-1.Pair.fastq - 47'419'311 sequences
  • Data_120426-2.Pair.fastq - 47'419'311 sequences
  • Data_120426-3.Pair.fastq - 140'580'810 sequences
  • Data_120426-4.Pair.fastq - 140'580'810 sequences

[2013-03-05] Started an assembly using 32 cores 15:55:53 CET.

[2013-03-05] Found a bug in the "assemblyPipeline.py" which causes pairSeq.py to use the wrong input files. Thats why no "singlests" sequences where created. Restarted the "parSeq.py" step of the pipeline.

[2013-03-05] Ran "fastqc" on the original files.

[2013-03-06] Started a "fastqc" analysis of the *Pair* and *Singles* files.

[2013-03-06] The assembly from "2013-03-05" terminated prematurely as the names of the input files was not correct. Have made changes to the pipeline so that correct names are given to output files. Started an assembly of the unfiltered original files. Result from this assembly will be compared with the output from the assembly of filtered and trimmed sequences.

[2013-03-07] The "clc-assembler" keeps crashing with the error meassage "Error: odd number of sequences in paired file". We have changed TMPDIR to a 1T partition, and will now try the analysis on fewer datasets, to see if it is the amount of data in this analysis that causes the problem. Restarted the analysis and only included the following files:

  • Data_120125-1.fastq
  • Data_120125-2.fastq
  • Data_120313-1.fastq
  • Data_120313-2.fastq

[2013-03-08] The assembly analysis started on four input files yesterday finished successfully after ~18 hours. The result is found in "/state/partition4/mats/littorina/20130218/unfiltered_files/assembly_four_datasets" on Albiorix node0. Restarted the analysis and this time included the following three datasets:

  • Data_120125-1.fastq
  • Data_120125-2.fastq
  • Data_120313-1.fastq
  • Data_120313-2.fastq
  • Data_120404-1.fastq
  • Data_120404-2.fastq

By including more and more date a in the analysis, we are testing if specific files are causing the problem, or if it is the total amount of data that is causing CLC to crash.

[2013-03-08] The analysis crashed with the following error message:

[mtop@compute-0-0 unfiltered_files]$ assemblyPipeline.py
[--] Running clc_assembler: ['clc_assembler', '--cpus', '32', '-o', '/state/partition4/mats/littorina/20130218/unfiltered_files/Littorina_20130218_novo.out', '-p', 'fb', 'ss', '100', '500', '-q', '-i', 'Data_120125-1.FXT.CA.FQF.Pair.fastq', 'Data_120125-2.FXT.CA.FQF.Pair.fastq', 'Data_120313-1.FXT.CA.FQF.Pair.fastq', 'Data_120313-2.FXT.CA.FQF.Pair.fastq', 'Data_120404-1.FXT.CA.FQF.Pair.fastq', 'Data_120404-2.FXT.CA.FQF.Pair.fastq', '-p', 'no', '-q', 'Data_120125-1.FXT.CA.FQF.Singles.fastq', 'Data_120125-2.FXT.CA.FQF.Singles.fastq', 'Data_120313-1.FXT.CA.FQF.Singles.fastq', 'Data_120313-2.FXT.CA.FQF.Singles.fastq', 'Data_120404-1.FXT.CA.FQF.Singles.fastq', 'Data_120404-2.FXT.CA.FQF.Singles.fastq']
Error: odd number of sequences in paired file
[--] Running clc_mapper: ['clc_mapper', '--cpus', '32', '-o', u'/state/partition4/mats/littorina/20130218/unfiltered_files/Littorina_20130218_ref.out', '-p', 'fb', 'ss', '100', '500', '-q', '-i', 'Data_120125-1.FXT.CA.FQF.Pair.fastq', 'Data_120125-2.FXT.CA.FQF.Pair.fastq', 'Data_120313-1.FXT.CA.FQF.Pair.fastq', 'Data_120313-2.FXT.CA.FQF.Pair.fastq', 'Data_120404-1.FXT.CA.FQF.Pair.fastq', 'Data_120404-2.FXT.CA.FQF.Pair.fastq', '-p', 'no', '-q', 'Data_120125-1.FXT.CA.FQF.Singles.fastq', 'Data_120125-2.FXT.CA.FQF.Singles.fastq', 'Data_120313-1.FXT.CA.FQF.Singles.fastq', 'Data_120313-2.FXT.CA.FQF.Singles.fastq', 'Data_120404-1.FXT.CA.FQF.Singles.fastq', 'Data_120404-2.FXT.CA.FQF.Singles.fastq', '-d', 'Littorina_20130218_novo.out']
Problem opening database file: Littorina_20130218_novo.out
[--] Running clc_mapping_info: ['clc_mapping_info', '-c', '-n', 'Littorina_20130218_ref.out']
Error opening assembly file for reading
[mtop@compute-0-0 unfiltered_files]$

I will therefore include two new files instead, and restart the analysis. The new files are:

  • Data_120125-1.fastq
  • Data_120125-2.fastq
  • Data_120313-1.fastq
  • Data_120313-2.fastq
  • Data_120426-1.fastq
  • Data_120426-2.fastq

This analysis also crashed with the same error message as previous analyses. Will try again and this time adding the last dataset to the analysis.

[2013-03-10] The analysis is still running and is approaching 95% complete. Datasets included in the analysis are:

  • Data_120125-1.fastq
  • Data_120125-2.fastq
  • Data_120313-1.fastq
  • Data_120313-2.fastq
  • Data_120426-3.fastq
  • Data_120426-4.fastq

[2013-03-11] Assembly analyses not including the datasets "Data_120404-1.fastq". "Data_120404-2.fastq", "Data_120426-1.fastq" and "Data_120426-2.fastq" are working as expected. Started an analysis of these four files to see if the problem lies in one of them.

This analysis crashed with the following error meassage:

[--] Running clc_assembler: ['clc_assembler', '--cpus', '16', '-o', '/state/partition4/mats/littorina/20130218/unfiltered_files/Littorina_20130218_novo.out', '-p', 'fb', 'ss', '100', '500', '-q', '-i', 'Data_120404-1.FXT.CA.FQF.Pair.fastq', 'Data_120404-2.FXT.CA.FQF.Pair.fastq', 'Data_120426-1.FXT.CA.FQF.Pair.fastq', 'Data_120426-2.FXT.CA.FQF.Pair.fastq', '-p', 'no', '-q', 'Data_120404-1.FXT.CA.FQF.Singles.fastq', 'Data_120404-2.FXT.CA.FQF.Singles.fastq', 'Data_120426-1.FXT.CA.FQF.Singles.fastq', 'Data_120426-2.FXT.CA.FQF.Singles.fastq']
Error: odd number of sequences in paired file

Hence, the problem seems to be with one of the following files.

  • Data_120404-1.fastq
  • Data_120404-2.fastq
  • Data_120426-1.fastq
  • Data_120426-2.fastq

[2013-03-11] I have started an assembly of the reads in the file "Data_120404-1.fastq" and "Data_120404-2.fastq", to figure out if the individual datasets are causing the problem, or if it is the combination of datasets that makes CLC quit.

[2013-03-12] The assembly of the files "Data_120404-1.fastq" and "Data_120404-2.fastq" started yesterday finished without any problems. Hence, I start to suspect that the problem is with the names of the sequences since it is the assembly of certain combinations of datasets that don't work. Could it be that some of the sequences in different datasets have the same name?

To test this, I have concatenated the pair information from all headers from the files "Data_120125-1.fastq", "Data_120313-1.fastq" and "Data_120404-1.fastq" in a file (e.i. the part that looks like @HWI-ST167:2:1101:1685:2192#0, and excludes "/1" that is the information of sequence direction).

[mtop@compute-0-0 unfiltered_files]$ grep "@HWI" Data_120125-1.fastqc > combi_header_120125-1_120313-1_120404-1.txt
[mtop@compute-0-0 unfiltered_files]$ grep "@HWI" Data_120313-1.fastqc >> combi_header_120125-1_120313-1_120404-1.txt
[mtop@compute-0-0 unfiltered_files]$ grep "@HWI" Data_120404-1.fastqc >> combi_header_120125-1_120313-1_120404-1.txt
[mtop@compute-0-0 unfiltered_files] cut -f1 -d'/' combi_header_120125-1_120313-1_120404-1.txt > combi_header_120125-1_120313-1_120404-1_key.txt
[mtop@compute-0-0 unfiltered_files] wc -l combi_header_120125-1_120313-1_120404-1_key.txt
324242532
[mtop@compute-0-0 unfiltered_files] sort -u combi_header_120125-1_120313-1_120404-1_key.txt  wc -l
324259856
[mtop@compute-0-0 unfiltered_files]

There are in total 324'242'532 headers (and sequences) in the three files. However, only 324'259'856 of these sequence names are unique, which means that 17324 name collisions occurs when the three sequence files are combined. This is probably the reason why CLC mapper crashes.

[2013-04-01] The problem has been solved (see Surirella brebissonii assembly 20130218). I have started a new assembly of the filtered file using the "clc_assembler" flags "-i" for each pair and "-e" to estimate fragment size.


20130404

Filtering the new 150 bp data.

Data:

  • /data2/littorina/littorina_150/6_130308_AD1TAPACXX_P386_101_dual9_1.fastq (Data_1)
  • /data2/littorina/littorina_150/6_130308_AD1TAPACXX_P386_101_dual9_2.fastq (Data_2)

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20

[rawfiles]
# Names of fastq format input files, followed by quality score format [Q33, q33, Q64 or q64],
# and delimiting character between unique (pair)sequence and forward-reverse indicator.
# The latter can be excluded if it is a space or be indicated with ' ' or " ".
# Example:
# 1: 100000_1.fastq Q33 ' '
1: Data_120125-1.fastq Q64 /
2: Data_120125-2.fastq Q64 /
3: Data_120313-1.fastq Q64 /
4: Data_120313-2.fastq Q64 /
5: Data_120404-1.fastq Q64 /
6: Data_120404-2.fastq Q64 /
7: Data_120426-1.fastq Q64 /
8: Data_120426-2.fastq Q64 /
9: Data_120426-3.fastq Q64 /
10: Data_120426-4.fastq Q64 /
11: Data_130308-1.fastq Q33 " "
12: Data_130308-2.fastq q33 ' '

[2013-04-12] Sending *Pair* files to Albiorix. Compressing *Singles* files on SPARC1. Files have to be renamed before the assembly.

[2013-04-13] Renamed files like "Data_1.FXT.CA.FQF.Pair.fastq" -> "Data_130308-1.FXT.CA.FQF.Pair.fastq" etc., and started a new assembly Including all six datasets.