Fucus vesiculosus de novo genome project

Introduction

Coming soon

Data

  • 2_120706_BC0YYNACXX_4_indexm2_1.fastq - 169'696'580 sequences (Data_1).
  • 2_120706_BC0YYNACXX_4_indexm2_2.fastq - 169'696'580 sequences (Data_2).
  • 2_130111_BD1HWHACXX_P388_101_indexm2_1.fastq - 155'901'827 sequences (Data_3)
  • 2_130111_BD1HWHACXX_P388_101_indexm2_2.fastq - 155'901'827 sequences (Data_4)

Analyses

20130305

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

[2013-03-05] Running a "fastqc" analysis of the original data (four files).

[2013-03-05] Started a filtering analysis on sparc1.

[2013-03-08] The analysis crashed with the following error meassage:

matsto@sparc1:/data6/fucus_mats/20130305$ assemblyPipeline.py
fastx_trimmer: Invalid quality score value (char '#' ord 35 quality value -29) on line 4
fastx_trimmer: Invalid quality score value (char '#' ord 35 quality value -29) on line 4
fastq_quality_filter: Premature End-Of-File (filename ='Data_3.FXT.CA.fastq')
fastq_quality_filter: Premature End-Of-File (filename ='Data_4.FXT.CA.fastq')
[--] Building initial dictionary of sequence id's in first file.
[--] Attempting memory garbage collection
[--] Comparing id's in second file to the dictionary.
[--] Comparing id's in first file to the dictionary
[--] Check if sequences in 'pair' files are in the same order
[op] 63948843 sequnce pairs are in order
Traceback (most recent call last):
  File "/home/mastto/bin/pairSeq.py", line 226, in 
    main()
  File "/home/mastto/bin/pairSeq.py", line 192, in main
    inFile1 = fastqFile(f1, 'r')                # First sequence file to read from
  File "/home/mastto/bin/pairSeq.py", line 103, in __init__
    file.__init__(self, name, mode)
IOError: [Errno 2] No such file or directory: 'Data_3.FXT.CA.FQF.fastq'
matsto@sparc1:/data6/fucus_mats/20130305$

Same problem here as with the Amphiura analysi. The the wrong quality coding is indicated in the configuration file for datasets "Data_3" and "Data_4". Restarted the analysis of these two files.

[2013-03-10] Running "fastqc" on the "*.FXT.fastq", "*.FXT.CA.fastq" and "*.FXT.CA.FQF.fastq" files.

[2013-03-11] Started an assembly analysis.

[2013-03-13] The assembly analysis has finished, albeit with some form of error/warning message.

[mtop@compute-0-0 20130305]$ assemblyPipeline.py
[--] Running clc_assembler: ['clc_assembler', '--cpus', '8', '-o', '/state/partition4/mats/fucus/20130305/Fucus_20130305_novo.out', '-p', 'fb', 'ss', '100', '450', '-q', '-i', 'Data_1.FXT.CA.FQF.Pair.fastq', 'Data_2.FXT.CA.FQF.Pair.fastq', 'Data_3.FXT.CA.FQF.Pair.fastq', 'Data_4.FXT.CA.FQF.Pair.fastq', '-p', 'no', '-q', 'Data_1.FXT.CA.FQF.Singles.fastq', 'Data_2.FXT.CA.FQF.Singles.fastq', 'Data_3.FXT.CA.FQF.Singles.fastq', 'Data_4.FXT.CA.FQF.Singles.fastq']
Progress:  100.0 %
[--] Running clc_mapper: ['clc_mapper', '--cpus', '8', '-o', u'/state/partition4/mats/fucus/20130305/Fucus_20130305_ref.out', '-p', 'fb', 'ss', '100', '450', '-q', '-i', 'Data_1.FXT.CA.FQF.Pair.fastq', 'Data_2.FXT.CA.FQF.Pair.fastq', 'Data_3.FXT.CA.FQF.Pair.fastq', 'Data_4.FXT.CA.FQF.Pair.fastq', '-p', 'no', '-q', 'Data_1.FXT.CA.FQF.Singles.fastq', 'Data_2.FXT.CA.FQF.Singles.fastq', 'Data_3.FXT.CA.FQF.Singles.fastq', 'Data_4.FXT.CA.FQF.Singles.fastq', '-d', 'Fucus_20130305_novo.out']
Problem closing assembly file
[--] Running clc_mapping_info: ['clc_mapping_info', '-c', '-n', 'Fucus_20130305_ref.out']
Unsupported scoring scheme, type 8
[mtop@compute-0-0 20130305]$

[2013-03-11] This probably happened as the disks had filled up during due to all analyses running simultaneously. I compressed files and restarted the analysis. The analysis finished without problem.

[2013-03-16] Removed the filtered files to save space on Albiorix.


20130331

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

The previous assembly was done using the wrong flags with clc_assembler (see Surirella brebissonii assembly 20130218). I'm therefore doing a new assembly (with it's own id number) of the filtered and sorted sequences from the 20130305 analysis.


20130401

Settings:

[fastx_trimmer]
f: 6
[cutadapt]
q: 15
o: 10
e: 0.1
n: 1
m: 50
[fastq_quality_filter]
p: 95
k: 20
[clc]
min_dist: 100
max_dist: 450

Same analysis as "20130331", only this time I'm estimating fragment size for clc_assembler using the "-e" flag.