SUPER-SMART - Manipulating the GBIF database

Introduction

We are using the GBIF database as the source of biogeographical date for the SUPER-SMART (a.k.a. "neotropics") pipeline. The latest release of the database (GBIFPortalDB-2012-07-13) is 37G in size when compressed (gzip compression) and ~226G when uncompressed. Handeling a file that is more that 200G large can be very tedious and requires lots of RAM and preferably also fast hard disks. After using the database for a while, I have reached the conclusion that we actually only need some parts of the database, namely the tables "occurrence_record" and "taxon_name". The "occurrence_record" part of the mysql-dump file is 71G in size, and still quite challenging to handle. The "taxon_name" part is only XXG in size and should not be problematic to handle. Below follows the commands used to manipulate the database to extract the parts we are going to use for downstream analyses. Please note that the execution of all commands have been timed. The database have been manipulated on node0 of the Albiorix cluster (96G RAM, 48 cores AMD Opteron(tm) Processor 6168 1.9 GHz, 600 G SAS 10K RPM Hard Drive).

Downloading the GBIF MySQL dump file


[root@compute-0-0 tmp]# time wget ftp://***********:**********@ftp.gbif.org/GBIFPortalDB-2012-07-13.dump.gz
--2012-09-03 08:26:29--  ftp://***********:*password*@ftp.gbif.org/GBIFPortalDB-2012-07-13.dump.gz
           => `GBIFPortalDB-2012-07-13.dump.gz'
Resolving ftp.gbif.org... 130.226.238.136
Connecting to ftp.gbif.org|130.226.238.136|:21... connected.
Logging in as *********** ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD not needed.
==> SIZE GBIFPortalDB-2012-07-13.dump.gz ... 38716808728
==> PASV ... done.    ==> RETR GBIFPortalDB-2012-07-13.dump.gz ... done.
Length: 38716808728 (36G)

100%[===================================================================================>] 38,716,808,728 11.2M/s   in 56m 13s 

2012-09-03 09:22:42 (10.9 MB/s) - `GBIFPortalDB-2012-07-13.dump.gz' saved [38716808728]


real    56m13.573s
user    2m8.519s
sys     12m20.133s
[root@compute-0-0 tmp]# ls -lh
total 37G
-rw-r--r-- 1 root root 37G Sep  3 09:22 GBIFPortalDB-2012-07-13.dump.gz
[root@compute-0-0 tmp]# 

Copy the dump file between two hard disks


[root@compute-0-0 tmp]# time cp GBIFPortalDB-2012-07-13.dump.gz /state/partition2/matstopel/GBIFPortalDB-2012-07-13/

real    5m34.775s
user    0m2.898s
sys     2m1.236s

Find the "occurrence_record" data block in the dump file

[root@compute-0-0 GBIFPortalDB-2012-07-13]# time zcat GBIFPortalDB-2012-07-13.dump.gz | grep -n "DROP TABLE IF EXISTS"
22:DROP TABLE IF EXISTS `QRTZ_BLOB_TRIGGERS`;
46:DROP TABLE IF EXISTS `QRTZ_CALENDARS`;
69:DROP TABLE IF EXISTS `QRTZ_CRON_TRIGGERS`;
94:DROP TABLE IF EXISTS `QRTZ_FIRED_TRIGGERS`;
128:DROP TABLE IF EXISTS `QRTZ_JOB_DETAILS`;
159:DROP TABLE IF EXISTS `QRTZ_JOB_LISTENERS`;
183:DROP TABLE IF EXISTS `QRTZ_LOCKS`;
206:DROP TABLE IF EXISTS `QRTZ_PAUSED_TRIGGER_GRPS`;
228:DROP TABLE IF EXISTS `QRTZ_SCHEDULER_STATE`;
253:DROP TABLE IF EXISTS `QRTZ_SIMPLE_TRIGGERS`;
294:DROP TABLE IF EXISTS `QRTZ_TRIGGERS`;
334:DROP TABLE IF EXISTS `QRTZ_TRIGGER_LISTENERS`;
358:DROP TABLE IF EXISTS `agent`;
388:DROP TABLE IF EXISTS `bi_relation_tag`;
1640:DROP TABLE IF EXISTS `boolean_tag`;
1667:DROP TABLE IF EXISTS `catalogue_number`;
9938:DROP TABLE IF EXISTS `cell_country`;
9964:DROP TABLE IF EXISTS `cell_density`;
10426:DROP TABLE IF EXISTS `centi_cell_density`;
12952:DROP TABLE IF EXISTS `collection_code`;
13017:DROP TABLE IF EXISTS `common_name`;
13077:DROP TABLE IF EXISTS `country`;
13114:DROP TABLE IF EXISTS `country_name`;
13147:DROP TABLE IF EXISTS `data_provider`;
13197:DROP TABLE IF EXISTS `data_provider_agent`;
13225:DROP TABLE IF EXISTS `data_resource`;
13290:DROP TABLE IF EXISTS `data_resource_agent`;
13318:DROP TABLE IF EXISTS `entity_type`;
13342:DROP TABLE IF EXISTS `gbif_log_message`;
32308:DROP TABLE IF EXISTS `gbif_user`;
32336:DROP TABLE IF EXISTS `geo_mapping`;
32859:DROP TABLE IF EXISTS `geo_region`;
32897:DROP TABLE IF EXISTS `geographical_coverage_tag`;
32929:DROP TABLE IF EXISTS `gmba`;
33542:DROP TABLE IF EXISTS `identifier_record`;
53350:DROP TABLE IF EXISTS `image_record`;
54295:DROP TABLE IF EXISTS `index_data`;
54489:DROP TABLE IF EXISTS `institution_code`;
54607:DROP TABLE IF EXISTS `ip_country`;
54640:DROP TABLE IF EXISTS `link_record`;
55748:DROP TABLE IF EXISTS `lookup_agent_type`;
55771:DROP TABLE IF EXISTS `lookup_basis_of_record`;
55794:DROP TABLE IF EXISTS `lookup_cell_density_type`;
55817:DROP TABLE IF EXISTS `lookup_identifier_type`;
55840:DROP TABLE IF EXISTS `lookup_image_type`;
55863:DROP TABLE IF EXISTS `lookup_link_type`;
55886:DROP TABLE IF EXISTS `lookup_log_event_type`;
55909:DROP TABLE IF EXISTS `lookup_quad_relation_entity`;
55932:DROP TABLE IF EXISTS `lookup_relationship_assertion_type`;
55955:DROP TABLE IF EXISTS `lookup_remote_concept_remote_id`;
55978:DROP TABLE IF EXISTS `namespace_mapping`;
56005:DROP TABLE IF EXISTS `network_membership`;
56033:DROP TABLE IF EXISTS `number_tag`;
56062:DROP TABLE IF EXISTS `occurrence_record`;
129093:DROP TABLE IF EXISTS `property_store_namespace`;
129117:DROP TABLE IF EXISTS `quad_relation_tag`;
129177:DROP TABLE IF EXISTS `rank`;
129202:DROP TABLE IF EXISTS `raw_occurrence_record`;
276457:DROP TABLE IF EXISTS `registration_login`;
276484:DROP TABLE IF EXISTS `relationship_assertion`;
276593:DROP TABLE IF EXISTS `remote_concept`;
277956:DROP TABLE IF EXISTS `resource_access_point`;
277996:DROP TABLE IF EXISTS `resource_country`;
278025:DROP TABLE IF EXISTS `resource_network`;
278065:DROP TABLE IF EXISTS `resource_rank`;
278096:DROP TABLE IF EXISTS `rollover`;
278120:DROP TABLE IF EXISTS `stats_country_contribution`;
278150:DROP TABLE IF EXISTS `stats_participant_contribution`;
278180:DROP TABLE IF EXISTS `stats_participant_data_usage`;
278209:DROP TABLE IF EXISTS `string_tag`;
278242:DROP TABLE IF EXISTS `tag`;
278269:DROP TABLE IF EXISTS `taxon_concept`;
281271:DROP TABLE IF EXISTS `taxon_country`;
281382:DROP TABLE IF EXISTS `taxon_name`;
282815:DROP TABLE IF EXISTS `temporal_coverage_tag`;
282844:DROP TABLE IF EXISTS `typification_record`;

real	44m59.206s
user	44m4.501s
sys	5m27.030s
[root@compute-0-0 GBIFPortalDB-2012-07-13]# 

Extract "occurrence_record" and "taxon_name" into separate dump files


[root@compute-0-0 GBIFPortalDB-2012-07-13]# time zcat GBIFPortalDB-2012-07-13.dump.gz | sed -n '56062,129092p' > occurrence_record.sql

real	161m20.079s
user	42m0.063s
sys	121m34.890s

[root@compute-0-0 GBIFPortalDB-2012-07-13]# time zcat GBIFPortalDB-2012-07-13.dump.gz | sed -n '281382,282814p' > taxon_name.sql

Read the "occurrence_record" dump file into MySQL

The dump file and the MySQL database are located on separate hard disks, and hopefully this will speed up this step a little.

[root@compute-0-0 GBIFPortalDB-2012-07-13]# time mysql -u root -p gbif 

This last step took almost sixteen days!