-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
The current choice of writing a big combined RDF file in RDF/XML format is incredibly slow in the case of large datasets like Co-ops UK.
This makes backlogs of runs more likely (where two runs collide), and can result in server load which could interfere with the Property Boundaries Server's job (which is also memory and CPU heavy).
Suggested Resolution
Adjust the se_open_data library to write in NTriples or possible TTL format. This isn't totally trivial because subsequent steps in the chain (deployment of the triplestores) need to be adjusted to consume that. It's possible downstream users might care about that change too, but my guess is we don't have any.
Possible alternatives
Since the big datasets that experience this, dotcoop, coops-uk, and to a lesser extent, workers-coop, don't currently use a SPARQL query , perhaps the large RDF dump doesn't need to be done at all.
The problem with that is:
- Within the
seod generatestep it is (or was until recently) not possible to pick and choose what data it generated.- This is now possible post-addition of the facility to support exporting murmurations data
- However the later step
seod triplestoreexpects the large RDF dump to be there, so some stub would need to be created to stop this later step failing, or the process amended to allow the later step to be optional, which also requires modification of se_open_data in its own right.
- It would mean that our static linked data and html files wouldn't be generated, which would mean the lod.coop links for these datasets would not go anywhere.
Additional context
Originally posted by @wu-lee in #109
The current choice of writing a big combined RDF file in RDF/XML format is incredibly slow in the case of large datasets like Co-ops UK. We need to adjust the se_open_data library to write in NTriples or possible TTL format. This isn't totally trivial because subsequent steps in the chain (deployment of the triplestores) need to be adjusted to consume that. It's possible downstream users might care about that change too, but my guess is we don't have any.
In the CUK dataset, there are about 7k initiatives, the RDF/XML output
all.rdfis ~12MB, and it takes literally hours to write the file every time the dataset is deployed. Inserting some print statements suggests that it's quite slow to write all the indivitual inititaive files out, order of minutes, and then another order of minutes to combine these in memory to make a unified set of triples... but then writing out does something incredibly computing-intensive, and everything stops and waits in some internal part of the RDF/XML serialiser for the remaining tens to hundreds of minutes.Inserting a
require "profile"at the start of theSeOpenData::Initiative::Collection::RDF#save_one_big_rdfxmlmethod which writes data, gets the attached [profiling table][1]. (Caveat: although with a bunch of print traces still included andSEA_LOG_LEVEL=debugset.) Which suggests a lot of URI comparison and ecaping goes on? And that it is built on top of RDF::Ntriples.Whereas hacking this method to write NTriples dumps the data in seconds (although admittedly the resilt is 20MB)
Writing in TTL mode, by comparison, is moderately slow - order of a small number of minutes for serialisation, following all the triple unification. But not glacially slow, like RDF/XML. The output file size is 9MB (and could probably be smaller if it used more abbreviations).