Skip to content

Dump data for Virtuoso as ntriples instead of RDF/XML #3

@wu-lee

Description

@wu-lee

Problem

The current choice of writing a big combined RDF file in RDF/XML format is incredibly slow in the case of large datasets like Co-ops UK.

This makes backlogs of runs more likely (where two runs collide), and can result in server load which could interfere with the Property Boundaries Server's job (which is also memory and CPU heavy).

Suggested Resolution

Adjust the se_open_data library to write in NTriples or possible TTL format. This isn't totally trivial because subsequent steps in the chain (deployment of the triplestores) need to be adjusted to consume that. It's possible downstream users might care about that change too, but my guess is we don't have any.

Possible alternatives

Since the big datasets that experience this, dotcoop, coops-uk, and to a lesser extent, workers-coop, don't currently use a SPARQL query , perhaps the large RDF dump doesn't need to be done at all.

The problem with that is:

  • Within the seod generate step it is (or was until recently) not possible to pick and choose what data it generated.
    • This is now possible post-addition of the facility to support exporting murmurations data
    • However the later step seod triplestore expects the large RDF dump to be there, so some stub would need to be created to stop this later step failing, or the process amended to allow the later step to be optional, which also requires modification of se_open_data in its own right.
  • It would mean that our static linked data and html files wouldn't be generated, which would mean the lod.coop links for these datasets would not go anywhere.

Additional context

Originally posted by @wu-lee in #109

The current choice of writing a big combined RDF file in RDF/XML format is incredibly slow in the case of large datasets like Co-ops UK. We need to adjust the se_open_data library to write in NTriples or possible TTL format. This isn't totally trivial because subsequent steps in the chain (deployment of the triplestores) need to be adjusted to consume that. It's possible downstream users might care about that change too, but my guess is we don't have any.

In the CUK dataset, there are about 7k initiatives, the RDF/XML output all.rdf is ~12MB, and it takes literally hours to write the file every time the dataset is deployed. Inserting some print statements suggests that it's quite slow to write all the indivitual inititaive files out, order of minutes, and then another order of minutes to combine these in memory to make a unified set of triples... but then writing out does something incredibly computing-intensive, and everything stops and waits in some internal part of the RDF/XML serialiser for the remaining tens to hundreds of minutes.

Inserting a require "profile" at the start of the SeOpenData::Initiative::Collection::RDF#save_one_big_rdfxml method which writes data, gets the attached [profiling table][1]. (Caveat: although with a bunch of print traces still included and SEA_LOG_LEVEL=debug set.) Which suggests a lot of URI comparison and ecaping goes on? And that it is built on top of RDF::Ntriples.

Whereas hacking this method to write NTriples dumps the data in seconds (although admittedly the resilt is 20MB)

Writing in TTL mode, by comparison, is moderately slow - order of a small number of minutes for serialisation, following all the triple unification. But not glacially slow, like RDF/XML. The output file size is 9MB (and could probably be smaller if it used more abbreviations).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions