Dump data for Virtuoso as ntriples instead of RDF/XML

## Problem

The current choice of writing a big combined RDF file in RDF/XML format is *incredibly* slow in the case of large datasets like Co-ops UK.

This makes backlogs of runs more likely (where two runs collide), and can result in server load which could interfere with the Property Boundaries Server's job (which is also memory and CPU heavy).

## Suggested Resolution

Adjust the `se_open_data` library to write in NTriples or possible TTL format. This isn't totally trivial because subsequent steps in the chain (deployment of the triplestores) need to be adjusted to consume that. It's possible downstream users might care about that change too, but my guess is we don't have any.

## Possible alternatives

Since the big datasets that experience this, `dotcoop`, `coops-uk`, and to a lesser extent, `workers-coop`, don't currently use a SPARQL query , perhaps the large RDF dump doesn't need to be done at all.

The problem with that is:
- Within the `seod generate` step it is (or was until recently) not possible to pick and choose what data it generated.
  - This is now possible post-addition of the facility to support exporting murmurations data
  - However the later step `seod triplestore` expects the large RDF dump to be there, so some stub would need to be created to stop this later step failing, or the process amended to allow the later step to be optional, which also requires modification of se_open_data in its own right.
- It would mean that our static linked data and html files wouldn't be generated, which would mean the lod.coop links for these datasets would not go anywhere.

## Additional context

 _Originally posted by @wu-lee in [#109](https://github.com/DigitalCommons/technology-and-infrastructure/issues/109#issuecomment-2598209171)_

> The current choice of writing a big combined RDF file in RDF/XML format is *incredibly* slow in the case of large datasets like Co-ops UK. We need to adjust the se_open_data library to write in NTriples or possible TTL format. This isn't totally trivial because subsequent steps in the chain (deployment of the triplestores) need to be adjusted to consume that. It's possible downstream users might care about that change too, but my guess is we don't have any.
> 
> In the CUK dataset, there are about 7k initiatives, the RDF/XML output `all.rdf` is ~12MB, and it takes literally hours to write the file every time the dataset is deployed. Inserting some print statements suggests that it's quite slow to write all the indivitual inititaive files out, order of minutes, and then another order of minutes to combine these in memory to make a unified set of triples... but then writing out does something incredibly computing-intensive, and everything stops and waits in some internal part of the RDF/XML serialiser for the remaining tens to hundreds of minutes.
> 
> Inserting a `require "profile"` at the start of the `SeOpenData::Initiative::Collection::RDF#save_one_big_rdfxml ` method which writes data, gets the attached [profiling table][1]. (Caveat: although with a bunch of print traces still included and `SEA_LOG_LEVEL=debug` set.)  Which suggests a lot of URI comparison and ecaping goes on? And that it is built on top of RDF::Ntriples.
> 
> Whereas hacking this method to write NTriples dumps the data in seconds (although admittedly the resilt is 20MB)
> 
> Writing in TTL mode, by comparison, is moderately slow - order of a small number of minutes for serialisation, following all the triple unification. But not glacially slow, like RDF/XML. The output file size is 9MB (and could probably be smaller if it used more abbreviations).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dump data for Virtuoso as ntriples instead of RDF/XML #3

Problem

Suggested Resolution

Possible alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dump data for Virtuoso as ntriples instead of RDF/XML #3

Description

Problem

Suggested Resolution

Possible alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions