programmed in Java (version 20) see pom file for dependencies
main class: GUI.DataCheck
The main idea of this tool is to execute pre-defined SPARQL queries (rules) against a SPARQL endpoint to identify data quality issues (inconsistencies within the data, missing values and outliers). The results of these SPARQL queries are compiled and listed in an Excel file (a spreadsheet for each query and an overview spreadsheet). In the resulting Excel file, domain experts can enter comments about the status of the found issue (could be no error or to identify reasons for inconsistencies or missing data). This commented Excel file can be used for the next run of the data quality check and the tool will retain: a) the date an issue was first reported and b) the comments made by the domain experts.
We implemented the tool based on RDF/SPARQL to allow better reuse. The tool can be used by anyone who makes their data available via a SPARQL endpoint. Groups sharing the same model can reuse and share their generated sPARQL queries. The tool comes with the rules we have generated for our Corpus Nummorum coin data, based on the Nomisma.org ontology.
The implementation of this tool was iniciated in order to increase the data quality of the data within the Corpus Nummorum (CN) project (https://www.corpus-nummorum.eu/).
The tool was supported by a number of theses:
- Modernisierung eines Legacy-Systems zur Datenqualitätsprüfung und Entwicklung eines Testdatenmanagement-Tools im Kontext von Linked Open Data (Bachelor Thesis) – Anna-Lena Buccoli (https://zenodo.org/record/8403298)
- Benutzerschnittstelle für die Erstellung von Datenqualit¨atsabfragen in SPARQL (Bachelor Thesis) – Elif Tugba Dichter und Vladyslav Matsuyev
- Information Propagation across Versions in Context of a SPARQL-Rules System (Bachelor Thesis) – Kateryna Kvasnytsia