-
Notifications
You must be signed in to change notification settings - Fork 0
Dictionary Encoding Implementation for RDF Term Storage #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds dictionary-based encoding for RDF terms to reduce storage overhead. The dictionary maps URI strings and literal values to numeric IDs, allowing the system to store compact 8-byte IDs instead of full strings in log records.
- Implements a
Dictionarystruct with bidirectional URI↔ID mapping and file persistence - Adds
ResolvedEventtype and resolution methods to convert ID-based events back to URI strings - Extends sparse index functionality to integrate with dictionary-based encoding
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/indexing/dictionary.rs | New module implementing bidirectional dictionary mapping with persistence |
| src/indexing/shared.rs | Adds ResolvedEvent struct and Event::resolve() method for ID-to-URI conversion |
| src/indexing/sparse.rs | Adds dictionary integration methods and comprehensive documentation |
| src/indexing/mod.rs | Exports the new dictionary module |
| tests/dictionary_encoding_test.rs | Comprehensive test suite covering dictionary operations, RDF encoding, and integration scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/indexing/shared.rs
Outdated
| } | ||
|
|
||
| #[derive(Debug, Clone)] | ||
| pub struct ResolvedEvent{ |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before opening brace. Should be pub struct ResolvedEvent { to follow Rust style conventions.
| pub struct ResolvedEvent{ | |
| pub struct ResolvedEvent { |
src/indexing/sparse.rs
Outdated
| @@ -1,20 +1,45 @@ | |||
| use crate::indexing::shared::{decode_record, Event, RECORD_SIZE}; | |||
| use crate::indexing::dictionary::{self, Dictionary}; | |||
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self import (dictionary::{self, Dictionary}) is unused. Only Dictionary is imported and used in the code. Remove self from the import: use crate::indexing::dictionary::Dictionary;
| use crate::indexing::dictionary::{self, Dictionary}; | |
| use crate::indexing::dictionary::Dictionary; |
src/indexing/dictionary.rs
Outdated
| file.read_exact(&mut count_bytes)?; | ||
| let count = u64::from_be_bytes(count_bytes); | ||
|
|
||
| // Reading each IRI Entry |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment uses 'IRI' but the code and rest of the codebase consistently refer to 'URI'. Change to 'URI Entry' for consistency.
| // Reading each IRI Entry | |
| // Reading each URI Entry |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@argahsuknesib I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: argahsuknesib <87450516+argahsuknesib@users.noreply.github.com>
Apply code review feedback: fix formatting and remove unused imports
…h additional fields
…indexing strategies
…r improved insights on indexing strategies
…tructures for efficient event handling and querying
- Introduced a new core module for data structures and types, including Event and RDFEvent. - Updated the storage module to include a new indexing structure with dense and sparse indexing capabilities. - Implemented a user-friendly API for writing and querying RDF events in StreamingSegmentedStorage. - Added benchmarks for RDF segmented storage, including writing and reading performance tests. - Created a dictionary for encoding and decoding RDF URIs to numeric IDs, improving storage efficiency. - Enhanced the dense and sparse indexing mechanisms to support efficient querying of RDF events. - Added comprehensive tests for the dictionary and encoding/decoding functionality.
…king for dense and sparse indexing
…event handling; update StreamingConfig for batch processing parameters
- Changed ID types in Dictionary from u64 to u32 for memory efficiency. - Updated encode and decode methods to reflect the new ID type. - Adjusted tests to use the new encoding/decoding methods. - Modified memory_tracker to track memory usage with detailed statistics. - Added MemoryTracker struct for monitoring memory usage during runtime. - Implemented methods for recording, retrieving, and resetting memory measurements. - Enhanced segmented storage to utilize Rc for dictionary management. - Updated utility functions to include a new StorageComponentSizes struct for memory breakdown.
…xing for improved performance and clarity - Deleted the basic example demonstrating Janus RDF Stream Processing Engine. - Removed the comprehensive benchmark script for testing Dense vs Sparse indexing approaches. - Refactored `main.rs` to clean up print statements and improve readability. - Updated `dictionary.rs` to simplify logging in tests. - Corrected file naming in `segmented_storage.rs` for index files. - Enhanced the `load_index_directory_from_file` function to reconstruct index blocks accurately. - Added new examples for point query and range query benchmarks, focusing on realistic IoT sensor data. - Implemented a realistic RDF benchmark for IoT sensor observations, analyzing write and read performance.
… clean up whitespace and formatting
…itories; refactor linting warnings in lib.rs and remove unused import in memory_tracker.rs
…egex variable names and clean up test assertions
Overview
This PR introduces dictionary encoding for the Janus RDF stream processing engine, achieving 40% space optimization for RDF events through URI-to-ID mapping. Combined with streaming segmented storage and batch buffering, this implementation enables high-throughput RDF stream processing with efficient memory usage.
Problem Statement
RDF streams consume significant memory when storing complete URI strings for subjects, predicates, objects, and graph identifiers. Each 4-tuple RDF event typically requires 40 bytes or more. For high-volume IoT sensor data streams, this becomes a critical bottleneck.
Solution Overview
Dictionary Encoding System
A dictionary-based compression scheme maps URIs to u32 identifiers:
This reduces each RDF event from 40+ bytes to 24 bytes (subject u32 + predicate u32 + object u32 + graph u32 + timestamp u64).
Fixed-Size Event Representation
The core Event struct:
Total: 24 bytes per event (vs 40+ bytes with full URI strings)
Streaming Segmented Storage
StreamingSegmentedStorage manages high-throughput RDF ingestion:
Key Features
Architecture Components
src/core/mod.rs
Core data structures and type definitions:
src/storage/indexing/dictionary.rs
Dictionary encoding implementation:
src/storage/segmented_storage.rs (717 lines)
Streaming storage with batch buffering:
src/storage/util.rs
Shared utilities:
tests/dictionary_encoding_test.rs (623 lines)
Comprehensive integration tests:
Benchmark Examples
Three production-ready benchmark examples:
Benchmarks use realistic IoT sensor datasets and measure:
Implementation Details
Event Encoding Pipeline
Query Processing
Range queries over timestamp range:
Point queries for specific subject/predicate:
Memory Management
StreamingSegmentedStorage tracks:
Configuration allows tuning of:
Performance Characteristics
Space Optimization
Throughput
Latency
Testing
Unit Tests
5 unit tests covering:
Integration Tests
12 integration tests covering:
Test Coverage
All major code paths tested:
Validation Results
All 17 tests pass:
Code Quality
Clippy Compliance
All Clippy warnings addressed:
Formatting
Code follows Rust idioms:
Documentation
Comprehensive inline comments:
CI/CD Improvements
GitHub Actions Updates
Updated to latest versions:
Rustfmt Configuration
Simplified rustfmt.toml:
Dependency Review
Removed dependency review CI job for private repositories. Alternative: Use GitHub Advanced Security subscription if needed.
Files Changed
32 files modified:
Breaking Changes
None. This is a purely additive feature:
New Dependencies
Added bincode (v1.3+):
Migration Guide
For users upgrading to this version:
Commits in This PR
25+ commits including:
Validation Checklist
Recommendations for Review
Related Issues
Addresses dictionary encoding and RDF stream compression requirements.
References