DataGenFlow is minimal tool to help you generate and validate data from seed/documents with full visibility.
- Easy to Extend: Add custom blocks in minutes with auto-discovery
- Faster Development: Visual pipeline builder eliminates boilerplate code
- Simple to Use: Intuitive drag-and-drop interface, no training required
- Full Transparency: Complete execution traces for debugging
Get started in under 2 minutes:
# Install dependencies
make setup
make dev
# Launch application (backend + frontend), make sure to have .env configured
make run-dev
# Open http://localhost:8000That's it! No complex configuration, no external services required beyond your LLM endpoint.
Example of JSON extraction pipeline from text:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. SEED DATA (JSON) β
β { "repetitions": 2, "metadata": {"content": "Python is a..."} } β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. PIPELINE (Visual Drag & Drop) β
β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Structured β ββββΊ β JSON β β
β β Generator β β Validator β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β
β Accumulated State Flow: β
β content ββΊ + generated (title, description) ββΊ + valid, parsed β
β β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. GENERATION & REVIEW β
β + Execute pipeline for each seed Γ repetitions β
β + Review results with keyboard shortcuts (A/R/E) β
β + View full execution trace for debugging β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. EXPORT β
β Download as JSONL ββΊ Ready for training/integration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Concept: Each block adds data to the accumulated state, so subsequent blocks automatically have access to all previous outputs-no manual wiring needed!
Start by creating a JSON seed file with the variables your pipeline will use. Seeds define what data you want to generate.
Single seed:
{
"repetitions": 2,
"metadata": {
"topic": "Python programming",
"difficulty": "beginner"
}
}Multiple seeds (generate different variations):
[
{
"repetitions": 1,
"metadata": {
"topic": "Python lists",
"difficulty": "beginner"
}
},
{
"repetitions": 1,
"metadata": {
"topic": "Python dictionaries",
"difficulty": "intermediate"
}
}
]Fields:
repetitions: How many times to run the pipeline with this seedmetadata: Variables accessible in your blocks via{{ variable_name }}
Design your data generation workflow using drag-and-drop blocks. Each block processes data and passes it to the next one. Currenlty there are 3 types of blocks:
- Generators: Create new content
- Validators: Validate or parse existing content
- Metrics: Calculate quality metrics on content
Here are some example blocks available out of the box:
- [Generator] Text Generator: Generate text using LLM with configurable parameters
- [Generator] Structured Generator: Generate structured JSON with schema validation
- [Validators] Validator: Validate text (length, forbidden words, patterns)
- [Validators] JSON Validator: Parse and validate JSON structures
- [Metrics] Coherence Score: Calculate text coherence metrics
- [Metrics] Diversity Score: Measure lexical diversity
- [Metrics] Rouge Score: Calculate ROUGE similarity scores
- [Seeders] Markdown Chunker: Split markdown documents into chunks for processing
- ... other blocks will be added over time, you can contribute new ones too!
The real power of DataGenFlow is creating your own blocks. Add domain-specific logic in minutes with automatic discovery:
from lib.blocks.base import BaseBlock
from lib.entities.block_execution_context import BlockExecutionContext
from typing import Any
class SentimentAnalyzerBlock(BaseBlock):
name = "Sentiment Analyzer"
description = "Analyzes text sentiment"
category = "validators" # generators, validators, metrics, seeders, general
inputs = ["text"] # what this block needs from accumulated state
outputs = ["sentiment", "confidence"] # what it adds to accumulated state
async def execute(self, context: BlockExecutionContext) -> dict[str, Any]:
text = context.get_state("text", "") # access from accumulated state
sentiment = analyze_sentiment(text)
# return values are added to accumulated state automatically
return {
"sentiment": sentiment.label,
"confidence": sentiment.score
}Drop your file in user_blocks/ and it's automatically discovered on restart-no configuration needed.
Why this matters:
- Adapt to your specific domain or workflow instantly
- Integrate proprietary validation logic or data sources
- Build reusable components for your team
- Share blocks as Python files-simple as copy/paste
Debugging Custom Blocks
Need to debug your custom block? Use the included debug_pipeline.py script with VS Code debugger. See Developer Documentation for details.
π Complete guide: Custom Block Development
Data flows automatically through your pipeline. Each block adds its outputs to an accumulated state that every subsequent block can access-no manual wiring:
βββββββββββββββββββββββ
β Structured Generatorβ β outputs: {"generated": {"title": "...", "description": "..."}}
βββββββββββββββββββββββ
β
βΌ (state: content, generated)
βββββββββββββββββββββββ
β JSON Validator β β outputs: {"valid": true, "parsed_json": {...}}
βββββββββββββββββββββββ
β
βΌ (state: content, generated, valid, parsed_json)
All subsequent blocks can access all fields
This makes building complex pipelines incredibly simple-connect blocks and they automatically share data.
Review your results with keyboard shortcuts (Accept: A, Reject: R, Edit: E) and full execution traces to see how each result was generated.
Export your data in JSONL format, filtered by status (accepted, rejected, pending).
Create .env file (or copy from .env.example):
# LLM Configuration
LLM_ENDPOINT=http://localhost:11434/v1/chat/completions # Ollama, OpenAI, etc.
LLM_API_KEY= # Optional for some endpoints
LLM_MODEL=llama3.2
# Database
DATABASE_PATH=data/qa_records.db
# Server
HOST=0.0.0.0
PORT=8000
# Debug mode (optional)
DEBUG=false # set to true for detailed loggingπ Comprehensive Guides
- How to Use DataGenFlow - Complete user guide
- Custom Block Development - Extend functionality
- Developer Documentation - Technical reference for developers
Contributions are welcome and appreciated. Before submitting a contribution, please review the guidelines below.
Prerequisites:
- Read the Contributing Guidelines thoroughly
- Check existing issues and pull requests to avoid duplication
- Follow the project's commit conventions and code style standards
Areas for Contribution:
- New processing blocks and pipeline templates
- Documentation improvements and examples
- Bug fixes and performance optimizations
- Test coverage expansion
- Integration examples and use cases
For detailed technical requirements and development setup, refer to the Developer Documentation.
Get Started β’ View Documentation
Happy Data Generating! π±
