Skip to content

nicofretti/DataGenFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

minimal.mp4

Define seeds β†’ Build pipeline β†’ Review results β†’ Export data

Watch full demo

Why DataGenFlow 🌱

DataGenFlow is minimal tool to help you generate and validate data from seed/documents with full visibility.

Key Benefits

  • Easy to Extend: Add custom blocks in minutes with auto-discovery
  • Faster Development: Visual pipeline builder eliminates boilerplate code
  • Simple to Use: Intuitive drag-and-drop interface, no training required
  • Full Transparency: Complete execution traces for debugging

Quick Start

Get started in under 2 minutes:

# Install dependencies
make setup
make dev

# Launch application (backend + frontend), make sure to have .env configured
make run-dev

# Open http://localhost:8000

That's it! No complex configuration, no external services required beyond your LLM endpoint.

How It Works

TL;DR - Visual Overview

Example of JSON extraction pipeline from text:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. SEED DATA (JSON)                                                     β”‚
β”‚    { "repetitions": 2, "metadata": {"content": "Python is a..."} }      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. PIPELINE (Visual Drag & Drop)                                        β”‚
β”‚                                                                         β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚         β”‚   Structured     β”‚    ───►   β”‚       JSON       β”‚             β”‚
β”‚         β”‚    Generator     β”‚           β”‚    Validator     β”‚             β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚                                                                         β”‚
β”‚    Accumulated State Flow:                                              β”‚
β”‚    content  ─►  + generated (title, description)  ─►  + valid, parsed   β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. GENERATION & REVIEW                                                  β”‚
β”‚    + Execute pipeline for each seed Γ— repetitions                       β”‚
β”‚    + Review results with keyboard shortcuts (A/R/E)                     β”‚
β”‚    + View full execution trace for debugging                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. EXPORT                                                               β”‚
β”‚    Download as JSONL ─► Ready for training/integration                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Concept: Each block adds data to the accumulated state, so subsequent blocks automatically have access to all previous outputs-no manual wiring needed!


1. Define Your Seed Data

Start by creating a JSON seed file with the variables your pipeline will use. Seeds define what data you want to generate.

Single seed:

{
  "repetitions": 2,
  "metadata": {
    "topic": "Python programming",
    "difficulty": "beginner"
  }
}

Multiple seeds (generate different variations):

[
  {
    "repetitions": 1,
    "metadata": {
      "topic": "Python lists",
      "difficulty": "beginner"
    }
  },
  {
    "repetitions": 1,
    "metadata": {
      "topic": "Python dictionaries",
      "difficulty": "intermediate"
    }
  }
]

Fields:

  • repetitions: How many times to run the pipeline with this seed
  • metadata: Variables accessible in your blocks via {{ variable_name }}

2. Build Your Pipeline Visually

Design your data generation workflow using drag-and-drop blocks. Each block processes data and passes it to the next one. Currenlty there are 3 types of blocks:

  • Generators: Create new content
  • Validators: Validate or parse existing content
  • Metrics: Calculate quality metrics on content

Here are some example blocks available out of the box:

  • [Generator] Text Generator: Generate text using LLM with configurable parameters
  • [Generator] Structured Generator: Generate structured JSON with schema validation
  • [Validators] Validator: Validate text (length, forbidden words, patterns)
  • [Validators] JSON Validator: Parse and validate JSON structures
  • [Metrics] Coherence Score: Calculate text coherence metrics
  • [Metrics] Diversity Score: Measure lexical diversity
  • [Metrics] Rouge Score: Calculate ROUGE similarity scores
  • [Seeders] Markdown Chunker: Split markdown documents into chunks for processing
  • ... other blocks will be added over time, you can contribute new ones too!

Extend with Custom Blocks

The real power of DataGenFlow is creating your own blocks. Add domain-specific logic in minutes with automatic discovery:

from lib.blocks.base import BaseBlock
from lib.entities.block_execution_context import BlockExecutionContext
from typing import Any

class SentimentAnalyzerBlock(BaseBlock):
    name = "Sentiment Analyzer"
    description = "Analyzes text sentiment"
    category = "validators"  # generators, validators, metrics, seeders, general
    inputs = ["text"]  # what this block needs from accumulated state
    outputs = ["sentiment", "confidence"]  # what it adds to accumulated state

    async def execute(self, context: BlockExecutionContext) -> dict[str, Any]:
        text = context.get_state("text", "")  # access from accumulated state
        sentiment = analyze_sentiment(text)

        # return values are added to accumulated state automatically
        return {
            "sentiment": sentiment.label,
            "confidence": sentiment.score
        }

Drop your file in user_blocks/ and it's automatically discovered on restart-no configuration needed.

Why this matters:

  • Adapt to your specific domain or workflow instantly
  • Integrate proprietary validation logic or data sources
  • Build reusable components for your team
  • Share blocks as Python files-simple as copy/paste

Debugging Custom Blocks

Need to debug your custom block? Use the included debug_pipeline.py script with VS Code debugger. See Developer Documentation for details.

πŸ“š Complete guide: Custom Block Development

Accumulated State

Data flows automatically through your pipeline. Each block adds its outputs to an accumulated state that every subsequent block can access-no manual wiring:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Structured Generatorβ”‚ β†’ outputs: {"generated": {"title": "...", "description": "..."}}
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό (state: content, generated)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   JSON Validator    β”‚ β†’ outputs: {"valid": true, "parsed_json": {...}}
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό (state: content, generated, valid, parsed_json)
    All subsequent blocks can access all fields

This makes building complex pipelines incredibly simple-connect blocks and they automatically share data.

3. Review and Refine

Review your results with keyboard shortcuts (Accept: A, Reject: R, Edit: E) and full execution traces to see how each result was generated.

4. Export Your Data

Export your data in JSONL format, filtered by status (accepted, rejected, pending).

Configuration

Create .env file (or copy from .env.example):

# LLM Configuration
LLM_ENDPOINT=http://localhost:11434/v1/chat/completions  # Ollama, OpenAI, etc.
LLM_API_KEY=                            # Optional for some endpoints
LLM_MODEL=llama3.2

# Database
DATABASE_PATH=data/qa_records.db

# Server
HOST=0.0.0.0
PORT=8000

# Debug mode (optional)
DEBUG=false  # set to true for detailed logging

Documentation

πŸ“– Comprehensive Guides

Contributing

Contributions are welcome and appreciated. Before submitting a contribution, please review the guidelines below.

Prerequisites:

  • Read the Contributing Guidelines thoroughly
  • Check existing issues and pull requests to avoid duplication
  • Follow the project's commit conventions and code style standards

Areas for Contribution:

  • New processing blocks and pipeline templates
  • Documentation improvements and examples
  • Bug fixes and performance optimizations
  • Test coverage expansion
  • Integration examples and use cases

For detailed technical requirements and development setup, refer to the Developer Documentation.

Get Started β€’ View Documentation

Happy Data Generating! 🌱