Skip to content

🚀 Feat: template conversational data augmentation #38

@nicofretti

Description

@nicofretti

Description

Given a small set of high-quality, multi-turn conversations (e.g., 100 customer support chats), generate 1,000 new, realistic, and high-quality conversations.
This is critical for training chatbots and customer service models, which are starved for high-quality, diverse training data.

Constraints

  • Maintaining Coherence: each generated conversation must be logical from turn to turn. The "agent" and "user" must respond to what the other just said. The LLM can't just generate two good "user" lines and two good "agent" lines; they must flow.

  • Semantic Diversity (Avoiding "Mode Collapse"): the LLM will tend to find a few "safe" or common conversational paths and generate 1,000 minor variations of them. The challenge is to generate truly different conversations, covering a wide range of topics, user intents, and emotional tones (e.g., angry users, confused users, happy users).

  • Adherence to Persona and Goals: the generated "agent" must consistently follow its script, persona, or rules. The "user" must have a clear and consistent intent from the beginning of the chat to the end.

Before start this issue suggest a solution and wait for the approval.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions