-
Notifications
You must be signed in to change notification settings - Fork 7
🚀 feat: integrate RAGAS evaluation framework #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
e82c32f
feat: partial ragas intergation
kevinagyeman 6a651af
Merge remote-tracking branch 'origin/develop' into ragas-integration-…
kevinagyeman f7a53ca
feat: ragas integration (partial)
kevinagyeman 5ea1e61
feat: partial ragas integration wip
kevinagyeman a41d158
edit: add .worktrees/ to gitignore
nicofretti c814c1b
fix: theme + storybook
nicofretti a9fe6fd
fix: bug config pipeline
nicofretti 5562d01
edit: improve generator validator
nicofretti 20d7202
fix: stop job handling
nicofretti 20c8d1e
Merge remote-tracking branch 'origin/feat/replace-modals' into ragas-…
kevinagyeman 918f89e
feat: integration of answer_relevancy and context_precision metrics
kevinagyeman 783eadc
feat: aggregate multiple ragas metrics
kevinagyeman 46452fc
feat: ragas integration completion
kevinagyeman 157af89
Merge remote-tracking branch 'origin/develop' into ragas-integration-…
kevinagyeman 8caaf62
fix: update blocks to use BlockExecutionContext pattern and add depen…
kevinagyeman ceda594
fix: address copilot code review feedback
kevinagyeman d0f3601
wip: fixing the ragas block + field mapper
nicofretti 06d553c
wip: fixing ragas block
nicofretti 1f03105
fix: langfuse error
nicofretti da1196b
fix: integration ragas
nicofretti 1f10c30
fix: renaming + review
nicofretti 57ad2c0
fix: format
nicofretti 350374c
fix: missing import + doc
nicofretti fb5221f
fix: pre-merge + docs
nicofretti 52a1645
fix: ui view
nicofretti 627daf2
edit: changelog
nicofretti 2249e1a
fix: pipeline errors
nicofretti 8acc9c0
fix: pipeline
nicofretti 7e0b20c
edit: import on ragas
nicofretti 735107a
fix: debug_pipeline.py
nicofretti b9c3339
Merge branch 'develop' into ragas-integration-block
nicofretti File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| # RAGAS Evaluation Guide | ||
|
|
||
| ## Overview | ||
|
|
||
| RAGAS (Retrieval Augmented Generation Assessment) is a framework for evaluating the quality of RAG-generated answers. The **RagasMetrics** block evaluates a single QA pair against multiple quality metrics. | ||
|
|
||
| ## Metrics | ||
|
|
||
| ### 1. Answer Relevancy | ||
| **What it measures**: How relevant the answer is to the question. | ||
|
|
||
| **Range**: 0.0 - 1.0 (higher is better) | ||
|
|
||
| **Requires**: | ||
| - question | ||
| - answer | ||
| - embeddings (configured via embedding model) | ||
|
|
||
| **Example**: | ||
| - Question: "What is the capital of France?" | ||
| - Answer: "Paris is the capital of France" -> High score (0.9+) | ||
| - Answer: "France is a European country" -> Low score (0.3-) | ||
|
|
||
| ### 2. Faithfulness | ||
| **What it measures**: Whether the answer is factually consistent with the provided context. | ||
|
|
||
| **Range**: 0.0 - 1.0 (higher is better) | ||
|
|
||
| **Requires**: | ||
| - question | ||
| - answer | ||
| - contexts | ||
|
|
||
| **Example**: | ||
| - Context: "The Eiffel Tower is 330 meters tall" | ||
| - Answer: "The Eiffel Tower is 330 meters tall" -> High score (0.9+) | ||
| - Answer: "The Eiffel Tower is 500 meters tall" -> Low score (0.3-) | ||
|
|
||
| ### 3. Context Precision | ||
| **What it measures**: Whether the relevant context chunks appear earlier in the context list. | ||
|
|
||
| **Range**: 0.0 - 1.0 (higher is better) | ||
|
|
||
| **Requires**: | ||
| - question | ||
| - contexts | ||
| - ground_truth | ||
|
|
||
| **Example**: | ||
| If the most relevant context appears first in the list -> High score | ||
| If relevant context is buried at the end -> Low score | ||
|
|
||
| ### 4. Context Recall | ||
| **What it measures**: Whether all information needed to answer the question is present in the contexts. | ||
|
|
||
| **Range**: 0.0 - 1.0 (higher is better) | ||
|
|
||
| **Requires**: | ||
| - question | ||
| - contexts | ||
| - ground_truth | ||
|
|
||
| **Example**: | ||
| - Ground truth: "Paris is the capital of France, located on the Seine river" | ||
| - Context includes both facts -> High score (1.0) | ||
| - Context only includes capital fact -> Lower score (0.5) | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Field References | ||
|
|
||
| The block uses field references to locate data in the pipeline state: | ||
| - **question_field**: Field containing the question | ||
| - **answer_field**: Field containing the answer | ||
| - **contexts_field**: Field containing contexts (list of strings) | ||
| - **ground_truth_field**: Field containing expected answer | ||
|
|
||
| These are dropdowns populated from available pipeline fields, you can use the **FieldMapper** block to rename or create fields as needed (eg. extract fields from nasted structures). | ||
|
|
||
| ### Selecting Metrics | ||
|
|
||
| Use the **metrics** multi-select to choose which metrics to compute: | ||
| - Check all metrics you want to evaluate | ||
| - Uncheck metrics you don't need | ||
| - Note: `answer_relevancy` requires an embedding model | ||
|
|
||
| ### Score Threshold | ||
|
|
||
| The field **score_threshold** is the minimum value for each metric to be considered passing. The block outputs a boolean `passed` indicating if all selected metrics meet or exceed this threshold. | ||
|
|
||
|
|
||
| ### Model Configuration | ||
|
|
||
| - **model**: LLM model for evaluation (leave empty for pipeline default) | ||
| - **embedding_model**: Embedding model for answer_relevancy (leave empty for default) | ||
|
|
||
| ## Output Format | ||
|
|
||
| The block outputs a single `ragas_scores` object: | ||
|
|
||
| ```json | ||
| { | ||
| ... | ||
| "ragas_scores": { | ||
| "answer_relevancy": 0.92, | ||
| "faithfulness": 0.88, | ||
| "context_precision": 0.95, | ||
| "context_recall": 0.85, | ||
| "passed": true | ||
| }, | ||
| } | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.