LLM evaluation framework for Elixir - Idiomatic + Compatible Elixir port of DeepEval.
Attribution: This project is a derivative work of DeepEval by Confident AI, licensed under Apache 2.0. The core evaluation algorithms, metrics, and prompt templates are derived from the original Python implementation.
Add deep_eval_ex to your list of dependencies in mix.exs:
def deps do
[
{:deep_eval_ex, "~> 0.1.0"}
]
end# Create a test case
test_case = DeepEvalEx.TestCase.new!(
input: "What is the capital of France?",
actual_output: "The capital of France is Paris.",
expected_output: "Paris"
)
# Evaluate with ExactMatch metric
{:ok, result} = DeepEvalEx.Metrics.ExactMatch.measure(test_case)
# Check result
result.score # => 0.0 (not an exact match)
result.success # => false
result.reason # => "The actual and expected outputs are different."Configure your LLM provider in config/config.exs:
config :deep_eval_ex,
default_model: {:openai, "gpt-4o-mini"},
openai_api_key: System.get_env("OPENAI_API_KEY"),
default_threshold: 0.5| Metric | Purpose |
|---|---|
| ExactMatch | Simple string comparison |
| GEval | Flexible criteria-based evaluation using LLM-as-judge |
| Faithfulness | RAG: claims supported by retrieval context |
| Hallucination | Detects unsupported statements |
| AnswerRelevancy | Response relevance to input question |
| ContextualPrecision | RAG retrieval ranking quality |
| ContextualRecall | RAG coverage of ground truth |
See the Metrics Overview for detailed documentation on each metric.
| Guide | Description |
|---|---|
| Quick Start | Get up and running in 5 minutes |
| Configuration | LLM provider setup and options |
| Metrics Overview | All available metrics explained |
| ExUnit Integration | Test assertions for CI/CD |
| Custom Metrics | Build your own evaluation metrics |
| Telemetry | Observability and monitoring |
- TestCase - Test case structure
- Result - Evaluation results
- Evaluator - Batch evaluation
- LLM Adapters - Provider adapters
- Architecture Decision Records - Design decisions and rationale
DeepEvalEx supports multiple LLM providers:
- OpenAI - GPT-4o, GPT-4o-mini, GPT-3.5-turbo
- Anthropic - Claude 3 family (planned)
- Ollama - Local models (planned)
See LLM Adapters and Custom LLM Adapters for details.
defmodule MyApp.LLMTest do
use ExUnit.Case
alias DeepEvalEx.{TestCase, Metrics}
test "LLM generates accurate responses" do
test_case = TestCase.new!(
input: "What is 2 + 2?",
actual_output: get_llm_response("What is 2 + 2?"),
expected_output: "4"
)
{:ok, result} = Metrics.ExactMatch.measure(test_case)
assert result.success, result.reason
end
endEvaluate multiple test cases concurrently:
test_cases = [
TestCase.new!(input: "Q1", actual_output: "A1", expected_output: "A1"),
TestCase.new!(input: "Q2", actual_output: "A2", expected_output: "A2")
]
results = DeepEvalEx.evaluate_batch(test_cases, [Metrics.ExactMatch],
concurrency: 20
)DeepEvalEx emits telemetry events for observability:
:telemetry.attach("my-handler", [:deep_eval_ex, :metric, :stop], fn _event, measurements, metadata, _config ->
IO.puts("Metric #{metadata.metric} completed with score #{measurements.score}")
end, nil)See Telemetry Guide for all events and integration patterns.
Apache 2.0 - See LICENSE and NOTICE for details.
This project is a derivative work of DeepEval by Confident AI, also licensed under Apache 2.0.
