Skip to content

holsee/deep_eval_ex

Repository files navigation

DeepEvalEx Logo

CI Hex.pm Documentation License

DeepEvalEx

LLM evaluation framework for Elixir - Idiomatic + Compatible Elixir port of DeepEval.

Attribution: This project is a derivative work of DeepEval by Confident AI, licensed under Apache 2.0. The core evaluation algorithms, metrics, and prompt templates are derived from the original Python implementation.

Installation

Add deep_eval_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:deep_eval_ex, "~> 0.1.0"}
  ]
end

Quick Start

# Create a test case
test_case = DeepEvalEx.TestCase.new!(
  input: "What is the capital of France?",
  actual_output: "The capital of France is Paris.",
  expected_output: "Paris"
)

# Evaluate with ExactMatch metric
{:ok, result} = DeepEvalEx.Metrics.ExactMatch.measure(test_case)

# Check result
result.score      # => 0.0 (not an exact match)
result.success    # => false
result.reason     # => "The actual and expected outputs are different."

Configuration

Configure your LLM provider in config/config.exs:

config :deep_eval_ex,
  default_model: {:openai, "gpt-4o-mini"},
  openai_api_key: System.get_env("OPENAI_API_KEY"),
  default_threshold: 0.5

Available Metrics

Metric Purpose
ExactMatch Simple string comparison
GEval Flexible criteria-based evaluation using LLM-as-judge
Faithfulness RAG: claims supported by retrieval context
Hallucination Detects unsupported statements
AnswerRelevancy Response relevance to input question
ContextualPrecision RAG retrieval ranking quality
ContextualRecall RAG coverage of ground truth

See the Metrics Overview for detailed documentation on each metric.

Documentation

Guide Description
Quick Start Get up and running in 5 minutes
Configuration LLM provider setup and options
Metrics Overview All available metrics explained
ExUnit Integration Test assertions for CI/CD
Custom Metrics Build your own evaluation metrics
Telemetry Observability and monitoring

API Reference

Architecture

LLM Adapters

DeepEvalEx supports multiple LLM providers:

  • OpenAI - GPT-4o, GPT-4o-mini, GPT-3.5-turbo
  • Anthropic - Claude 3 family (planned)
  • Ollama - Local models (planned)

See LLM Adapters and Custom LLM Adapters for details.

Usage with ExUnit

defmodule MyApp.LLMTest do
  use ExUnit.Case

  alias DeepEvalEx.{TestCase, Metrics}

  test "LLM generates accurate responses" do
    test_case = TestCase.new!(
      input: "What is 2 + 2?",
      actual_output: get_llm_response("What is 2 + 2?"),
      expected_output: "4"
    )

    {:ok, result} = Metrics.ExactMatch.measure(test_case)
    assert result.success, result.reason
  end
end

Concurrent Evaluation

Evaluate multiple test cases concurrently:

test_cases = [
  TestCase.new!(input: "Q1", actual_output: "A1", expected_output: "A1"),
  TestCase.new!(input: "Q2", actual_output: "A2", expected_output: "A2")
]

results = DeepEvalEx.evaluate_batch(test_cases, [Metrics.ExactMatch],
  concurrency: 20
)

Telemetry

DeepEvalEx emits telemetry events for observability:

:telemetry.attach("my-handler", [:deep_eval_ex, :metric, :stop], fn _event, measurements, metadata, _config ->
  IO.puts("Metric #{metadata.metric} completed with score #{measurements.score}")
end, nil)

See Telemetry Guide for all events and integration patterns.

License

Apache 2.0 - See LICENSE and NOTICE for details.

This project is a derivative work of DeepEval by Confident AI, also licensed under Apache 2.0.

About

LLM evaluation framework for Elixir

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages