Build software better, together

coze-dev / coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated Dec 5, 2025
Go

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Nov 18, 2025
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Dec 5, 2025
Python

mozilla-ai / any-agent

Star

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated Dec 5, 2025
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Nov 18, 2025
Jupyter Notebook

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

microsoft / ignite25-PREL13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

Star

Learn How To Observe, Manage, and Scale, Agentic AI Apps Using Azure AI Foundry - with this hands-on workshop

observability quality-evaluation aiops distillation-model azure-openai azure-ai-search safety-evaluation azure-ai-foundry supervised-fine-tuning agent-evaluation azure-ai-foundry-models

Updated Nov 24, 2025
Jupyter Notebook

SparkBeyond / agentune

Star

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

customer-support customer-service conversational-agents ai-agents chatbot-evaluation agent-simulator kpi-analysis agent-evaluation agent-optimization sales-agents customer-facing-agents kpi-optimization

Updated Dec 4, 2025
Python

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

shiragannavar / Testing-RAG

Star

evaluation ground-truth llm generative-ai agent-evaluation

Updated May 12, 2025
Python

hidai25 / eval-view

Star

EvalView: pytest-style test harness for AI agents - YAML scenarios, tool-call checks, cost/latency & safety evals, CI-friendly reports

testing evaluation pytest ai-agents mlops llm llmops anthropic openai-assistants crewai langgraph langgraph-python crewai-tools agent-evaluation agent-benchmark

Updated Dec 5, 2025
Python

lml2468 / ContextOptimizer

Star

Intelligent Context Engineering Assistant for Multi-Agent Systems. Analyze, optimize, and enhance your AI agent configurations with AI-powered insights

multi-agent-systems prompt-engineering agent-evaluation context-engineering agent-optimizer

Updated Jul 5, 2025
Python

JetBrains / teamcity-ai-agent-testing-demo

Star

End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI

ai evaluation eval evaluation-framework agentic-ai agent-evaluation evaluation-tools

Updated Aug 13, 2025
Kotlin

Rayyan-Oumlil / CustoFlow

Star

Multi-agent customer support system with Google ADK & Gemini 2.5 Flash Lite. Kaggle capstone demonstrating 11+ concepts. Automates 80%+ queries, <10s response time.

production-ready observability multi-agent-system openapi-tools supabase agent-evaluation google-adk a2a-protocol

Updated Dec 1, 2025
Python

anaishowland / neurosim

Star

Neurosim is a Python framework for building, running, and evaluating AI agent systems. It provides core primitives for agent evaluation, cloud storage integration, and an LLM-as-a-judge system for automated scoring.

computer-vision evaluation-metrics evaluation-framework web-agent evals computer-use agent-evaluation

Updated Oct 29, 2025
Python

Arc-Computer / CL-Bench

Star

Benchmark framework for evaluating LLM agent continual learning in stateful environments. Features production-realistic CRM workflows with multi-turn conversations, state mutations, and cross-entity relationships. Extensible to additional domains

benchmark continual-learning agent-evaluation

Updated Nov 14, 2025
Python

srikanthbaride / reflection-timing

Star

Experiments and analysis on reflection timing in reinforcement learning agents — exploring self-evaluation, meta-learning, and adaptive reflection intervals.

python machine-learning reflection latex reinforcement-learning research-paper meta-learning self-play agent-evaluation

Updated Oct 8, 2025
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

pyros-projects / agent-comparison

Star

Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks

orchestration ai-agents ai-benchmarks qualitative-evaluation llm-agents coding-agents agentic-workflows agent-evaluation agent-testing ai-coding-assistants agent-comparison development-tasks

Updated Nov 25, 2025
Python

ajmal-uk / kaggle-capstone-ai-agent

Star

A safety-first multi-agent mental health companion with real-time distress tracking, triple-layer guardrails, and evidence-based grounding techniques. Built for Kaggle × Google Agents Intensive 2025 Capstone (Agents for Good Track)

gemini mental-health gradio observability crisis-support ai-agents multi-agent-system responsible-ai huggingface-spaces llm-safety agent-evaluation a2a-protocol grounding-techniques agents-for-good

Updated Nov 26, 2025
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evaluation

Here are 28 public repositories matching this topic...

coze-dev / coze-loop

Giskard-AI / giskard-oss

truera / trulens

mozilla-ai / any-agent

rungalileo / agent-leaderboard

Cre4T3Tiv3 / ai-agents-reality-check

microsoft / ignite25-PREL13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

SparkBeyond / agentune

chaosync-org / awesome-ai-agent-testing

shiragannavar / Testing-RAG

hidai25 / eval-view

lml2468 / ContextOptimizer

JetBrains / teamcity-ai-agent-testing-demo

Rayyan-Oumlil / CustoFlow

anaishowland / neurosim

Arc-Computer / CL-Bench

srikanthbaride / reflection-timing

PabloCabaleiro / pondera

pyros-projects / agent-comparison

ajmal-uk / kaggle-capstone-ai-agent

Improve this page

Add this topic to your repo