Skip to content

Commit 65494a3

Browse files
oleshoclaude
andauthored
Local Dockerised Eval Server (#52)
# Local Dockerised Eval Server <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Centralized LLM configuration with programmatic panel API, per-request overrides, persistent settings, multi-tier models (main/mini/nano), and per-client application during evaluations. - Dynamic configuration via JSON-RPC (configure_llm) for runtime/provider updates. - Documentation - Added comprehensive LLM configuration docs, provider guides, protocol examples, usage snippets, and an environment template. - Chores - Default ports updated for evaluation/HTTP endpoints and examples adjusted for automated mode (auth disabled by default). <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent c561c79 commit 65494a3

22 files changed

+2170
-188
lines changed

MODEL-CONFIGS.md

Lines changed: 450 additions & 0 deletions
Large diffs are not rendered by default.

eval-server/nodejs/.env.example

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Evaluation Server Configuration
2+
# Copy this file to .env and configure your settings
3+
4+
# Server Configuration
5+
PORT=8080
6+
HOST=127.0.0.1
7+
8+
# LLM Provider API Keys
9+
# Configure one or more providers for evaluation
10+
11+
# OpenAI Configuration
12+
OPENAI_API_KEY=sk-your-openai-api-key-here
13+
14+
# LiteLLM Configuration (if using a LiteLLM server)
15+
LITELLM_ENDPOINT=http://localhost:4000
16+
LITELLM_API_KEY=your-litellm-api-key-here
17+
18+
# Groq Configuration
19+
GROQ_API_KEY=gsk_your-groq-api-key-here
20+
21+
# OpenRouter Configuration
22+
OPENROUTER_API_KEY=sk-or-v1-your-openrouter-api-key-here
23+
24+
# Default LLM Configuration for Evaluations
25+
# These will be used as fallbacks when not specified in evaluation requests
26+
DEFAULT_PROVIDER=openai
27+
DEFAULT_MAIN_MODEL=gpt-4
28+
DEFAULT_MINI_MODEL=gpt-4-mini
29+
DEFAULT_NANO_MODEL=gpt-3.5-turbo
30+
31+
# Logging Configuration
32+
LOG_LEVEL=info
33+
LOG_DIR=./logs
34+
35+
# Client Configuration
36+
CLIENTS_DIR=./clients
37+
EVALS_DIR=./evals
38+
39+
# RPC Configuration
40+
RPC_TIMEOUT=30000
41+
42+
# Security
43+
# Set this to enable authentication for client connections
44+
# Leave empty to disable authentication
45+
AUTH_SECRET_KEY=

eval-server/nodejs/CLAUDE.md

Lines changed: 89 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,16 @@ bo-eval-server is a WebSocket-based evaluation server for LLM agents that implem
2222
- `OPENAI_API_KEY` - OpenAI API key for LLM judge functionality
2323
- `PORT` - WebSocket server port (default: 8080)
2424

25+
### LLM Provider Configuration (Optional)
26+
- `GROQ_API_KEY` - Groq API key for Groq provider support
27+
- `OPENROUTER_API_KEY` - OpenRouter API key for OpenRouter provider support
28+
- `LITELLM_ENDPOINT` - LiteLLM server endpoint URL
29+
- `LITELLM_API_KEY` - LiteLLM API key for LiteLLM provider support
30+
- `DEFAULT_PROVIDER` - Default LLM provider (openai, groq, openrouter, litellm)
31+
- `DEFAULT_MAIN_MODEL` - Default main model name
32+
- `DEFAULT_MINI_MODEL` - Default mini model name
33+
- `DEFAULT_NANO_MODEL` - Default nano model name
34+
2535
## Architecture
2636

2737
### Core Components
@@ -33,10 +43,11 @@ bo-eval-server is a WebSocket-based evaluation server for LLM agents that implem
3343
- Handles bidirectional RPC communication
3444

3545
**RPC Client** (`src/rpc-client.js`)
36-
- Implements JSON-RPC 2.0 protocol for server-to-client calls
46+
- Implements JSON-RPC 2.0 protocol for bidirectional communication
3747
- Manages request/response correlation with unique IDs
3848
- Handles timeouts and error conditions
3949
- Calls `Evaluate(request: String) -> String` method on connected agents
50+
- Supports `configure_llm` method for dynamic LLM provider configuration
4051

4152
**LLM Evaluator** (`src/evaluator.js`)
4253
- Integrates with OpenAI API for LLM-as-a-judge functionality
@@ -78,7 +89,10 @@ logs/ # Log files (created automatically)
7889
### Key Features
7990

8091
- **Bidirectional RPC**: Server can call methods on connected clients
81-
- **LLM-as-a-Judge**: Automated evaluation of agent responses using GPT-4
92+
- **Multi-Provider LLM Support**: Support for OpenAI, Groq, OpenRouter, and LiteLLM providers
93+
- **Dynamic LLM Configuration**: Runtime configuration via `configure_llm` JSON-RPC method
94+
- **Per-Client Configuration**: Each connected client can have different LLM settings
95+
- **LLM-as-a-Judge**: Automated evaluation of agent responses using configurable LLM providers
8296
- **Concurrent Evaluations**: Support for multiple agents and parallel evaluations
8397
- **Structured Logging**: All interactions logged as JSON for analysis
8498
- **Interactive CLI**: Built-in CLI for testing and server management
@@ -93,6 +107,79 @@ Agents must implement:
93107
- `Evaluate(task: string) -> string` method
94108
- "ready" message to signal availability for evaluations
95109

110+
### Model Configuration Schema
111+
112+
The server uses a canonical nested model configuration format that allows per-tier provider and API key settings:
113+
114+
#### Model Configuration Structure
115+
116+
```typescript
117+
interface ModelTierConfig {
118+
provider: string; // "openai" | "groq" | "openrouter" | "litellm"
119+
model: string; // Model name (e.g., "gpt-4", "llama-3.1-8b-instant")
120+
api_key: string; // API key for this tier
121+
}
122+
123+
interface ModelConfig {
124+
main_model: ModelTierConfig; // Primary model for complex tasks
125+
mini_model: ModelTierConfig; // Secondary model for simpler tasks
126+
nano_model: ModelTierConfig; // Tertiary model for basic tasks
127+
}
128+
```
129+
130+
#### Example: Evaluation with Model Configuration
131+
132+
```json
133+
{
134+
"jsonrpc": "2.0",
135+
"method": "evaluate",
136+
"params": {
137+
"tool": "chat",
138+
"input": {"message": "Hello"},
139+
"model": {
140+
"main_model": {
141+
"provider": "openai",
142+
"model": "gpt-4",
143+
"api_key": "sk-main-key"
144+
},
145+
"mini_model": {
146+
"provider": "openai",
147+
"model": "gpt-4-mini",
148+
"api_key": "sk-mini-key"
149+
},
150+
"nano_model": {
151+
"provider": "groq",
152+
"model": "llama-3.1-8b-instant",
153+
"api_key": "gsk-nano-key"
154+
}
155+
}
156+
}
157+
}
158+
```
159+
160+
### Dynamic LLM Configuration
161+
162+
The server supports runtime LLM configuration via the `configure_llm` JSON-RPC method:
163+
164+
```json
165+
{
166+
"jsonrpc": "2.0",
167+
"method": "configure_llm",
168+
"params": {
169+
"provider": "openai|groq|openrouter|litellm",
170+
"apiKey": "your-api-key",
171+
"endpoint": "endpoint-url-for-litellm",
172+
"models": {
173+
"main": "main-model-name",
174+
"mini": "mini-model-name",
175+
"nano": "nano-model-name"
176+
},
177+
"partial": false
178+
},
179+
"id": "config-request-id"
180+
}
181+
```
182+
96183
### Configuration
97184

98185
All configuration is managed through environment variables and `src/config.js`. Key settings:

eval-server/nodejs/README.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,23 @@ server.onConnect(async client => {
145145
message: "Your question here"
146146
},
147147
timeout: 30000, // Optional timeout (ms)
148-
model: {}, // Optional model config
148+
model: { // Optional nested model config
149+
main_model: {
150+
provider: "openai",
151+
model: "gpt-4",
152+
api_key: "sk-..."
153+
},
154+
mini_model: {
155+
provider: "openai",
156+
model: "gpt-4-mini",
157+
api_key: "sk-..."
158+
},
159+
nano_model: {
160+
provider: "groq",
161+
model: "llama-3.1-8b-instant",
162+
api_key: "gsk-..."
163+
}
164+
},
149165
metadata: { // Optional metadata
150166
tags: ['api', 'test']
151167
}

0 commit comments

Comments
 (0)