LLM Benchmarking #3486

bradleyshep · 2025-10-25T02:20:39Z

Description of Changes

Introduce a new LLM benchmarking app and supporting code.

CLI: llm with subcommands run, routes list, diff, ci-check.
Runner: executes globally numbered tasks; filters by --lang, --categories, --tasks, --providers, --models.
Providers/clients: route layer (provider:model) with HTTP LLM Vendor clients; env-driven keys/base URLs.
Evaluation: deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages.
Results: stable JSON schema; single-file HTML viewer to inspect/filter/export CSV.
Build & guards: build script for compile-time setup;
Docs: DEVELOP.md includes cargo llm … usage.

This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer).

How it works

Pick what to run
- Choose tasks (--tasks 0,7,12), or a language (--lang rust|csharp), or categories (--categories basics,schema).
- Optionally limit vendors/models (--providers …, --models …).
Resolve routes
- Read env (API keys + base URLs) and build the active set (e.g., openai:gpt-5).
Build context
- Start Spacetime
- Publish golden answer modules
- Prepare prompts and send to LLM model
- Attempt to publish LLM module
Execute calls
- Run the selected tasks within each test against selected models and languages.
Score outputs
- Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks).
- Record the score and any short failure reason.
Update results file
- Write/update the single results JSON with task/route outcomes, timings, and summaries.

API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

Expected complexity level and risk

4/5. New CLI, routing, evaluation, and artifact format.

External model APIs may rate-limit/timeout; concurrency tunable via LLM_BENCH_CONCURRENCY / LLM_BENCH_ROUTE_CONCURRENCY.

Testing

I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using act.

Please verify

.github/workflows/ci.yml

…ain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>

Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>

docs/DEVELOP.md

docs/llms/DEVELOP.md

crates/xtask-llm-benchmark/src/bin/llm_benchmark.rs

.github/workflows/ci.yml

In general, the fix is to add an explicit permissions block to the csharp-testsuite job (or at workflow root) to limit the default GITHUB_TOKEN permissions to the minimal set required, which in this case appears to be read-only repository contents. This prevents the job from inheriting broader default permissions such as write access.

The best targeted fix, without changing behavior, is to add permissions: contents: read directly under the csharp-testsuite job definition. Other jobs already have their own permission blocks, so setting it at the workflow root is unnecessary. Nothing in csharp-testsuite needs to write checks, statuses, or pull requests; it only checks out code and runs tests, so contents: read is sufficient and consistent with the CodeQL recommendation. Concretely, in .github/workflows/ci.yml, between line 650 (csharp-testsuite:) and line 651 (needs: [lints, llm_ci_check]), insert:

permissions: contents: read

No additional imports, methods, or definitions are required since this is a configuration-only change in the workflow file.

Add retry logic for signal-killed processes (SIGSEGV) with up to 2 retries and 500ms delay between attempts. Also reduce C# build concurrency from 8 to 4 by default to prevent resource contention in dotnet/WASI SDK builds. The C# concurrency can be configured via LLM_BENCH_CSHARP_CONCURRENCY env var.

Set MSBUILDDISABLENODEREUSE=1 and DOTNET_CLI_USE_MSBUILD_SERVER=0 to prevent resource contention when running multiple dotnet publish commands in parallel on GitHub Actions runners. See: dotnet/msbuild#6657

.github/workflows/ci.yml

clockwork-labs-bot · 2026-01-06T00:39:46Z

LLM Benchmark Results (ci-quickfix)

Language	Mode	Category	Tests Passed	Pass %	Task Pass %
Rust	rustdoc_json	basics	20/27	74.1%	75.0%
Rust	rustdoc_json	schema	23/34	67.6%	60.0%
Rust	rustdoc_json	total	43/61	70.5%	68.2%
C#	docs	basics	27/27	100.0%	100.0%
C#	docs	schema	31/34	91.2%	90.0%
C#	docs	total	58/61	95.1%	95.5%

_{Generated at: 2026-01-06T00:39:43.087Z}

cloutiertyler · 2026-01-06T01:28:29Z

I think we're okay to merge this now that /update-llm-benchmark is now able to run the fix automatically on github.

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

…ld it

bradleyshep added 8 commits October 24, 2025 15:00

init files

7110164

Update llm-benchmark-details.json

ef61f28

llm benchmarks (moved from private)

a200254

remove dotenvy

58a5a08

ignore registry

af45a20

summary updates; command

961dd1c

Merge branch 'LLM-benchmarks' into bradley/llm-benchmark

8e6624f

develop updates

b59650f

bradleyshep requested a review from cloutiertyler October 25, 2025 02:20

github-advanced-security bot found potential problems Oct 25, 2025

View reviewed changes

.github/workflows/ci.yml Fixed Show fixed Hide fixed

bradleyshep added 4 commits October 24, 2025 22:28

DEVELOP + registry ignored

38596ba

change generated registry to use relative paths + include in git

7d69779

attempt fix to pass

3aa051b

DEVELOP updates; clippy fixes?

e443251

bfops added the release-any To be landed in any release window label Oct 27, 2025

bradleyshep and others added 6 commits November 3, 2025 13:11

clippy fixes

8161e45

Update ci.yml

26de9c4

Potential fix for code scanning alert no. 106: Workflow does not cont…

79d4abe

…ain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>

Merge branch 'master' into bradley/llm-benchmark

0f606b0

Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>

bump to 1.6, fixes

edeefb1

partial category scores

3a0c2de

cloutiertyler reviewed Nov 4, 2025

View reviewed changes

docs/DEVELOP.md Show resolved Hide resolved

Update DEVELOP.md

4466fa0

cloutiertyler reviewed Nov 7, 2025

View reviewed changes

docs/llms/DEVELOP.md Outdated Show resolved Hide resolved

bradleyshep and others added 5 commits November 7, 2025 15:25

Remove diff; add ci-quickfix

eb46333

Merge branch 'master' into bradley/llm-benchmark

1534e75

Merge branch 'master' into bradley/llm-benchmark

fd1933e

Fixes whitespace

bc5e5bc

Merges in master

7d811b1

cloutiertyler reviewed Dec 31, 2025

View reviewed changes

crates/xtask-llm-benchmark/src/bin/llm_benchmark.rs Outdated Show resolved Hide resolved

cloutiertyler added 5 commits January 4, 2026 16:21

Made long running jobs dependent on short running basic checks

8d916b8

Forgot to save file

25e79c5

Consolidated internal tests into the CI workflow

861c804

slight name change

a0817d5

Small fix. Lints now needed to run c sharp test suite

0d13679

github-advanced-security bot found potential problems Jan 4, 2026

View reviewed changes

cloutiertyler added 7 commits January 5, 2026 00:00

cargo fmt

e278696

Try to fix C# problems

1095c95

Add MSBuild env vars to fix "Pipe is broken" errors in CI

320c46c

Set MSBUILDDISABLENODEREUSE=1 and DOTNET_CLI_USE_MSBUILD_SERVER=0 to prevent resource contention when running multiple dotnet publish commands in parallel on GitHub Actions runners. See: dotnet/msbuild#6657

Removed errant file

916fafb

Hopefully fix thing

1faf580

Added nix flake check

f570c3b

github-advanced-security bot found potential problems Jan 6, 2026

View reviewed changes

.github/workflows/ci.yml Fixed Show fixed Hide fixed

spacetimedb-bot and others added 2 commits January 6, 2026 00:39

Update LLM benchmark results

b36fc11

Fixed workflow

04eb91a

cloutiertyler approved these changes Jan 6, 2026

View reviewed changes

jdetter added 11 commits January 5, 2026 21:49

Nix flake CI on new runner

4213fe4

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Update ci.yml

7396996

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Quick fix from chatgpt

cb629a7

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

More debug for nix flake failureo

e8c8ebc

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Even more nix debug info

7408818

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Even more logs for nix flake

b041cfd

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Update parallelism for nix flake

9a56549

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Fix syntax issue in nix flake check

1536389

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Rust full backtrace for nix flake test

0adcb86

Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>

Don't depend on spacetimedb-lib 1.6.0 because the nix flake can't bui…

e1b3325

…ld it

debugging jemalloc

e51b4e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM Benchmarking #3486

LLM Benchmarking #3486

Uh oh!

bradleyshep commented Oct 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check warning

Copilot Autofix

Uh oh!

clockwork-labs-bot commented Jan 6, 2026

Uh oh!

cloutiertyler commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

@@ -648,6 +648,8 @@
                       UNITY_SERIAL: ${{ secrets.UNITY_SERIAL }}
               csharp-testsuite:
+                permissions:
+                  contents: read
                 needs: [lints, llm_ci_check]
                 runs-on: spacetimedb-new-runner
                 container:

LLM Benchmarking #3486

Are you sure you want to change the base?

LLM Benchmarking #3486

Uh oh!

Conversation

bradleyshep commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of Changes

How it works

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check warning

Copilot Autofix

Uh oh!

clockwork-labs-bot commented Jan 6, 2026

LLM Benchmark Results (ci-quickfix)

Uh oh!

cloutiertyler commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bradleyshep commented Oct 25, 2025 •

edited

Loading