Skip to content

Conversation

@bradleyshep
Copy link

@bradleyshep bradleyshep commented Oct 25, 2025

Description of Changes

Introduce a new LLM benchmarking app and supporting code.

  • CLI: llm with subcommands run, routes list, diff, ci-check.
  • Runner: executes globally numbered tasks; filters by --lang, --categories, --tasks, --providers, --models.
  • Providers/clients: route layer (provider:model) with HTTP LLM Vendor clients; env-driven keys/base URLs.
  • Evaluation: deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages.
  • Results: stable JSON schema; single-file HTML viewer to inspect/filter/export CSV.
  • Build & guards: build script for compile-time setup;
  • Docs: DEVELOP.md includes cargo llm … usage.

This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer).

How it works

  1. Pick what to run

    • Choose tasks (--tasks 0,7,12), or a language (--lang rust|csharp), or categories (--categories basics,schema).
    • Optionally limit vendors/models (--providers …, --models …).
  2. Resolve routes

    • Read env (API keys + base URLs) and build the active set (e.g., openai:gpt-5).
  3. Build context

    • Start Spacetime
    • Publish golden answer modules
    • Prepare prompts and send to LLM model
    • Attempt to publish LLM module
  4. Execute calls

    • Run the selected tasks within each test against selected models and languages.
  5. Score outputs

    • Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks).
    • Record the score and any short failure reason.
  6. Update results file

    • Write/update the single results JSON with task/route outcomes, timings, and summaries.

API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

Expected complexity level and risk

4/5. New CLI, routing, evaluation, and artifact format.

  • External model APIs may rate-limit/timeout; concurrency tunable via LLM_BENCH_CONCURRENCY / LLM_BENCH_ROUTE_CONCURRENCY.

Testing

I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using act.

Please verify

  • llm run --tasks 0,1,2 (explicit run)
  • llm run --lang rust --categories basics (filters)
  • llm run --categories basics,schema (multiple categories)
  • llm run --lang csharp (language switch)
  • llm run --providers openai,anthropic --models "openai:gpt-5 anthropic:claude-sonnet-4-5" (provider/model limits)
  • llm run --hash-only (dry integrity)
  • llm run --goldens-only (test goldens only)
  • llm run --force (skip hash check)
  • llm ci-check
  • Stats viewer loads the JSON; filtering and CSV export work
  • CI works as intended

@bfops bfops added the release-any To be landed in any release window label Oct 27, 2025
bradleyshep and others added 6 commits November 3, 2025 13:11
…ain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI about 4 hours ago

In general, the fix is to add an explicit permissions block to the csharp-testsuite job (or at workflow root) to limit the default GITHUB_TOKEN permissions to the minimal set required, which in this case appears to be read-only repository contents. This prevents the job from inheriting broader default permissions such as write access.

The best targeted fix, without changing behavior, is to add permissions: contents: read directly under the csharp-testsuite job definition. Other jobs already have their own permission blocks, so setting it at the workflow root is unnecessary. Nothing in csharp-testsuite needs to write checks, statuses, or pull requests; it only checks out code and runs tests, so contents: read is sufficient and consistent with the CodeQL recommendation. Concretely, in .github/workflows/ci.yml, between line 650 (csharp-testsuite:) and line 651 (needs: [lints, llm_ci_check]), insert:

    permissions:
      contents: read

No additional imports, methods, or definitions are required since this is a configuration-only change in the workflow file.

Suggested changeset 1
.github/workflows/ci.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -648,6 +648,8 @@
           UNITY_SERIAL: ${{ secrets.UNITY_SERIAL }}
 
   csharp-testsuite:
+    permissions:
+      contents: read
     needs: [lints, llm_ci_check]
     runs-on: spacetimedb-new-runner
     container:
EOF
@@ -648,6 +648,8 @@
UNITY_SERIAL: ${{ secrets.UNITY_SERIAL }}

csharp-testsuite:
permissions:
contents: read
needs: [lints, llm_ci_check]
runs-on: spacetimedb-new-runner
container:
Copilot is powered by AI and may make mistakes. Always verify output.
Add retry logic for signal-killed processes (SIGSEGV) with up to 2 retries
and 500ms delay between attempts. Also reduce C# build concurrency from 8
to 4 by default to prevent resource contention in dotnet/WASI SDK builds.

The C# concurrency can be configured via LLM_BENCH_CSHARP_CONCURRENCY env var.
Set MSBUILDDISABLENODEREUSE=1 and DOTNET_CLI_USE_MSBUILD_SERVER=0 to
prevent resource contention when running multiple dotnet publish commands
in parallel on GitHub Actions runners.

See: dotnet/msbuild#6657
@clockwork-labs-bot
Copy link
Collaborator

LLM Benchmark Results (ci-quickfix)

Language Mode Category Tests Passed Pass % Task Pass %
Rust rustdoc_json basics 20/27 74.1% 75.0%
Rust rustdoc_json schema 23/34 67.6% 60.0%
Rust rustdoc_json total 43/61 70.5% 68.2%
C# docs basics 27/27 100.0% 100.0%
C# docs schema 31/34 91.2% 90.0%
C# docs total 58/61 95.1% 95.5%

Generated at: 2026-01-06T00:39:43.087Z

@cloutiertyler
Copy link
Contributor

I think we're okay to merge this now that /update-llm-benchmark is now able to run the fix automatically on github.

jdetter added 11 commits January 5, 2026 21:49
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-any To be landed in any release window

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants