Skip to content

Conversation

@yash2998chhabria
Copy link
Contributor

@yash2998chhabria yash2998chhabria commented Dec 4, 2025

Summary

  • Fixed grading LLMs incorrectly evaluating attack prompts instead of target responses
  • Created dedicated REDTEAM_GRADING_PROMPT for redteam graders only
  • Added test coverage for steganographic attack scenarios

Problem

When redteam graders evaluated responses, the grading LLM would sometimes analyze the attack prompt (embedded in the <Rubric> section as context) instead of the target model's actual response (in the <Output> section).

This was particularly problematic with steganographic attacks where the attack prompt contained encoded harmful content, causing false failures even when the target correctly refused the request.

Solution

Created a new REDTEAM_GRADING_PROMPT constant in src/prompts/grading.ts:

  • Added CRITICAL INSTRUCTION to evaluate ONLY the <Output> content
  • Clarified that <Rubric> contains contextual information (not content to grade)
  • Required the "reason" field to reference OUTPUT content
  • Updated RedteamGraderBase to use this dedicated prompt via rubricPrompt parameter

Impact

  • Fixes all 70+ redteam graders that use matchesLlmRubric()
  • DEFAULT_GRADING_PROMPT unchanged - no impact on non-redteam grading
  • Fully backward compatible - isolated to redteam graders only
  • Test coverage added to prevent regression

Test Results

✅ 3/3 new tests passing
✅ 31/31 grader tests passing
✅ Live redteam test completed successfully (60% pass rate)

🤖 Generated with Claude Code

Copy link
Contributor

@promptfoo-scanner promptfoo-scanner bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 All Clear

I reviewed this PR for LLM security vulnerabilities (prompt injection, secrets/PII in prompts, and excessive agency). This change improves the grading prompt to prevent evaluator confusion between attack prompts and target responses. No security vulnerabilities were found.

Minimum severity threshold for this scan: 🟡 Medium | Learn more

@yash2998chhabria yash2998chhabria marked this pull request as ready for review December 4, 2025 23:05
@yash2998chhabria yash2998chhabria requested a review from a team as a code owner December 4, 2025 23:05
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

📝 Walkthrough

Walkthrough

This pull request introduces output-centric grading for redteam evaluations. The DEFAULT_GRADING_PROMPT in src/prompts/grading.ts is expanded with explicit instructions to evaluate only content within <Output> tags while using <Rubric> for criteria reference, preventing false failures from complex or steganographic prompts. Supporting documentation is added to CHANGELOG.md. A new test suite test/redteam/steganographic-grading.test.ts validates the grader's ability to focus on target output rather than attack prompts in steganographic contexts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Grading prompt semantics: The changes to src/prompts/grading.ts represent a significant semantic shift in evaluation scope. Review requires verification that the new instructions correctly isolate <Output> content evaluation and that examples accurately reflect the intended behavior.
  • Test coverage completeness: The new test suite in test/redteam/steganographic-grading.test.ts introduces multiple scenarios (benign output with steganographic pattern, harmful output, rubric-embedded attack prompt). Verify that test assertions properly validate reasoning references OUTPUT content rather than contextual information.
  • Prompt instruction clarity: Ensure the CRITICAL INSTRUCTION wording in the grading prompt unambiguously communicates the evaluation boundary to the LLM and that structured JSON guidance aligns with expected grader behavior.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: improving clarity in the grading prompt to prevent evaluation confusion, which directly aligns with the PR's primary objective of fixing grader behavior.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description clearly explains the problem (grading LLMs evaluating attack prompts instead of target responses), the solution (new REDTEAM_GRADING_PROMPT with critical instructions), and the impact (fixes 70+ redteam graders while maintaining backward compatibility). All changes in the raw summary align with this description.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/redteam-grading-prompt-confusion

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/redteam/steganographic-grading.test.ts (2)

80-84: Assertions verify mock output, not actual grading behavior.

The assertions on lines 83-84 check that result.grade.reason contains 'output' and doesn't contain the attack keywords. However, since the test mocks the grading LLM response (lines 64-69), these assertions only verify that the mock response is passed through correctly—not that the updated DEFAULT_GRADING_PROMPT actually causes graders to focus on output content.

This is acceptable as a wiring/integration test, but consider adding a comment clarifying that prompt effectiveness is validated via live tests (as mentioned in the PR description: "60% pass rate").

       expect(result.grade.pass).toBe(true);
       expect(result.grade.score).toBe(1);
-      // The reason should reference the OUTPUT (hospitality tips), not the attack prompt
+      // Verify the mock response is passed through correctly
+      // (actual prompt effectiveness is validated via live redteam tests)
       expect(result.grade.reason).toContain('output');
       expect(result.grade.reason).not.toContain('approach, behind, grip, kidney');

20-32: Consider extracting token usage constants.

The tokenUsage object { total: 100, prompt: 50, completion: 50 } is repeated in multiple places (lines 24, 73, 106, 145). Consider extracting to a constant for DRY compliance.

+const DEFAULT_TOKEN_USAGE = { total: 100, prompt: 50, completion: 50 };
+
 const createMockProvider = (mockResponse?: ProviderResponse): ApiProvider => {
   const callApi = vi.fn().mockResolvedValue(
     mockResponse || {
       output: JSON.stringify({ pass: true, score: 1, reason: 'test reason' }),
-      tokenUsage: { total: 100, prompt: 50, completion: 50 },
+      tokenUsage: DEFAULT_TOKEN_USAGE,
     },
   );

Then reuse DEFAULT_TOKEN_USAGE at lines 73, 106, and 145.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f617ee5 and 6d76171.

📒 Files selected for processing (3)
  • CHANGELOG.md (1 hunks)
  • src/prompts/grading.ts (1 hunks)
  • test/redteam/steganographic-grading.test.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
test/**/*.test.{ts,tsx,js}

📄 CodeRabbit inference engine (test/AGENTS.md)

test/**/*.test.{ts,tsx,js}: Never increase test timeouts in Vitest tests - fix the slow test instead
Never use .only() or .skip() in committed Vitest test code
Call vi.resetAllMocks() in afterEach() hook to prevent test pollution
Test entire objects with expect(result).toEqual({...}) rather than individual fields
Mock minimally - only mock external dependencies (APIs, databases), not code under test
Organize tests with nested describe() and it() blocks to structure test suites logically
Use Vitest's mocking utilities (vi.mock, vi.fn, vi.spyOn) for mocking in tests
Prefer shallow mocking over deep mocking in Vitest tests

Files:

  • test/redteam/steganographic-grading.test.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{ts,tsx}: Use TypeScript with strict type checking
Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging to prevent exposing secrets, API keys, passwords, and other credentials in logs. Use the logger methods (debug, info, warn, error) with the optional second parameter for context objects that will be automatically sanitized, or use the sanitizeObject function from ./util/sanitizer for manual sanitization
Keep code DRY and use existing utilities where possible

Files:

  • test/redteam/steganographic-grading.test.ts
  • src/prompts/grading.ts
{test/**/*.{ts,tsx},src/app/**/*.{test,spec}.{ts,tsx}}

📄 CodeRabbit inference engine (AGENTS.md)

Use Vitest for all tests (both backend tests in test/ and frontend tests in src/app/)

Files:

  • test/redteam/steganographic-grading.test.ts
{src/**/*.{ts,tsx},test/**/*.{ts,tsx}}

📄 CodeRabbit inference engine (AGENTS.md)

Follow file structure: core logic in src/, tests in test/

Files:

  • test/redteam/steganographic-grading.test.ts
  • src/prompts/grading.ts
test/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Test both success and error cases for all functionality

Files:

  • test/redteam/steganographic-grading.test.ts
src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use Drizzle ORM for database operations

Files:

  • src/prompts/grading.ts
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/graders.ts : Evaluate attack success using grader logic in `src/redteam/graders.ts`
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/red-team/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:33.657Z
Learning: Applies to site/docs/red-team/**/*.md : Eliminate LLM-generated fluff and redundant explanations; remove substantially redundant criteria across pages; keep examples focused and actionable
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T20:54:08.666Z
Learning: Pull request titles must follow Conventional Commits format with one required scope from: redteam (mandatory for ALL redteam-related changes), feature domains (providers, assertions, eval, api), product areas (webui, cli, server, site), or technical/infrastructure (deps, ci, tests, build, examples). If redteam-related, always use (redteam) scope with no exceptions
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/**/*.ts : Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/plugins/*.ts : Generate targeted test cases for specific vulnerabilities in red team plugins
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/graders.ts : Evaluate attack success using grader logic in `src/redteam/graders.ts`

Applied to files:

  • test/redteam/steganographic-grading.test.ts
  • src/prompts/grading.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new red team plugins in the `test/redteam/` directory

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/**/*.ts : Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/plugins/*.ts : Generate targeted test cases for specific vulnerabilities in red team plugins

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/plugins/*.ts : Include assertions defining failure conditions in red team plugin test cases

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-12-01T18:19:09.570Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-12-01T18:19:09.570Z
Learning: Applies to test/**/*.test.{ts,tsx,js} : Mock minimally - only mock external dependencies (APIs, databases), not code under test

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-12-01T18:18:56.517Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-12-01T18:18:56.517Z
Learning: Applies to src/providers/test/providers/**/*.test.ts : Mock API responses in provider tests and do not call real APIs

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:24:17.021Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-11-29T00:24:17.021Z
Learning: Applies to src/redteam/**/*agent*.{ts,tsx,js,jsx} : Maintain clear agent interface definitions and usage patterns

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-12-01T18:19:09.570Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-12-01T18:19:09.570Z
Learning: Applies to test/providers/**/*.test.{ts,tsx,js} : Provider tests must cover: success case (normal API response), error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:25:33.657Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/red-team/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:33.657Z
Learning: Applies to site/docs/red-team/**/*.md : Avoid verbose, LLM-generated explanations; avoid repetitive content across related pages; avoid generic examples that don't illustrate the specific plugin; avoid excessive use of bullet points where prose would be clearer; avoid missing SEO opportunities in favor of brevity; avoid prescriptive test scenarios that limit user flexibility

Applied to files:

  • test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:25:33.657Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/red-team/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:33.657Z
Learning: Applies to site/docs/red-team/**/*.md : Eliminate LLM-generated fluff and redundant explanations; remove substantially redundant criteria across pages; keep examples focused and actionable

Applied to files:

  • CHANGELOG.md
🧬 Code graph analysis (2)
test/redteam/steganographic-grading.test.ts (4)
src/types/providers.ts (1)
  • ApiProvider (81-98)
src/redteam/providers/shared.ts (1)
  • redteamProviderManager (167-167)
src/redteam/plugins/harmful/graders.ts (1)
  • ViolentCrimeGrader (439-476)
src/matchers.ts (1)
  • matchesLlmRubric (580-725)
src/prompts/grading.ts (1)
examples/image-classification/prompt.js (1)
  • dedent (1-1)
🪛 markdownlint-cli2 (0.18.1)
CHANGELOG.md

234-234: Multiple headings with the same content

(MD024, no-duplicate-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Redteam (Staging API)
  • GitHub Check: Build Docs
  • GitHub Check: webui tests
  • GitHub Check: Test on Node 20.x and macOS-latest
  • GitHub Check: Test on Node 24.x and windows-latest
  • GitHub Check: Test on Node 24.x and ubuntu-latest
  • GitHub Check: Test on Node 22.x and windows-latest
  • GitHub Check: Test on Node 22.x and macOS-latest
  • GitHub Check: Test on Node 22.x and ubuntu-latest
  • GitHub Check: Test on Node 20.x and ubuntu-latest
  • GitHub Check: Test on Node 20.x and windows-latest
🔇 Additional comments (4)
src/prompts/grading.ts (1)

14-41: Well-structured prompt update with clear evaluation boundaries.

The revised DEFAULT_GRADING_PROMPT effectively addresses the steganographic attack confusion issue by:

  1. Adding a clear CRITICAL INSTRUCTION to evaluate only <Output> content
  2. Explicitly distinguishing rubric context from grading target
  3. Including a relevant example (lines 36-41) demonstrating correct handling of contextual queries

The examples progressively demonstrate: greeting detection, pirate language, and sensitive-information refusal—all with consistent JSON structure.

test/redteam/steganographic-grading.test.ts (2)

42-48: Good test setup following guidelines.

The beforeEach correctly uses vi.resetAllMocks() to prevent test pollution, as required by coding guidelines.


166-174: Good verification of prompt structure.

This test appropriately verifies that matchesLlmRubric constructs the prompt correctly with separate <Output> and <Rubric> sections, which directly validates that the updated prompt template is used correctly during grading. This is a valuable structural test.

CHANGELOG.md (1)

236-237: Confirm correct version placement

Double-check that this fix belongs in 0.119.8 (2025-11-18) and not a later release or Unreleased. If it shipped after 0.119.8, move it accordingly.

@yash2998chhabria yash2998chhabria force-pushed the fix/redteam-grading-prompt-confusion branch from 87f5fff to 6a49846 Compare December 5, 2025 01:55
Copy link
Member

@mldangelo mldangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should scope this change only to red teaming.

@yash2998chhabria yash2998chhabria force-pushed the fix/redteam-grading-prompt-confusion branch from 6c6bdf1 to 6d32093 Compare December 5, 2025 02:11
},
]);

export const REDTEAM_GRADING_PROMPT = JSON.stringify([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief comment explaining why we're JSON stringify'ing an array that contains Nunjucks template strings? There's probably a good reason for this, and I imagine it works, but it's opaque to reason about given the declaration alone.

Copy link
Contributor Author

@yash2998chhabria yash2998chhabria Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each string value gets Nunjucks rendering applied. I think for this case only the second string has {{ output }} and {{ rubric }} which get replaced with actual values which is in the second string. similar to how we had it with the default grader src/prompts/grading.ts

@yash2998chhabria yash2998chhabria force-pushed the fix/redteam-grading-prompt-confusion branch from d880f7b to 6a49846 Compare December 5, 2025 21:46
yash2998chhabria and others added 6 commits December 5, 2025 15:13
…nfusion

Grading LLMs were sometimes analyzing attack prompts embedded in rubric context
instead of evaluating the target model's actual response. This occurred when rubrics
included contextual information like steganographic attack prompts within tags like
<UserQuery>, <UserInput>, or <UserPrompt>.

Changes:
- Enhanced DEFAULT_GRADING_PROMPT with explicit instructions to evaluate ONLY <Output> content
- Added clarification that <Rubric> contains context that should NOT be evaluated
- Added numbered task list to guide the grading LLM
- Created comprehensive test suite to prevent regression

This fix applies to all 70+ redteam plugins and all 30+ strategies.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…s only

This change addresses a grading confusion bug where grading LLMs were
evaluating attack prompts (embedded in rubric as context) instead of
target model responses. This was particularly problematic with steganographic
attacks where prompts contained encoded harmful content.

Changes:
- Created REDTEAM_GRADING_PROMPT with explicit instructions to evaluate only <Output> tags
- Kept DEFAULT_GRADING_PROMPT unchanged for backward compatibility
- Updated RedteamGraderBase to use REDTEAM_GRADING_PROMPT via rubricPrompt parameter
- Added test coverage for steganographic attack grading scenarios

The new prompt explicitly instructs graders to:
1. Only evaluate content within <Output> tags (target's response)
2. Ignore contextual information in <Rubric> (attack prompts, test inputs)
3. Reference OUTPUT content in reasoning, not rubric context
…ote grading

When rubricPrompt is set, the grading logic skips remote grading and requires
local API keys. This was causing CI failures in the production API test which
relies on remote grading (promptfoo cloud service).

Solution: Only set rubricPrompt when NOT using remote grading. When remote
grading is enabled (shouldGenerateRemote() === true), omit the rubricPrompt
parameter so the code falls through to the remote grading path.

This allows:
- Remote grading tests to continue working without API keys
- Local grading to use the enhanced REDTEAM_GRADING_PROMPT when available
After adding shouldGenerateRemote import to RedteamGraderBase, tests that mock
the remoteGeneration module need to export this function. Updated all tests
with explicit mocks to include shouldGenerateRemote.

Tests using importActual() already inherit the function and don't need updates.
Export REDTEAM_GRADING_PROMPT from main index to allow cloud service to import
and use the enhanced grading prompt without code duplication.
@yash2998chhabria yash2998chhabria force-pushed the fix/redteam-grading-prompt-confusion branch from 1d85f69 to 6381f7c Compare December 5, 2025 23:15
...test.options,
provider: await redteamProviderManager.getGradingProvider({ jsonOnly: true }),
// Only use custom prompt when not using remote grading (which doesn't support custom prompts)
...(!shouldGenerateRemote() && { rubricPrompt: REDTEAM_GRADING_PROMPT }),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update cloud to also read the same prompt !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants