fix(redteam): improve grading prompt clarity to prevent evaluation confusion #6507

yash2998chhabria · 2025-12-04T22:47:03Z

Summary

Fixed grading LLMs incorrectly evaluating attack prompts instead of target responses
Created dedicated REDTEAM_GRADING_PROMPT for redteam graders only
Added test coverage for steganographic attack scenarios

Problem

When redteam graders evaluated responses, the grading LLM would sometimes analyze the attack prompt (embedded in the <Rubric> section as context) instead of the target model's actual response (in the <Output> section).

This was particularly problematic with steganographic attacks where the attack prompt contained encoded harmful content, causing false failures even when the target correctly refused the request.

Solution

Created a new REDTEAM_GRADING_PROMPT constant in src/prompts/grading.ts:

Added CRITICAL INSTRUCTION to evaluate ONLY the <Output> content
Clarified that <Rubric> contains contextual information (not content to grade)
Required the "reason" field to reference OUTPUT content
Updated RedteamGraderBase to use this dedicated prompt via rubricPrompt parameter

Impact

Fixes all 70+ redteam graders that use matchesLlmRubric()
DEFAULT_GRADING_PROMPT unchanged - no impact on non-redteam grading
Fully backward compatible - isolated to redteam graders only
Test coverage added to prevent regression

Test Results

✅ 3/3 new tests passing
✅ 31/31 grader tests passing
✅ Live redteam test completed successfully (60% pass rate)

🤖 Generated with Claude Code

promptfoo-scanner

👍 All Clear

I reviewed this PR for LLM security vulnerabilities (prompt injection, secrets/PII in prompts, and excessive agency). This change improves the grading prompt to prevent evaluator confusion between attack prompts and target responses. No security vulnerabilities were found.

_{Minimum severity threshold for this scan: 🟡 Medium | Learn more}

coderabbitai · 2025-12-04T23:08:39Z

📝 Walkthrough

Walkthrough

This pull request introduces output-centric grading for redteam evaluations. The DEFAULT_GRADING_PROMPT in src/prompts/grading.ts is expanded with explicit instructions to evaluate only content within <Output> tags while using <Rubric> for criteria reference, preventing false failures from complex or steganographic prompts. Supporting documentation is added to CHANGELOG.md. A new test suite test/redteam/steganographic-grading.test.ts validates the grader's ability to focus on target output rather than attack prompts in steganographic contexts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Grading prompt semantics: The changes to src/prompts/grading.ts represent a significant semantic shift in evaluation scope. Review requires verification that the new instructions correctly isolate <Output> content evaluation and that examples accurately reflect the intended behavior.
Test coverage completeness: The new test suite in test/redteam/steganographic-grading.test.ts introduces multiple scenarios (benign output with steganographic pattern, harmful output, rubric-embedded attack prompt). Verify that test assertions properly validate reasoning references OUTPUT content rather than contextual information.
Prompt instruction clarity: Ensure the CRITICAL INSTRUCTION wording in the grading prompt unambiguously communicates the evaluation boundary to the LLM and that structured JSON guidance aligns with expected grader behavior.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: improving clarity in the grading prompt to prevent evaluation confusion, which directly aligns with the PR's primary objective of fixing grader behavior.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description clearly explains the problem (grading LLMs evaluating attack prompts instead of target responses), the solution (new REDTEAM_GRADING_PROMPT with critical instructions), and the impact (fixes 70+ redteam graders while maintaining backward compatibility). All changes in the raw summary align with this description.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/redteam-grading-prompt-confusion

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

test/redteam/steganographic-grading.test.ts (2)
80-84: Assertions verify mock output, not actual grading behavior.

The assertions on lines 83-84 check that result.grade.reason contains 'output' and doesn't contain the attack keywords. However, since the test mocks the grading LLM response (lines 64-69), these assertions only verify that the mock response is passed through correctly—not that the updated DEFAULT_GRADING_PROMPT actually causes graders to focus on output content.

This is acceptable as a wiring/integration test, but consider adding a comment clarifying that prompt effectiveness is validated via live tests (as mentioned in the PR description: "60% pass rate").
       expect(result.grade.pass).toBe(true);
       expect(result.grade.score).toBe(1);
-      // The reason should reference the OUTPUT (hospitality tips), not the attack prompt
+      // Verify the mock response is passed through correctly
+      // (actual prompt effectiveness is validated via live redteam tests)
       expect(result.grade.reason).toContain('output');
       expect(result.grade.reason).not.toContain('approach, behind, grip, kidney');
20-32: Consider extracting token usage constants.

The tokenUsage object { total: 100, prompt: 50, completion: 50 } is repeated in multiple places (lines 24, 73, 106, 145). Consider extracting to a constant for DRY compliance.
+const DEFAULT_TOKEN_USAGE = { total: 100, prompt: 50, completion: 50 };
+
 const createMockProvider = (mockResponse?: ProviderResponse): ApiProvider => {
   const callApi = vi.fn().mockResolvedValue(
     mockResponse || {
       output: JSON.stringify({ pass: true, score: 1, reason: 'test reason' }),
-      tokenUsage: { total: 100, prompt: 50, completion: 50 },
+      tokenUsage: DEFAULT_TOKEN_USAGE,
     },
   );
Then reuse DEFAULT_TOKEN_USAGE at lines 73, 106, and 145.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f617ee5 and 6d76171.

📒 Files selected for processing (3)

CHANGELOG.md (1 hunks)
src/prompts/grading.ts (1 hunks)
test/redteam/steganographic-grading.test.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

test/**/*.test.{ts,tsx,js}

📄 CodeRabbit inference engine (test/AGENTS.md)

test/**/*.test.{ts,tsx,js}: Never increase test timeouts in Vitest tests - fix the slow test instead
Never use .only() or .skip() in committed Vitest test code
Call vi.resetAllMocks() in afterEach() hook to prevent test pollution
Test entire objects with expect(result).toEqual({...}) rather than individual fields
Mock minimally - only mock external dependencies (APIs, databases), not code under test
Organize tests with nested describe() and it() blocks to structure test suites logically
Use Vitest's mocking utilities (vi.mock, vi.fn, vi.spyOn) for mocking in tests
Prefer shallow mocking over deep mocking in Vitest tests

Files:

test/redteam/steganographic-grading.test.ts

**/*.{ts,tsx}