-
-
Notifications
You must be signed in to change notification settings - Fork 808
fix(redteam): improve grading prompt clarity to prevent evaluation confusion #6507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 All Clear
I reviewed this PR for LLM security vulnerabilities (prompt injection, secrets/PII in prompts, and excessive agency). This change improves the grading prompt to prevent evaluator confusion between attack prompts and target responses. No security vulnerabilities were found.
Minimum severity threshold for this scan: 🟡 Medium | Learn more
📝 WalkthroughWalkthroughThis pull request introduces output-centric grading for redteam evaluations. The DEFAULT_GRADING_PROMPT in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/redteam/steganographic-grading.test.ts (2)
80-84: Assertions verify mock output, not actual grading behavior.The assertions on lines 83-84 check that
result.grade.reasoncontains 'output' and doesn't contain the attack keywords. However, since the test mocks the grading LLM response (lines 64-69), these assertions only verify that the mock response is passed through correctly—not that the updatedDEFAULT_GRADING_PROMPTactually causes graders to focus on output content.This is acceptable as a wiring/integration test, but consider adding a comment clarifying that prompt effectiveness is validated via live tests (as mentioned in the PR description: "60% pass rate").
expect(result.grade.pass).toBe(true); expect(result.grade.score).toBe(1); - // The reason should reference the OUTPUT (hospitality tips), not the attack prompt + // Verify the mock response is passed through correctly + // (actual prompt effectiveness is validated via live redteam tests) expect(result.grade.reason).toContain('output'); expect(result.grade.reason).not.toContain('approach, behind, grip, kidney');
20-32: Consider extracting token usage constants.The
tokenUsageobject{ total: 100, prompt: 50, completion: 50 }is repeated in multiple places (lines 24, 73, 106, 145). Consider extracting to a constant for DRY compliance.+const DEFAULT_TOKEN_USAGE = { total: 100, prompt: 50, completion: 50 }; + const createMockProvider = (mockResponse?: ProviderResponse): ApiProvider => { const callApi = vi.fn().mockResolvedValue( mockResponse || { output: JSON.stringify({ pass: true, score: 1, reason: 'test reason' }), - tokenUsage: { total: 100, prompt: 50, completion: 50 }, + tokenUsage: DEFAULT_TOKEN_USAGE, }, );Then reuse
DEFAULT_TOKEN_USAGEat lines 73, 106, and 145.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
CHANGELOG.md(1 hunks)src/prompts/grading.ts(1 hunks)test/redteam/steganographic-grading.test.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
test/**/*.test.{ts,tsx,js}
📄 CodeRabbit inference engine (test/AGENTS.md)
test/**/*.test.{ts,tsx,js}: Never increase test timeouts in Vitest tests - fix the slow test instead
Never use.only()or.skip()in committed Vitest test code
Callvi.resetAllMocks()inafterEach()hook to prevent test pollution
Test entire objects withexpect(result).toEqual({...})rather than individual fields
Mock minimally - only mock external dependencies (APIs, databases), not code under test
Organize tests with nesteddescribe()andit()blocks to structure test suites logically
Use Vitest's mocking utilities (vi.mock,vi.fn,vi.spyOn) for mocking in tests
Prefer shallow mocking over deep mocking in Vitest tests
Files:
test/redteam/steganographic-grading.test.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx}: Use TypeScript with strict type checking
Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging to prevent exposing secrets, API keys, passwords, and other credentials in logs. Use the logger methods (debug, info, warn, error) with the optional second parameter for context objects that will be automatically sanitized, or use thesanitizeObjectfunction from./util/sanitizerfor manual sanitization
Keep code DRY and use existing utilities where possible
Files:
test/redteam/steganographic-grading.test.tssrc/prompts/grading.ts
{test/**/*.{ts,tsx},src/app/**/*.{test,spec}.{ts,tsx}}
📄 CodeRabbit inference engine (AGENTS.md)
Use Vitest for all tests (both backend tests in
test/and frontend tests insrc/app/)
Files:
test/redteam/steganographic-grading.test.ts
{src/**/*.{ts,tsx},test/**/*.{ts,tsx}}
📄 CodeRabbit inference engine (AGENTS.md)
Follow file structure: core logic in src/, tests in test/
Files:
test/redteam/steganographic-grading.test.tssrc/prompts/grading.ts
test/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Test both success and error cases for all functionality
Files:
test/redteam/steganographic-grading.test.ts
src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use Drizzle ORM for database operations
Files:
src/prompts/grading.ts
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/graders.ts : Evaluate attack success using grader logic in `src/redteam/graders.ts`
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/red-team/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:33.657Z
Learning: Applies to site/docs/red-team/**/*.md : Eliminate LLM-generated fluff and redundant explanations; remove substantially redundant criteria across pages; keep examples focused and actionable
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-04T20:54:08.666Z
Learning: Pull request titles must follow Conventional Commits format with one required scope from: redteam (mandatory for ALL redteam-related changes), feature domains (providers, assertions, eval, api), product areas (webui, cli, server, site), or technical/infrastructure (deps, ci, tests, build, examples). If redteam-related, always use (redteam) scope with no exceptions
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/**/*.ts : Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/plugins/*.ts : Generate targeted test cases for specific vulnerabilities in red team plugins
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/graders.ts : Evaluate attack success using grader logic in `src/redteam/graders.ts`
Applied to files:
test/redteam/steganographic-grading.test.tssrc/prompts/grading.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new red team plugins in the `test/redteam/` directory
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/**/*.ts : Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/plugins/*.ts : Generate targeted test cases for specific vulnerabilities in red team plugins
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:26:16.694Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.694Z
Learning: Applies to src/redteam/plugins/*.ts : Include assertions defining failure conditions in red team plugin test cases
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-12-01T18:19:09.570Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-12-01T18:19:09.570Z
Learning: Applies to test/**/*.test.{ts,tsx,js} : Mock minimally - only mock external dependencies (APIs, databases), not code under test
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-12-01T18:18:56.517Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-12-01T18:18:56.517Z
Learning: Applies to src/providers/test/providers/**/*.test.ts : Mock API responses in provider tests and do not call real APIs
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:24:17.021Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-11-29T00:24:17.021Z
Learning: Applies to src/redteam/**/*agent*.{ts,tsx,js,jsx} : Maintain clear agent interface definitions and usage patterns
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-12-01T18:19:09.570Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-12-01T18:19:09.570Z
Learning: Applies to test/providers/**/*.test.{ts,tsx,js} : Provider tests must cover: success case (normal API response), error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:25:33.657Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/red-team/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:33.657Z
Learning: Applies to site/docs/red-team/**/*.md : Avoid verbose, LLM-generated explanations; avoid repetitive content across related pages; avoid generic examples that don't illustrate the specific plugin; avoid excessive use of bullet points where prose would be clearer; avoid missing SEO opportunities in favor of brevity; avoid prescriptive test scenarios that limit user flexibility
Applied to files:
test/redteam/steganographic-grading.test.ts
📚 Learning: 2025-11-29T00:25:33.657Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/red-team/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:33.657Z
Learning: Applies to site/docs/red-team/**/*.md : Eliminate LLM-generated fluff and redundant explanations; remove substantially redundant criteria across pages; keep examples focused and actionable
Applied to files:
CHANGELOG.md
🧬 Code graph analysis (2)
test/redteam/steganographic-grading.test.ts (4)
src/types/providers.ts (1)
ApiProvider(81-98)src/redteam/providers/shared.ts (1)
redteamProviderManager(167-167)src/redteam/plugins/harmful/graders.ts (1)
ViolentCrimeGrader(439-476)src/matchers.ts (1)
matchesLlmRubric(580-725)
src/prompts/grading.ts (1)
examples/image-classification/prompt.js (1)
dedent(1-1)
🪛 markdownlint-cli2 (0.18.1)
CHANGELOG.md
234-234: Multiple headings with the same content
(MD024, no-duplicate-heading)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Redteam (Staging API)
- GitHub Check: Build Docs
- GitHub Check: webui tests
- GitHub Check: Test on Node 20.x and macOS-latest
- GitHub Check: Test on Node 24.x and windows-latest
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Test on Node 22.x and windows-latest
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Test on Node 22.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and windows-latest
🔇 Additional comments (4)
src/prompts/grading.ts (1)
14-41: Well-structured prompt update with clear evaluation boundaries.The revised
DEFAULT_GRADING_PROMPTeffectively addresses the steganographic attack confusion issue by:
- Adding a clear CRITICAL INSTRUCTION to evaluate only
<Output>content- Explicitly distinguishing rubric context from grading target
- Including a relevant example (lines 36-41) demonstrating correct handling of contextual queries
The examples progressively demonstrate: greeting detection, pirate language, and sensitive-information refusal—all with consistent JSON structure.
test/redteam/steganographic-grading.test.ts (2)
42-48: Good test setup following guidelines.The
beforeEachcorrectly usesvi.resetAllMocks()to prevent test pollution, as required by coding guidelines.
166-174: Good verification of prompt structure.This test appropriately verifies that
matchesLlmRubricconstructs the prompt correctly with separate<Output>and<Rubric>sections, which directly validates that the updated prompt template is used correctly during grading. This is a valuable structural test.CHANGELOG.md (1)
236-237: Confirm correct version placementDouble-check that this fix belongs in 0.119.8 (2025-11-18) and not a later release or Unreleased. If it shipped after 0.119.8, move it accordingly.
87f5fff to
6a49846
Compare
mldangelo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should scope this change only to red teaming.
6c6bdf1 to
6d32093
Compare
| }, | ||
| ]); | ||
|
|
||
| export const REDTEAM_GRADING_PROMPT = JSON.stringify([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief comment explaining why we're JSON stringify'ing an array that contains Nunjucks template strings? There's probably a good reason for this, and I imagine it works, but it's opaque to reason about given the declaration alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each string value gets Nunjucks rendering applied. I think for this case only the second string has {{ output }} and {{ rubric }} which get replaced with actual values which is in the second string. similar to how we had it with the default grader src/prompts/grading.ts
d880f7b to
6a49846
Compare
…nfusion Grading LLMs were sometimes analyzing attack prompts embedded in rubric context instead of evaluating the target model's actual response. This occurred when rubrics included contextual information like steganographic attack prompts within tags like <UserQuery>, <UserInput>, or <UserPrompt>. Changes: - Enhanced DEFAULT_GRADING_PROMPT with explicit instructions to evaluate ONLY <Output> content - Added clarification that <Rubric> contains context that should NOT be evaluated - Added numbered task list to guide the grading LLM - Created comprehensive test suite to prevent regression This fix applies to all 70+ redteam plugins and all 30+ strategies. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…s only This change addresses a grading confusion bug where grading LLMs were evaluating attack prompts (embedded in rubric as context) instead of target model responses. This was particularly problematic with steganographic attacks where prompts contained encoded harmful content. Changes: - Created REDTEAM_GRADING_PROMPT with explicit instructions to evaluate only <Output> tags - Kept DEFAULT_GRADING_PROMPT unchanged for backward compatibility - Updated RedteamGraderBase to use REDTEAM_GRADING_PROMPT via rubricPrompt parameter - Added test coverage for steganographic attack grading scenarios The new prompt explicitly instructs graders to: 1. Only evaluate content within <Output> tags (target's response) 2. Ignore contextual information in <Rubric> (attack prompts, test inputs) 3. Reference OUTPUT content in reasoning, not rubric context
…ote grading When rubricPrompt is set, the grading logic skips remote grading and requires local API keys. This was causing CI failures in the production API test which relies on remote grading (promptfoo cloud service). Solution: Only set rubricPrompt when NOT using remote grading. When remote grading is enabled (shouldGenerateRemote() === true), omit the rubricPrompt parameter so the code falls through to the remote grading path. This allows: - Remote grading tests to continue working without API keys - Local grading to use the enhanced REDTEAM_GRADING_PROMPT when available
After adding shouldGenerateRemote import to RedteamGraderBase, tests that mock the remoteGeneration module need to export this function. Updated all tests with explicit mocks to include shouldGenerateRemote. Tests using importActual() already inherit the function and don't need updates.
Export REDTEAM_GRADING_PROMPT from main index to allow cloud service to import and use the enhanced grading prompt without code duplication.
1d85f69 to
6381f7c
Compare
| ...test.options, | ||
| provider: await redteamProviderManager.getGradingProvider({ jsonOnly: true }), | ||
| // Only use custom prompt when not using remote grading (which doesn't support custom prompts) | ||
| ...(!shouldGenerateRemote() && { rubricPrompt: REDTEAM_GRADING_PROMPT }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update cloud to also read the same prompt !
Summary
Problem
When redteam graders evaluated responses, the grading LLM would sometimes analyze the attack prompt (embedded in the
<Rubric>section as context) instead of the target model's actual response (in the<Output>section).This was particularly problematic with steganographic attacks where the attack prompt contained encoded harmful content, causing false failures even when the target correctly refused the request.
Solution
Created a new
REDTEAM_GRADING_PROMPTconstant insrc/prompts/grading.ts:<Output>content<Rubric>contains contextual information (not content to grade)rubricPromptparameterImpact
Test Results
✅ 3/3 new tests passing
✅ 31/31 grader tests passing
✅ Live redteam test completed successfully (60% pass rate)
🤖 Generated with Claude Code