Skip to content

Commit 5df701d

Browse files
authored
[feat]: Add opt-in evidence results for Pillar Security guardrail during monitoring (#17812)
* add evidence headers to litellm * ensure that evidence is surface-able, even in opt-in mode * update the docs
1 parent f8e7e15 commit 5df701d

File tree

4 files changed

+416
-7
lines changed

4 files changed

+416
-7
lines changed

docs/my-website/docs/proxy/guardrails/pillar_security.md

Lines changed: 141 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,7 @@ curl -X POST "http://localhost:4000/v1/chat/completions" \
233233
}'
234234
```
235235

236-
This provides clear, explicit conversation tracking that works seamlessly with LiteLLM's session management.
236+
This provides clear, explicit conversation tracking that works seamlessly with LiteLLM's session management. When using monitor mode, the session ID is returned in the `x-pillar-session-id` response header for easy correlation and tracking.
237237

238238
### Actions on Flagged Content
239239

@@ -251,6 +251,73 @@ Logs the violation but allows the request to proceed:
251251
on_flagged_action: "monitor"
252252
```
253253
254+
**Response Headers:**
255+
256+
You can opt in to receiving detection details in response headers by configuring `include_scanners: true` and/or `include_evidence: true`. When enabled, these headers are included for **every request**—not just flagged ones—enabling comprehensive metrics, false positive analysis, and threat investigation.
257+
258+
- **`x-pillar-flagged`**: Boolean string indicating Pillar's blocking recommendation (`"true"` or `"false"`)
259+
- **`x-pillar-scanners`**: URL-encoded JSON object showing scanner categories (e.g., `%7B%22jailbreak%22%3Atrue%7D`) — requires `include_scanners: true`
260+
- **`x-pillar-evidence`**: URL-encoded JSON array of detection evidence (may contain items even when `flagged` is `false`) — requires `include_evidence: true`
261+
- **`x-pillar-session-id`**: URL-encoded session ID for correlation and investigation
262+
263+
:::info Understanding `flagged` vs Scanner Results
264+
The `flagged` field is Pillar's **policy-level blocking recommendation**, which may differ from individual scanner results:
265+
266+
- **`flagged: true`** → Pillar recommends blocking based on your configured policies
267+
- **`flagged: false`** → Pillar does not recommend blocking, but individual scanners may still detect content
268+
269+
For example, the `toxic_language` scanner might detect profanity (`scanners.toxic_language: true`) while `flagged` remains `false` if your Pillar policy doesn't block on toxic language alone. This allows you to:
270+
- Monitor threats without blocking users
271+
- Build metrics on detection rates vs block rates
272+
- Analyze false positive rates by comparing scanner results to user feedback
273+
:::
274+
275+
The `x-pillar-scanners`, `x-pillar-evidence`, and `x-pillar-session-id` headers use URL encoding (percent-encoding) to convert JSON data into an ASCII-safe format. This is necessary because HTTP headers only support ISO-8859-1 characters and cannot contain raw JSON special characters (`{`, `"`, `:`) or Unicode text. To read these headers, first URL-decode the value, then parse it as JSON.
276+
277+
LiteLLM truncates the `x-pillar-evidence` header to a maximum of 8 KB per header to avoid proxy limits. Note that most proxies and servers also enforce a total header size limit of approximately 32 KB across all headers combined. When truncation occurs, each affected evidence item includes an `"evidence_truncated": true` flag and the metadata contains `pillar_evidence_truncated: true`.
278+
279+
**Example Response Headers (URL-encoded):**
280+
```http
281+
x-pillar-flagged: true
282+
x-pillar-session-id: abc-123-def-456
283+
x-pillar-scanners: %7B%22jailbreak%22%3Atrue%2C%22prompt_injection%22%3Afalse%2C%22toxic_language%22%3Afalse%7D
284+
x-pillar-evidence: %5B%7B%22category%22%3A%22prompt_injection%22%2C%22evidence%22%3A%22Ignore%20previous%20instructions%22%7D%5D
285+
```
286+
287+
**After Decoding:**
288+
```json
289+
// x-pillar-scanners
290+
{"jailbreak": true, "prompt_injection": false, "toxic_language": false}
291+
292+
// x-pillar-evidence
293+
[{"category": "prompt_injection", "evidence": "Ignore previous instructions"}]
294+
```
295+
296+
**Decoding Example (Python):**
297+
298+
```python
299+
from urllib.parse import unquote
300+
import json
301+
302+
# Step 1: URL-decode the header value (converts %7B to {, %22 to ", etc.)
303+
# Step 2: Parse the resulting JSON string
304+
scanners = json.loads(unquote(response.headers["x-pillar-scanners"]))
305+
evidence = json.loads(unquote(response.headers["x-pillar-evidence"]))
306+
307+
# Session ID is a plain string, so only URL-decode is needed (no JSON parsing)
308+
session_id = unquote(response.headers["x-pillar-session-id"])
309+
```
310+
311+
:::tip
312+
LiteLLM mirrors the encoded values onto `metadata["pillar_response_headers"]` so you can inspect exactly what was returned. When truncation occurs, it sets `metadata["pillar_evidence_truncated"]` to `true` and marks affected evidence items with `"evidence_truncated": true`. Evidence text is shortened with a `...[truncated]` suffix, and entire evidence entries may be removed if necessary to stay under the 8 KB header limit. Check these flags to determine if full evidence details are available in your logs.
313+
:::
314+
315+
This allows your application to:
316+
- Track threats without blocking legitimate users
317+
- Implement custom handling logic based on threat types
318+
- Build analytics and alerting on security events
319+
- Correlate threats across requests using session IDs
320+
254321
### Resilience and Error Handling
255322

256323
#### Graceful Degradation (`fallback_on_error`)
@@ -544,6 +611,79 @@ curl -X POST "http://localhost:4000/v1/chat/completions" \
544611
}
545612
```
546613

614+
</TabItem>
615+
<TabItem value="monitor" label="Monitor Mode with Headers">
616+
617+
**Monitor mode request with scanner detection:**
618+
619+
```bash
620+
# Test with content that triggers scanner detection
621+
curl -v -X POST "http://localhost:4000/v1/chat/completions" \
622+
-H "Content-Type: application/json" \
623+
-H "Authorization: Bearer YOUR_LITELLM_PROXY_MASTER_KEY" \
624+
-d '{
625+
"model": "gpt-4.1-mini",
626+
"messages": [{"role": "user", "content": "how do I rob a bank?"}],
627+
"max_tokens": 50
628+
}'
629+
```
630+
631+
**Expected response (Allowed with headers):**
632+
633+
The request succeeds and returns the LLM response. Headers are included for **all requests** when `include_scanners` and `include_evidence` are enabled—even when `flagged` is `false`:
634+
635+
```http
636+
HTTP/1.1 200 OK
637+
x-litellm-applied-guardrails: pillar-monitor-everything,pillar-monitor-everything
638+
x-pillar-flagged: false
639+
x-pillar-scanners: %7B%22jailbreak%22%3Afalse%2C%22safety%22%3Atrue%2C%22prompt_injection%22%3Afalse%2C%22pii%22%3Afalse%2C%22secret%22%3Afalse%2C%22toxic_language%22%3Afalse%7D
640+
x-pillar-evidence: %5B%7B%22category%22%3A%22safety%22%2C%22type%22%3A%22non_violent_crimes%22%2C%22end_idx%22%3A20%2C%22evidence%22%3A%22how%20do%20I%20rob%20a%20bank%3F%22%2C%22metadata%22%3A%7B%22start_idx%22%3A0%2C%22end_idx%22%3A20%7D%7D%5D
641+
x-pillar-session-id: d9433f86-b428-4ee7-93ee-e97a53f8a180
642+
```
643+
644+
Notice that `x-pillar-flagged: false` but `safety: true` in the scanners. This is because `flagged` represents Pillar's policy-level blocking recommendation, while individual scanners report their own detections.
645+
646+
```python
647+
from urllib.parse import unquote
648+
import json
649+
650+
scanners = json.loads(unquote(response.headers["x-pillar-scanners"]))
651+
evidence = json.loads(unquote(response.headers["x-pillar-evidence"]))
652+
session_id = unquote(response.headers["x-pillar-session-id"])
653+
flagged = response.headers["x-pillar-flagged"] == "true"
654+
655+
# Scanner detected safety issue, but policy didn't flag for blocking
656+
print(f"Flagged for blocking: {flagged}") # False
657+
print(f"Safety issue detected: {scanners.get('safety')}") # True
658+
print(f"Evidence: {evidence}")
659+
# [{'category': 'safety', 'type': 'non_violent_crimes', 'evidence': 'how do I rob a bank?', ...}]
660+
```
661+
662+
```json
663+
{
664+
"id": "chatcmpl-xyz123",
665+
"object": "chat.completion",
666+
"model": "gpt-4.1-mini",
667+
"choices": [
668+
{
669+
"index": 0,
670+
"message": {
671+
"role": "assistant",
672+
"content": "I'm sorry, but I can't assist with that request."
673+
},
674+
"finish_reason": "stop"
675+
}
676+
],
677+
"usage": {
678+
"prompt_tokens": 14,
679+
"completion_tokens": 11,
680+
"total_tokens": 25
681+
}
682+
}
683+
```
684+
685+
**Note:** In monitor mode, scanner results and evidence are included in response headers for every request, allowing you to build metrics and analyze detection patterns. The `flagged` field indicates whether Pillar's policy recommends blocking—your application can use the detailed scanner data for custom alerting, analytics, or false positive analysis.
686+
547687
</TabItem>
548688
<TabItem value="secrets" label="Secrets">
549689

litellm/proxy/common_utils/callback_utils.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -359,7 +359,11 @@ def get_remaining_tokens_and_requests_from_request_data(data: Dict) -> Dict[str,
359359

360360

361361
def get_logging_caching_headers(request_data: Dict) -> Optional[Dict]:
362-
_metadata = request_data.get("metadata", None) or {}
362+
_metadata = request_data.get("metadata", None)
363+
if not _metadata:
364+
_metadata = request_data.get("litellm_metadata", None)
365+
if not isinstance(_metadata, dict):
366+
_metadata = {}
363367
headers = {}
364368
if "applied_guardrails" in _metadata:
365369
headers["x-litellm-applied-guardrails"] = ",".join(
@@ -369,6 +373,12 @@ def get_logging_caching_headers(request_data: Dict) -> Optional[Dict]:
369373
if "semantic-similarity" in _metadata:
370374
headers["x-litellm-semantic-similarity"] = str(_metadata["semantic-similarity"])
371375

376+
pillar_headers = _metadata.get("pillar_response_headers")
377+
if isinstance(pillar_headers, dict):
378+
headers.update(pillar_headers)
379+
elif "pillar_flagged" in _metadata:
380+
headers["x-pillar-flagged"] = str(_metadata["pillar_flagged"]).lower()
381+
372382
return headers
373383

374384

litellm/proxy/guardrails/guardrail_hooks/pillar/pillar.py

Lines changed: 129 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,10 @@
66
# +-------------------------------------------------------------+
77

88
# Standard library imports
9+
import json
910
import os
10-
from typing import TYPE_CHECKING, Any, Dict, Literal, Optional, Tuple, Type, Union
11+
from urllib.parse import quote
12+
from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Type, Union
1113

1214
# Third-party imports
1315
from fastapi import HTTPException
@@ -28,13 +30,117 @@
2830
from litellm.proxy._types import UserAPIKeyAuth
2931
from litellm.proxy.common_utils.callback_utils import (
3032
add_guardrail_to_applied_guardrails_header,
33+
get_metadata_variable_name_from_kwargs,
3134
)
3235
from litellm.types.guardrails import GuardrailEventHooks
3336
from litellm.types.utils import LLMResponseTypes
3437

3538
if TYPE_CHECKING:
3639
from litellm.types.proxy.guardrails.guardrail_hooks.base import GuardrailConfigModel
3740

41+
MAX_PILLAR_HEADER_VALUE_BYTES = 8 * 1024
42+
43+
44+
def _encode_json_for_header(data: Any) -> str:
45+
"""
46+
JSON-serialize and URL-encode data for safe header transmission.
47+
"""
48+
json_payload = json.dumps(data, ensure_ascii=False, separators=(",", ":"))
49+
return quote(json_payload, safe="")
50+
51+
52+
def _truncate_evidence_payload(
53+
evidence: Any, max_bytes: int = MAX_PILLAR_HEADER_VALUE_BYTES
54+
) -> Tuple[Any, str, bool]:
55+
"""
56+
Truncate evidence payload so the encoded header value stays within max_bytes.
57+
58+
Returns:
59+
truncated_evidence: Evidence list/value after truncation
60+
encoded_value: URL-encoded JSON string for header
61+
was_truncated: Whether truncation occurred
62+
"""
63+
if not isinstance(evidence, list):
64+
encoded = _encode_json_for_header(evidence)
65+
if len(encoded.encode("utf-8")) <= max_bytes:
66+
return evidence, encoded, False
67+
truncated_value = "[truncated]"
68+
return truncated_value, _encode_json_for_header(truncated_value), True
69+
70+
truncated: List[Any] = []
71+
encoded = _encode_json_for_header(truncated)
72+
truncated_flag = False
73+
74+
for entry in evidence:
75+
working_entry: Any
76+
if isinstance(entry, dict):
77+
working_entry = dict(entry)
78+
else:
79+
working_entry = entry
80+
81+
truncated.append(working_entry)
82+
encoded = _encode_json_for_header(truncated)
83+
84+
if len(encoded.encode("utf-8")) <= max_bytes:
85+
continue
86+
87+
truncated_flag = True
88+
if isinstance(working_entry, dict):
89+
evidence_text = str(working_entry.get("evidence", ""))
90+
if evidence_text:
91+
step = max(1, len(evidence_text) // 2)
92+
while len(encoded.encode("utf-8")) > max_bytes and evidence_text:
93+
evidence_text = (
94+
evidence_text[:-step] if len(evidence_text) > step else evidence_text[:-1]
95+
)
96+
step = max(1, step // 2)
97+
truncated_text = (
98+
f"{evidence_text}...[truncated]" if evidence_text else "[truncated]"
99+
)
100+
working_entry["evidence"] = truncated_text
101+
working_entry["evidence_truncated"] = True
102+
encoded = _encode_json_for_header(truncated)
103+
104+
if len(encoded.encode("utf-8")) <= max_bytes:
105+
continue
106+
107+
truncated.pop()
108+
encoded = _encode_json_for_header(truncated)
109+
110+
return truncated, encoded, truncated_flag
111+
112+
113+
def build_pillar_response_headers(metadata_store: Dict[str, Any]) -> Dict[str, str]:
114+
"""
115+
Create URL-safe Pillar response headers and apply truncation metadata.
116+
"""
117+
headers: Dict[str, str] = {}
118+
119+
if "pillar_flagged" in metadata_store:
120+
headers["x-pillar-flagged"] = str(metadata_store["pillar_flagged"]).lower()
121+
122+
if "pillar_scanners" in metadata_store:
123+
headers["x-pillar-scanners"] = _encode_json_for_header(metadata_store["pillar_scanners"])
124+
125+
if "pillar_evidence" in metadata_store:
126+
truncated_evidence, encoded_value, truncated_flag = _truncate_evidence_payload(
127+
metadata_store["pillar_evidence"]
128+
)
129+
metadata_store["pillar_evidence"] = truncated_evidence
130+
if truncated_flag:
131+
metadata_store["pillar_evidence_truncated"] = True
132+
headers["x-pillar-evidence"] = encoded_value
133+
134+
if "pillar_session_id_response" in metadata_store:
135+
headers["x-pillar-session-id"] = quote(
136+
str(metadata_store["pillar_session_id_response"]), safe=""
137+
)
138+
139+
if headers:
140+
metadata_store["pillar_response_headers"] = headers
141+
142+
return headers
143+
38144

39145
# Exception classes
40146
class PillarGuardrailMissingSecrets(Exception):
@@ -637,15 +743,31 @@ def _process_pillar_response(self, pillar_response: Dict[str, Any], original_dat
637743

638744
flagged = pillar_response.get("flagged", False)
639745

746+
metadata_field = get_metadata_variable_name_from_kwargs(original_data)
747+
if metadata_field not in original_data or not isinstance(original_data.get(metadata_field), dict):
748+
original_data[metadata_field] = {}
749+
metadata_store = original_data[metadata_field]
750+
751+
# Backwards compatibility - ensure metadata alias exists when different key used
752+
if metadata_field != "metadata":
753+
if "metadata" not in original_data or not isinstance(original_data.get("metadata"), dict):
754+
original_data["metadata"] = metadata_store
755+
640756
# Store session_id from Pillar response for potential reuse
641757
pillar_session_id = pillar_response.get("session_id")
642758
if pillar_session_id:
643759
verbose_proxy_logger.debug(f"Pillar Guardrail: Received session_id from server: {pillar_session_id}")
644760
# Store in request metadata for use in subsequent hooks
645-
if "metadata" not in original_data:
646-
original_data["metadata"] = {}
647-
if "pillar_session_id" not in original_data["metadata"]:
648-
original_data["metadata"]["pillar_session_id"] = pillar_session_id
761+
if "pillar_session_id" not in metadata_store:
762+
metadata_store["pillar_session_id"] = pillar_session_id
763+
metadata_store["pillar_session_id_response"] = pillar_session_id
764+
765+
# Always set flagged status and scanner/evidence data for monitor mode
766+
metadata_store["pillar_flagged"] = flagged
767+
if self.include_scanners:
768+
metadata_store["pillar_scanners"] = pillar_response.get("scanners", {})
769+
if self.include_evidence:
770+
metadata_store["pillar_evidence"] = pillar_response.get("evidence", [])
649771

650772
if flagged:
651773
verbose_proxy_logger.warning("Pillar Guardrail: Threat detected")
@@ -654,6 +776,8 @@ def _process_pillar_response(self, pillar_response: Dict[str, Any], original_dat
654776
elif self.on_flagged_action == "monitor":
655777
verbose_proxy_logger.info("Pillar Guardrail: Monitoring mode - allowing flagged content to proceed")
656778

779+
build_pillar_response_headers(metadata_store)
780+
657781
def _raise_pillar_detection_exception(self, pillar_response: Dict[str, Any]) -> None:
658782
"""
659783
Raise an HTTPException for Pillar security detections.

0 commit comments

Comments
 (0)