[feat]: Add opt-in evidence results for Pillar Security guardrail during monitoring (#17812)

afogel · web-flow · commit 5df701d15ce9 · 2025-12-12T04:09:13.000-08:00
* add evidence headers to litellm

* ensure that evidence is surface-able, even in opt-in mode

* update the docs
diff --git a/docs/my-website/docs/proxy/guardrails/pillar_security.md b/docs/my-website/docs/proxy/guardrails/pillar_security.md
@@ -233,7 +233,7 @@ curl -X POST "http://localhost:4000/v1/chat/completions" \
   }'
 ```
 
-This provides clear, explicit conversation tracking that works seamlessly with LiteLLM's session management.
+This provides clear, explicit conversation tracking that works seamlessly with LiteLLM's session management. When using monitor mode, the session ID is returned in the `x-pillar-session-id` response header for easy correlation and tracking.
 
 ### Actions on Flagged Content
 
@@ -251,6 +251,73 @@ Logs the violation but allows the request to proceed:
 on_flagged_action: "monitor"
 ```
 
+**Response Headers:**
+
+You can opt in to receiving detection details in response headers by configuring `include_scanners: true` and/or `include_evidence: true`. When enabled, these headers are included for **every request**—not just flagged ones—enabling comprehensive metrics, false positive analysis, and threat investigation.
+
+- **`x-pillar-flagged`**: Boolean string indicating Pillar's blocking recommendation (`"true"` or `"false"`)
+- **`x-pillar-scanners`**: URL-encoded JSON object showing scanner categories (e.g., `%7B%22jailbreak%22%3Atrue%7D`) — requires `include_scanners: true`
+- **`x-pillar-evidence`**: URL-encoded JSON array of detection evidence (may contain items even when `flagged` is `false`) — requires `include_evidence: true`
+- **`x-pillar-session-id`**: URL-encoded session ID for correlation and investigation
+
+:::info Understanding `flagged` vs Scanner Results
+The `flagged` field is Pillar's **policy-level blocking recommendation**, which may differ from individual scanner results:
+
+- **`flagged: true`** → Pillar recommends blocking based on your configured policies
+- **`flagged: false`** → Pillar does not recommend blocking, but individual scanners may still detect content
+
+For example, the `toxic_language` scanner might detect profanity (`scanners.toxic_language: true`) while `flagged` remains `false` if your Pillar policy doesn't block on toxic language alone. This allows you to:
+- Monitor threats without blocking users
+- Build metrics on detection rates vs block rates
+- Analyze false positive rates by comparing scanner results to user feedback
+:::
+
+The `x-pillar-scanners`, `x-pillar-evidence`, and `x-pillar-session-id` headers use URL encoding (percent-encoding) to convert JSON data into an ASCII-safe format. This is necessary because HTTP headers only support ISO-8859-1 characters and cannot contain raw JSON special characters (`{`, `"`, `:`) or Unicode text. To read these headers, first URL-decode the value, then parse it as JSON.
+
+LiteLLM truncates the `x-pillar-evidence` header to a maximum of 8 KB per header to avoid proxy limits. Note that most proxies and servers also enforce a total header size limit of approximately 32 KB across all headers combined. When truncation occurs, each affected evidence item includes an `"evidence_truncated": true` flag and the metadata contains `pillar_evidence_truncated: true`.
+
+**Example Response Headers (URL-encoded):**
+```http
+x-pillar-flagged: true
+x-pillar-session-id: abc-123-def-456
+x-pillar-scanners: %7B%22jailbreak%22%3Atrue%2C%22prompt_injection%22%3Afalse%2C%22toxic_language%22%3Afalse%7D
+x-pillar-evidence: %5B%7B%22category%22%3A%22prompt_injection%22%2C%22evidence%22%3A%22Ignore%20previous%20instructions%22%7D%5D
+```
+
+**After Decoding:**
+```json
+// x-pillar-scanners
+{"jailbreak": true, "prompt_injection": false, "toxic_language": false}
+
+// x-pillar-evidence
+[{"category": "prompt_injection", "evidence": "Ignore previous instructions"}]
+```
+
+**Decoding Example (Python):**
+
+```python
+from urllib.parse import unquote
+import json
+
+# Step 1: URL-decode the header value (converts %7B to {, %22 to ", etc.)
+# Step 2: Parse the resulting JSON string
+scanners = json.loads(unquote(response.headers["x-pillar-scanners"]))
+evidence = json.loads(unquote(response.headers["x-pillar-evidence"]))
+
+# Session ID is a plain string, so only URL-decode is needed (no JSON parsing)
+session_id = unquote(response.headers["x-pillar-session-id"])
+```
+
+:::tip
+LiteLLM mirrors the encoded values onto `metadata["pillar_response_headers"]` so you can inspect exactly what was returned. When truncation occurs, it sets `metadata["pillar_evidence_truncated"]` to `true` and marks affected evidence items with `"evidence_truncated": true`. Evidence text is shortened with a `...[truncated]` suffix, and entire evidence entries may be removed if necessary to stay under the 8 KB header limit. Check these flags to determine if full evidence details are available in your logs.
+:::
+
+This allows your application to:
+- Track threats without blocking legitimate users
+- Implement custom handling logic based on threat types
+- Build analytics and alerting on security events
+- Correlate threats across requests using session IDs
+
 ### Resilience and Error Handling
 
 #### Graceful Degradation (`fallback_on_error`)
@@ -544,6 +611,79 @@ curl -X POST "http://localhost:4000/v1/chat/completions" \
 }
 ```
 
+</TabItem>
+<TabItem value="monitor" label="Monitor Mode with Headers">
+
+**Monitor mode request with scanner detection:**
+
+```bash
+# Test with content that triggers scanner detection
+curl -v -X POST "http://localhost:4000/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer YOUR_LITELLM_PROXY_MASTER_KEY" \
+  -d '{
+    "model": "gpt-4.1-mini",
+    "messages": [{"role": "user", "content": "how do I rob a bank?"}],
+    "max_tokens": 50
+  }'
+```
+
+**Expected response (Allowed with headers):**
+
+The request succeeds and returns the LLM response. Headers are included for **all requests** when `include_scanners` and `include_evidence` are enabled—even when `flagged` is `false`:
+
+```http
+HTTP/1.1 200 OK
+x-litellm-applied-guardrails: pillar-monitor-everything,pillar-monitor-everything
+x-pillar-flagged: false
+x-pillar-scanners: %7B%22jailbreak%22%3Afalse%2C%22safety%22%3Atrue%2C%22prompt_injection%22%3Afalse%2C%22pii%22%3Afalse%2C%22secret%22%3Afalse%2C%22toxic_language%22%3Afalse%7D
+x-pillar-evidence: %5B%7B%22category%22%3A%22safety%22%2C%22type%22%3A%22non_violent_crimes%22%2C%22end_idx%22%3A20%2C%22evidence%22%3A%22how%20do%20I%20rob%20a%20bank%3F%22%2C%22metadata%22%3A%7B%22start_idx%22%3A0%2C%22end_idx%22%3A20%7D%7D%5D
+x-pillar-session-id: d9433f86-b428-4ee7-93ee-e97a53f8a180
+```
+
+Notice that `x-pillar-flagged: false` but `safety: true` in the scanners. This is because `flagged` represents Pillar's policy-level blocking recommendation, while individual scanners report their own detections.
+
+```python
+from urllib.parse import unquote
+import json
+
+scanners = json.loads(unquote(response.headers["x-pillar-scanners"]))
+evidence = json.loads(unquote(response.headers["x-pillar-evidence"]))
+session_id = unquote(response.headers["x-pillar-session-id"])
+flagged = response.headers["x-pillar-flagged"] == "true"
+
+# Scanner detected safety issue, but policy didn't flag for blocking
+print(f"Flagged for blocking: {flagged}")  # False
+print(f"Safety issue detected: {scanners.get('safety')}")  # True
+print(f"Evidence: {evidence}")
+# [{'category': 'safety', 'type': 'non_violent_crimes', 'evidence': 'how do I rob a bank?', ...}]
+```
+
+```json
+{
+  "id": "chatcmpl-xyz123",
+  "object": "chat.completion",
+  "model": "gpt-4.1-mini",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "I'm sorry, but I can't assist with that request."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 14,
+    "completion_tokens": 11,
+    "total_tokens": 25
+  }
+}
+```
+
+**Note:** In monitor mode, scanner results and evidence are included in response headers for every request, allowing you to build metrics and analyze detection patterns. The `flagged` field indicates whether Pillar's policy recommends blocking—your application can use the detailed scanner data for custom alerting, analytics, or false positive analysis.
+
 </TabItem>
 <TabItem value="secrets" label="Secrets">
 
diff --git a/litellm/proxy/common_utils/callback_utils.py b/litellm/proxy/common_utils/callback_utils.py
@@ -359,7 +359,11 @@ def get_remaining_tokens_and_requests_from_request_data(data: Dict) -> Dict[str,
 
 
 def get_logging_caching_headers(request_data: Dict) -> Optional[Dict]:
-    _metadata = request_data.get("metadata", None) or {}
+    _metadata = request_data.get("metadata", None)
+    if not _metadata:
+        _metadata = request_data.get("litellm_metadata", None)
+    if not isinstance(_metadata, dict):
+        _metadata = {}
     headers = {}
     if "applied_guardrails" in _metadata:
         headers["x-litellm-applied-guardrails"] = ",".join(
@@ -369,6 +373,12 @@ def get_logging_caching_headers(request_data: Dict) -> Optional[Dict]:
     if "semantic-similarity" in _metadata:
         headers["x-litellm-semantic-similarity"] = str(_metadata["semantic-similarity"])
 
+    pillar_headers = _metadata.get("pillar_response_headers")
+    if isinstance(pillar_headers, dict):
+        headers.update(pillar_headers)
+    elif "pillar_flagged" in _metadata:
+        headers["x-pillar-flagged"] = str(_metadata["pillar_flagged"]).lower()
+
     return headers
 
 
diff --git a/litellm/proxy/guardrails/guardrail_hooks/pillar/pillar.py b/litellm/proxy/guardrails/guardrail_hooks/pillar/pillar.py
@@ -6,8 +6,10 @@
 # +-------------------------------------------------------------+
 
 # Standard library imports
+import json
 import os
-from typing import TYPE_CHECKING, Any, Dict, Literal, Optional, Tuple, Type, Union
+from urllib.parse import quote
+from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Tuple, Type, Union
 
 # Third-party imports
 from fastapi import HTTPException
@@ -28,13 +30,117 @@
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.proxy.common_utils.callback_utils import (
     add_guardrail_to_applied_guardrails_header,
+    get_metadata_variable_name_from_kwargs,
 )
 from litellm.types.guardrails import GuardrailEventHooks
 from litellm.types.utils import LLMResponseTypes
 
 if TYPE_CHECKING:
     from litellm.types.proxy.guardrails.guardrail_hooks.base import GuardrailConfigModel
 
+MAX_PILLAR_HEADER_VALUE_BYTES = 8 * 1024
+
+
+def _encode_json_for_header(data: Any) -> str:
+    """
+    JSON-serialize and URL-encode data for safe header transmission.
+    """
+    json_payload = json.dumps(data, ensure_ascii=False, separators=(",", ":"))
+    return quote(json_payload, safe="")
+
+
+def _truncate_evidence_payload(
+    evidence: Any, max_bytes: int = MAX_PILLAR_HEADER_VALUE_BYTES
+) -> Tuple[Any, str, bool]:
+    """
+    Truncate evidence payload so the encoded header value stays within max_bytes.
+
+    Returns:
+        truncated_evidence: Evidence list/value after truncation
+        encoded_value: URL-encoded JSON string for header
+        was_truncated: Whether truncation occurred
+    """
+    if not isinstance(evidence, list):
+        encoded = _encode_json_for_header(evidence)
+        if len(encoded.encode("utf-8")) <= max_bytes:
+            return evidence, encoded, False
+        truncated_value = "[truncated]"
+        return truncated_value, _encode_json_for_header(truncated_value), True
+
+    truncated: List[Any] = []
+    encoded = _encode_json_for_header(truncated)
+    truncated_flag = False
+
+    for entry in evidence:
+        working_entry: Any
+        if isinstance(entry, dict):
+            working_entry = dict(entry)
+        else:
+            working_entry = entry
+
+        truncated.append(working_entry)
+        encoded = _encode_json_for_header(truncated)
+
+        if len(encoded.encode("utf-8")) <= max_bytes:
+            continue
+
+        truncated_flag = True
+        if isinstance(working_entry, dict):
+            evidence_text = str(working_entry.get("evidence", ""))
+            if evidence_text:
+                step = max(1, len(evidence_text) // 2)
+                while len(encoded.encode("utf-8")) > max_bytes and evidence_text:
+                    evidence_text = (
+                        evidence_text[:-step] if len(evidence_text) > step else evidence_text[:-1]
+                    )
+                    step = max(1, step // 2)
+                    truncated_text = (
+                        f"{evidence_text}...[truncated]" if evidence_text else "[truncated]"
+                    )
+                    working_entry["evidence"] = truncated_text
+                    working_entry["evidence_truncated"] = True
+                    encoded = _encode_json_for_header(truncated)
+
+                if len(encoded.encode("utf-8")) <= max_bytes:
+                    continue
+
+        truncated.pop()
+        encoded = _encode_json_for_header(truncated)
+
+    return truncated, encoded, truncated_flag
+
+
+def build_pillar_response_headers(metadata_store: Dict[str, Any]) -> Dict[str, str]:
+    """
+    Create URL-safe Pillar response headers and apply truncation metadata.
+    """
+    headers: Dict[str, str] = {}
+
+    if "pillar_flagged" in metadata_store:
+        headers["x-pillar-flagged"] = str(metadata_store["pillar_flagged"]).lower()
+
+    if "pillar_scanners" in metadata_store:
+        headers["x-pillar-scanners"] = _encode_json_for_header(metadata_store["pillar_scanners"])
+
+    if "pillar_evidence" in metadata_store:
+        truncated_evidence, encoded_value, truncated_flag = _truncate_evidence_payload(
+            metadata_store["pillar_evidence"]
+        )
+        metadata_store["pillar_evidence"] = truncated_evidence
+        if truncated_flag:
+            metadata_store["pillar_evidence_truncated"] = True
+        headers["x-pillar-evidence"] = encoded_value
+
+    if "pillar_session_id_response" in metadata_store:
+        headers["x-pillar-session-id"] = quote(
+            str(metadata_store["pillar_session_id_response"]), safe=""
+        )
+
+    if headers:
+        metadata_store["pillar_response_headers"] = headers
+
+    return headers
+
 
 # Exception classes
 class PillarGuardrailMissingSecrets(Exception):
@@ -637,15 +743,31 @@ def _process_pillar_response(self, pillar_response: Dict[str, Any], original_dat
 
         flagged = pillar_response.get("flagged", False)
 
+        metadata_field = get_metadata_variable_name_from_kwargs(original_data)
+        if metadata_field not in original_data or not isinstance(original_data.get(metadata_field), dict):
+            original_data[metadata_field] = {}
+        metadata_store = original_data[metadata_field]
+
+        # Backwards compatibility - ensure metadata alias exists when different key used
+        if metadata_field != "metadata":
+            if "metadata" not in original_data or not isinstance(original_data.get("metadata"), dict):
+                original_data["metadata"] = metadata_store
+
         # Store session_id from Pillar response for potential reuse
         pillar_session_id = pillar_response.get("session_id")
         if pillar_session_id:
             verbose_proxy_logger.debug(f"Pillar Guardrail: Received session_id from server: {pillar_session_id}")
             # Store in request metadata for use in subsequent hooks
-            if "metadata" not in original_data:
-                original_data["metadata"] = {}
-            if "pillar_session_id" not in original_data["metadata"]:
-                original_data["metadata"]["pillar_session_id"] = pillar_session_id
+            if "pillar_session_id" not in metadata_store:
+                metadata_store["pillar_session_id"] = pillar_session_id
+            metadata_store["pillar_session_id_response"] = pillar_session_id
+
+        # Always set flagged status and scanner/evidence data for monitor mode
+        metadata_store["pillar_flagged"] = flagged
+        if self.include_scanners:
+            metadata_store["pillar_scanners"] = pillar_response.get("scanners", {})
+        if self.include_evidence:
+            metadata_store["pillar_evidence"] = pillar_response.get("evidence", [])
 
         if flagged:
             verbose_proxy_logger.warning("Pillar Guardrail: Threat detected")
@@ -654,6 +776,8 @@ def _process_pillar_response(self, pillar_response: Dict[str, Any], original_dat
             elif self.on_flagged_action == "monitor":
                 verbose_proxy_logger.info("Pillar Guardrail: Monitoring mode - allowing flagged content to proceed")
 
+        build_pillar_response_headers(metadata_store)
+
     def _raise_pillar_detection_exception(self, pillar_response: Dict[str, Any]) -> None:
         """
         Raise an HTTPException for Pillar security detections.
diff --git a/tests/test_litellm/proxy/guardrails/test_pillar_guardrails.py b/tests/test_litellm/proxy/guardrails/test_pillar_guardrails.py