fix(evals): auto generate evals #3529

galzilber · 2025-12-30T13:13:49Z

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

This PR introduces dynamic evaluator model generation and management using Pydantic, enhancing input validation and output deserialization for evaluators.

Behavior:
- Evaluators validate inputs against request schemas before execution in evaluator.py.
- Execution results can be deserialized into user-provided typed models using typed_result() in model.py.
- Dynamic evaluator factory exposes evaluator methods at runtime and adds a create_evaluator entry point in evaluators_made_by_traceloop.py.
Models:
- Adds automated generation of evaluator request/response models in generated/evaluators.
- Central registry for models in registry.py.
Scripts:
- Adds generate_evaluator_models.py for model generation from OpenAPI specs.
- Adds generate-models.sh for running the generation script.
Misc:
- Updated linter configuration in .flake8 to allow generated evaluator files.
- Added datamodel-code-generator as a dev dependency in pyproject.toml.

^{This description was created by}^{for b1e4b30. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

Release Notes

New Features
- Evaluator input validation now enforced at runtime
- Typed result deserialization for execution responses
- New create_evaluator API for evaluator instantiation
- Model generation tooling from OpenAPI specifications
Chores
- Added development dependencies for code generation
- Updated linting configuration for generated files

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-30T13:14:02Z

📝 Walkthrough

Walkthrough

This PR introduces automated code generation infrastructure for creating Pydantic models from OpenAPI/Swagger specifications, adds runtime input validation for evaluators, and refactors the evaluator factory to use a dynamic, registry-driven design instead of static methods.

Changes

Cohort / File(s)	Summary
Code Generation Infrastructure `scripts/codegen/generate_evaluator_models.py`, `scripts/generate-models.sh`, `packages/traceloop-sdk/pyproject.toml`	New codegen script that parses Swagger JSON, recursively resolves `$ref` definitions, generates request/response Pydantic models per evaluator slug, and produces a registry with accessor functions. Bash wrapper script orchestrates the codegen via Poetry. Added `datamodel-code-generator ^0.26.0` as dev dependency.
Evaluator Input Validation `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py`	Added `_validate_evaluator_input()` helper that retrieves request model from registry and validates input using Pydantic. Invoked at start of `run_experiment_evaluator` and `trigger_experiment_evaluator` to enforce validation before request construction.
Model Type-Safe Result Deserialization `packages/traceloop-sdk/traceloop/sdk/evaluator/model.py`	Extended `ExecutionResponse` with new `typed_result(model: Type[T]) -> T` method enabling type-safe deserialization of results into arbitrary Pydantic models. Added module-level `TypeVar T` bound to `BaseModel`.
Dynamic Evaluator Factory `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py`, `packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py`	Replaced static factory methods with registry-driven design: removed all hardcoded methods, introduced metaclass `_EvaluatorMadeByTraceloopMeta` to dynamically generate methods per slug at runtime, added `create_evaluator(slug, **config)` for validated config resolution, and exposed new public API in `__init__.py`.
Linting Configuration `packages/traceloop-sdk/.flake8`	Expanded per-file-ignores to suppress E501 (line too long) for generated files under `traceloop/sdk/generated/*/.py`.

Sequence Diagram(s)

sequenceDiagram
    participant User as Developer
    participant Script as generate-models.sh
    participant Python as generate_evaluator_models.py
    participant Swagger as Swagger JSON
    participant Codegen as datamodel-codegen
    participant Output as Generated Files

    User->>Script: ./generate-models.sh swagger.json output/
    Script->>Python: poetry run python generate_evaluator_models.py
    Python->>Swagger: Load & parse JSON
    Swagger-->>Python: Swagger definitions
    Note over Python: Extract definitions,<br/>resolve $ref,<br/>build slug mappings
    Python->>Output: Write registry.py
    Python->>Output: Write temp schema.json
    Python->>Codegen: Invoke with filtered schema
    Codegen-->>Output: Generate Pydantic models
    Python->>Output: Generate __init__.py
    Python->>Output: Clean temp files
    Output-->>User: Request/response models + registry

sequenceDiagram
    participant Client as Evaluator Client
    participant Run as run_experiment_evaluator()
    participant Validate as _validate_evaluator_input()
    participant Registry as REQUEST_MODELS
    participant Pydantic as Pydantic Validator
    participant Build as Build Evaluator Request

    Client->>Run: run(slug, input)
    Run->>Validate: _validate_evaluator_input(slug, input)
    Validate->>Registry: get_request_model(slug)
    Registry-->>Validate: Request Model
    Validate->>Pydantic: model(**input)
    alt Validation Success
        Pydantic-->>Validate: Validated instance
        Validate-->>Run: Return (no error)
    else Validation Failure
        Pydantic-->>Validate: ValidationError
        Validate-->>Run: Raise ValueError
        Run-->>Client: Error
    end
    Run->>Build: Construct request
    Build-->>Client: Response

sequenceDiagram
    participant User as Code
    participant EvalFactory as EvaluatorMadeByTraceloop
    participant Meta as _EvaluatorMadeByTraceloopMeta
    participant Registry as REQUEST_MODELS
    participant CreateEval as create_evaluator()

    User->>EvalFactory: EvaluatorMadeByTraceloop.pii_detector(...)
    Note over EvalFactory: Metaclass intercepts<br/>attribute access
    EvalFactory->>Meta: __getattr__('pii_detector')
    Meta->>Registry: Lookup 'pii_detector' in slugs
    alt Slug exists
        Meta->>Meta: Generate method from slug
        Meta-->>EvalFactory: Dynamic method
    else Slug not found
        Meta-->>EvalFactory: AttributeError
    end
    User->>CreateEval: create_evaluator(slug='pii_detector', config)
    CreateEval->>Registry: get_request_model(slug)
    CreateEval->>CreateEval: Validate slug,<br/>filter None config,<br/>resolve required fields
    CreateEval-->>User: EvaluatorDetails

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A code generator hops to life,
Swagger specs turn models without strife,
Validation guards each eager input,
Metaclass magic makes factories submit,
Dynamic methods bloom—evaluators fly free! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(evals): auto generate evals' accurately describes the main objective of the PR, which is to automate the generation of evaluator models and enhance the evaluator system.
Docstring Coverage	✅ Passed	Docstring coverage is 84.21% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to a808557 in 2 minutes and 27 seconds. Click for details.

Reviewed 1280 lines of code in 9 files
Skipped 1 files when reviewing.
Skipped posting 5 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/pyproject.toml:88

Draft comment:
Consider moving 'datamodel-code-generator' to a dev dependency if it’s only used during model generation.
Reason this comment was not posted:
Comment looked like it was already resolved.

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:18

Draft comment:
The '_validate_evaluator_input' function type hints input as Dict[str, str] but some evaluator models expect non-string types (e.g., float). Consider using Dict[str, Any].
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment makes a reasonable point about type hints. If evaluator models can accept non-string types (like floats), then the type hint should reflect that. However, I notice that the calling methods (run_experiment_evaluator and trigger_experiment_evaluator) also have input: Dict[str, str] in their signatures. This suggests either: (1) the entire chain should be updated to use Dict[str, Any], or (2) the current Dict[str, str] is intentional. Without seeing the actual evaluator models or get_request_model implementation, I cannot definitively say whether evaluators accept non-string types. The comment is speculative ("some evaluator models expect non-string types") without showing concrete evidence. This violates the rule about not making speculative comments. The comment assumes that evaluator models expect non-string types without providing concrete evidence from the codebase. The fact that multiple methods in the chain use Dict[str, str] suggests this might be intentional design. Without seeing actual evaluator model definitions or examples where non-string types are required, this is speculative. While the comment could be speculative, type hints should accurately reflect what the code accepts. If the validation function uses pydantic models that can accept various types, then Dict[str, str] would be too restrictive and could cause type checking issues. However, without strong evidence that this is actually a problem, the comment remains speculative. The comment is speculative without concrete evidence that evaluator models actually require non-string types. The consistent use of Dict[str, str] throughout the call chain suggests this might be intentional. Following the rule that comments should not be speculative ("If X, then Y is an issue"), this comment should be deleted.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/model.py:48

Draft comment:
In 'typed_result()', consider using 'model.parse_obj(self.result)' for more robust parsing.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment suggests using parse_obj() for "more robust parsing". However, I need to consider: 1) In Pydantic v2, parse_obj() is deprecated and replaced with model_validate(). 2) In Pydantic v1, parse_obj() and direct instantiation with **kwargs are functionally similar for dictionaries. 3) The comment doesn't explain what "more robust" means - both methods perform validation. 4) Without knowing the Pydantic version being used, this advice could be outdated or incorrect. 5) The current implementation model(**self.result) is a standard and valid way to instantiate Pydantic models. The comment is somewhat speculative about what would be "more robust" without clear evidence. I might be missing that parse_obj() could handle edge cases better in certain Pydantic versions, or that there are specific validation behaviors that differ between the two approaches. The comment might be based on legitimate Pydantic best practices that I'm not fully aware of. While there might be subtle differences, the comment doesn't provide specific evidence of why parse_obj() would be more robust in this context. Additionally, in modern Pydantic (v2), parse_obj() is deprecated, making this advice potentially outdated. The current implementation is standard and valid, and without clear evidence of a problem, this comment appears to be a speculative suggestion rather than addressing a definite issue. This comment should be deleted. It makes a speculative suggestion without explaining what specific robustness issue it addresses. The current implementation using model(**self.result) is a standard, valid approach for Pydantic model instantiation, and the suggested alternative might even be deprecated in newer Pydantic versions.

4. scripts/codegen/generate_evaluator_models.py:180

Draft comment:
The regex-based extraction in 'generate_init_py' might be fragile if the generated file format changes; consider using an AST-based approach if this becomes an issue.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

5. scripts/codegen/generate_evaluator_models.py:212

Draft comment:
Typographical issue: In the multi-line string for the registry import in init.py (lines 207-214), there's an extra colon after get_response_model, before the closing parenthesis. This appears to be a typo—please remove the extra colon so that the closing parenthesis is correctly formatted.
Reason this comment was not posted:
Comment looked like it was already resolved.

Workflow ID: wflow_t6RIKriUveS7wogj

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
18-33: Validation silently skips unknown evaluator slugs.

When get_request_model(slug) returns None (slug not in registry), validation is bypassed without warning. This could mask typos in evaluator slugs until the API call fails.

Consider logging a warning or raising an error for unknown slugs to provide earlier feedback:
🔎 Optional: Add warning for unknown slugs
+import logging
+
+logger = logging.getLogger(__name__)
+
 def _validate_evaluator_input(slug: str, input: Dict[str, str]) -> None:
     """Validate input against the evaluator's request model if available.
     ...
     """
     request_model = get_request_model(slug)
-    if request_model:
+    if request_model is None:
+        logger.debug(f"No request model found for evaluator '{slug}', skipping validation")
+        return
+    try:
+        request_model(**input)
+    except ValidationError as e:
+        raise ValueError(f"Invalid input for '{slug}': {e}") from e
-        try:
-            request_model(**input)
-        except ValidationError as e:
-            raise ValueError(f"Invalid input for '{slug}': {e}") from e
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1)
48-63: Consider handling ValidationError in typed_result.

If self.result doesn't conform to the provided model's schema, model(**self.result) will raise a Pydantic ValidationError. This is reasonable behavior, but you may want to document this in the docstring or wrap it with additional context:
🔎 Optional: Document or wrap ValidationError
     def typed_result(self, model: Type[T]) -> T:
         """Parse result into a typed Pydantic model.

         Args:
             model: The Pydantic model class to parse the result into

         Returns:
             An instance of the provided model class

+        Raises:
+            pydantic.ValidationError: If result doesn't match the model schema
+
         Example:
             from traceloop.sdk.evaluators_generated import PIIDetectorResponse
             result = await evaluator.run_experiment_evaluator(...)
             pii = result.typed_result(PIIDetectorResponse)
             print(pii.has_pii)  # IDE autocomplete works!
         """
         return model(**self.result)
scripts/codegen/generate_evaluator_models.py (2)
42-42: Unused loop variable method.

The loop variable method is not used within the loop body. Rename to _method to indicate it's intentionally unused.
🔎 Proposed fix
-            for method, details in methods.items():
+            for _method, details in methods.items():
27-28: Consider specifying explicit encoding for file operations.

While json.load handles encoding, specifying encoding="utf-8" explicitly ensures consistent behavior across different system locales.
🔎 Proposed fix
-    with open(swagger_path) as f:
+    with open(swagger_path, encoding="utf-8") as f:
         data = json.load(f)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py (1)

87-162: __all__ sorting should be addressed in the generator.

The static analysis tool notes that __all__ is not isort-style sorted. Since this file is auto-generated, the sorting logic should be fixed in scripts/codegen/generate_evaluator_models.py rather than manually here. The current grouping (registry functions, then request models, then response models) is intentional and logical.

Consider whether ErrorResponse (line 63, 140) should also be added to the RESPONSE_MODELS registry in registry.py, or if it's intentionally a general-purpose error model not tied to any specific evaluator slug.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9290933 and a808557.

⛔ Files ignored due to path filters (1)

packages/traceloop-sdk/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (9)

packages/traceloop-sdk/pyproject.toml
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py
scripts/codegen/generate_evaluator_models.py
scripts/generate-models.sh

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py
scripts/codegen/generate_evaluator_models.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py

🧠 Learnings (1)

📚 Learning: 2025-08-22T14:41:26.962Z

Learnt from: prane-eth
Repo: traceloop/openllmetry PR: 3336
File: packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/utils.py:8-8
Timestamp: 2025-08-22T14:41:26.962Z
Learning: In the openllmetry project, the `packaging` library is available as a transitive dependency through other packages (visible in poetry.lock) and doesn't need to be explicitly declared in pyproject.toml dependencies.

Applied to files:

packages/traceloop-sdk/pyproject.toml

🧬 Code graph analysis (3)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py (2)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (33)

AgentEfficiencyRequest (11-17)

AgentFlowQualityRequest (20-34)

AgentGoalAccuracyRequest (37-47)

AgentGoalCompletenessRequest (50-59)

AgentToolErrorDetectorRequest (62-71)

AnswerCompletenessRequest (74-77)

AnswerCorrectnessRequest (80-83)

AnswerRelevancyRequest (86-88)

CharCountRatioRequest (91-95)

CharCountRequest (98-99)

ContextRelevanceRequest (102-110)

ConversationQualityRequest (113-123)

FaithfulnessRequest (126-136)

InstructionAdherenceRequest (139-148)

IntentChangeRequest (151-162)

JSONValidatorRequest (165-168)

PIIDetectorRequest (171-178)

PerplexityRequest (181-182)

PlaceholderRegexRequest (185-193)

ProfanityDetectorRequest (196-197)

PromptInjectionRequest (200-202)

PromptPerplexityRequest (205-206)

RegexValidatorRequest (209-215)

SQLValidatorRequest (218-219)

SecretsDetectorRequest (222-225)

SemanticSimilarityRequest (228-230)

SexismDetectorRequest (233-238)

ToneDetectionRequest (241-242)

TopicAdherenceRequest (245-255)

ToxicityDetectorRequest (258-260)

UncertaintyDetectorRequest (263-266)

WordCountRatioRequest (269-273)

WordCountRequest (276-277)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (33)

AgentEfficiencyResponse (11-19)

AgentFlowQualityResponse (22-27)

AgentGoalAccuracyResponse (30-31)

AgentGoalCompletenessResponse (34-37)

AgentToolErrorDetectorResponse (40-44)

AnswerCompletenessResponse (47-48)

AnswerCorrectnessResponse (51-52)

AnswerRelevancyResponse (55-56)

CharCountRatioResponse (59-60)

CharCountResponse (63-64)

ContextRelevanceResponse (67-68)

ConversationQualityResponse (71-72)

FaithfulnessResponse (79-80)

InstructionAdherenceResponse (83-84)

IntentChangeResponse (87-92)

JSONValidatorResponse (95-96)

PIIDetectorResponse (99-100)

PerplexityResponse (103-104)

PlaceholderRegexResponse (107-108)

ProfanityDetectorResponse (111-112)

PromptInjectionResponse (115-116)

PromptPerplexityResponse (119-120)

RegexValidatorResponse (123-124)

SQLValidatorResponse (127-128)

SecretsDetectorResponse (131-132)

SemanticSimilarityResponse (135-136)

SexismDetectorResponse (139-140)

ToneDetectionResponse (143-145)

TopicAdherenceResponse (148-149)

ToxicityDetectorResponse (152-153)

UncertaintyDetectorResponse (156-158)

WordCountRatioResponse (161-162)

WordCountResponse (165-166)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py (3)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (1)

AgentEfficiencyRequest (11-17)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py (2)

get_request_model (161-163)

get_response_model (166-168)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)

AgentEfficiencyResponse (11-19)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)

char_count (183-196)

word_count (221-233)

🪛 Ruff (0.14.10)

scripts/codegen/generate_evaluator_models.py

42-42: Loop control variable method not used within loop body

Rename unused method to _method

(B007)

299-299: subprocess call: check for execution of untrusted input

(S603)

300-310: Starting a process with a partial executable path

(S607)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py

33-33: Avoid specifying long messages outside the exception class

(TRY003)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py

87-162: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (7)

packages/traceloop-sdk/pyproject.toml (1)

86-86: LGTM!

Adding datamodel-code-generator as a dev dependency is appropriate since it's only used during code generation and not at runtime. This keeps the production dependency footprint minimal.

scripts/generate-models.sh (1)

1-35: LGTM!

The shell script follows best practices:

Uses strict mode (set -euo pipefail) for early failure detection

Properly validates input arguments and file existence

Correctly resolves script directory using BASH_SOURCE for portability

All variable expansions are quoted to prevent word splitting

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

117-118: Good placement of input validation.

Validation is correctly placed at the start of both run_experiment_evaluator and trigger_experiment_evaluator, ensuring inputs are validated before building requests or making API calls. This follows the fail-fast principle.

Also applies to: 161-162

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py (1)

1-168: LGTM!

The generated registry module is well-structured:

Clear "DO NOT EDIT MANUALLY" warning with regeneration instructions

Consistent alphabetical ordering of slug mappings

Simple, effective helper functions for model lookup

Type annotations are correct (Dict[str, Type[BaseModel]])

scripts/codegen/generate_evaluator_models.py (1)

298-315: LGTM - subprocess usage is appropriate for dev tooling.

The subprocess call uses a hardcoded command (datamodel-codegen) with validated file inputs. The check=True parameter ensures failures are raised as exceptions, and the finally block properly cleans up temporary files. This is safe for a code generation script.

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)

1-166: LGTM!

The auto-generated response models are well-structured:

Correct use of Optional fields with defaults for response flexibility

Proper handling of Python reserved keyword pass using alias='pass' in IntentChangeResponse

Effective use of inheritance to reduce duplication (RegexValidatorResponse, ToxicityDetectorResponse)

Consistent use of Field() with examples for documentation

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (1)

1-8: LGTM - Clean generated file structure.

The file header correctly identifies this as auto-generated code, and imports are appropriate for Pydantic model definitions. No security concerns.

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed a8a77fd in 1 minute and 8 seconds. Click for details.

Reviewed 392 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/.flake8:12

Draft comment:
Added per-file ignore for E501 on generated evaluator files. Ensure this intentional ignore doesn't hide style issues if generated code changes.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 30% <= threshold 50% The comment is asking the author to ensure that the ignore doesn't hide style issues, which is a form of asking for confirmation. This violates the rule against asking the PR author to ensure behavior is intended. However, it does point out a potential issue with ignoring style checks on generated files, which could be useful if phrased differently.

2. packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py:13

Draft comment:
Standardized example strings to use double quotes for consistency across models.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

3. packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py:11

Draft comment:
Consistent update of example strings to double quotes in responses; changes look purely stylistic and improve uniformity.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only praises the consistency of using double quotes without providing any actionable feedback or suggestions. It doesn't ask for any confirmation or suggest any improvements.

4. packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py:109

Draft comment:
It looks like 'gpt-4o' might be a typographical error. Did you mean 'gpt-4'?
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_3oaQT6TeYzm4Zj0T

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (1)

1-277: Past review: Flake8 line length violations require generator configuration update.

A previous review identified 23 Flake8 E501 line-length violations in this file. The recommendation was to update the generator configuration in scripts/codegen/generate_evaluator_models.py to enforce --max-line-length 79 and regenerate the models.

Since this is auto-generated code, please ensure the generator configuration is updated to prevent these violations in future regenerations. This should be addressed together with the response.py file to maintain consistency across all generated models.

🧹 Nitpick comments (3)

packages/traceloop-sdk/.flake8 (1)

12-14: LGTM! Good configuration for generated code.

The multi-line format improves readability, and ignoring E501 (line too long) for auto-generated evaluator models is a pragmatic approach since code generators typically produce long lines that are difficult to control.

Optional: Consider whether additional ignores may be needed for generated code, such as F401 (unused imports) or W503 (line break before binary operator), depending on what the code generator produces. However, the current configuration may be sufficient.

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (2)

113-115: Inheritance pattern creates alias without additional functionality.

RegexValidatorResponse inherits from PlaceholderRegexResponse with no additional fields or methods. While valid, this pattern simply creates an alias. Consider whether this is intentional or if the models should be merged in the source specification.

If these are meant to be identical, the source OpenAPI spec could reference the same schema to avoid duplication. If they're meant to diverge in the future, the current structure is appropriate.

142-144: Inheritance pattern creates alias without additional functionality.

ToxicityDetectorResponse inherits from SexismDetectorResponse with no additional fields. Same consideration as RegexValidatorResponse above.

Review the source API specification to determine if these should share a schema definition or remain separate for future extensibility.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a808557 and a8a77fd.

📒 Files selected for processing (3)

packages/traceloop-sdk/.flake8
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (1)

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)

1-156: E501 violations are intentionally ignored for generated evaluator models.

The file is already configured for Flake8 compliance. The packages/traceloop-sdk/setup.cfg has per-file-ignores = traceloop/sdk/evaluators_generated/*.py:E501, which explicitly excludes line-length violations (E501) for all files in this generated directory. No changes to the datamodel-codegen script or the response.py file are needed.

coderabbitai · 2025-12-30T13:32:24Z

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py

+class PromptInjectionResponse(BaseModel):
+    has_injection: Optional[str] = Field(None, examples=["safe"])


⚠️ Potential issue | 🟡 Minor

Type mismatch: field name suggests boolean but typed as string.

The field has_injection is typed as Optional[str] with example "safe", but the name suggests it should be a boolean. This likely reflects the source API specification, but it may cause confusion for developers expecting a boolean return value.

Verify whether the API actually returns a string or if this should be a boolean in the OpenAPI spec. If the API spec is correct, consider adding a docstring or comment to clarify that this field returns a string status (e.g., "safe", "unsafe") rather than a boolean.

🤖 Prompt for AI Agents

In packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py around lines 105-106, the field has_injection is declared as Optional[str] but its name implies a boolean; verify the OpenAPI source to see whether the API actually returns a boolean or a status string. If the API returns a boolean, change the field type to Optional[bool] and update the Field example and any related serializers/tests accordingly; if the API returns a status string, keep the type but add a concise docstring or inline comment describing allowed values (e.g., "safe"/"unsafe"/"unknown") and update the Field examples to reflect valid status strings so callers aren’t confused. Ensure generated types and unit tests are updated to match the chosen correction.

coderabbitai · 2025-12-30T13:32:24Z

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py

+class SexismDetectorResponse(BaseModel):
+    is_safe: Optional[str] = Field(None, examples=["safe"])


⚠️ Potential issue | 🟡 Minor

Type mismatch: field name suggests boolean but typed as string.

Similar to has_injection above, the field is_safe is typed as Optional[str] with example "safe", but the name convention suggests a boolean. This naming could mislead developers.

Verify the source API specification to confirm the return type. If a string is correct, consider documenting this in the source spec or adding clarifying comments.

🤖 Prompt for AI Agents

In packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py around lines 129-130, the field is_safe is declared Optional[str] but its name implies a boolean; verify the API spec and if the endpoint returns a boolean change the type to Optional[bool] and update the Field examples to True/False accordingly (and regenerate/adjust any type hints/usages), otherwise if the API actually returns string values keep the str type but rename the field to is_safe_label or add a clarifying comment/docstring and update the spec to avoid confusion.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 0265e2a in 2 minutes and 32 seconds. Click for details.

Reviewed 847 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py:3

Draft comment:
Import of create_evaluator and its addition to all is clear and improves the module interface.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:60

Draft comment:
In create_evaluator, the config is filtered but defaults from _get_config_fields aren’t merged. Consider merging defaults so unset options can automatically use default values.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a code quality suggestion about merging defaults. However, I need to consider: 1) Is this actually a problem? The config is passed to EvaluatorDetails which might handle defaults elsewhere, or the backend might handle them. 2) The old code (before the diff) explicitly set defaults in each factory method (e.g., probability_threshold: float = 0.5), so users could omit parameters and get defaults. 3) The new code uses **config: Any which means users must explicitly pass values or they won't be in the config dict at all. 4) The comment is speculative - it says "Consider merging" which is a suggestion, not pointing out a definite bug. 5) Without seeing how EvaluatorDetails or the backend handles missing config values, we can't be sure this is needed. The comment is speculative ("Consider merging") rather than pointing out a definite issue. We don't know if missing config values are actually a problem - the backend or EvaluatorDetails might handle defaults. The old code had explicit defaults in function signatures, but we don't know if removing them breaks anything or if it's intentional. This requires cross-file context to verify. While the comment makes a reasonable suggestion, it's speculative and we lack evidence that this is actually causing issues. The refactoring from explicit factory methods to dynamic generation might intentionally rely on backend defaults. Without seeing test failures or knowing the full system behavior, we can't confirm this is needed. This is a speculative code quality suggestion without strong evidence of an actual problem. The comment uses "Consider" language and we lack context about whether missing config values are handled elsewhere. Following the rule to only keep comments with strong evidence of correctness, this should be deleted.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:73

Draft comment:
The dynamic factory generation in getattr is neat. Optionally, consider caching default config values (from _get_config_fields) if repeated lookups become a performance concern.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

4. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:84

Draft comment:
In dir, dynamic addition of evaluator method names is implemented. Ensure that potential duplicates with inherited attributes are acceptable.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

Workflow ID: wflow_x2knBzUFK7yWhbup

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)
42-67: Consider: Validate config keys against model fields.

The function accepts arbitrary **config parameters but doesn't verify that the keys match the request model's fields. This could lead to typos or invalid parameters being silently ignored by the evaluator backend.

Consider validating config keys or at minimum logging a warning for unrecognized parameters.
🔎 Example validation approach
 def create_evaluator(slug: str, **config: Any) -> EvaluatorDetails:
     """Create an EvaluatorDetails for the given slug with optional config.
 
     Args:
         slug: The evaluator slug (e.g., "pii-detector")
         **config: Configuration options for the evaluator
 
     Returns:
         EvaluatorDetails configured for the specified evaluator
 
     Example:
         >>> from traceloop.sdk.evaluator import create_evaluator
         >>> evaluator = create_evaluator("pii-detector", probability_threshold=0.8)
     """
     if slug not in REQUEST_MODELS:
         available = ", ".join(sorted(REQUEST_MODELS.keys()))
         raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}")
-
+    
+    # Validate config keys against model fields
+    model = REQUEST_MODELS[slug]
+    valid_fields = set(model.model_fields.keys())
+    invalid_keys = set(config.keys()) - valid_fields
+    if invalid_keys:
+        raise ValueError(f"Invalid config keys for '{slug}': {invalid_keys}. Valid keys: {valid_fields}")
+    
     # Remove None values from config
     config = {k: v for k, v in config.items() if v is not None}
     return EvaluatorDetails(
         slug=slug,
         version=None,
         config=config,
         required_input_fields=_get_required_fields(slug),
     )
58-58: Nitpick: Long error message in exception.

Ruff suggests avoiding long messages directly in exception raises. Consider extracting to a variable or creating a custom exception class for better maintainability.
🔎 Suggested refactor
     if slug not in REQUEST_MODELS:
         available = ", ".join(sorted(REQUEST_MODELS.keys()))
-        raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}")
+        message = f"Unknown evaluator slug: '{slug}'. Available: {available}"
+        raise ValueError(message)
Based on static analysis (Ruff TRY003).

82-82: Enhance error message with available options.

The error message could be more helpful by listing available evaluators, similar to line 58. Also, consider extracting the message to a variable per Ruff TRY003.
🔎 Suggested improvement
-        raise AttributeError(f"'{cls.__name__}' has no attribute '{name}'")
+        available = ", ".join(sorted(_slug_to_method_name(s) for s in REQUEST_MODELS.keys()))
+        message = f"'{cls.__name__}' has no attribute '{name}'. Available evaluators: {available}"
+        raise AttributeError(message)
Based on static analysis (Ruff TRY003).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a8a77fd and 0265e2a.

📒 Files selected for processing (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

create_evaluator (42-67)

🪛 Ruff (0.14.10)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

58-58: Avoid specifying long messages outside the exception class

(TRY003)

82-82: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Lint
GitHub Check: Test Packages (3.11)

🔇 Additional comments (5)

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

3-3: LGTM: Clean public API expansion.

The new create_evaluator function is properly imported and exported, enabling the registry-driven evaluator creation pattern described in the PR objectives.

Also applies to: 9-9

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

12-39: LGTM: Well-structured helper functions.

The helper functions appropriately extract metadata from the registry models. The defensive returns (empty list/dict) are safe since create_evaluator validates slug existence before these are called.

70-89: Clever metaclass pattern with IDE support tradeoff.

The dynamic method generation via __getattr__ and __dir__ is a clean implementation that enables registry-driven evaluator creation. The tradeoff is reduced IDE autocomplete and static type checking support for the generated methods.

This is acceptable given the flexibility benefits, but users may need to reference documentation or use dir(EvaluatorMadeByTraceloop) to discover available evaluators.

92-115: LGTM: Clear documentation for dynamic factory class.

The comprehensive docstring and examples effectively document the dynamic evaluator creation pattern, helping users understand the registry-driven approach despite limited IDE support for autocomplete.

60-61: No issues found. The None filtering in config (lines 60-61) is intentional and correct behavior. All optional config parameters in the request models default to None, so filtering explicit None values is a valid cleanup pattern that doesn't prevent any meaningful use cases.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 436e856 in 1 minute and 12 seconds. Click for details.

Reviewed 36 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:7

Draft comment:
Consider verifying the import order. The new addition of 'from ..evaluators_generated import REQUEST_MODELS' and repositioning of 'EvaluatorDetails' might benefit from following a standard order (e.g. standard library, then third-party, then local) to improve clarity.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:83

Draft comment:
Consider formatting the 'config_fields' output in the factory method's docstring. Instead of embedding a raw list (or 'none'), joining the keys into a comma‐separated string could enhance readability.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_ygFrcNJJqi5hIF5K

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)

57-59: Consider using a custom exception or shorter message.

Ruff flags this as a long exception message. Consider extracting to a custom exception class or shortening the message.

🔎 Proposed refactor using a custom exception

+class UnknownEvaluatorError(ValueError):
+    """Raised when an unknown evaluator slug is provided."""
+    pass
+
 def create_evaluator(slug: str, **config: Any) -> EvaluatorDetails:
     """Create an EvaluatorDetails for the given slug with optional config.
 
     Args:
         slug: The evaluator slug (e.g., "pii-detector")
         **config: Configuration options for the evaluator
 
     Returns:
         EvaluatorDetails configured for the specified evaluator
 
     Example:
         >>> from traceloop.sdk.evaluator import create_evaluator
         >>> evaluator = create_evaluator("pii-detector", probability_threshold=0.8)
     """
     if slug not in REQUEST_MODELS:
         available = ", ".join(sorted(REQUEST_MODELS.keys()))
-        raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}")
+        raise UnknownEvaluatorError(f"Unknown evaluator slug: '{slug}'. Available: {available}")

83-84: Consider clarifying the config_fields display logic.

The expression list(_get_config_fields(slug).keys()) or "none" produces either a list or the string "none", which is type-inconsistent. While this only affects the docstring display, it could be clearer.

🔎 Proposed refactor for clearer display logic

             factory.__name__ = name
-            config_fields = list(_get_config_fields(slug).keys()) or "none"
-            factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_fields}"
+            config_keys = list(_get_config_fields(slug).keys())
+            config_display = ", ".join(config_keys) if config_keys else "none"
+            factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_display}"
             return factory

86-86: Consider using a shorter exception message.

Ruff flags this as a long exception message. The error message could be more concise.

🔎 Proposed refactor with shorter message

-        raise AttributeError(f"'{cls.__name__}' has no attribute '{name}'")
+        raise AttributeError(f"Unknown evaluator method: '{name}'")

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0265e2a and 436e856.

📒 Files selected for processing (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

🪛 Ruff (0.14.10)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

59-59: Avoid specifying long messages outside the exception class

(TRY003)

86-86: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Lint
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)

🔇 Additional comments (7)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (7)

13-30: LGTM - Helper functions correctly extract field information.

The helper functions properly introspect Pydantic model fields to extract required fields and config fields with defaults. The empty return values when the model is not found are acceptable since create_evaluator validates the slug separately.

33-40: LGTM - Conversion logic is simple and appropriate.

The slug/method name conversion correctly handles kebab-case to snake_case transformation. This should work well as long as evaluator slugs follow the kebab-case convention (e.g., "pii-detector") and don't contain underscores.

62-62: Verify None-filtering behavior is intentional.

The code filters out None values from the config dict. Confirm this is the desired behavior, as it prevents users from explicitly passing None to unset optional config values.

65-65: Version parameter is hardcoded to None.

The version parameter is always set to None when creating EvaluatorDetails. Verify whether this is temporary or if version support will be added later.

88-93: LGTM - dir implementation enables proper introspection.

The __dir__ override correctly combines parent class methods with dynamically generated evaluator methods, enabling IDE autocomplete and proper dir() introspection.

96-120: LGTM - Well-documented class with metaclass-driven functionality.

The class definition is clean with a comprehensive docstring that provides clear usage examples and lists available evaluators. The empty body is appropriate since all functionality is provided by the metaclass.

9-9: No action required. REQUEST_MODELS is properly populated at module load time with 32+ evaluator models and is safely handled throughout the codebase with clear error messages listing available evaluators when an unknown slug is provided.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 54c3c66 in 51 seconds. Click for details.

Reviewed 22 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:74

Draft comment:
Added return type hint '-> Any' on getattr. Consider using a more specific Callable type (e.g., Callable[..., EvaluatorDetails]) if possible for stronger type safety.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:88

Draft comment:
Added explicit return type '-> List[str]' on dir, which improves clarity for consumers using the metaclass.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_ta6L6BMAVbc41MQh

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
83-84: Inconsistent type in docstring generation.

config_fields is either a list or the string "none", making the docstring format inconsistent:

With fields: "Config fields: ['field1', 'field2']"

Without: "Config fields: none"
🔎 Suggested fix for consistent formatting
-            config_fields = list(_get_config_fields(slug).keys()) or "none"
-            factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_fields}"
+            config_fields = list(_get_config_fields(slug).keys())
+            fields_str = ", ".join(config_fields) if config_fields else "none"
+            factory.__doc__ = f"Create {slug} evaluator. Config fields: {fields_str}"

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 436e856 and 54c3c66.

📒 Files selected for processing (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

🪛 Ruff (0.14.10)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

59-59: Avoid specifying long messages outside the exception class

(TRY003)

86-86: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint
GitHub Check: Analyze (python)

🔇 Additional comments (6)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (6)

1-10: LGTM!

Imports are clean and appropriate for the dynamic factory pattern implementation.

13-18: LGTM!

The silent fallback to an empty list is acceptable since create_evaluator validates the slug before this function is called.

21-30: LGTM!

Clean extraction of optional config fields with their defaults.

43-68: LGTM with a minor note.

The implementation is clean and provides a good user experience with descriptive error messages. The version parameter is hardcoded to None—if evaluator versioning becomes relevant, consider exposing it as an optional parameter.

96-120: LGTM!

The metaclass-based dynamic factory is a clean design that eliminates boilerplate. The docstring appropriately guides users to dir(EvaluatorMadeByTraceloop) for the complete list of available evaluators.

38-40: No issues found. All slugs in REQUEST_MODELS use hyphens (e.g., "pii-detector", "json-validator"), not underscores. The conversion functions _slug_to_method_name() and _method_name_to_slug() are properly bidirectional for all actual evaluator slugs in the registry.

doronkopit5 · 2025-12-30T15:14:21Z

packages/traceloop-sdk/.flake8

-per-file-ignores = __init__.py:F401
+per-file-ignores =
+	__init__.py:F401
+	traceloop/sdk/evaluators_generated/*.py:E501


Suggested change

traceloop/sdk/evaluators_generated/*.py:E501

traceloop/sdk/generated/**/*.py:E501

ellipsis-dev

Important

Looks good to me! 👍

Reviewed b1e4b30 in 1 minute and 39 seconds. Click for details.

Reviewed 488 lines of code in 8 files
Skipped 0 files when reviewing.
Skipped posting 9 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/.flake8:14

Draft comment:
Updated exclusion pattern to use 'generated/**' instead of 'evaluators_generated'; ensure the glob pattern covers all intended nested files.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the author to ensure that the glob pattern covers all intended nested files. This falls under the category of asking the author to double-check or ensure something, which is against the rules. Therefore, this comment should be removed.

2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:15

Draft comment:
Refactored import to use the new generated evaluator module location.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states what was done without offering any insight or guidance.

3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:9

Draft comment:
Updated reference to the evaluator registry; now importing from the new 'generated/evaluators' module.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only states what has been done without providing any actionable feedback or suggestions. It doesn't ask for confirmation or suggest any improvements.

4. packages/traceloop-sdk/traceloop/sdk/evaluator/model.py:55

Draft comment:
Updated example import in docstring to reference new generated evaluators module.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only states that an example import in a docstring was updated to reference a new module. It doesn't provide any actionable feedback or suggestions for improvement.

5. packages/traceloop-sdk/traceloop/sdk/generated/__init__.py:1

Draft comment:
New init.py correctly consolidates generated evaluator modules. Make sure to update documentation to reflect the new module paths.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to update documentation, which is not allowed according to the rules. It doesn't provide a specific code suggestion or point out a potential issue with the code itself.

6. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:12

Draft comment:
Auto-generated evaluator request models now use consistent single quotes in examples. Verify consistency with project style guidelines.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

7. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py:11

Draft comment:
Auto-generated evaluator response models have updated examples with consistent quoting. Confirm that the field types and examples match expected API responses.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to confirm that the field types and examples match expected API responses. This falls under the rule of not asking the author to confirm their intention or to double-check things. Therefore, this comment should be removed.

8. packages/traceloop-sdk/scripts/generate-models.sh:17

Draft comment:
OUTPUT_DIR updated to the new 'generated/evaluators' path; verify that the new directory structure aligns with the rest of the project.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

9. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:12

Draft comment:
Line 12 appears to contain an extraneous ")". It doesn't seem to match any opening statement. Please verify if this is intentional or remove it if it's a typographical error.
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_UNmXKLhLPLh2IlO9

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
18-33: LGTM!

The validation logic correctly validates input against the evaluator's request model schema. The error chaining preserves context, and the silent skip when no model exists is sensible for backward compatibility.
Optional: Address Ruff TRY003 hint

Ruff suggests avoiding long messages in exception constructors. If you prefer tighter style compliance, consider extracting to a custom exception class:
+class EvaluatorInputValidationError(ValueError):
+    """Raised when evaluator input fails validation."""
+    def __init__(self, slug: str, validation_error: ValidationError):
+        super().__init__(f"Invalid input for '{slug}': {validation_error}")
+
+
 def _validate_evaluator_input(slug: str, input: Dict[str, str]) -> None:
     """Validate input against the evaluator's request model if available.
     ...
     """
     request_model = get_request_model(slug)
     if request_model:
         try:
             request_model(**input)
         except ValidationError as e:
-            raise ValueError(f"Invalid input for '{slug}': {e}") from e
+            raise EvaluatorInputValidationError(slug, e) from e
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
43-68: LGTM!

The function correctly validates the evaluator slug and constructs EvaluatorDetails with dynamically derived required fields. The None-filtering in config (line 62) ensures model defaults are used when callers pass explicit None values.
Optional: Address Ruff TRY003 hint

Similar to the earlier validation function, Ruff suggests avoiding long exception messages. You could extract to a custom exception if desired:
+class UnknownEvaluatorError(ValueError):
+    """Raised when an unknown evaluator slug is requested."""
+    def __init__(self, slug: str, available: str):
+        super().__init__(f"Unknown evaluator slug: '{slug}'. Available: {available}")
+
+
 def create_evaluator(slug: str, **config: Any) -> EvaluatorDetails:
     """..."""
     if slug not in REQUEST_MODELS:
         available = ", ".join(sorted(REQUEST_MODELS.keys()))
-        raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}")
+        raise UnknownEvaluatorError(slug, available)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 54c3c66 and b1e4b30.

⛔ Files ignored due to path filters (5)

packages/traceloop-sdk/traceloop/sdk/generated/__init__.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/registry.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py is excluded by !**/generated/**
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py is excluded by !**/generated/**

📒 Files selected for processing (5)

packages/traceloop-sdk/.flake8
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py
scripts/generate-models.sh

🚧 Files skipped from review as they are similar to previous changes (2)

packages/traceloop-sdk/.flake8
scripts/generate-models.sh

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/evaluator/model.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py

🧬 Code graph analysis (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

packages/traceloop-sdk/traceloop/sdk/client/http.py (1)

get (40-60)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

packages/traceloop-sdk/traceloop/sdk/generated/evaluators/registry.py (1)

get_request_model (161-163)

🪛 Ruff (0.14.10)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

59-59: Avoid specifying long messages outside the exception class

(TRY003)

86-86: Avoid specifying long messages outside the exception class

(TRY003)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py

33-33: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Lint
GitHub Check: Build Packages (3.11)

🔇 Additional comments (8)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)

3-3: LGTM!

The new imports support input validation and are used appropriately in the new _validate_evaluator_input helper.

Also applies to: 15-15

117-118: LGTM!

Input validation is correctly placed at the start of both execution methods, ensuring inputs are valid before building requests or triggering evaluator execution.

Also applies to: 161-162

packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (2)

2-2: LGTM!

Standard TypeVar pattern for enabling generic, type-safe result parsing in the new typed_result method.

Also applies to: 5-6

48-63: LGTM!

Clean convenience method that enables type-safe result parsing. The implementation is straightforward, and the documentation clearly illustrates the usage pattern. Pydantic's ValidationError will naturally propagate if the result doesn't match the model schema, which is appropriate behavior for the caller to handle.

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

1-10: LGTM!

The updated docstring clearly describes the dynamic factory approach, and the imports correctly reference the registry and configuration models.

13-30: LGTM!

The helper functions correctly introspect Pydantic models to extract required fields and configuration defaults using standard Pydantic APIs. Graceful handling of missing models is appropriate.

33-40: LGTM!

Clean bidirectional conversion between slug and method name formats. The simple string replacement approach is appropriate for this use case.

96-120: LGTM!

Excellent documentation with clear examples and guidance. The clean separation of concerns (metaclass handles behavior, class provides interface) makes the dynamic factory pattern easy to understand and use.

coderabbitai · 2025-12-30T15:38:28Z

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

+class _EvaluatorMadeByTraceloopMeta(type):
+    """Metaclass that dynamically generates evaluator factory methods."""
+
+    def __getattr__(cls, name: str) -> Any:
+        """Dynamically create factory methods for any evaluator slug."""
+        slug = _method_name_to_slug(name)
+        if slug in REQUEST_MODELS:
+
+            def factory(**config: Any) -> EvaluatorDetails:
+                return create_evaluator(slug, **config)
+
+            factory.__name__ = name
+            config_fields = list(_get_config_fields(slug).keys()) or "none"
+            factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_fields}"
+            return factory
+        raise AttributeError(f"'{cls.__name__}' has no attribute '{name}'")
+
+    def __dir__(cls) -> List[str]:
+        """List all available evaluator methods."""
+        methods = list(super().__dir__())
+        for slug in REQUEST_MODELS:
+            methods.append(_slug_to_method_name(slug))
+        return methods
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's verify the file exists and check the exact code at lines 71-94 echo "=== Checking code at lines 71-94 ===" if [ -f "packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py" ]; then sed -n '71,94p' "packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py" else echo "File not found" fi # Check for REQUEST_MODELS import echo -e "\n=== Checking REQUEST_MODELS import ===" head -30 "packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py"

Repository: traceloop/openllmetry

Length of output: 2089

🏁 Script executed:

#!/bin/bash # Check if registry file exists and REQUEST_MODELS definition echo "=== Searching for REQUEST_MODELS definition ===" rg "REQUEST_MODELS" packages/traceloop-sdk/ -A 2 -B 2 # Look for the registry file echo -e "\n=== Looking for registry or models file ===" fd "registry" packages/traceloop-sdk/ fd "models" packages/traceloop-sdk/ | grep -E "\.(py|json)$" | head -10

Repository: traceloop/openllmetry

Length of output: 7234

🏁 Script executed:

#!/bin/bash # Check for hardcoded API keys or secrets in the target file echo "=== Checking for hardcoded secrets in target file ===" rg -i "(api[_-]?key|secret|password|token|credential)" packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py # Check for environment variable patterns echo -e "\n=== Checking for env var usage pattern ===" rg "os\.environ|getenv" packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

Repository: traceloop/openllmetry

Length of output: 162

🏁 Script executed:

#!/bin/bash # Check Flake8 compliance if available echo "=== Checking Flake8 compliance ===" if command -v flake8 &> /dev/null; then flake8 packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py --show-source else echo "Flake8 not installed, checking for linting issues manually..." # Look for common issues rg "except.*:" packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py -A 1 fi

Repository: traceloop/openllmetry

Length of output: 3469

Fix Flake8 E501 violations before approval.

The metaclass implementation is sound and REQUEST_MODELS is properly populated at import time. However, the file has multiple Flake8 E501 violations (line too long) at lines 4, 18, 22, 55, 59, 84, and 107 that violate the coding guideline requiring Flake8 compliance. These must be fixed by breaking long lines or using line continuation before this change can be approved.

🧰 Tools

🪛 Ruff (0.14.10)

86-86: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py around lines 71-94 (and elsewhere where flake8 flagged E501 at lines 4, 18, 22, 55, 59, 84, 107), several lines exceed the 88-character limit; break those long lines by introducing intermediate variables, splitting long f-strings into concatenated parts or using parentheses for implicit continuation, wrap long list or dict expressions across multiple lines, or move complex expressions onto separate lines so each line stays within the limit; ensure docstrings and long assignment lines are wrapped, prefer parentheses for multi-line returns and keep line length under 88 characters to resolve the E501 violations.

galzilber added 2 commits December 30, 2025 15:01

fix(evals): auto generate evals by the code

e014577

fx

a808557

ellipsis-dev bot reviewed Dec 30, 2025

View reviewed changes

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py Show resolved Hide resolved

fx

a8a77fd

ellipsis-dev bot reviewed Dec 30, 2025

View reviewed changes

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

fx

0265e2a

galzilber changed the title ~~fix(evals): auto generate evals by the code~~ fix(evals): auto generate evals Dec 30, 2025

ellipsis-dev bot reviewed Dec 30, 2025

View reviewed changes

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

fx

436e856

ellipsis-dev bot reviewed Dec 30, 2025

View reviewed changes

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

fx

54c3c66

ellipsis-dev bot reviewed Dec 30, 2025

View reviewed changes

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

doronkopit5 reviewed Dec 30, 2025

View reviewed changes

doronkopit5 approved these changes Dec 30, 2025

View reviewed changes

fx

b1e4b30

ellipsis-dev bot reviewed Dec 30, 2025

View reviewed changes

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

galzilber merged commit a6d3fa9 into main Dec 30, 2025
12 checks passed

		class PromptInjectionResponse(BaseModel):
		has_injection: Optional[str] = Field(None, examples=["safe"])

		class SexismDetectorResponse(BaseModel):
		is_safe: Optional[str] = Field(None, examples=["safe"])

	traceloop/sdk/evaluators_generated/*.py:E501
	traceloop/sdk/generated/*/.py:E501

fix(evals): auto generate evals #3529

fix(evals): auto generate evals #3529

Uh oh!

Conversation

galzilber commented Dec 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

doronkopit5 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galzilber commented Dec 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 30, 2025 •

edited

Loading