-
Notifications
You must be signed in to change notification settings - Fork 855
fix(evals): auto generate evals #3529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR introduces automated code generation infrastructure for creating Pydantic models from OpenAPI/Swagger specifications, adds runtime input validation for evaluators, and refactors the evaluator factory to use a dynamic, registry-driven design instead of static methods. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Developer
participant Script as generate-models.sh
participant Python as generate_evaluator_models.py
participant Swagger as Swagger JSON
participant Codegen as datamodel-codegen
participant Output as Generated Files
User->>Script: ./generate-models.sh swagger.json output/
Script->>Python: poetry run python generate_evaluator_models.py
Python->>Swagger: Load & parse JSON
Swagger-->>Python: Swagger definitions
Note over Python: Extract definitions,<br/>resolve $ref,<br/>build slug mappings
Python->>Output: Write registry.py
Python->>Output: Write temp schema.json
Python->>Codegen: Invoke with filtered schema
Codegen-->>Output: Generate Pydantic models
Python->>Output: Generate __init__.py
Python->>Output: Clean temp files
Output-->>User: Request/response models + registry
sequenceDiagram
participant Client as Evaluator Client
participant Run as run_experiment_evaluator()
participant Validate as _validate_evaluator_input()
participant Registry as REQUEST_MODELS
participant Pydantic as Pydantic Validator
participant Build as Build Evaluator Request
Client->>Run: run(slug, input)
Run->>Validate: _validate_evaluator_input(slug, input)
Validate->>Registry: get_request_model(slug)
Registry-->>Validate: Request Model
Validate->>Pydantic: model(**input)
alt Validation Success
Pydantic-->>Validate: Validated instance
Validate-->>Run: Return (no error)
else Validation Failure
Pydantic-->>Validate: ValidationError
Validate-->>Run: Raise ValueError
Run-->>Client: Error
end
Run->>Build: Construct request
Build-->>Client: Response
sequenceDiagram
participant User as Code
participant EvalFactory as EvaluatorMadeByTraceloop
participant Meta as _EvaluatorMadeByTraceloopMeta
participant Registry as REQUEST_MODELS
participant CreateEval as create_evaluator()
User->>EvalFactory: EvaluatorMadeByTraceloop.pii_detector(...)
Note over EvalFactory: Metaclass intercepts<br/>attribute access
EvalFactory->>Meta: __getattr__('pii_detector')
Meta->>Registry: Lookup 'pii_detector' in slugs
alt Slug exists
Meta->>Meta: Generate method from slug
Meta-->>EvalFactory: Dynamic method
else Slug not found
Meta-->>EvalFactory: AttributeError
end
User->>CreateEval: create_evaluator(slug='pii_detector', config)
CreateEval->>Registry: get_request_model(slug)
CreateEval->>CreateEval: Validate slug,<br/>filter None config,<br/>resolve required fields
CreateEval-->>User: EvaluatorDetails
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to a808557 in 2 minutes and 27 seconds. Click for details.
- Reviewed
1280lines of code in9files - Skipped
1files when reviewing. - Skipped posting
5draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/pyproject.toml:88
- Draft comment:
Consider moving 'datamodel-code-generator' to a dev dependency if it’s only used during model generation. - Reason this comment was not posted:
Comment looked like it was already resolved.
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:18
- Draft comment:
The '_validate_evaluator_input' function type hints input as Dict[str, str] but some evaluator models expect non-string types (e.g., float). Consider using Dict[str, Any]. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment makes a reasonable point about type hints. If evaluator models can accept non-string types (like floats), then the type hint should reflect that. However, I notice that the calling methods (run_experiment_evaluatorandtrigger_experiment_evaluator) also haveinput: Dict[str, str]in their signatures. This suggests either: (1) the entire chain should be updated to useDict[str, Any], or (2) the currentDict[str, str]is intentional. Without seeing the actual evaluator models orget_request_modelimplementation, I cannot definitively say whether evaluators accept non-string types. The comment is speculative ("some evaluator models expect non-string types") without showing concrete evidence. This violates the rule about not making speculative comments. The comment assumes that evaluator models expect non-string types without providing concrete evidence from the codebase. The fact that multiple methods in the chain useDict[str, str]suggests this might be intentional design. Without seeing actual evaluator model definitions or examples where non-string types are required, this is speculative. While the comment could be speculative, type hints should accurately reflect what the code accepts. If the validation function uses pydantic models that can accept various types, thenDict[str, str]would be too restrictive and could cause type checking issues. However, without strong evidence that this is actually a problem, the comment remains speculative. The comment is speculative without concrete evidence that evaluator models actually require non-string types. The consistent use ofDict[str, str]throughout the call chain suggests this might be intentional. Following the rule that comments should not be speculative ("If X, then Y is an issue"), this comment should be deleted.
3. packages/traceloop-sdk/traceloop/sdk/evaluator/model.py:48
- Draft comment:
In 'typed_result()', consider using 'model.parse_obj(self.result)' for more robust parsing. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment suggests usingparse_obj()for "more robust parsing". However, I need to consider: 1) In Pydantic v2,parse_obj()is deprecated and replaced withmodel_validate(). 2) In Pydantic v1,parse_obj()and direct instantiation with**kwargsare functionally similar for dictionaries. 3) The comment doesn't explain what "more robust" means - both methods perform validation. 4) Without knowing the Pydantic version being used, this advice could be outdated or incorrect. 5) The current implementationmodel(**self.result)is a standard and valid way to instantiate Pydantic models. The comment is somewhat speculative about what would be "more robust" without clear evidence. I might be missing thatparse_obj()could handle edge cases better in certain Pydantic versions, or that there are specific validation behaviors that differ between the two approaches. The comment might be based on legitimate Pydantic best practices that I'm not fully aware of. While there might be subtle differences, the comment doesn't provide specific evidence of whyparse_obj()would be more robust in this context. Additionally, in modern Pydantic (v2),parse_obj()is deprecated, making this advice potentially outdated. The current implementation is standard and valid, and without clear evidence of a problem, this comment appears to be a speculative suggestion rather than addressing a definite issue. This comment should be deleted. It makes a speculative suggestion without explaining what specific robustness issue it addresses. The current implementation usingmodel(**self.result)is a standard, valid approach for Pydantic model instantiation, and the suggested alternative might even be deprecated in newer Pydantic versions.
4. scripts/codegen/generate_evaluator_models.py:180
- Draft comment:
The regex-based extraction in 'generate_init_py' might be fragile if the generated file format changes; consider using an AST-based approach if this becomes an issue. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
5. scripts/codegen/generate_evaluator_models.py:212
- Draft comment:
Typographical issue: In the multi-line string for the registry import in init.py (lines 207-214), there's an extra colon afterget_response_model,before the closing parenthesis. This appears to be a typo—please remove the extra colon so that the closing parenthesis is correctly formatted. - Reason this comment was not posted:
Comment looked like it was already resolved.
Workflow ID: wflow_t6RIKriUveS7wogj
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
18-33: Validation silently skips unknown evaluator slugs.When
get_request_model(slug)returnsNone(slug not in registry), validation is bypassed without warning. This could mask typos in evaluator slugs until the API call fails.Consider logging a warning or raising an error for unknown slugs to provide earlier feedback:
🔎 Optional: Add warning for unknown slugs
+import logging + +logger = logging.getLogger(__name__) + def _validate_evaluator_input(slug: str, input: Dict[str, str]) -> None: """Validate input against the evaluator's request model if available. ... """ request_model = get_request_model(slug) - if request_model: + if request_model is None: + logger.debug(f"No request model found for evaluator '{slug}', skipping validation") + return + try: + request_model(**input) + except ValidationError as e: + raise ValueError(f"Invalid input for '{slug}': {e}") from e - try: - request_model(**input) - except ValidationError as e: - raise ValueError(f"Invalid input for '{slug}': {e}") from epackages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1)
48-63: Consider handling ValidationError intyped_result.If
self.resultdoesn't conform to the provided model's schema,model(**self.result)will raise a PydanticValidationError. This is reasonable behavior, but you may want to document this in the docstring or wrap it with additional context:🔎 Optional: Document or wrap ValidationError
def typed_result(self, model: Type[T]) -> T: """Parse result into a typed Pydantic model. Args: model: The Pydantic model class to parse the result into Returns: An instance of the provided model class + Raises: + pydantic.ValidationError: If result doesn't match the model schema + Example: from traceloop.sdk.evaluators_generated import PIIDetectorResponse result = await evaluator.run_experiment_evaluator(...) pii = result.typed_result(PIIDetectorResponse) print(pii.has_pii) # IDE autocomplete works! """ return model(**self.result)scripts/codegen/generate_evaluator_models.py (2)
42-42: Unused loop variablemethod.The loop variable
methodis not used within the loop body. Rename to_methodto indicate it's intentionally unused.🔎 Proposed fix
- for method, details in methods.items(): + for _method, details in methods.items():
27-28: Consider specifying explicit encoding for file operations.While
json.loadhandles encoding, specifyingencoding="utf-8"explicitly ensures consistent behavior across different system locales.🔎 Proposed fix
- with open(swagger_path) as f: + with open(swagger_path, encoding="utf-8") as f: data = json.load(f)packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py (1)
87-162:__all__sorting should be addressed in the generator.The static analysis tool notes that
__all__is not isort-style sorted. Since this file is auto-generated, the sorting logic should be fixed inscripts/codegen/generate_evaluator_models.pyrather than manually here. The current grouping (registry functions, then request models, then response models) is intentional and logical.Consider whether
ErrorResponse(line 63, 140) should also be added to theRESPONSE_MODELSregistry inregistry.py, or if it's intentionally a general-purpose error model not tied to any specific evaluator slug.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
packages/traceloop-sdk/poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
packages/traceloop-sdk/pyproject.tomlpackages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.pypackages/traceloop-sdk/traceloop/sdk/evaluator/model.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.pyscripts/codegen/generate_evaluator_models.pyscripts/generate-models.sh
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.pyscripts/codegen/generate_evaluator_models.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.pypackages/traceloop-sdk/traceloop/sdk/evaluator/model.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py
🧠 Learnings (1)
📚 Learning: 2025-08-22T14:41:26.962Z
Learnt from: prane-eth
Repo: traceloop/openllmetry PR: 3336
File: packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/utils.py:8-8
Timestamp: 2025-08-22T14:41:26.962Z
Learning: In the openllmetry project, the `packaging` library is available as a transitive dependency through other packages (visible in poetry.lock) and doesn't need to be explicitly declared in pyproject.toml dependencies.
Applied to files:
packages/traceloop-sdk/pyproject.toml
🧬 Code graph analysis (3)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py (2)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (33)
AgentEfficiencyRequest(11-17)AgentFlowQualityRequest(20-34)AgentGoalAccuracyRequest(37-47)AgentGoalCompletenessRequest(50-59)AgentToolErrorDetectorRequest(62-71)AnswerCompletenessRequest(74-77)AnswerCorrectnessRequest(80-83)AnswerRelevancyRequest(86-88)CharCountRatioRequest(91-95)CharCountRequest(98-99)ContextRelevanceRequest(102-110)ConversationQualityRequest(113-123)FaithfulnessRequest(126-136)InstructionAdherenceRequest(139-148)IntentChangeRequest(151-162)JSONValidatorRequest(165-168)PIIDetectorRequest(171-178)PerplexityRequest(181-182)PlaceholderRegexRequest(185-193)ProfanityDetectorRequest(196-197)PromptInjectionRequest(200-202)PromptPerplexityRequest(205-206)RegexValidatorRequest(209-215)SQLValidatorRequest(218-219)SecretsDetectorRequest(222-225)SemanticSimilarityRequest(228-230)SexismDetectorRequest(233-238)ToneDetectionRequest(241-242)TopicAdherenceRequest(245-255)ToxicityDetectorRequest(258-260)UncertaintyDetectorRequest(263-266)WordCountRatioRequest(269-273)WordCountRequest(276-277)packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (33)
AgentEfficiencyResponse(11-19)AgentFlowQualityResponse(22-27)AgentGoalAccuracyResponse(30-31)AgentGoalCompletenessResponse(34-37)AgentToolErrorDetectorResponse(40-44)AnswerCompletenessResponse(47-48)AnswerCorrectnessResponse(51-52)AnswerRelevancyResponse(55-56)CharCountRatioResponse(59-60)CharCountResponse(63-64)ContextRelevanceResponse(67-68)ConversationQualityResponse(71-72)FaithfulnessResponse(79-80)InstructionAdherenceResponse(83-84)IntentChangeResponse(87-92)JSONValidatorResponse(95-96)PIIDetectorResponse(99-100)PerplexityResponse(103-104)PlaceholderRegexResponse(107-108)ProfanityDetectorResponse(111-112)PromptInjectionResponse(115-116)PromptPerplexityResponse(119-120)RegexValidatorResponse(123-124)SQLValidatorResponse(127-128)SecretsDetectorResponse(131-132)SemanticSimilarityResponse(135-136)SexismDetectorResponse(139-140)ToneDetectionResponse(143-145)TopicAdherenceResponse(148-149)ToxicityDetectorResponse(152-153)UncertaintyDetectorResponse(156-158)WordCountRatioResponse(161-162)WordCountResponse(165-166)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py (3)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (1)
AgentEfficiencyRequest(11-17)packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py (2)
get_request_model(161-163)get_response_model(166-168)packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)
AgentEfficiencyResponse(11-19)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)
char_count(183-196)word_count(221-233)
🪛 Ruff (0.14.10)
scripts/codegen/generate_evaluator_models.py
42-42: Loop control variable method not used within loop body
Rename unused method to _method
(B007)
299-299: subprocess call: check for execution of untrusted input
(S603)
300-310: Starting a process with a partial executable path
(S607)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
33-33: Avoid specifying long messages outside the exception class
(TRY003)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/__init__.py
87-162: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (7)
packages/traceloop-sdk/pyproject.toml (1)
86-86: LGTM!Adding
datamodel-code-generatoras a dev dependency is appropriate since it's only used during code generation and not at runtime. This keeps the production dependency footprint minimal.scripts/generate-models.sh (1)
1-35: LGTM!The shell script follows best practices:
- Uses strict mode (
set -euo pipefail) for early failure detection- Properly validates input arguments and file existence
- Correctly resolves script directory using
BASH_SOURCEfor portability- All variable expansions are quoted to prevent word splitting
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
117-118: Good placement of input validation.Validation is correctly placed at the start of both
run_experiment_evaluatorandtrigger_experiment_evaluator, ensuring inputs are validated before building requests or making API calls. This follows the fail-fast principle.Also applies to: 161-162
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/registry.py (1)
1-168: LGTM!The generated registry module is well-structured:
- Clear "DO NOT EDIT MANUALLY" warning with regeneration instructions
- Consistent alphabetical ordering of slug mappings
- Simple, effective helper functions for model lookup
- Type annotations are correct (
Dict[str, Type[BaseModel]])scripts/codegen/generate_evaluator_models.py (1)
298-315: LGTM - subprocess usage is appropriate for dev tooling.The subprocess call uses a hardcoded command (
datamodel-codegen) with validated file inputs. Thecheck=Trueparameter ensures failures are raised as exceptions, and thefinallyblock properly cleans up temporary files. This is safe for a code generation script.packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)
1-166: LGTM!The auto-generated response models are well-structured:
- Correct use of
Optionalfields with defaults for response flexibility- Proper handling of Python reserved keyword
passusingalias='pass'inIntentChangeResponse- Effective use of inheritance to reduce duplication (
RegexValidatorResponse,ToxicityDetectorResponse)- Consistent use of
Field()with examples for documentationpackages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (1)
1-8: LGTM - Clean generated file structure.The file header correctly identifies this as auto-generated code, and imports are appropriate for Pydantic model definitions. No security concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed a8a77fd in 1 minute and 8 seconds. Click for details.
- Reviewed
392lines of code in3files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/.flake8:12
- Draft comment:
Added per-file ignore for E501 on generated evaluator files. Ensure this intentional ignore doesn't hide style issues if generated code changes. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =30%<= threshold50%The comment is asking the author to ensure that the ignore doesn't hide style issues, which is a form of asking for confirmation. This violates the rule against asking the PR author to ensure behavior is intended. However, it does point out a potential issue with ignoring style checks on generated files, which could be useful if phrased differently.
2. packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py:13
- Draft comment:
Standardized example strings to use double quotes for consistency across models. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py:11
- Draft comment:
Consistent update of example strings to double quotes in responses; changes look purely stylistic and improve uniformity. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative, as it only praises the consistency of using double quotes without providing any actionable feedback or suggestions. It doesn't ask for any confirmation or suggest any improvements.
4. packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py:109
- Draft comment:
It looks like 'gpt-4o' might be a typographical error. Did you mean 'gpt-4'? - Reason this comment was not posted:
Comment was on unchanged code.
Workflow ID: wflow_3oaQT6TeYzm4Zj0T
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.py (1)
1-277: Past review: Flake8 line length violations require generator configuration update.A previous review identified 23 Flake8 E501 line-length violations in this file. The recommendation was to update the generator configuration in
scripts/codegen/generate_evaluator_models.pyto enforce--max-line-length 79and regenerate the models.Since this is auto-generated code, please ensure the generator configuration is updated to prevent these violations in future regenerations. This should be addressed together with the response.py file to maintain consistency across all generated models.
🧹 Nitpick comments (3)
packages/traceloop-sdk/.flake8 (1)
12-14: LGTM! Good configuration for generated code.The multi-line format improves readability, and ignoring E501 (line too long) for auto-generated evaluator models is a pragmatic approach since code generators typically produce long lines that are difficult to control.
Optional: Consider whether additional ignores may be needed for generated code, such as F401 (unused imports) or W503 (line break before binary operator), depending on what the code generator produces. However, the current configuration may be sufficient.
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (2)
113-115: Inheritance pattern creates alias without additional functionality.
RegexValidatorResponseinherits fromPlaceholderRegexResponsewith no additional fields or methods. While valid, this pattern simply creates an alias. Consider whether this is intentional or if the models should be merged in the source specification.If these are meant to be identical, the source OpenAPI spec could reference the same schema to avoid duplication. If they're meant to diverge in the future, the current structure is appropriate.
142-144: Inheritance pattern creates alias without additional functionality.
ToxicityDetectorResponseinherits fromSexismDetectorResponsewith no additional fields. Same consideration asRegexValidatorResponseabove.Review the source API specification to determine if these should share a schema definition or remain separate for future extensibility.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/traceloop-sdk/.flake8packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/request.pypackages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (1)
packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py (1)
1-156: E501 violations are intentionally ignored for generated evaluator models.The file is already configured for Flake8 compliance. The
packages/traceloop-sdk/setup.cfghasper-file-ignores = traceloop/sdk/evaluators_generated/*.py:E501, which explicitly excludes line-length violations (E501) for all files in this generated directory. No changes to the datamodel-codegen script or the response.py file are needed.
| class PromptInjectionResponse(BaseModel): | ||
| has_injection: Optional[str] = Field(None, examples=["safe"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch: field name suggests boolean but typed as string.
The field has_injection is typed as Optional[str] with example "safe", but the name suggests it should be a boolean. This likely reflects the source API specification, but it may cause confusion for developers expecting a boolean return value.
Verify whether the API actually returns a string or if this should be a boolean in the OpenAPI spec. If the API spec is correct, consider adding a docstring or comment to clarify that this field returns a string status (e.g., "safe", "unsafe") rather than a boolean.
🤖 Prompt for AI Agents
In packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py around
lines 105-106, the field has_injection is declared as Optional[str] but its name
implies a boolean; verify the OpenAPI source to see whether the API actually
returns a boolean or a status string. If the API returns a boolean, change the
field type to Optional[bool] and update the Field example and any related
serializers/tests accordingly; if the API returns a status string, keep the type
but add a concise docstring or inline comment describing allowed values (e.g.,
"safe"/"unsafe"/"unknown") and update the Field examples to reflect valid status
strings so callers aren’t confused. Ensure generated types and unit tests are
updated to match the chosen correction.
| class SexismDetectorResponse(BaseModel): | ||
| is_safe: Optional[str] = Field(None, examples=["safe"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch: field name suggests boolean but typed as string.
Similar to has_injection above, the field is_safe is typed as Optional[str] with example "safe", but the name convention suggests a boolean. This naming could mislead developers.
Verify the source API specification to confirm the return type. If a string is correct, consider documenting this in the source spec or adding clarifying comments.
🤖 Prompt for AI Agents
In packages/traceloop-sdk/traceloop/sdk/evaluators_generated/response.py around
lines 129-130, the field is_safe is declared Optional[str] but its name implies
a boolean; verify the API spec and if the endpoint returns a boolean change the
type to Optional[bool] and update the Field examples to True/False accordingly
(and regenerate/adjust any type hints/usages), otherwise if the API actually
returns string values keep the str type but rename the field to is_safe_label or
add a clarifying comment/docstring and update the spec to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 0265e2a in 2 minutes and 32 seconds. Click for details.
- Reviewed
847lines of code in2files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py:3
- Draft comment:
Import of create_evaluator and its addition to all is clear and improves the module interface. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:60
- Draft comment:
In create_evaluator, the config is filtered but defaults from _get_config_fields aren’t merged. Consider merging defaults so unset options can automatically use default values. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a code quality suggestion about merging defaults. However, I need to consider: 1) Is this actually a problem? The config is passed to EvaluatorDetails which might handle defaults elsewhere, or the backend might handle them. 2) The old code (before the diff) explicitly set defaults in each factory method (e.g.,probability_threshold: float = 0.5), so users could omit parameters and get defaults. 3) The new code uses**config: Anywhich means users must explicitly pass values or they won't be in the config dict at all. 4) The comment is speculative - it says "Consider merging" which is a suggestion, not pointing out a definite bug. 5) Without seeing how EvaluatorDetails or the backend handles missing config values, we can't be sure this is needed. The comment is speculative ("Consider merging") rather than pointing out a definite issue. We don't know if missing config values are actually a problem - the backend or EvaluatorDetails might handle defaults. The old code had explicit defaults in function signatures, but we don't know if removing them breaks anything or if it's intentional. This requires cross-file context to verify. While the comment makes a reasonable suggestion, it's speculative and we lack evidence that this is actually causing issues. The refactoring from explicit factory methods to dynamic generation might intentionally rely on backend defaults. Without seeing test failures or knowing the full system behavior, we can't confirm this is needed. This is a speculative code quality suggestion without strong evidence of an actual problem. The comment uses "Consider" language and we lack context about whether missing config values are handled elsewhere. Following the rule to only keep comments with strong evidence of correctness, this should be deleted.
3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:73
- Draft comment:
The dynamic factory generation in getattr is neat. Optionally, consider caching default config values (from _get_config_fields) if repeated lookups become a performance concern. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
4. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:84
- Draft comment:
In dir, dynamic addition of evaluator method names is implemented. Ensure that potential duplicates with inherited attributes are acceptable. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
Workflow ID: wflow_x2knBzUFK7yWhbup
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)
42-67: Consider: Validate config keys against model fields.The function accepts arbitrary
**configparameters but doesn't verify that the keys match the request model's fields. This could lead to typos or invalid parameters being silently ignored by the evaluator backend.Consider validating config keys or at minimum logging a warning for unrecognized parameters.
🔎 Example validation approach
def create_evaluator(slug: str, **config: Any) -> EvaluatorDetails: """Create an EvaluatorDetails for the given slug with optional config. Args: slug: The evaluator slug (e.g., "pii-detector") **config: Configuration options for the evaluator Returns: EvaluatorDetails configured for the specified evaluator Example: >>> from traceloop.sdk.evaluator import create_evaluator >>> evaluator = create_evaluator("pii-detector", probability_threshold=0.8) """ if slug not in REQUEST_MODELS: available = ", ".join(sorted(REQUEST_MODELS.keys())) raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}") - + + # Validate config keys against model fields + model = REQUEST_MODELS[slug] + valid_fields = set(model.model_fields.keys()) + invalid_keys = set(config.keys()) - valid_fields + if invalid_keys: + raise ValueError(f"Invalid config keys for '{slug}': {invalid_keys}. Valid keys: {valid_fields}") + # Remove None values from config config = {k: v for k, v in config.items() if v is not None} return EvaluatorDetails( slug=slug, version=None, config=config, required_input_fields=_get_required_fields(slug), )
58-58: Nitpick: Long error message in exception.Ruff suggests avoiding long messages directly in exception raises. Consider extracting to a variable or creating a custom exception class for better maintainability.
🔎 Suggested refactor
if slug not in REQUEST_MODELS: available = ", ".join(sorted(REQUEST_MODELS.keys())) - raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}") + message = f"Unknown evaluator slug: '{slug}'. Available: {available}" + raise ValueError(message)Based on static analysis (Ruff TRY003).
82-82: Enhance error message with available options.The error message could be more helpful by listing available evaluators, similar to line 58. Also, consider extracting the message to a variable per Ruff TRY003.
🔎 Suggested improvement
- raise AttributeError(f"'{cls.__name__}' has no attribute '{name}'") + available = ", ".join(sorted(_slug_to_method_name(s) for s in REQUEST_MODELS.keys())) + message = f"'{cls.__name__}' has no attribute '{name}'. Available evaluators: {available}" + raise AttributeError(message)Based on static analysis (Ruff TRY003).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
create_evaluator(42-67)
🪛 Ruff (0.14.10)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
58-58: Avoid specifying long messages outside the exception class
(TRY003)
82-82: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Lint
- GitHub Check: Test Packages (3.11)
🔇 Additional comments (5)
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)
3-3: LGTM: Clean public API expansion.The new
create_evaluatorfunction is properly imported and exported, enabling the registry-driven evaluator creation pattern described in the PR objectives.Also applies to: 9-9
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
12-39: LGTM: Well-structured helper functions.The helper functions appropriately extract metadata from the registry models. The defensive returns (empty list/dict) are safe since
create_evaluatorvalidates slug existence before these are called.
70-89: Clever metaclass pattern with IDE support tradeoff.The dynamic method generation via
__getattr__and__dir__is a clean implementation that enables registry-driven evaluator creation. The tradeoff is reduced IDE autocomplete and static type checking support for the generated methods.This is acceptable given the flexibility benefits, but users may need to reference documentation or use
dir(EvaluatorMadeByTraceloop)to discover available evaluators.
92-115: LGTM: Clear documentation for dynamic factory class.The comprehensive docstring and examples effectively document the dynamic evaluator creation pattern, helping users understand the registry-driven approach despite limited IDE support for autocomplete.
60-61: No issues found. The None filtering in config (lines 60-61) is intentional and correct behavior. All optional config parameters in the request models default to None, so filtering explicit None values is a valid cleanup pattern that doesn't prevent any meaningful use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 436e856 in 1 minute and 12 seconds. Click for details.
- Reviewed
36lines of code in1files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:7
- Draft comment:
Consider verifying the import order. The new addition of 'from ..evaluators_generated import REQUEST_MODELS' and repositioning of 'EvaluatorDetails' might benefit from following a standard order (e.g. standard library, then third-party, then local) to improve clarity. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:83
- Draft comment:
Consider formatting the 'config_fields' output in the factory method's docstring. Instead of embedding a raw list (or 'none'), joining the keys into a comma‐separated string could enhance readability. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
Workflow ID: wflow_ygFrcNJJqi5hIF5K
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)
57-59: Consider using a custom exception or shorter message.Ruff flags this as a long exception message. Consider extracting to a custom exception class or shortening the message.
🔎 Proposed refactor using a custom exception
+class UnknownEvaluatorError(ValueError): + """Raised when an unknown evaluator slug is provided.""" + pass + def create_evaluator(slug: str, **config: Any) -> EvaluatorDetails: """Create an EvaluatorDetails for the given slug with optional config. Args: slug: The evaluator slug (e.g., "pii-detector") **config: Configuration options for the evaluator Returns: EvaluatorDetails configured for the specified evaluator Example: >>> from traceloop.sdk.evaluator import create_evaluator >>> evaluator = create_evaluator("pii-detector", probability_threshold=0.8) """ if slug not in REQUEST_MODELS: available = ", ".join(sorted(REQUEST_MODELS.keys())) - raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}") + raise UnknownEvaluatorError(f"Unknown evaluator slug: '{slug}'. Available: {available}")
83-84: Consider clarifying the config_fields display logic.The expression
list(_get_config_fields(slug).keys()) or "none"produces either a list or the string"none", which is type-inconsistent. While this only affects the docstring display, it could be clearer.🔎 Proposed refactor for clearer display logic
factory.__name__ = name - config_fields = list(_get_config_fields(slug).keys()) or "none" - factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_fields}" + config_keys = list(_get_config_fields(slug).keys()) + config_display = ", ".join(config_keys) if config_keys else "none" + factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_display}" return factory
86-86: Consider using a shorter exception message.Ruff flags this as a long exception message. The error message could be more concise.
🔎 Proposed refactor with shorter message
- raise AttributeError(f"'{cls.__name__}' has no attribute '{name}'") + raise AttributeError(f"Unknown evaluator method: '{name}'")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)
🪛 Ruff (0.14.10)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
59-59: Avoid specifying long messages outside the exception class
(TRY003)
86-86: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Lint
- GitHub Check: Test Packages (3.12)
- GitHub Check: Build Packages (3.11)
🔇 Additional comments (7)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (7)
13-30: LGTM - Helper functions correctly extract field information.The helper functions properly introspect Pydantic model fields to extract required fields and config fields with defaults. The empty return values when the model is not found are acceptable since
create_evaluatorvalidates the slug separately.
33-40: LGTM - Conversion logic is simple and appropriate.The slug/method name conversion correctly handles kebab-case to snake_case transformation. This should work well as long as evaluator slugs follow the kebab-case convention (e.g., "pii-detector") and don't contain underscores.
62-62: Verify None-filtering behavior is intentional.The code filters out
Nonevalues from the config dict. Confirm this is the desired behavior, as it prevents users from explicitly passingNoneto unset optional config values.
65-65: Version parameter is hardcoded to None.The
versionparameter is always set toNonewhen creatingEvaluatorDetails. Verify whether this is temporary or if version support will be added later.
88-93: LGTM - dir implementation enables proper introspection.The
__dir__override correctly combines parent class methods with dynamically generated evaluator methods, enabling IDE autocomplete and properdir()introspection.
96-120: LGTM - Well-documented class with metaclass-driven functionality.The class definition is clean with a comprehensive docstring that provides clear usage examples and lists available evaluators. The empty body is appropriate since all functionality is provided by the metaclass.
9-9: No action required. REQUEST_MODELS is properly populated at module load time with 32+ evaluator models and is safely handled throughout the codebase with clear error messages listing available evaluators when an unknown slug is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 54c3c66 in 51 seconds. Click for details.
- Reviewed
22lines of code in1files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:74
- Draft comment:
Added return type hint '-> Any' on getattr. Consider using a more specific Callable type (e.g., Callable[..., EvaluatorDetails]) if possible for stronger type safety. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:88
- Draft comment:
Added explicit return type '-> List[str]' on dir, which improves clarity for consumers using the metaclass. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
Workflow ID: wflow_ta6L6BMAVbc41MQh
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
83-84: Inconsistent type in docstring generation.
config_fieldsis either alistor the string"none", making the docstring format inconsistent:
- With fields:
"Config fields: ['field1', 'field2']"- Without:
"Config fields: none"🔎 Suggested fix for consistent formatting
- config_fields = list(_get_config_fields(slug).keys()) or "none" - factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_fields}" + config_fields = list(_get_config_fields(slug).keys()) + fields_str = ", ".join(config_fields) if config_fields else "none" + factory.__doc__ = f"Create {slug} evaluator. Config fields: {fields_str}"
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)
🪛 Ruff (0.14.10)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
59-59: Avoid specifying long messages outside the exception class
(TRY003)
86-86: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
- GitHub Check: Analyze (python)
🔇 Additional comments (6)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (6)
1-10: LGTM!Imports are clean and appropriate for the dynamic factory pattern implementation.
13-18: LGTM!The silent fallback to an empty list is acceptable since
create_evaluatorvalidates the slug before this function is called.
21-30: LGTM!Clean extraction of optional config fields with their defaults.
43-68: LGTM with a minor note.The implementation is clean and provides a good user experience with descriptive error messages. The
versionparameter is hardcoded toNone—if evaluator versioning becomes relevant, consider exposing it as an optional parameter.
96-120: LGTM!The metaclass-based dynamic factory is a clean design that eliminates boilerplate. The docstring appropriately guides users to
dir(EvaluatorMadeByTraceloop)for the complete list of available evaluators.
38-40: No issues found. All slugs inREQUEST_MODELSuse hyphens (e.g., "pii-detector", "json-validator"), not underscores. The conversion functions_slug_to_method_name()and_method_name_to_slug()are properly bidirectional for all actual evaluator slugs in the registry.
packages/traceloop-sdk/.flake8
Outdated
| per-file-ignores = __init__.py:F401 | ||
| per-file-ignores = | ||
| __init__.py:F401 | ||
| traceloop/sdk/evaluators_generated/*.py:E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| traceloop/sdk/evaluators_generated/*.py:E501 | |
| traceloop/sdk/generated/**/*.py:E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed b1e4b30 in 1 minute and 39 seconds. Click for details.
- Reviewed
488lines of code in8files - Skipped
0files when reviewing. - Skipped posting
9draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/.flake8:14
- Draft comment:
Updated exclusion pattern to use 'generated/**' instead of 'evaluators_generated'; ensure the glob pattern covers all intended nested files. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the author to ensure that the glob pattern covers all intended nested files. This falls under the category of asking the author to double-check or ensure something, which is against the rules. Therefore, this comment should be removed.
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:15
- Draft comment:
Refactored import to use the new generated evaluator module location. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states what was done without offering any insight or guidance.
3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:9
- Draft comment:
Updated reference to the evaluator registry; now importing from the new 'generated/evaluators' module. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative, as it only states what has been done without providing any actionable feedback or suggestions. It doesn't ask for confirmation or suggest any improvements.
4. packages/traceloop-sdk/traceloop/sdk/evaluator/model.py:55
- Draft comment:
Updated example import in docstring to reference new generated evaluators module. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative, as it only states that an example import in a docstring was updated to reference a new module. It doesn't provide any actionable feedback or suggestions for improvement.
5. packages/traceloop-sdk/traceloop/sdk/generated/__init__.py:1
- Draft comment:
New init.py correctly consolidates generated evaluator modules. Make sure to update documentation to reflect the new module paths. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to update documentation, which is not allowed according to the rules. It doesn't provide a specific code suggestion or point out a potential issue with the code itself.
6. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.py:12
- Draft comment:
Auto-generated evaluator request models now use consistent single quotes in examples. Verify consistency with project style guidelines. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
7. packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.py:11
- Draft comment:
Auto-generated evaluator response models have updated examples with consistent quoting. Confirm that the field types and examples match expected API responses. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to confirm that the field types and examples match expected API responses. This falls under the rule of not asking the author to confirm their intention or to double-check things. Therefore, this comment should be removed.
8. packages/traceloop-sdk/scripts/generate-models.sh:17
- Draft comment:
OUTPUT_DIR updated to the new 'generated/evaluators' path; verify that the new directory structure aligns with the rest of the project. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
9. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:12
- Draft comment:
Line 12 appears to contain an extraneous ")". It doesn't seem to match any opening statement. Please verify if this is intentional or remove it if it's a typographical error. - Reason this comment was not posted:
Comment was on unchanged code.
Workflow ID: wflow_UNmXKLhLPLh2IlO9
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
18-33: LGTM!The validation logic correctly validates input against the evaluator's request model schema. The error chaining preserves context, and the silent skip when no model exists is sensible for backward compatibility.
Optional: Address Ruff TRY003 hint
Ruff suggests avoiding long messages in exception constructors. If you prefer tighter style compliance, consider extracting to a custom exception class:
+class EvaluatorInputValidationError(ValueError): + """Raised when evaluator input fails validation.""" + def __init__(self, slug: str, validation_error: ValidationError): + super().__init__(f"Invalid input for '{slug}': {validation_error}") + + def _validate_evaluator_input(slug: str, input: Dict[str, str]) -> None: """Validate input against the evaluator's request model if available. ... """ request_model = get_request_model(slug) if request_model: try: request_model(**input) except ValidationError as e: - raise ValueError(f"Invalid input for '{slug}': {e}") from e + raise EvaluatorInputValidationError(slug, e) from epackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
43-68: LGTM!The function correctly validates the evaluator slug and constructs
EvaluatorDetailswith dynamically derived required fields. The None-filtering in config (line 62) ensures model defaults are used when callers pass explicitNonevalues.Optional: Address Ruff TRY003 hint
Similar to the earlier validation function, Ruff suggests avoiding long exception messages. You could extract to a custom exception if desired:
+class UnknownEvaluatorError(ValueError): + """Raised when an unknown evaluator slug is requested.""" + def __init__(self, slug: str, available: str): + super().__init__(f"Unknown evaluator slug: '{slug}'. Available: {available}") + + def create_evaluator(slug: str, **config: Any) -> EvaluatorDetails: """...""" if slug not in REQUEST_MODELS: available = ", ".join(sorted(REQUEST_MODELS.keys())) - raise ValueError(f"Unknown evaluator slug: '{slug}'. Available: {available}") + raise UnknownEvaluatorError(slug, available)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (5)
packages/traceloop-sdk/traceloop/sdk/generated/__init__.pyis excluded by!**/generated/**packages/traceloop-sdk/traceloop/sdk/generated/evaluators/__init__.pyis excluded by!**/generated/**packages/traceloop-sdk/traceloop/sdk/generated/evaluators/registry.pyis excluded by!**/generated/**packages/traceloop-sdk/traceloop/sdk/generated/evaluators/request.pyis excluded by!**/generated/**packages/traceloop-sdk/traceloop/sdk/generated/evaluators/response.pyis excluded by!**/generated/**
📒 Files selected for processing (5)
packages/traceloop-sdk/.flake8packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.pypackages/traceloop-sdk/traceloop/sdk/evaluator/model.pyscripts/generate-models.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/traceloop-sdk/.flake8
- scripts/generate-models.sh
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/model.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
🧬 Code graph analysis (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)packages/traceloop-sdk/traceloop/sdk/client/http.py (1)
get(40-60)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
packages/traceloop-sdk/traceloop/sdk/generated/evaluators/registry.py (1)
get_request_model(161-163)
🪛 Ruff (0.14.10)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
59-59: Avoid specifying long messages outside the exception class
(TRY003)
86-86: Avoid specifying long messages outside the exception class
(TRY003)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
33-33: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Lint
- GitHub Check: Build Packages (3.11)
🔇 Additional comments (8)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)
3-3: LGTM!The new imports support input validation and are used appropriately in the new
_validate_evaluator_inputhelper.Also applies to: 15-15
117-118: LGTM!Input validation is correctly placed at the start of both execution methods, ensuring inputs are valid before building requests or triggering evaluator execution.
Also applies to: 161-162
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (2)
2-2: LGTM!Standard TypeVar pattern for enabling generic, type-safe result parsing in the new
typed_resultmethod.Also applies to: 5-6
48-63: LGTM!Clean convenience method that enables type-safe result parsing. The implementation is straightforward, and the documentation clearly illustrates the usage pattern. Pydantic's
ValidationErrorwill naturally propagate if the result doesn't match the model schema, which is appropriate behavior for the caller to handle.packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
1-10: LGTM!The updated docstring clearly describes the dynamic factory approach, and the imports correctly reference the registry and configuration models.
13-30: LGTM!The helper functions correctly introspect Pydantic models to extract required fields and configuration defaults using standard Pydantic APIs. Graceful handling of missing models is appropriate.
33-40: LGTM!Clean bidirectional conversion between slug and method name formats. The simple string replacement approach is appropriate for this use case.
96-120: LGTM!Excellent documentation with clear examples and guidance. The clean separation of concerns (metaclass handles behavior, class provides interface) makes the dynamic factory pattern easy to understand and use.
| class _EvaluatorMadeByTraceloopMeta(type): | ||
| """Metaclass that dynamically generates evaluator factory methods.""" | ||
|
|
||
| def __getattr__(cls, name: str) -> Any: | ||
| """Dynamically create factory methods for any evaluator slug.""" | ||
| slug = _method_name_to_slug(name) | ||
| if slug in REQUEST_MODELS: | ||
|
|
||
| def factory(**config: Any) -> EvaluatorDetails: | ||
| return create_evaluator(slug, **config) | ||
|
|
||
| factory.__name__ = name | ||
| config_fields = list(_get_config_fields(slug).keys()) or "none" | ||
| factory.__doc__ = f"Create {slug} evaluator. Config fields: {config_fields}" | ||
| return factory | ||
| raise AttributeError(f"'{cls.__name__}' has no attribute '{name}'") | ||
|
|
||
| def __dir__(cls) -> List[str]: | ||
| """List all available evaluator methods.""" | ||
| methods = list(super().__dir__()) | ||
| for slug in REQUEST_MODELS: | ||
| methods.append(_slug_to_method_name(slug)) | ||
| return methods | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's verify the file exists and check the exact code at lines 71-94
echo "=== Checking code at lines 71-94 ==="
if [ -f "packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py" ]; then
sed -n '71,94p' "packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py"
else
echo "File not found"
fi
# Check for REQUEST_MODELS import
echo -e "\n=== Checking REQUEST_MODELS import ==="
head -30 "packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py"Repository: traceloop/openllmetry
Length of output: 2089
🏁 Script executed:
#!/bin/bash
# Check if registry file exists and REQUEST_MODELS definition
echo "=== Searching for REQUEST_MODELS definition ==="
rg "REQUEST_MODELS" packages/traceloop-sdk/ -A 2 -B 2
# Look for the registry file
echo -e "\n=== Looking for registry or models file ==="
fd "registry" packages/traceloop-sdk/
fd "models" packages/traceloop-sdk/ | grep -E "\.(py|json)$" | head -10Repository: traceloop/openllmetry
Length of output: 7234
🏁 Script executed:
#!/bin/bash
# Check for hardcoded API keys or secrets in the target file
echo "=== Checking for hardcoded secrets in target file ==="
rg -i "(api[_-]?key|secret|password|token|credential)" packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
# Check for environment variable patterns
echo -e "\n=== Checking for env var usage pattern ==="
rg "os\.environ|getenv" packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.pyRepository: traceloop/openllmetry
Length of output: 162
🏁 Script executed:
#!/bin/bash
# Check Flake8 compliance if available
echo "=== Checking Flake8 compliance ==="
if command -v flake8 &> /dev/null; then
flake8 packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py --show-source
else
echo "Flake8 not installed, checking for linting issues manually..."
# Look for common issues
rg "except.*:" packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py -A 1
fiRepository: traceloop/openllmetry
Length of output: 3469
Fix Flake8 E501 violations before approval.
The metaclass implementation is sound and REQUEST_MODELS is properly populated at import time. However, the file has multiple Flake8 E501 violations (line too long) at lines 4, 18, 22, 55, 59, 84, and 107 that violate the coding guideline requiring Flake8 compliance. These must be fixed by breaking long lines or using line continuation before this change can be approved.
🧰 Tools
🪛 Ruff (0.14.10)
86-86: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
around lines 71-94 (and elsewhere where flake8 flagged E501 at lines 4, 18, 22,
55, 59, 84, 107), several lines exceed the 88-character limit; break those long
lines by introducing intermediate variables, splitting long f-strings into
concatenated parts or using parentheses for implicit continuation, wrap long
list or dict expressions across multiple lines, or move complex expressions onto
separate lines so each line stays within the limit; ensure docstrings and long
assignment lines are wrapped, prefer parentheses for multi-line returns and keep
line length under 88 characters to resolve the E501 violations.
feat(instrumentation): ...orfix(instrumentation): ....Important
This PR introduces dynamic evaluator model generation and management using Pydantic, enhancing input validation and output deserialization for evaluators.
evaluator.py.typed_result()inmodel.py.create_evaluatorentry point inevaluators_made_by_traceloop.py.generated/evaluators.registry.py.generate_evaluator_models.pyfor model generation from OpenAPI specs.generate-models.shfor running the generation script..flake8to allow generated evaluator files.datamodel-code-generatoras a dev dependency inpyproject.toml.This description was created by
for b1e4b30. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Release Notes
New Features
create_evaluatorAPI for evaluator instantiationChores
✏️ Tip: You can customize this high-level summary in your review settings.