openai · seratch · Jan 13, 2026
diff --git a/.codex/skills/final-release-review/SKILL.md b/.codex/skills/final-release-review/SKILL.md
@@ -0,0 +1,53 @@
+---
+name: final-release-review
+description: Perform a release-readiness review by locating the previous release tag from remote tags and auditing the diff (e.g., v1.2.3...main) for breaking changes, regressions, improvement opportunities, and risks before releasing openai-agents-python.
+---
+
+# Final Release Review
+
+## Purpose
+
+Use this skill when validating main for release. It guides you to fetch remote tags, pick the previous release tag, and thoroughly inspect the `BASE_TAG...TARGET` diff for breaking changes, introduced bugs/regressions, improvement opportunities, and release risks.
+
+## Quick start
+
+1. Ensure repository root: `pwd` → `path-to-workspace/openai-agents-python`.
+2. Sync tags and pick base (default `v*`):
+   ```bash
+   BASE_TAG="$(.codex/skills/final-release-review/scripts/find_latest_release_tag.sh origin 'v*')"
+   ```
+3. Choose target (default `main`, ensure fresh): `git fetch origin main --prune` then `TARGET="main"`.
+4. Snapshot scope:
+   ```bash
+   git diff --stat "${BASE_TAG}"..."${TARGET}"
+   git diff --dirstat=files,0 "${BASE_TAG}"..."${TARGET}"
+   git log --oneline --reverse "${BASE_TAG}".."${TARGET}"
+   git diff --name-status "${BASE_TAG}"..."${TARGET}"
+   ```
+5. Deep review using `references/review-checklist.md` to spot breaking changes, regressions, and improvement chances.
+6. Capture findings and call the release gate: ship/block with conditions; propose focused tests for risky areas.
+
+## Workflow
+
+- **Prepare**
+  - Run the quick-start tag command to ensure you use the latest remote tag. If the tag pattern differs, override the pattern argument (e.g., `'*.*.*'`).
+  - If the user specifies a base tag, prefer it but still fetch remote tags first.
+  - Keep the working tree clean to avoid diff noise.
+- **Map the diff**
+  - Use `--stat`, `--dirstat`, and `--name-status` outputs to spot hot directories and file types.
+  - For suspicious files, prefer `git diff --word-diff BASE...TARGET -- <path>`.
+  - Note any deleted or newly added tests, config (for example `pyproject.toml`, `uv.lock`, `mkdocs.yml`), migrations, or scripts.
+- **Analyze risk**
+  - Walk through the categories in `references/review-checklist.md` (breaking changes, regression clues, improvement opportunities).
+  - When you suspect a risk, cite the specific file/commit and explain the behavioral impact.
+  - Suggest minimal, high-signal validation commands (targeted tests or linters) instead of generic reruns when time is tight.
+- **Form a recommendation**
+  - State BASE_TAG and TARGET explicitly.
+  - Provide a concise diff summary (key directories/files and counts).
+  - List: breaking-change candidates, probable regressions/bugs, improvement opportunities, missing release notes/migrations.
+  - Recommend ship/block and the exact checks needed to unblock if blocking.
+
+## Resources
+
+- `scripts/find_latest_release_tag.sh`: Fetches remote tags and returns the newest tag matching a pattern (default `v*`).
+- `references/review-checklist.md`: Detailed signals and commands for spotting breaking changes, regressions, and release polish gaps.
diff --git a/.codex/skills/final-release-review/references/review-checklist.md b/.codex/skills/final-release-review/references/review-checklist.md
@@ -0,0 +1,40 @@
+# Release Diff Review Checklist
+
+## Quick commands
+
+- Sync tags: `git fetch origin --tags --prune`.
+- Identify latest release tag (default pattern `v*`): `git tag -l 'v*' --sort=-v:refname | head -n1` or use `.codex/skills/final-release-review/scripts/find_latest_release_tag.sh`.
+- Generate overview: `git diff --stat BASE...TARGET`, `git diff --dirstat=files,0 BASE...TARGET`, `git log --oneline --reverse BASE..TARGET`.
+- Inspect risky files quickly: `git diff --name-status BASE...TARGET`, `git diff --word-diff BASE...TARGET -- <path>`.
+
+## Breaking change signals
+
+- Public API surface: removed/renamed modules, classes, functions, or re-exports; changed parameters/return types, default values changed, new required options, stricter validation.
+- Protocol/schema: request/response fields added/removed/renamed, enum changes, JSON shape changes, ID formats, pagination defaults.
+- Config/CLI/env: renamed flags, default behavior flips, removed fallbacks, environment variable changes, logging levels tightened.
+- Dependencies/platform: Python version requirement changes, dependency major bumps, `pyproject.toml`/`uv.lock` changes, removed or renamed extras.
+- Persistence/data: migration scripts missing, data model changes, stored file formats, cache keys altered without invalidation.
+- Docs/examples drift: examples still reflect old behavior or lack migration note.
+
+## Regression risk clues
+
+- Large refactors with light test deltas or deleted tests; new `skip`/`todo` markers.
+- Concurrency/timing: new async flows, asyncio event-loop changes, retries, timeouts, debounce/caching changes, race-prone patterns.
+- Error handling: catch blocks removed, swallowed errors, broader catch-all added without logging, stricter throws without caller updates.
+- Stateful components: mutable shared state, global singletons, lifecycle changes (init/teardown), resource cleanup removal.
+- Third-party changes: swapped core libraries, feature flags toggled, observability removed or gated.
+
+## Improvement opportunities
+
+- Missing coverage for new code paths; add focused tests.
+- Performance: obvious N+1 loops, repeated I/O without caching, excessive serialization.
+- Developer ergonomics: unclear naming, missing inline docs for public APIs, missing examples for new features.
+- Release hygiene: add migration/upgrade note when behavior changes; ensure changelog/notes capture user-facing shifts.
+
+## Evidence to capture in the review output
+
+- BASE tag and TARGET ref used for the diff; confirm tags fetched.
+- High-level diff stats and key directories touched.
+- Concrete files/commits that indicate breaking changes or risk, with brief rationale.
+- Tests or commands suggested to validate suspected risks.
+- Explicit release gate call (ship/block) with conditions to unblock.
diff --git a/.codex/skills/final-release-review/scripts/find_latest_release_tag.sh b/.codex/skills/final-release-review/scripts/find_latest_release_tag.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+remote="${1:-origin}"
+pattern="${2:-v*}"
+
+# Sync tags from the remote to ensure the latest release tag is available locally.
+git fetch "$remote" --tags --prune --quiet
+
+latest_tag=$(git tag -l "$pattern" --sort=-v:refname | head -n1)
+
+if [[ -z "$latest_tag" ]]; then
+  echo "No tags found matching pattern '$pattern' after fetching from $remote." >&2
+  exit 1
+fi
+
+echo "$latest_tag"
diff --git a/.codex/skills/test-coverage-improver/SKILL.md b/.codex/skills/test-coverage-improver/SKILL.md
@@ -0,0 +1,42 @@
+---
+name: test-coverage-improver
+description: 'Improve test coverage in the OpenAI Agents Python repository: run `make coverage`, inspect coverage artifacts, identify low-coverage files, propose high-impact tests, and confirm with the user before writing tests.'
+---
+
+# Test Coverage Improver
+
+## Overview
+
+Use this skill whenever coverage needs assessment or improvement (coverage regressions, failing thresholds, or user requests for stronger tests). It runs the coverage suite, analyzes results, highlights the biggest gaps, and prepares test additions while confirming with the user before changing code.
+
+## Quick Start
+
+1. From the repo root run `make coverage` to regenerate `.coverage` data and `coverage.xml`.
+2. Collect artifacts: `.coverage` and `coverage.xml`, plus the console output from `coverage report -m` for drill-downs.
+3. Summarize coverage: total percentages, lowest files, and uncovered lines/paths.
+4. Draft test ideas per file: scenario, behavior under test, expected outcome, and likely coverage gain.
+5. Ask the user for approval to implement the proposed tests; pause until they agree.
+6. After approval, write the tests in `tests/`, rerun `make coverage`, and then run `$code-change-verification` before marking work complete.
+
+## Workflow Details
+
+- **Run coverage**: Execute `make coverage` at repo root. Avoid watch flags and keep prior coverage artifacts only if comparing trends.
+- **Parse summaries efficiently**:
+  - Prefer the console output from `coverage report -m` for file-level totals; fallback to `coverage.xml` for tooling or spreadsheets.
+  - Use `uv run coverage html` to generate `htmlcov/index.html` if you need an interactive drill-down.
+- **Prioritize targets**:
+  - Public APIs or shared utilities in `src/agents/` before examples or docs.
+  - Files with low statement coverage or newly added code at 0%.
+  - Recent bug fixes or risky code paths (error handling, retries, timeouts, concurrency).
+- **Design impactful tests**:
+  - Hit uncovered paths: error cases, boundary inputs, optional flags, and cancellation/timeouts.
+  - Cover combinational logic rather than trivial happy paths.
+  - Place tests under `tests/` and avoid flaky async timing.
+- **Coordinate with the user**: Present a numbered, concise list of proposed test additions and expected coverage gains. Ask explicitly before editing code or fixtures.
+- **After implementation**: Rerun coverage, report the updated summary, and note any remaining low-coverage areas.
+
+## Notes
+
+- Keep any added comments or code in English.
+- Do not create `scripts/`, `references/`, or `assets/` unless needed later.
+- If coverage artifacts are missing or stale, rerun `pnpm test:coverage` instead of guessing.