Skip to content

Commit b7aa5d6

Browse files
authored
Merge branch 'main' into fix/antigravity-credential-stuck-unavailable
2 parents 0af8a39 + 73a2395 commit b7aa5d6

File tree

12 files changed

+4307
-937
lines changed

12 files changed

+4307
-937
lines changed

.env.example

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,83 @@ MAX_CONCURRENT_REQUESTS_PER_KEY_GEMINI=1
159159
MAX_CONCURRENT_REQUESTS_PER_KEY_ANTHROPIC=1
160160
MAX_CONCURRENT_REQUESTS_PER_KEY_IFLOW=1
161161

162+
# --- Credential Rotation Mode ---
163+
# Controls how credentials are rotated when multiple are available for a provider.
164+
# This affects how the proxy selects the next credential to use for requests.
165+
#
166+
# Available modes:
167+
# balanced - (Default) Rotate credentials evenly across requests to distribute load.
168+
# Best for API keys with per-minute rate limits.
169+
# sequential - Use one credential until it's exhausted (429 error), then switch to next.
170+
# Best for credentials with daily/weekly quotas (e.g., free tier accounts).
171+
# When a credential hits quota, it's put on cooldown based on the reset time
172+
# parsed from the provider's error response.
173+
#
174+
# Format: ROTATION_MODE_<PROVIDER_NAME>=<mode>
175+
#
176+
# Provider Defaults:
177+
# - antigravity: sequential (free tier accounts with daily quotas)
178+
# - All others: balanced
179+
#
180+
# Example:
181+
# ROTATION_MODE_GEMINI=sequential # Use Gemini keys until quota exhausted
182+
# ROTATION_MODE_OPENAI=balanced # Distribute load across OpenAI keys (default)
183+
# ROTATION_MODE_ANTIGRAVITY=balanced # Override Antigravity's sequential default
184+
#
185+
# ROTATION_MODE_GEMINI=balanced
186+
# ROTATION_MODE_ANTIGRAVITY=sequential
187+
188+
# --- Priority-Based Concurrency Multipliers ---
189+
# Credentials can be assigned to priority tiers (1=highest, 2, 3, etc.).
190+
# Each tier can have a concurrency multiplier that increases the effective
191+
# concurrent request limit for credentials in that tier.
192+
#
193+
# How it works:
194+
# effective_concurrent_limit = MAX_CONCURRENT_REQUESTS_PER_KEY * tier_multiplier
195+
#
196+
# This allows paid/premium credentials to handle more concurrent requests than
197+
# free tier credentials, regardless of rotation mode.
198+
#
199+
# Provider Defaults (built into provider classes):
200+
# Antigravity:
201+
# Priority 1: 5x (paid ultra tier)
202+
# Priority 2: 3x (standard paid tier)
203+
# Priority 3+: 2x (sequential mode) or 1x (balanced mode)
204+
# Gemini CLI:
205+
# Priority 1: 5x
206+
# Priority 2: 3x
207+
# Others: 1x (all modes)
208+
#
209+
# Format: CONCURRENCY_MULTIPLIER_<PROVIDER>_PRIORITY_<N>=<multiplier>
210+
#
211+
# Mode-specific overrides (optional):
212+
# Format: CONCURRENCY_MULTIPLIER_<PROVIDER>_PRIORITY_<N>_<MODE>=<multiplier>
213+
#
214+
# Examples:
215+
# CONCURRENCY_MULTIPLIER_ANTIGRAVITY_PRIORITY_1=10 # Override P1 to 10x
216+
# CONCURRENCY_MULTIPLIER_ANTIGRAVITY_PRIORITY_3=1 # Override P3 to 1x
217+
# CONCURRENCY_MULTIPLIER_ANTIGRAVITY_PRIORITY_2_BALANCED=1 # P2 = 1x in balanced mode only
218+
219+
# --- Model Quota Groups ---
220+
# Models that share quota/cooldown timing. When one model in a group hits
221+
# quota exhausted (429), all models in the group receive the same cooldown timestamp.
222+
# They also reset (archive stats) together when the quota period expires.
223+
#
224+
# This is useful for providers where multiple model variants share the same
225+
# underlying quota (e.g., Claude Sonnet and Opus on Antigravity).
226+
#
227+
# Format: QUOTA_GROUPS_<PROVIDER>_<GROUP>="model1,model2,model3"
228+
#
229+
# To DISABLE a default group, set it to empty string:
230+
# QUOTA_GROUPS_ANTIGRAVITY_CLAUDE=""
231+
#
232+
# Default groups:
233+
# ANTIGRAVITY.CLAUDE: claude-sonnet-4-5,claude-opus-4-5
234+
#
235+
# Examples:
236+
# QUOTA_GROUPS_ANTIGRAVITY_CLAUDE="claude-sonnet-4-5,claude-opus-4-5"
237+
# QUOTA_GROUPS_ANTIGRAVITY_GEMINI="gemini-3-pro-preview,gemini-3-pro-image-preview"
238+
162239
# ------------------------------------------------------------------------------
163240
# | [ADVANCED] Proxy Configuration |
164241
# ------------------------------------------------------------------------------

DOCUMENTATION.md

Lines changed: 243 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -96,37 +96,50 @@ The `_safe_streaming_wrapper` is a critical component for stability. It:
9696

9797
### 2.2. `usage_manager.py` - Stateful Concurrency & Usage Management
9898

99-
This class is the stateful core of the library, managing concurrency, usage tracking, and cooldowns.
99+
This class is the stateful core of the library, managing concurrency, usage tracking, cooldowns, and quota resets.
100100

101101
#### Key Concepts
102102

103103
* **Async-Native & Lazy-Loaded**: Fully asynchronous, using `aiofiles` for non-blocking file I/O. Usage data is loaded only when needed.
104104
* **Fine-Grained Locking**: Each API key has its own `asyncio.Lock` and `asyncio.Condition`. This allows for highly granular control.
105+
* **Multiple Reset Modes**: Supports three reset strategies:
106+
- **per_model**: Each model has independent usage window with authoritative `quota_reset_ts` (from provider errors)
107+
- **credential**: One window per credential with custom duration (e.g., 5 hours, 7 days)
108+
- **daily**: Legacy daily reset at `daily_reset_time_utc`
109+
* **Model Quota Groups**: Models can be grouped to share quota limits. When one model in a group hits quota, all receive the same reset timestamp.
105110

106111
#### Tiered Key Acquisition Strategy
107112

108113
The `acquire_key` method uses a sophisticated strategy to balance load:
109114

110115
1. **Filtering**: Keys currently on cooldown (global or model-specific) are excluded.
111-
2. **Tiering**: Valid keys are split into two tiers:
116+
2. **Rotation Mode**: Determines credential selection strategy:
117+
* **Balanced Mode** (default): Credentials sorted by usage count - least-used first for even distribution
118+
* **Sequential Mode**: Credentials sorted by usage count descending - most-used first to maintain sticky behavior until exhausted
119+
3. **Tiering**: Valid keys are split into two tiers:
112120
* **Tier 1 (Ideal)**: Keys that are completely idle (0 concurrent requests).
113121
* **Tier 2 (Acceptable)**: Keys that are busy but still under their configured `MAX_CONCURRENT_REQUESTS_PER_KEY_<PROVIDER>` limit for the requested model. This allows a single key to be used multiple times for the same model, maximizing throughput.
114-
3. **Selection Strategy** (configurable via `rotation_tolerance`):
122+
4. **Selection Strategy** (configurable via `rotation_tolerance`):
115123
* **Deterministic (tolerance=0.0)**: Within each tier, keys are sorted by daily usage count and the least-used key is always selected. This provides perfect load balance but predictable patterns.
116124
* **Weighted Random (tolerance>0, default)**: Keys are selected randomly with weights biased toward less-used ones:
117125
- Formula: `weight = (max_usage - credential_usage) + tolerance + 1`
118126
- `tolerance=2.0` (recommended): Balanced randomness - credentials within 2 uses of the maximum can still be selected with reasonable probability
119127
- `tolerance=5.0+`: High randomness - even heavily-used credentials have significant probability
120128
- **Security Benefit**: Unpredictable selection patterns make rate limit detection and fingerprinting harder
121129
- **Load Balance**: Lower-usage credentials still preferred, maintaining reasonable distribution
122-
4. **Concurrency Limits**: Checks against `max_concurrent` limits to prevent overloading a single key.
123-
5. **Priority Groups**: When credential prioritization is enabled, higher-tier credentials (lower priority numbers) are tried first before moving to lower tiers.
130+
5. **Concurrency Limits**: Checks against `max_concurrent` limits (with priority multipliers applied) to prevent overloading a single key.
131+
6. **Priority Groups**: When credential prioritization is enabled, higher-tier credentials (lower priority numbers) are tried first before moving to lower tiers.
124132

125133
#### Failure Handling & Cooldowns
126134

127135
* **Escalating Backoff**: When a failure occurs, the key gets a temporary cooldown for that specific model. Consecutive failures increase this time (10s -> 30s -> 60s -> 120s).
128136
* **Key-Level Lockouts**: If a key accumulates failures across multiple distinct models (3+), it is assumed to be dead/revoked and placed on a global 5-minute lockout.
129137
* **Authentication Errors**: Immediate 5-minute global lockout.
138+
* **Quota Exhausted Errors**: When a provider returns a quota exhausted error with an authoritative reset timestamp:
139+
- The `quota_reset_ts` is extracted from the error response (via provider's `parse_quota_error()` method)
140+
- Applied to the affected model (and all models in its quota group if defined)
141+
- Cooldown preserved even during daily/window resets until the actual quota reset time
142+
- Logs show the exact reset time in local timezone with ISO format
130143

131144
### 2.3. `batch_manager.py` - Efficient Request Aggregation
132145

@@ -406,6 +419,10 @@ The most sophisticated provider implementation, supporting Google's internal Ant
406419
- **Thought Signature Caching**: Server-side caching of encrypted signatures for multi-turn Gemini 3 conversations
407420
- **Model-Specific Logic**: Automatic configuration based on model type (Gemini 3, Claude Sonnet, Claude Opus)
408421
- **Credential Prioritization**: Automatic tier detection with paid credentials prioritized over free (paid tier resets every 5 hours, free tier resets weekly)
422+
- **Sequential Rotation Mode**: Default rotation mode is sequential (use credentials until exhausted) to maximize thought signature cache hits
423+
- **Per-Model Quota Tracking**: Each model tracks independent usage windows with authoritative reset timestamps from quota errors
424+
- **Quota Groups**: Claude models (Sonnet 4.5 + Opus 4.5) can be grouped to share quota limits (disabled by default, configurable via `QUOTA_GROUPS_ANTIGRAVITY_CLAUDE`)
425+
- **Priority Multipliers**: Paid tier credentials get higher concurrency limits (Priority 1: 5x, Priority 2: 3x, Priority 3+: 2x in sequential mode)
409426

410427
#### Model Support
411428

@@ -585,6 +602,221 @@ cache/
585602

586603
---
587604

605+
### 2.13. Sequential Rotation & Per-Model Quota Tracking
606+
607+
A comprehensive credential rotation and quota management system introduced in PR #31.
608+
609+
#### Rotation Modes
610+
611+
Two rotation strategies are available per provider:
612+
613+
**Balanced Mode (Default)**:
614+
- Distributes load evenly across all credentials
615+
- Least-used credentials selected first
616+
- Best for providers with per-minute rate limits
617+
- Prevents any single credential from being overused
618+
619+
**Sequential Mode**:
620+
- Uses one credential until it's exhausted (429 quota error)
621+
- Switches to next credential only after current one fails
622+
- Most-used credentials selected first (sticky behavior)
623+
- Best for providers with daily/weekly quotas
624+
- Maximizes cache hit rates (e.g., Antigravity thought signatures)
625+
- Default for Antigravity provider
626+
627+
**Configuration**:
628+
```env
629+
# Set per provider
630+
ROTATION_MODE_GEMINI=sequential
631+
ROTATION_MODE_OPENAI=balanced
632+
ROTATION_MODE_ANTIGRAVITY=balanced # Override default
633+
```
634+
635+
#### Per-Model Quota Tracking
636+
637+
Instead of tracking usage at the credential level, the system now supports granular per-model tracking:
638+
639+
**Data Structure** (when `mode="per_model"`):
640+
```json
641+
{
642+
"credential_id": {
643+
"models": {
644+
"gemini-2.5-pro": {
645+
"window_start_ts": 1733678400.0,
646+
"quota_reset_ts": 1733696400.0,
647+
"success_count": 15,
648+
"prompt_tokens": 5000,
649+
"completion_tokens": 1000,
650+
"approx_cost": 0.05,
651+
"window_started": "2025-12-08 14:00:00 +0100",
652+
"quota_resets": "2025-12-08 19:00:00 +0100"
653+
}
654+
},
655+
"global": {...},
656+
"model_cooldowns": {...}
657+
}
658+
}
659+
```
660+
661+
**Key Features**:
662+
- Each model tracks its own usage window independently
663+
- `window_start_ts`: When the current quota period started
664+
- `quota_reset_ts`: Authoritative reset time from provider error response
665+
- Human-readable timestamps added for debugging
666+
- Supports custom window durations (5h, 7d, etc.)
667+
668+
#### Provider-Specific Quota Parsing
669+
670+
Providers can implement `parse_quota_error()` to extract precise reset times from error responses:
671+
672+
```python
673+
@staticmethod
674+
def parse_quota_error(error, error_body) -> Optional[Dict]:
675+
"""Extract quota reset timestamp from provider error.
676+
677+
Returns:
678+
{
679+
'quota_reset_timestamp': 1733696400.0, # Unix timestamp
680+
'retry_after': 18000 # Seconds until reset
681+
}
682+
"""
683+
```
684+
685+
**Google RPC Format** (Antigravity, Gemini CLI):
686+
- Parses `RetryInfo` and `ErrorInfo` from error details
687+
- Handles duration strings: `"143h4m52.73s"` or `"515092.73s"`
688+
- Extracts `quotaResetTimeStamp` and converts to Unix timestamp
689+
- Falls back to `quotaResetDelay` if timestamp not available
690+
691+
**Example Error Response**:
692+
```json
693+
{
694+
"error": {
695+
"code": 429,
696+
"message": "Quota exceeded",
697+
"details": [{
698+
"@type": "type.googleapis.com/google.rpc.RetryInfo",
699+
"retryDelay": "143h4m52.73s"
700+
}, {
701+
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
702+
"metadata": {
703+
"quotaResetTimeStamp": "2025-12-08T19:00:00Z"
704+
}
705+
}]
706+
}
707+
}
708+
```
709+
710+
#### Model Quota Groups
711+
712+
Models that share the same quota limits can be grouped:
713+
714+
**Configuration**:
715+
```env
716+
# Models in a group share quota/cooldown timing
717+
QUOTA_GROUPS_ANTIGRAVITY_CLAUDE="claude-sonnet-4-5,claude-opus-4-5"
718+
719+
# To disable a default group:
720+
QUOTA_GROUPS_ANTIGRAVITY_CLAUDE=""
721+
```
722+
723+
**Behavior**:
724+
- When one model hits quota, all models in the group receive the same `quota_reset_ts`
725+
- Combined weighted usage for credential selection (e.g., Opus counts 2x vs Sonnet)
726+
- Group resets only when ALL models' quotas have reset
727+
- Preserves unexpired cooldowns during other resets
728+
729+
**Provider Implementation**:
730+
```python
731+
class AntigravityProvider(ProviderInterface):
732+
model_quota_groups = {
733+
"claude": ["claude-sonnet-4-5", "claude-opus-4-5"]
734+
}
735+
736+
model_usage_weights = {
737+
"claude-opus-4-5": 2 # Opus counts 2x vs Sonnet
738+
}
739+
```
740+
741+
#### Priority-Based Concurrency Multipliers
742+
743+
Credentials can be assigned to priority tiers with configurable concurrency limits:
744+
745+
**Configuration**:
746+
```env
747+
# Universal multipliers (all modes)
748+
CONCURRENCY_MULTIPLIER_ANTIGRAVITY_PRIORITY_1=10
749+
CONCURRENCY_MULTIPLIER_ANTIGRAVITY_PRIORITY_2=3
750+
751+
# Mode-specific overrides
752+
CONCURRENCY_MULTIPLIER_ANTIGRAVITY_PRIORITY_2_BALANCED=1 # Lower in balanced mode
753+
```
754+
755+
**How it works**:
756+
```python
757+
effective_concurrent_limit = MAX_CONCURRENT_REQUESTS_PER_KEY * tier_multiplier
758+
```
759+
760+
**Provider Defaults** (Antigravity):
761+
- Priority 1 (paid ultra): 5x multiplier
762+
- Priority 2 (standard paid): 3x multiplier
763+
- Priority 3+ (free): 2x (sequential mode) or 1x (balanced mode)
764+
765+
**Benefits**:
766+
- Paid credentials handle more load without manual configuration
767+
- Different concurrency for different rotation modes
768+
- Automatic tier detection based on credential properties
769+
770+
#### Reset Window Configuration
771+
772+
Providers can specify custom reset windows per priority tier:
773+
774+
```python
775+
class AntigravityProvider(ProviderInterface):
776+
usage_reset_configs = {
777+
frozenset([1, 2]): UsageResetConfigDef(
778+
mode="per_model",
779+
window_hours=5, # 5-hour rolling window for paid tiers
780+
field_name="5h_window"
781+
),
782+
frozenset([3, 4, 5]): UsageResetConfigDef(
783+
mode="per_model",
784+
window_hours=168, # 7-day window for free tier
785+
field_name="7d_window"
786+
)
787+
}
788+
```
789+
790+
**Supported Modes**:
791+
- `per_model`: Independent window per model with authoritative reset times
792+
- `credential`: Single window per credential (legacy)
793+
- `daily`: Daily reset at configured UTC hour (legacy)
794+
795+
#### Usage Flow
796+
797+
1. **Request arrives** for model X with credential Y
798+
2. **Check rotation mode**: Sequential or balanced?
799+
3. **Select credential**:
800+
- Filter by priority tier requirements
801+
- Apply concurrency multiplier for effective limit
802+
- Sort by rotation mode strategy
803+
4. **Check quota**:
804+
- Load model's usage data
805+
- Check if within window (window_start_ts to quota_reset_ts)
806+
- Check model quota groups for combined usage
807+
5. **Execute request**
808+
6. **On success**: Increment model usage count
809+
7. **On quota error**:
810+
- Parse error for `quota_reset_ts`
811+
- Apply to model (and quota group)
812+
- Credential remains on cooldown until reset time
813+
8. **On window expiration**:
814+
- Archive model data to global stats
815+
- Start fresh window with new `window_start_ts`
816+
- Preserve unexpired quota cooldowns
817+
818+
---
819+
588820
### 2.12. Google OAuth Base (`providers/google_oauth_base.py`)
589821

590822
A refactored, reusable OAuth2 base class that eliminates code duplication across Google-based providers.
@@ -637,6 +869,12 @@ The library handles provider idiosyncrasies through specialized "Provider" class
637869

638870
The `GeminiCliProvider` is the most complex implementation, mimicking the Google Cloud Code extension.
639871

872+
**New in PR #31**:
873+
- **Quota Parsing**: Implements `parse_quota_error()` using Google RPC format parser
874+
- **Tier Configuration**: Defines `tier_priorities` and `usage_reset_configs` for automatic priority resolution
875+
- **Balanced Rotation**: Defaults to balanced mode (unlike Antigravity which uses sequential)
876+
- **Priority Multipliers**: Same as Antigravity (P1: 5x, P2: 3x, others: 1x)
877+
640878
#### Authentication (`gemini_auth_base.py`)
641879

642880
* **Device Flow**: Uses a standard OAuth 2.0 flow. The `credential_tool` spins up a local web server (`localhost:8085`) to capture the callback from Google's auth page.

0 commit comments

Comments
 (0)