## Overview
This PR implements SDK-native rate limiting by leveraging each LLM
provider's built-in retry mechanisms, significantly improving
reliability while simplifying our codebase.
## Changes
### Core Improvements
- **Increased max_retries to 10** for all LLM providers (was 2-3)
- **OpenAI**: Uses built-in retry with exponential backoff, jitter, and
Retry-After header support
- **Anthropic**: Uses built-in retry with intelligent error
classification
- **Groq**: Uses built-in retry with capacity exceeded handling (498
errors)
- **Google**: Implements custom retry logic with pattern-based error
detection
- **Azure**: Inherits improved retry from OpenAI base class
### Provider-Specific Details
#### OpenAI (`browser_use/llm/openai/chat.py`)
- ✅ Full SDK retry support with `max_retries=10`
- Exponential backoff with jitter
- Retry-After header support
- Automatic rate limit detection (429 errors)
- Server error retries (5xx errors)
#### Anthropic (`browser_use/llm/anthropic/chat.py`)
- ✅ Full SDK retry support with `max_retries=10`
- Intelligent error classification (RateLimitError, APIStatusError)
- Server overload handling (529 errors)
#### Groq (`browser_use/llm/groq/chat.py`)
- ✅ Full SDK retry support with `max_retries=10`
- Rate limit handling (429 errors)
- Capacity exceeded handling (498 errors)
#### Google/Gemini (`browser_use/llm/google/chat.py`)
- ❌ Limited SDK retry support (requires custom implementation)
- Custom retry with 10 attempts and exponential backoff
- Pattern-based error detection for rate limits, server errors,
connection issues
#### Azure OpenAI (`browser_use/llm/azure/chat.py`)
- ✅ Inherits improved retry from ChatOpenAI base class
## Benefits
1. **🛡️ Reliability**: Provider SDKs know best how to handle their own
error patterns
2. **🧹 Maintenance**: Reduces custom retry code we need to maintain
3. **⚡ Performance**: SDKs implement optimizations like jitter and
Retry-After headers
4. **🎯 Accuracy**: Provider-specific error classification is more
accurate
5. **🚀 Future-Proof**: Automatically benefits from SDK improvements
## Error Handling Strategy
### Retryable Errors
- Rate limits (429, Resource Exhausted)
- Server errors (5xx)
- Connection timeouts/network issues
- Provider-specific capacity errors (e.g., Groq 498)
### Non-Retryable Errors
- Authentication errors (401, 403)
- Bad requests (400, 422)
- Not found (404)
- Quota/billing issues
## Testing
✅ **Stress testing completed** with 200+ concurrent requests using 24k
tokens each:
- **100% success rate** - no failed requests
- **Automatic retry handling** - response times varied from 3s to 17s
indicating internal retries
- **Consistent agent output** - all responses correctly returned "Hello
World"
- **Rate limit recovery** - slower requests indicate SDK handled rate
limits gracefully
## Configuration
All providers now default to **10 retries** making browser-use
significantly more resilient to temporary API issues during long-running
automation tasks.
Users can still override by setting `max_retries` when instantiating
models:
```python
# Custom retry configuration
model = ChatOpenAI(model="gpt-4o", max_retries=15)
```
## Impact
This change makes browser-use much more reliable for production
automation by leveraging battle-tested retry logic that each provider
has optimized for their APIs.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Added SDK-native rate limiting for all LLM providers by increasing
default retries to 10 and using each provider's built-in retry logic.
This makes browser-use more reliable during automation by handling rate
limits and server errors automatically.
- **Bug Fixes**
- OpenAI, Anthropic, and Groq now use their SDKs' retry features for
rate limits and server errors.
- Google uses custom retry logic with pattern-based error detection.
- Azure inherits improved retry from OpenAI base.
<!-- End of auto-generated description by cubic. -->
- Increase max_retries to 10 for all LLM providers (was 2-3)
- OpenAI: Use built-in retry with exponential backoff, jitter, Retry-After header support
- Anthropic: Use built-in retry with intelligent error classification
- Groq: Use built-in retry with capacity exceeded handling (498 errors)
- Google: Implement custom retry logic with pattern-based error detection
- Azure: Inherits improved retry from OpenAI base class
Benefits:
- More resilient to temporary API issues during automation
- Leverages provider SDKs' optimized retry logic
- Reduces custom retry code maintenance burden
- Automatically handles provider-specific error patterns
All providers now retry rate limits, server errors, and connection issues
while skipping authentication errors and bad requests.
<!-- This is an auto-generated description by cubic. -->
## Summary by cubic
Fixed a bug where the done callback was called twice when a task
completed. Now the callback is only triggered once at the correct time.
<!-- End of auto-generated description by cubic. -->
Auto-generated PR for branch: Set-default-parameter
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Changed default settings to allow up to 10 actions per step, disable
memory by default, and reduce structured data extraction size to 40,000
characters. Improved error message for missing element indices.
- **Bug Fixes**
- Updated error message to suggest scrolling for more elements when an
index is missing.
- **Refactors**
- Set max_actions_per_step default to 10.
- Set enable_memory default to False.
- Reduced max_chars for extraction to 40,000.
<!-- End of auto-generated description by cubic. -->
This PR improves the evaluation judge system by adding explicit criteria
for scoring tasks that produce no output.
**Changes:**
- Added clarification that tasks producing no output must receive low
scores
- Improved evaluation consistency for incomplete task scenarios
- Ensures judges properly penalize agents that fail to produce results
**Problem Addressed:**
Previously, the evaluation criteria didn't explicitly state how to
handle cases where agents complete their steps but produce no meaningful
output. This could lead to inconsistent scoring where some judges might
give moderate scores for "effort" even when no results were delivered.
**Benefits:**
- **Clearer guidance**: Judges now have explicit instruction on scoring
no-output scenarios
- **Consistent evaluation**: Reduces variability in how different judges
score incomplete tasks
- **Better agent assessment**: Ensures that failure to produce output is
properly reflected in scoring
- **Improved training data**: More accurate scores help improve future
agent development
**Impact:**
This change helps maintain evaluation accuracy by making it clear that
lack of output is a significant failure condition that should result in
low scores, regardless of the apparent "effort" shown in the agent's
trajectory.
**Testing:**
- No breaking changes to existing functionality
- Pre-commit hooks pass successfully
- Change only affects evaluation scoring guidance
- Clarify that tasks producing no output must receive low scores
- Improve evaluation consistency for incomplete task scenarios
- Ensure judges properly penalize agents that fail to produce results
This helps maintain evaluation accuracy by explicitly stating that
lack of output is a significant failure condition that should be
reflected in low scoring.
This PR cleans up unused telemetry code in the controller registry
system.
**Changes:**
- Remove and classes from telemetry views
- Remove telemetry capture call from controller registry service that
was collecting registered function data
- Clean up unused imports in telemetry module
- Simplify telemetry exports to only include actively used classes
**Benefits:**
- Reduces code complexity by removing dead code
- Eliminates unnecessary telemetry data collection that wasn't being
used
- Cleaner imports and module exports
- Improved maintainability
**Testing:**
- All existing functionality remains unchanged
- No breaking changes to public APIs
- Pre-commit hooks pass successfully
This is a safe cleanup that removes functionality that was implemented
but never actually utilized in the application.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Removed unused telemetry code from the controller registry to reduce
complexity and clean up the codebase. No changes to existing
functionality.
<!-- End of auto-generated description by cubic. -->
This PR enhances the evaluation judge system with more precise criteria
and guidelines:
**Changes:**
- Removed verbose architecture documentation comments that were
cluttering the file
- Added guideline to avoid file system usage for short tasks unless
specifically required
- Added stricter evaluation criteria for tasks blocked by captchas or
impossible scenarios
- Added criteria for evaluating bad trajectories and inappropriate tool
usage
- Added requirement that incomplete tasks must receive low scores
- Added consideration for trajectory optimization opportunities
**Benefits:**
- Cleaner, more focused code with reduced comment noise
- More precise evaluation guidelines for better agent assessment
- Stricter criteria to prevent false positive evaluations
- Better guidance for scoring edge cases and failure scenarios
These improvements will help provide more accurate and consistent
evaluations of browser-use agent performance.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Improved the judge system by adding stricter evaluation criteria and
clearer guidelines for scoring agent performance, while removing
unnecessary documentation comments.
- **Evaluation Criteria Updates**
- Added rules to penalize incomplete tasks, bad trajectories, and poor
tool usage.
- Introduced guidelines for handling captchas, impossible tasks, and
unnecessary file system use.
- Reduced comment clutter for easier code maintenance.
<!-- End of auto-generated description by cubic. -->
- Remove ControllerRegisteredFunctionsTelemetryEvent and RegisteredFunction classes
- Remove telemetry capture call from controller registry service
- Clean up unused imports in telemetry module
- Simplify telemetry __all__ exports
This removes dead code that was capturing registered function telemetry
data but wasn't being used anywhere in the application.