Commit Graph

3637 Commits

Author SHA1 Message Date
Nick Sweeting
984b287ede lower CI performance expectations 2025-06-27 00:12:37 -07:00
Nick Sweeting
db7998efd0 fix type checks 2025-06-27 00:10:30 -07:00
Nick Sweeting
6c05a720ed auto-clear singletonLock on startup 2025-06-27 00:09:40 -07:00
Nick Sweeting
1415562d6e add portalocker multiprocess semaphore 2025-06-26 23:53:28 -07:00
Nick Sweeting
51b7e8ae66 always clear context and check connection properly using async call 2025-06-26 23:52:54 -07:00
Nick Sweeting
946ef4068d add stop hints and better logging 2025-06-26 23:52:19 -07:00
Nick Sweeting
ce2db1b9a8 exclude debug files from pytest 2025-06-26 22:58:36 -07:00
Nick Sweeting
747959a3d0 keep reference to the original browser session when copying 2025-06-26 22:45:05 -07:00
Nick Sweeting
112162cd3f prevent parent session from being garbage collected 2025-06-26 22:35:07 -07:00
Nick Sweeting
4f1fd7523f fix playwright internals calls 2025-06-26 22:33:23 -07:00
Nick Sweeting
7e5fc61b08 use raw channel directly for cdp 2025-06-26 20:06:45 -07:00
Nick Sweeting
7f4a3017c4 use raw channel directly for cdp 2025-06-26 20:03:36 -07:00
Nick Sweeting
c3bd596961 manually copy over browser_context instance when forking BrowserSession 2025-06-26 20:00:56 -07:00
Nick Sweeting
e054756f13 increase parallel browsers for stress test 2025-06-26 19:53:43 -07:00
Nick Sweeting
48fd583ae1 use raw CDP call for screenshot instead of playwright CDP session 2025-06-26 19:52:50 -07:00
Nick Sweeting
13048669e2 reset browser connection when screenshot fails 2025-06-26 19:07:09 -07:00
Nick Sweeting
8dfad7e557 tweak chrome launch args in docker to fix screenshots 2025-06-26 18:48:46 -07:00
Magnus Müller
b2059653e9 feat: implement SDK-native rate limiting for improved reliability (#2136)
## Overview

This PR implements SDK-native rate limiting by leveraging each LLM
provider's built-in retry mechanisms, significantly improving
reliability while simplifying our codebase.

## Changes

### Core Improvements
- **Increased max_retries to 10** for all LLM providers (was 2-3)
- **OpenAI**: Uses built-in retry with exponential backoff, jitter, and
Retry-After header support
- **Anthropic**: Uses built-in retry with intelligent error
classification
- **Groq**: Uses built-in retry with capacity exceeded handling (498
errors)
- **Google**: Implements custom retry logic with pattern-based error
detection
- **Azure**: Inherits improved retry from OpenAI base class

### Provider-Specific Details

#### OpenAI (`browser_use/llm/openai/chat.py`)
-  Full SDK retry support with `max_retries=10`
- Exponential backoff with jitter
- Retry-After header support
- Automatic rate limit detection (429 errors)
- Server error retries (5xx errors)

#### Anthropic (`browser_use/llm/anthropic/chat.py`)
-  Full SDK retry support with `max_retries=10`
- Intelligent error classification (RateLimitError, APIStatusError)
- Server overload handling (529 errors)

#### Groq (`browser_use/llm/groq/chat.py`)
-  Full SDK retry support with `max_retries=10`
- Rate limit handling (429 errors)
- Capacity exceeded handling (498 errors)

#### Google/Gemini (`browser_use/llm/google/chat.py`)
-  Limited SDK retry support (requires custom implementation)
- Custom retry with 10 attempts and exponential backoff
- Pattern-based error detection for rate limits, server errors,
connection issues

#### Azure OpenAI (`browser_use/llm/azure/chat.py`)
-  Inherits improved retry from ChatOpenAI base class

## Benefits

1. **🛡️ Reliability**: Provider SDKs know best how to handle their own
error patterns
2. **🧹 Maintenance**: Reduces custom retry code we need to maintain
3. ** Performance**: SDKs implement optimizations like jitter and
Retry-After headers
4. **🎯 Accuracy**: Provider-specific error classification is more
accurate
5. **🚀 Future-Proof**: Automatically benefits from SDK improvements

## Error Handling Strategy

### Retryable Errors
- Rate limits (429, Resource Exhausted)
- Server errors (5xx)
- Connection timeouts/network issues
- Provider-specific capacity errors (e.g., Groq 498)

### Non-Retryable Errors  
- Authentication errors (401, 403)
- Bad requests (400, 422)
- Not found (404)
- Quota/billing issues

## Testing

 **Stress testing completed** with 200+ concurrent requests using 24k
tokens each:
- **100% success rate** - no failed requests
- **Automatic retry handling** - response times varied from 3s to 17s
indicating internal retries
- **Consistent agent output** - all responses correctly returned "Hello
World"
- **Rate limit recovery** - slower requests indicate SDK handled rate
limits gracefully

## Configuration

All providers now default to **10 retries** making browser-use
significantly more resilient to temporary API issues during long-running
automation tasks.

Users can still override by setting `max_retries` when instantiating
models:

```python
# Custom retry configuration
model = ChatOpenAI(model="gpt-4o", max_retries=15)
```

## Impact

This change makes browser-use much more reliable for production
automation by leveraging battle-tested retry logic that each provider
has optimized for their APIs.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Added SDK-native rate limiting for all LLM providers by increasing
default retries to 10 and using each provider's built-in retry logic.
This makes browser-use more reliable during automation by handling rate
limits and server errors automatically.

- **Bug Fixes**
- OpenAI, Anthropic, and Groq now use their SDKs' retry features for
rate limits and server errors.
  - Google uses custom retry logic with pattern-based error detection.
  - Azure inherits improved retry from OpenAI base.

<!-- End of auto-generated description by cubic. -->
2025-06-26 23:00:40 +02:00
Magnus Müller
8b80b06566 cleanup: remove temporary documentation file 2025-06-26 18:09:06 +02:00
Magnus Müller
7fb8e5b39b feat: implement SDK-native rate limiting for improved reliability
- Increase max_retries to 10 for all LLM providers (was 2-3)
- OpenAI: Use built-in retry with exponential backoff, jitter, Retry-After header support
- Anthropic: Use built-in retry with intelligent error classification
- Groq: Use built-in retry with capacity exceeded handling (498 errors)
- Google: Implement custom retry logic with pattern-based error detection
- Azure: Inherits improved retry from OpenAI base class

Benefits:
- More resilient to temporary API issues during automation
- Leverages provider SDKs' optimized retry logic
- Reduces custom retry code maintenance burden
- Automatically handles provider-specific error patterns

All providers now retry rate limits, server errors, and connection issues
while skipping authentication errors and bad requests.
2025-06-26 18:08:51 +02:00
Mert Unsal
081eb7bfe6 fix MAJOR bug with done callback being called twice (#2130)
<!-- This is an auto-generated description by cubic. -->

## Summary by cubic
Fixed a bug where the done callback was called twice when a task
completed. Now the callback is only triggered once at the correct time.

<!-- End of auto-generated description by cubic. -->
2025-06-26 17:42:31 +02:00
Nick Sweeting
8e42ed5c6d Merge branch 'main' into mert/fix_done_callback 2025-06-26 08:33:13 -07:00
Nick Sweeting
15ffc8c2fd Use a device UUID to associate sync sessions instead of ephemeral usercode (#2127) 2025-06-26 08:32:57 -07:00
Mert Unsal
785e14fed8 Merge branch 'main' into mert/fix_done_callback 2025-06-26 17:32:24 +02:00
mertunsall
7b1630b443 fix MAJOR bug with done callback being caleld twice 2025-06-26 17:31:57 +02:00
Nick Sweeting
6f1e7596cc Merge branch 'main' into sync-device-id 2025-06-26 08:29:17 -07:00
Magnus Müller
aecc1becee Set-default-parameter (#2128)
Auto-generated PR for branch: Set-default-parameter
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Changed default settings to allow up to 10 actions per step, disable
memory by default, and reduce structured data extraction size to 40,000
characters. Improved error message for missing element indices.

- **Bug Fixes**
- Updated error message to suggest scrolling for more elements when an
index is missing.

- **Refactors**
  - Set max_actions_per_step default to 10.
  - Set enable_memory default to False.
  - Reduced max_chars for extraction to 40,000.

<!-- End of auto-generated description by cubic. -->
2025-06-26 17:23:48 +02:00
Nick Sweeting
5158419fc7 Update browser_use/browser/session.py
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-06-26 11:16:48 -04:00
Magnus Müller
03893b8adf Update comment to 40000 in service.py 2025-06-26 17:16:43 +02:00
Magnus Müller
42961a49f3 Merge branch 'main' into Set-default-parameter 2025-06-26 17:08:47 +02:00
Magnus Müller
4844bd048f Set default parameter - no memory, actions per step 10, and extraction input size 2025-06-26 17:08:15 +02:00
Nick Sweeting
90862026fa persist a unique device_id and send it with sync events 2025-06-26 08:00:20 -07:00
Nick Sweeting
18bd8fe715 tweak timeouts to fail faster 2025-06-26 07:59:55 -07:00
Nick Sweeting
dd837fd3d4 use retry decorator for locking and timeouts instead of manual logic (#2126) 2025-06-26 07:38:29 -07:00
Nick Sweeting
a1ce6ed5c2 fix linter 2025-06-26 07:34:47 -07:00
Nick Sweeting
b3ac5eb152 Merge branch 'main' into locking 2025-06-26 07:30:21 -07:00
Nick Sweeting
4b2d7c270b fix tests 2025-06-26 07:30:03 -07:00
Nick Sweeting
ea2c2205a3 swap decorator decorateee 2025-06-26 07:14:02 -07:00
Nick Sweeting
fc075c0152 use retry decorator for locking and timeouts instead ofmanual logic 2025-06-26 07:07:13 -07:00
Nick Sweeting
12583f917d start auth flow on first step 2025-06-26 06:50:33 -07:00
Magnus Müller
67d3f136eb Add explicit scoring criteria for tasks with no output (#2124)
This PR improves the evaluation judge system by adding explicit criteria
for scoring tasks that produce no output.

**Changes:**
- Added clarification that tasks producing no output must receive low
scores
- Improved evaluation consistency for incomplete task scenarios
- Ensures judges properly penalize agents that fail to produce results

**Problem Addressed:**
Previously, the evaluation criteria didn't explicitly state how to
handle cases where agents complete their steps but produce no meaningful
output. This could lead to inconsistent scoring where some judges might
give moderate scores for "effort" even when no results were delivered.

**Benefits:**
- **Clearer guidance**: Judges now have explicit instruction on scoring
no-output scenarios
- **Consistent evaluation**: Reduces variability in how different judges
score incomplete tasks
- **Better agent assessment**: Ensures that failure to produce output is
properly reflected in scoring
- **Improved training data**: More accurate scores help improve future
agent development

**Impact:**
This change helps maintain evaluation accuracy by making it clear that
lack of output is a significant failure condition that should result in
low scores, regardless of the apparent "effort" shown in the agent's
trajectory.

**Testing:**
- No breaking changes to existing functionality
- Pre-commit hooks pass successfully
- Change only affects evaluation scoring guidance
2025-06-26 15:34:52 +02:00
Magnus Müller
e004a7efd8 Add explicit scoring criteria for tasks with no output
- Clarify that tasks producing no output must receive low scores
- Improve evaluation consistency for incomplete task scenarios
- Ensure judges properly penalize agents that fail to produce results

This helps maintain evaluation accuracy by explicitly stating that
lack of output is a significant failure condition that should be
reflected in low scoring.
2025-06-26 15:34:01 +02:00
Magnus Müller
92fccdee82 Remove unused controller telemetry functionality (#2123)
This PR cleans up unused telemetry code in the controller registry
system.

**Changes:**
- Remove  and  classes from telemetry views
- Remove telemetry capture call from controller registry service that
was collecting registered function data
- Clean up unused imports in telemetry module
- Simplify telemetry  exports to only include actively used classes

**Benefits:**
- Reduces code complexity by removing dead code
- Eliminates unnecessary telemetry data collection that wasn't being
used
- Cleaner imports and module exports
- Improved maintainability

**Testing:**
- All existing functionality remains unchanged
- No breaking changes to public APIs
- Pre-commit hooks pass successfully

This is a safe cleanup that removes functionality that was implemented
but never actually utilized in the application.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Removed unused telemetry code from the controller registry to reduce
complexity and clean up the codebase. No changes to existing
functionality.

<!-- End of auto-generated description by cubic. -->
2025-06-26 15:15:02 +02:00
Magnus Müller
a8718d02cd Merge branch 'main' into cleanup/remove-unused-telemetry 2025-06-26 15:11:15 +02:00
Magnus Müller
55492b7e3e Improve judge system evaluation criteria (#2121)
This PR enhances the evaluation judge system with more precise criteria
and guidelines:

**Changes:**
- Removed verbose architecture documentation comments that were
cluttering the file
- Added guideline to avoid file system usage for short tasks unless
specifically required
- Added stricter evaluation criteria for tasks blocked by captchas or
impossible scenarios
- Added criteria for evaluating bad trajectories and inappropriate tool
usage
- Added requirement that incomplete tasks must receive low scores
- Added consideration for trajectory optimization opportunities

**Benefits:**
- Cleaner, more focused code with reduced comment noise
- More precise evaluation guidelines for better agent assessment
- Stricter criteria to prevent false positive evaluations
- Better guidance for scoring edge cases and failure scenarios

These improvements will help provide more accurate and consistent
evaluations of browser-use agent performance.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Improved the judge system by adding stricter evaluation criteria and
clearer guidelines for scoring agent performance, while removing
unnecessary documentation comments.

- **Evaluation Criteria Updates**
- Added rules to penalize incomplete tasks, bad trajectories, and poor
tool usage.
- Introduced guidelines for handling captchas, impossible tasks, and
unnecessary file system use.
  - Reduced comment clutter for easier code maintenance.

<!-- End of auto-generated description by cubic. -->
2025-06-26 15:11:03 +02:00
Magnus Müller
54736224eb Remove unused controller telemetry functionality
- Remove ControllerRegisteredFunctionsTelemetryEvent and RegisteredFunction classes
- Remove telemetry capture call from controller registry service
- Clean up unused imports in telemetry module
- Simplify telemetry __all__ exports

This removes dead code that was capturing registered function telemetry
data but wasn't being used anywhere in the application.
2025-06-26 15:06:17 +02:00
Magnus Müller
4a4336e1a1 Improve judge system evaluation criteria
- Remove verbose architecture documentation comments
- Add guideline to avoid file system usage for short tasks
- Add stricter evaluation criteria for blocked/impossible tasks
- Add criteria for bad trajectories and tool usage
- Add requirement for task completion scoring
- Add trajectory optimization considerations
2025-06-26 15:01:10 +02:00
Nick Sweeting
74d02c07a7 increase screenshot timeout to default timeout 0.3.3 2025-06-26 02:59:57 -07:00
Nick Sweeting
c7b29c9116 tweak browser shutdown locking and logging 2025-06-26 02:32:23 -07:00
Nick Sweeting
c0825fe3a5 increase timeout for tests 2025-06-26 02:05:52 -07:00