browser-use

mirror of https://github.com/browser-use/browser-use synced 2026-05-06 17:52:15 +02:00

Author	SHA1	Message	Date
Nick Sweeting	984b287ede	lower CI performance expectations	2025-06-27 00:12:37 -07:00
Nick Sweeting	db7998efd0	fix type checks	2025-06-27 00:10:30 -07:00
Nick Sweeting	6c05a720ed	auto-clear singletonLock on startup	2025-06-27 00:09:40 -07:00
Nick Sweeting	1415562d6e	add portalocker multiprocess semaphore	2025-06-26 23:53:28 -07:00
Nick Sweeting	51b7e8ae66	always clear context and check connection properly using async call	2025-06-26 23:52:54 -07:00
Nick Sweeting	946ef4068d	add stop hints and better logging	2025-06-26 23:52:19 -07:00
Nick Sweeting	ce2db1b9a8	exclude debug files from pytest	2025-06-26 22:58:36 -07:00
Nick Sweeting	747959a3d0	keep reference to the original browser session when copying	2025-06-26 22:45:05 -07:00
Nick Sweeting	112162cd3f	prevent parent session from being garbage collected	2025-06-26 22:35:07 -07:00
Nick Sweeting	4f1fd7523f	fix playwright internals calls	2025-06-26 22:33:23 -07:00
Nick Sweeting	7e5fc61b08	use raw channel directly for cdp	2025-06-26 20:06:45 -07:00
Nick Sweeting	7f4a3017c4	use raw channel directly for cdp	2025-06-26 20:03:36 -07:00
Nick Sweeting	c3bd596961	manually copy over browser_context instance when forking BrowserSession	2025-06-26 20:00:56 -07:00
Nick Sweeting	e054756f13	increase parallel browsers for stress test	2025-06-26 19:53:43 -07:00
Nick Sweeting	48fd583ae1	use raw CDP call for screenshot instead of playwright CDP session	2025-06-26 19:52:50 -07:00
Nick Sweeting	13048669e2	reset browser connection when screenshot fails	2025-06-26 19:07:09 -07:00
Nick Sweeting	8dfad7e557	tweak chrome launch args in docker to fix screenshots	2025-06-26 18:48:46 -07:00
Magnus Müller	b2059653e9	feat: implement SDK-native rate limiting for improved reliability (#2136 ) ## Overview This PR implements SDK-native rate limiting by leveraging each LLM provider's built-in retry mechanisms, significantly improving reliability while simplifying our codebase. ## Changes ### Core Improvements - Increased max_retries to 10 for all LLM providers (was 2-3) - OpenAI: Uses built-in retry with exponential backoff, jitter, and Retry-After header support - Anthropic: Uses built-in retry with intelligent error classification - Groq: Uses built-in retry with capacity exceeded handling (498 errors) - Google: Implements custom retry logic with pattern-based error detection - Azure: Inherits improved retry from OpenAI base class ### Provider-Specific Details #### OpenAI (`browser_use/llm/openai/chat.py`) - ✅ Full SDK retry support with `max_retries=10` - Exponential backoff with jitter - Retry-After header support - Automatic rate limit detection (429 errors) - Server error retries (5xx errors) #### Anthropic (`browser_use/llm/anthropic/chat.py`) - ✅ Full SDK retry support with `max_retries=10` - Intelligent error classification (RateLimitError, APIStatusError) - Server overload handling (529 errors) #### Groq (`browser_use/llm/groq/chat.py`) - ✅ Full SDK retry support with `max_retries=10` - Rate limit handling (429 errors) - Capacity exceeded handling (498 errors) #### Google/Gemini (`browser_use/llm/google/chat.py`) - ❌ Limited SDK retry support (requires custom implementation) - Custom retry with 10 attempts and exponential backoff - Pattern-based error detection for rate limits, server errors, connection issues #### Azure OpenAI (`browser_use/llm/azure/chat.py`) - ✅ Inherits improved retry from ChatOpenAI base class ## Benefits 1. 🛡️ Reliability: Provider SDKs know best how to handle their own error patterns 2. 🧹 Maintenance: Reduces custom retry code we need to maintain 3. ⚡ Performance: SDKs implement optimizations like jitter and Retry-After headers 4. 🎯 Accuracy: Provider-specific error classification is more accurate 5. 🚀 Future-Proof: Automatically benefits from SDK improvements ## Error Handling Strategy ### Retryable Errors - Rate limits (429, Resource Exhausted) - Server errors (5xx) - Connection timeouts/network issues - Provider-specific capacity errors (e.g., Groq 498) ### Non-Retryable Errors - Authentication errors (401, 403) - Bad requests (400, 422) - Not found (404) - Quota/billing issues ## Testing ✅ Stress testing completed with 200+ concurrent requests using 24k tokens each: - 100% success rate - no failed requests - Automatic retry handling - response times varied from 3s to 17s indicating internal retries - Consistent agent output - all responses correctly returned "Hello World" - Rate limit recovery - slower requests indicate SDK handled rate limits gracefully ## Configuration All providers now default to 10 retries making browser-use significantly more resilient to temporary API issues during long-running automation tasks. Users can still override by setting `max_retries` when instantiating models: ```python # Custom retry configuration model = ChatOpenAI(model="gpt-4o", max_retries=15) ``` ## Impact This change makes browser-use much more reliable for production automation by leveraging battle-tested retry logic that each provider has optimized for their APIs. <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Added SDK-native rate limiting for all LLM providers by increasing default retries to 10 and using each provider's built-in retry logic. This makes browser-use more reliable during automation by handling rate limits and server errors automatically. - Bug Fixes - OpenAI, Anthropic, and Groq now use their SDKs' retry features for rate limits and server errors. - Google uses custom retry logic with pattern-based error detection. - Azure inherits improved retry from OpenAI base. <!-- End of auto-generated description by cubic. -->	2025-06-26 23:00:40 +02:00
Magnus Müller	8b80b06566	cleanup: remove temporary documentation file	2025-06-26 18:09:06 +02:00
Magnus Müller	7fb8e5b39b	feat: implement SDK-native rate limiting for improved reliability - Increase max_retries to 10 for all LLM providers (was 2-3) - OpenAI: Use built-in retry with exponential backoff, jitter, Retry-After header support - Anthropic: Use built-in retry with intelligent error classification - Groq: Use built-in retry with capacity exceeded handling (498 errors) - Google: Implement custom retry logic with pattern-based error detection - Azure: Inherits improved retry from OpenAI base class Benefits: - More resilient to temporary API issues during automation - Leverages provider SDKs' optimized retry logic - Reduces custom retry code maintenance burden - Automatically handles provider-specific error patterns All providers now retry rate limits, server errors, and connection issues while skipping authentication errors and bad requests.	2025-06-26 18:08:51 +02:00
Mert Unsal	081eb7bfe6	fix MAJOR bug with done callback being called twice (#2130 ) <!-- This is an auto-generated description by cubic. --> ## Summary by cubic Fixed a bug where the done callback was called twice when a task completed. Now the callback is only triggered once at the correct time. <!-- End of auto-generated description by cubic. -->	2025-06-26 17:42:31 +02:00
Nick Sweeting	8e42ed5c6d	Merge branch 'main' into mert/fix_done_callback	2025-06-26 08:33:13 -07:00
Nick Sweeting	15ffc8c2fd	Use a device UUID to associate sync sessions instead of ephemeral usercode (#2127 )	2025-06-26 08:32:57 -07:00
Mert Unsal	785e14fed8	Merge branch 'main' into mert/fix_done_callback	2025-06-26 17:32:24 +02:00
mertunsall	7b1630b443	fix MAJOR bug with done callback being caleld twice	2025-06-26 17:31:57 +02:00
Nick Sweeting	6f1e7596cc	Merge branch 'main' into sync-device-id	2025-06-26 08:29:17 -07:00
Magnus Müller	aecc1becee	Set-default-parameter (#2128 ) Auto-generated PR for branch: Set-default-parameter <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Changed default settings to allow up to 10 actions per step, disable memory by default, and reduce structured data extraction size to 40,000 characters. Improved error message for missing element indices. - Bug Fixes - Updated error message to suggest scrolling for more elements when an index is missing. - Refactors - Set max_actions_per_step default to 10. - Set enable_memory default to False. - Reduced max_chars for extraction to 40,000. <!-- End of auto-generated description by cubic. -->	2025-06-26 17:23:48 +02:00
Nick Sweeting	5158419fc7	Update browser_use/browser/session.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>	2025-06-26 11:16:48 -04:00
Magnus Müller	03893b8adf	Update comment to 40000 in service.py	2025-06-26 17:16:43 +02:00
Magnus Müller	42961a49f3	Merge branch 'main' into Set-default-parameter	2025-06-26 17:08:47 +02:00
Magnus Müller	4844bd048f	Set default parameter - no memory, actions per step 10, and extraction input size	2025-06-26 17:08:15 +02:00
Nick Sweeting	90862026fa	persist a unique device_id and send it with sync events	2025-06-26 08:00:20 -07:00
Nick Sweeting	18bd8fe715	tweak timeouts to fail faster	2025-06-26 07:59:55 -07:00
Nick Sweeting	dd837fd3d4	use retry decorator for locking and timeouts instead of manual logic (#2126 )	2025-06-26 07:38:29 -07:00
Nick Sweeting	a1ce6ed5c2	fix linter	2025-06-26 07:34:47 -07:00
Nick Sweeting	b3ac5eb152	Merge branch 'main' into locking	2025-06-26 07:30:21 -07:00
Nick Sweeting	4b2d7c270b	fix tests	2025-06-26 07:30:03 -07:00
Nick Sweeting	ea2c2205a3	swap decorator decorateee	2025-06-26 07:14:02 -07:00
Nick Sweeting	fc075c0152	use retry decorator for locking and timeouts instead ofmanual logic	2025-06-26 07:07:13 -07:00
Nick Sweeting	12583f917d	start auth flow on first step	2025-06-26 06:50:33 -07:00
Magnus Müller	67d3f136eb	Add explicit scoring criteria for tasks with no output (#2124 ) This PR improves the evaluation judge system by adding explicit criteria for scoring tasks that produce no output. Changes: - Added clarification that tasks producing no output must receive low scores - Improved evaluation consistency for incomplete task scenarios - Ensures judges properly penalize agents that fail to produce results Problem Addressed: Previously, the evaluation criteria didn't explicitly state how to handle cases where agents complete their steps but produce no meaningful output. This could lead to inconsistent scoring where some judges might give moderate scores for "effort" even when no results were delivered. Benefits: - Clearer guidance: Judges now have explicit instruction on scoring no-output scenarios - Consistent evaluation: Reduces variability in how different judges score incomplete tasks - Better agent assessment: Ensures that failure to produce output is properly reflected in scoring - Improved training data: More accurate scores help improve future agent development Impact: This change helps maintain evaluation accuracy by making it clear that lack of output is a significant failure condition that should result in low scores, regardless of the apparent "effort" shown in the agent's trajectory. Testing: - No breaking changes to existing functionality - Pre-commit hooks pass successfully - Change only affects evaluation scoring guidance	2025-06-26 15:34:52 +02:00
Magnus Müller	e004a7efd8	Add explicit scoring criteria for tasks with no output - Clarify that tasks producing no output must receive low scores - Improve evaluation consistency for incomplete task scenarios - Ensure judges properly penalize agents that fail to produce results This helps maintain evaluation accuracy by explicitly stating that lack of output is a significant failure condition that should be reflected in low scoring.	2025-06-26 15:34:01 +02:00
Magnus Müller	92fccdee82	Remove unused controller telemetry functionality (#2123 ) This PR cleans up unused telemetry code in the controller registry system. Changes: - Remove and classes from telemetry views - Remove telemetry capture call from controller registry service that was collecting registered function data - Clean up unused imports in telemetry module - Simplify telemetry exports to only include actively used classes Benefits: - Reduces code complexity by removing dead code - Eliminates unnecessary telemetry data collection that wasn't being used - Cleaner imports and module exports - Improved maintainability Testing: - All existing functionality remains unchanged - No breaking changes to public APIs - Pre-commit hooks pass successfully This is a safe cleanup that removes functionality that was implemented but never actually utilized in the application. <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Removed unused telemetry code from the controller registry to reduce complexity and clean up the codebase. No changes to existing functionality. <!-- End of auto-generated description by cubic. -->	2025-06-26 15:15:02 +02:00
Magnus Müller	a8718d02cd	Merge branch 'main' into cleanup/remove-unused-telemetry	2025-06-26 15:11:15 +02:00
Magnus Müller	55492b7e3e	Improve judge system evaluation criteria (#2121 ) This PR enhances the evaluation judge system with more precise criteria and guidelines: Changes: - Removed verbose architecture documentation comments that were cluttering the file - Added guideline to avoid file system usage for short tasks unless specifically required - Added stricter evaluation criteria for tasks blocked by captchas or impossible scenarios - Added criteria for evaluating bad trajectories and inappropriate tool usage - Added requirement that incomplete tasks must receive low scores - Added consideration for trajectory optimization opportunities Benefits: - Cleaner, more focused code with reduced comment noise - More precise evaluation guidelines for better agent assessment - Stricter criteria to prevent false positive evaluations - Better guidance for scoring edge cases and failure scenarios These improvements will help provide more accurate and consistent evaluations of browser-use agent performance. <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Improved the judge system by adding stricter evaluation criteria and clearer guidelines for scoring agent performance, while removing unnecessary documentation comments. - Evaluation Criteria Updates - Added rules to penalize incomplete tasks, bad trajectories, and poor tool usage. - Introduced guidelines for handling captchas, impossible tasks, and unnecessary file system use. - Reduced comment clutter for easier code maintenance. <!-- End of auto-generated description by cubic. -->	2025-06-26 15:11:03 +02:00
Magnus Müller	54736224eb	Remove unused controller telemetry functionality - Remove ControllerRegisteredFunctionsTelemetryEvent and RegisteredFunction classes - Remove telemetry capture call from controller registry service - Clean up unused imports in telemetry module - Simplify telemetry __all__ exports This removes dead code that was capturing registered function telemetry data but wasn't being used anywhere in the application.	2025-06-26 15:06:17 +02:00
Magnus Müller	4a4336e1a1	Improve judge system evaluation criteria - Remove verbose architecture documentation comments - Add guideline to avoid file system usage for short tasks - Add stricter evaluation criteria for blocked/impossible tasks - Add criteria for bad trajectories and tool usage - Add requirement for task completion scoring - Add trajectory optimization considerations	2025-06-26 15:01:10 +02:00
Nick Sweeting	74d02c07a7	increase screenshot timeout to default timeout 0.3.3	2025-06-26 02:59:57 -07:00
Nick Sweeting	c7b29c9116	tweak browser shutdown locking and logging	2025-06-26 02:32:23 -07:00
Nick Sweeting	c0825fe3a5	increase timeout for tests	2025-06-26 02:05:52 -07:00

1 2 3 4 5 ...

3637 Commits