Merge branch 'main' into attempt-fix-real-browser

2026-05-13 17:56:35 +02:00 · 2025-11-07 01:33:14 +01:00
parent da4dc43aa2 42725a5088
commit 103bd46677
21 changed files with 2876 additions and 44 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,3 +1,4 @@
+# AGENTS.md Version 1
 <guidelines>
 Browser-Use is an AI agent that autonomously interacts with the web. It takes a user-defined task, navigates web pages using Chromium via CDP, processes HTML, and repeatedly queries a language model to decide the next action—until the task is completed.

--- a/CLOUD.md
+++ b/CLOUD.md
--- a/README.md
+++ b/README.md
@@ -158,7 +158,7 @@ https://github.com/user-attachments/assets/a6813fa7-4a7c-40a6-b4aa-382bf88b1850
 [Example code ↗](https://github.com/browser-use/browser-use/blob/main/examples/use-cases/buy_groceries.py)


-### 💻 Personal-Assistant. 
+### 💻 Personal-Assistant.
 #### Task = "Help me find parts for a custom PC."

 https://github.com/user-attachments/assets/ac34f75c-057a-43ef-ad06-5b2c9d42bf06
@@ -182,9 +182,9 @@ https://github.com/user-attachments/assets/ac34f75c-057a-43ef-ad06-5b2c9d42bf06
 We optimized **ChatBrowserUse()** specifically for browser automation tasks. On avg it completes tasks 3-5x faster than other models with SOTA accuracy.

 **Pricing (per 1M tokens):**
- Input tokens: $0.50
- Output tokens: $3.00  
- Cached tokens: $0.10
+- Input tokens: $0.20
+- Output tokens: $2.00
+- Cached tokens: $0.02

 For other LLM providers, see our [supported models documentation](https://docs.browser-use.com/supported-models).
 </details>
@@ -253,7 +253,7 @@ For production use cases, use our [Browser Use Cloud API](https://cloud.browser-
 <br/>

 <div align="center">
-  
+
 **Tell your computer what to do, and it gets it done.**

 <img src="https://github.com/user-attachments/assets/06fa3078-8461-4560-b434-445510c1766f" width="400"/>
--- a/browser_use/actor/README.md
+++ b/browser_use/actor/README.md
@@ -75,7 +75,7 @@ await element.drag_to(target_element)  # Drag and drop
 value = await element.get_attribute("value")
 box = await element.get_bounding_box()  # Returns BoundingBox or None
 info = await element.get_basic_info()  # Comprehensive element info
-screenshot_b64 = await element.screenshot(format='jpeg')
+screenshot_b64 = await element.screenshot(format='png')

 # Execute JavaScript on element (this context is the element)
 text = await element.evaluate("() => this.textContent")
@@ -108,7 +108,7 @@ await page.press("Escape")     # Single keys

 # Page controls
 await page.set_viewport_size(width=1920, height=1080)
-page_screenshot = await page.screenshot()  # JPEG by default
+page_screenshot = await page.screenshot()  # PNG by default
 page_png = await page.screenshot(format="png", quality=90)

 # Page information
@@ -166,7 +166,7 @@ products = await page.extract_content(
 - `evaluate(page_function: str, *args)` → `str` - Execute JavaScript (MUST use (...args) => format)
 - `press(key: str)` - Press key on page (supports "Control+A" format)
 - `set_viewport_size(width: int, height: int)` - Set viewport dimensions
- `screenshot(format='jpeg', quality=None)` → `str` - Take page screenshot, return base64
+- `screenshot(format='png', quality=None)` → `str` - Take page screenshot, return base64
 - `get_url()` → `str`, `get_title()` → `str` - Get page information
 - `mouse` → `Mouse` - Get mouse interface for this page

@@ -181,7 +181,7 @@ products = await page.extract_content(
 - `evaluate(page_function: str, *args)` → `str` - Execute JavaScript on element (this = element)
 - `get_attribute(name: str)` → `str | None` - Get attribute value
 - `get_bounding_box()` → `BoundingBox | None` - Get element position/size
- `screenshot(format='jpeg', quality=None)` → `str` - Take element screenshot, return base64
+- `screenshot(format='png', quality=None)` → `str` - Take element screenshot, return base64
 - `get_basic_info()` → `ElementInfo` - Get comprehensive element information


--- a/browser_use/actor/element.py
+++ b/browser_use/actor/element.py
@@ -679,7 +679,7 @@ class Element:
 		except Exception:
 			return None

-	async def screenshot(self, format: str = 'jpeg', quality: int | None = None) -> str:
+	async def screenshot(self, format: str = 'png', quality: int | None = None) -> str:
 		"""Take a screenshot of this element and return base64 encoded image.

 		Args:
--- a/browser_use/actor/page.py
+++ b/browser_use/actor/page.py
@@ -188,7 +188,7 @@ class Page:

 		return js_code

-	async def screenshot(self, format: str = 'jpeg', quality: int | None = None) -> str:
+	async def screenshot(self, format: str = 'png', quality: int | None = None) -> str:
 		"""Take a screenshot and return base64 encoded image.

 		Args:
--- a/browser_use/agent/cloud_events.py
+++ b/browser_use/agent/cloud_events.py
@@ -155,7 +155,7 @@ class CreateAgentStepEvent(BaseEvent):
 		# Capture screenshot as base64 data URL if available
 		screenshot_url = None
 		if browser_state_summary.screenshot:
-			screenshot_url = f'data:image/jpeg;base64,{browser_state_summary.screenshot}'
+			screenshot_url = f'data:image/png;base64,{browser_state_summary.screenshot}'
 			import logging

 			logger = logging.getLogger(__name__)
--- a/browser_use/agent/judge.py
+++ b/browser_use/agent/judge.py
@@ -122,6 +122,28 @@ def construct_judge_messages(
 - The agent made up content that is not in the screenshot or the page state
 - The agent calls done action before completing all key points of the task

+**IMPOSSIBLE TASK DETECTION:**
+Set `impossible_task` to true when the task fundamentally could not be completed due to:
+- Vague or ambiguous task instructions that cannot be reasonably interpreted
+- Website genuinely broken or non-functional (be conservative - temporary issues don't count)
+- Required links/pages truly inaccessible (404, 403, etc.)
+- Task requires authentication/login but no credentials were provided
+- Task asks for functionality that doesn't exist on the target site
+- Other insurmountable external obstacles beyond the agent's control
+
+Do NOT mark as impossible if:
+- Agent made poor decisions but task was achievable
+- Temporary page loading issues that could be retried
+- Agent didn't try the right approach
+- Website works but agent struggled with it
+
+**CAPTCHA DETECTION:**
+Set `reached_captcha` to true if:
+- Screenshots show captcha challenges (reCAPTCHA, hCaptcha, etc.)
+- Agent reports being blocked by bot detection
+- Error messages indicate captcha/verification requirements
+- Any evidence the agent encountered anti-bot measures during execution
+
 **IMPORTANT EVALUATION NOTES:**
 - **evaluate for action** - For each key step of the trace, double check whether the action that the agent tried to performed actually happened. If the required action did not actually occur, the verdict should be false.
 - **screenshot is not entire content** - The agent has the entire DOM content, but the screenshot is only part of the content. If the agent extracts information from the page, but you do not see it in the screenshot, you can assume this information is there.
@@ -136,9 +158,11 @@ def construct_judge_messages(
 Respond with EXACTLY this JSON structure (no additional text before or after):

 {{
-	"reasoning": "Breakdown of user task into key points. Detailed analysis covering: what went well, what didn't work, trajectory quality assessment, tool usage evaluation, output quality review, and overall user satisfaction prediction",
+	"reasoning": "Breakdown of user task into key points. Detailed analysis covering: what went well, what didn't work, trajectory quality assessment, tool usage evaluation, output quality review, and overall user satisfaction prediction.",
 	"verdict": true or false,
-	"failure_reason": "If verdict is false, provide the key reason why the task was not completed successfully. If verdict is true, use an empty string."
+	"failure_reason": "A brief explanation of key reasons why the task was not completed successfully in case of failure. If verdict is true, use an empty string. Keep it concise and easy to read.",
+	"impossible_task": true or false,
+	"reached_captcha": true or false
 }}
 </response_format>
 """
--- a/browser_use/agent/message_manager/utils.py
+++ b/browser_use/agent/message_manager/utils.py
@@ -42,8 +42,7 @@ async def _format_conversation(messages: list[BaseMessage], response: Any) -> st
 		lines.append('')  # Empty line after each message

 	# Format response
-	lines.append(' RESPONSE')
-	lines.append(json.dumps(json.loads(response.model_dump_json(exclude_unset=True)), indent=2))
+	lines.append(json.dumps(json.loads(response.model_dump_json(exclude_unset=True)), indent=2, ensure_ascii=False))

 	return '\n'.join(lines)

--- a/browser_use/agent/prompts.py
+++ b/browser_use/agent/prompts.py
@@ -373,8 +373,8 @@ Available tabs:
 				content_parts.append(
 					ContentPartImageParam(
 						image_url=ImageURL(
-							url=f'data:image/jpeg;base64,{screenshot}',
-							media_type='image/jpeg',
+							url=f'data:image/png;base64,{screenshot}',
+							media_type='image/png',
 							detail=self.vision_detail_level,
 						),
 					)
--- a/browser_use/agent/service.py
+++ b/browser_use/agent/service.py
@@ -973,7 +973,10 @@ class Agent(Generic[Context, AgentStructuredOutput]):
 				verdict_text = '✅ PASS' if judgement.verdict else '❌ FAIL'
 				judge_log += f'⚖️  {verdict_color}Judge Verdict: {verdict_text}\033[0m\n'
 				if judgement.failure_reason:
-					judge_log += f'   Failure: {judgement.failure_reason}\n'
+					judge_log += f'   Failure Reason: {judgement.failure_reason}\n'
+				if judgement.reached_captcha:
+					judge_log += '   🤖 Captcha Detected: Agent encountered captcha challenges\n'
+					judge_log += '   👉 🥷 Use Browser Use Cloud for the most stealth browser infra: https://docs.browser-use.com/customize/browser/remote\n'
 				judge_log += f'   {judgement.reasoning}\n'
 				self.logger.info(judge_log)

--- a/browser_use/agent/views.py
+++ b/browser_use/agent/views.py
@@ -93,7 +93,18 @@ class JudgementResult(BaseModel):

 	reasoning: str | None = Field(default=None, description='Explanation of the judgement')
 	verdict: bool = Field(description='Whether the trace was successful or not')
-	failure_reason: str | None = Field(default=None, description='If the trace was not successful, the reason why')
+	failure_reason: str | None = Field(
+		default=None,
+		description='A brief explanation of key reasons why the task was not completed successfully in case of failure. If verdict is true, use an empty string. Keep it concise and easy to read.',
+	)
+	impossible_task: bool = Field(
+		default=False,
+		description='True if the task was impossible to complete due to vague instructions, broken website, inaccessible links, missing login credentials, or other insurmountable obstacles',
+	)
+	reached_captcha: bool = Field(
+		default=False,
+		description='True if the agent encountered captcha challenges during task execution',
+	)


 class ActionResult(BaseModel):
--- a/browser_use/browser/watchdogs/screenshot_watchdog.py
+++ b/browser_use/browser/watchdogs/screenshot_watchdog.py
@@ -39,7 +39,7 @@ class ScreenshotWatchdog(BaseWatchdog):
 			cdp_session = await self.browser_session.get_or_create_cdp_session()

 			# Prepare screenshot parameters
-			params = CaptureScreenshotParameters(format='jpeg', quality=60, captureBeyondViewport=False)
+			params = CaptureScreenshotParameters(format='png', captureBeyondViewport=False)

 			# Take screenshot using CDP
 			self.logger.debug(f'[ScreenshotWatchdog] Taking screenshot with params: {params}')
--- a/browser_use/code_use/service.py
+++ b/browser_use/code_use/service.py
@@ -614,8 +614,8 @@ class CodeAgent:
 				content_parts.append(
 					ContentPartImageParam(
 						image_url=ImageURL(
-							url=f'data:image/jpeg;base64,{self._last_screenshot}',
-							media_type='image/jpeg',
+							url=f'data:image/png;base64,{self._last_screenshot}',
+							media_type='image/png',
 							detail='auto',
 						),
 					)
--- a/browser_use/llm/messages.py
+++ b/browser_use/llm/messages.py
@@ -61,7 +61,7 @@ class ImageURL(BaseModel):
    [Vision guide](https://platform.openai.com/docs/guides/vision#low-or-high-fidelity-image-understanding).
    """
 	# needed for Anthropic
-	media_type: SupportedImageMediaType = 'image/jpeg'
+	media_type: SupportedImageMediaType = 'image/png'

 	def __str__(self) -> str:
 		url_display = _format_image_url(self.url)
--- a/browser_use/llm/openai/chat.py
+++ b/browser_use/llm/openai/chat.py
@@ -42,6 +42,12 @@ class ChatOpenAI(BaseChatModel):
 	top_p: float | None = None
 	add_schema_to_system_prompt: bool = False  # Add JSON schema to system prompt instead of using response_format
 	dont_force_structured_output: bool = False  # If True, the model will not be forced to output a structured output
+	remove_min_items_from_schema: bool = (
+		False  # If True, remove minItems from JSON schema (for compatibility with some providers)
+	)
+	remove_defaults_from_schema: bool = (
+		False  # If True, remove default values from JSON schema (for compatibility with some providers)
+	)

 	# Client initialization parameters
 	api_key: str | None = None
@@ -206,7 +212,11 @@ class ChatOpenAI(BaseChatModel):
 				response_format: JSONSchema = {
 					'name': 'agent_output',
 					'strict': True,
-					'schema': SchemaOptimizer.create_optimized_json_schema(output_format),
+					'schema': SchemaOptimizer.create_optimized_json_schema(
+						output_format,
+						remove_min_items=self.remove_min_items_from_schema,
+						remove_defaults=self.remove_defaults_from_schema,
+					),
 				}

 				# Add JSON schema to system prompt if requested
--- a/browser_use/llm/schema.py
+++ b/browser_use/llm/schema.py
@@ -9,13 +9,20 @@ from pydantic import BaseModel

 class SchemaOptimizer:
 	@staticmethod
-	def create_optimized_json_schema(model: type[BaseModel]) -> dict[str, Any]:
+	def create_optimized_json_schema(
+		model: type[BaseModel],
+		*,
+		remove_min_items: bool = False,
+		remove_defaults: bool = False,
+	) -> dict[str, Any]:
 		"""
 		Create the most optimized schema by flattening all $ref/$defs while preserving
 		FULL descriptions and ALL action definitions. Also ensures OpenAI strict mode compatibility.

 		Args:
 			model: The Pydantic model to optimize
+			remove_min_items: If True, remove minItems from the schema
+			remove_defaults: If True, remove default values from the schema

 		Returns:
 			Optimized schema with all $refs resolved and strict mode compatibility
@@ -26,12 +33,9 @@ class SchemaOptimizer:
 		# Extract $defs for reference resolution, then flatten everything
 		defs_lookup = original_schema.get('$defs', {})

-		def optimize_schema(
-			obj: Any,
-			defs_lookup: dict[str, Any] | None = None,
-			*,
-			in_properties: bool = False,  # NEW: track context
-		) -> Any:
+		# Create optimized schema with flattening
+		# Pass flags to optimize_schema via closure
+		def optimize_schema(obj: Any, defs_lookup: dict[str, Any] | None = None, *, in_properties: bool = False) -> Any:
 			"""Apply all optimization techniques including flattening all $ref/$defs"""
 			if isinstance(obj, dict):
 				optimized: dict[str, Any] = {}
@@ -65,6 +69,12 @@ class SchemaOptimizer:
 							referenced_def = defs_lookup[ref_path]
 							flattened_ref = optimize_schema(referenced_def, defs_lookup)

+					# Skip minItems/min_items and default if requested (check BEFORE processing)
+					elif key in ('minItems', 'min_items') and remove_min_items:
+						continue  # Skip minItems/min_items
+					elif key == 'default' and remove_defaults:
+						continue  # Skip default values
+
 					# Keep all anyOf structures (action unions) and resolve any $refs within
 					elif key == 'anyOf' and isinstance(value, list):
 						optimized[key] = [optimize_schema(item, defs_lookup) for item in value]
@@ -78,7 +88,17 @@ class SchemaOptimizer:
 						)

 					# Keep essential validation fields
-					elif key in ['type', 'required', 'minimum', 'maximum', 'minItems', 'maxItems', 'pattern', 'default']:
+					elif key in [
+						'type',
+						'required',
+						'minimum',
+						'maximum',
+						'minItems',
+						'min_items',
+						'maxItems',
+						'pattern',
+						'default',
+					]:
 						optimized[key] = value if not isinstance(value, (dict, list)) else optimize_schema(value, defs_lookup)

 					# Recursively process all other fields
@@ -111,7 +131,6 @@ class SchemaOptimizer:
 				return [optimize_schema(item, defs_lookup, in_properties=in_properties) for item in obj]
 			return obj

-		# Create optimized schema with flattening
 		optimized_result = optimize_schema(original_schema, defs_lookup)

 		# Ensure we have a dictionary (should always be the case for schema root)
@@ -140,6 +159,29 @@ class SchemaOptimizer:
 		ensure_additional_properties_false(optimized_schema)
 		SchemaOptimizer._make_strict_compatible(optimized_schema)

+		# Final pass to remove minItems/min_items and default values if requested
+		if remove_min_items or remove_defaults:
+
+			def remove_forbidden_fields(obj: Any) -> None:
+				"""Recursively remove minItems/min_items and default values"""
+				if isinstance(obj, dict):
+					# Remove forbidden keys
+					if remove_min_items:
+						obj.pop('minItems', None)
+						obj.pop('min_items', None)
+					if remove_defaults:
+						obj.pop('default', None)
+					# Recursively process all values
+					for value in obj.values():
+						if isinstance(value, (dict, list)):
+							remove_forbidden_fields(value)
+				elif isinstance(obj, list):
+					for item in obj:
+						if isinstance(item, (dict, list)):
+							remove_forbidden_fields(item)
+
+			remove_forbidden_fields(optimized_schema)
+
 		return optimized_schema

 	@staticmethod
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -4,8 +4,14 @@
  "name": "Browser Use",
  "colors": {
    "primary": "#FE750E",
-    "light": "#FFF7ED",
-    "dark": "#C2410C"
+    "light": "#FE750E",
+    "dark": "#FE750E"
+  },
+  "background": {
+    "color": {
+      "light": "#FFFFFF",
+      "dark": "#09090B"
+    }
  },
  "favicon": "/favicon.ico",
  "contextual": {
--- a/docs/quickstart_llm.mdx
+++ b/docs/quickstart_llm.mdx
@@ -6,5 +6,5 @@ icon: "brain"



-1. Copy all content [🔗  from here](https://docs.browser-use.com/llms-full.txt)  (~32k tokens)
+1. Copy all content [🔗  from here](https://github.com/browser-use/browser-use/blob/main/AGENTS.md)  (~32k tokens)
 2. Paste it into your favorite coding agent (Cursor, Claude, ChatGPT ...). 
--- a/docs/supported-models.mdx
+++ b/docs/supported-models.mdx
@@ -32,17 +32,14 @@ Get your API key from the [Browser Use Cloud](https://cloud.browser-use.com/new-

 #### Pricing

-ChatBrowserUse offers competitive pricing per 1 million tokens:
+ChatBrowserUse offers the best pricing per 1 million tokens:

 | Token Type | Price per 1M tokens |
 |------------|---------------------|
-| Input tokens | $0.50 |
-| Output tokens | $3.00 |
-| Cached tokens | $0.10 |
+| Input tokens | $0.20 |
+| Cached tokens | $0.02 |
+| Output tokens | $2.00 |

-<Note>
-  Cached tokens provide significant cost savings on repeated context, reducing input costs by 80%.
-</Note>

 ### Google Gemini [example](https://github.com/browser-use/browser-use/blob/main/examples/models/gemini.py)

--- a/examples/models/moonshot.py
+++ b/examples/models/moonshot.py
@@ -0,0 +1,38 @@
+import asyncio
+import os
+
+from dotenv import load_dotenv
+
+from browser_use import Agent, ChatOpenAI
+
+load_dotenv()
+
+# Get API key from environment variable
+api_key = os.getenv('MOONSHOT_API_KEY')
+if api_key is None:
+	print('Make sure you have MOONSHOT_API_KEY set in your .env file')
+	print('Get your API key from https://platform.moonshot.ai/console/api-keys ')
+	exit(1)
+
+# Configure Moonshot AI model
+llm = ChatOpenAI(
+	model='kimi-k2-thinking',
+	base_url='https://api.moonshot.ai/v1',
+	api_key=api_key,
+	add_schema_to_system_prompt=True,
+	remove_min_items_from_schema=True,  # Moonshot doesn't support minItems in JSON schema
+	remove_defaults_from_schema=True,  # Moonshot doesn't allow default values with anyOf
+)
+
+
+async def main():
+	agent = Agent(
+		task='Search for the latest news about AI and summarize the top 3 articles',
+		llm=llm,
+		flash_mode=True,
+	)
+	await agent.run()
+
+
+if __name__ == '__main__':
+	asyncio.run(main())