Merge branch 'main' into attempt-fix-real-browser

This commit is contained in:
kalil0321
2025-11-07 01:33:14 +01:00
committed by GitHub
21 changed files with 2876 additions and 44 deletions

View File

@@ -1,3 +1,4 @@
# AGENTS.md Version 1
<guidelines>
Browser-Use is an AI agent that autonomously interacts with the web. It takes a user-defined task, navigates web pages using Chromium via CDP, processes HTML, and repeatedly queries a language model to decide the next action—until the task is completed.

2701
CLOUD.md Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -158,7 +158,7 @@ https://github.com/user-attachments/assets/a6813fa7-4a7c-40a6-b4aa-382bf88b1850
[Example code ↗](https://github.com/browser-use/browser-use/blob/main/examples/use-cases/buy_groceries.py)
### 💻 Personal-Assistant.
### 💻 Personal-Assistant.
#### Task = "Help me find parts for a custom PC."
https://github.com/user-attachments/assets/ac34f75c-057a-43ef-ad06-5b2c9d42bf06
@@ -182,9 +182,9 @@ https://github.com/user-attachments/assets/ac34f75c-057a-43ef-ad06-5b2c9d42bf06
We optimized **ChatBrowserUse()** specifically for browser automation tasks. On avg it completes tasks 3-5x faster than other models with SOTA accuracy.
**Pricing (per 1M tokens):**
- Input tokens: $0.50
- Output tokens: $3.00
- Cached tokens: $0.10
- Input tokens: $0.20
- Output tokens: $2.00
- Cached tokens: $0.02
For other LLM providers, see our [supported models documentation](https://docs.browser-use.com/supported-models).
</details>
@@ -253,7 +253,7 @@ For production use cases, use our [Browser Use Cloud API](https://cloud.browser-
<br/>
<div align="center">
**Tell your computer what to do, and it gets it done.**
<img src="https://github.com/user-attachments/assets/06fa3078-8461-4560-b434-445510c1766f" width="400"/>

View File

@@ -75,7 +75,7 @@ await element.drag_to(target_element) # Drag and drop
value = await element.get_attribute("value")
box = await element.get_bounding_box() # Returns BoundingBox or None
info = await element.get_basic_info() # Comprehensive element info
screenshot_b64 = await element.screenshot(format='jpeg')
screenshot_b64 = await element.screenshot(format='png')
# Execute JavaScript on element (this context is the element)
text = await element.evaluate("() => this.textContent")
@@ -108,7 +108,7 @@ await page.press("Escape") # Single keys
# Page controls
await page.set_viewport_size(width=1920, height=1080)
page_screenshot = await page.screenshot() # JPEG by default
page_screenshot = await page.screenshot() # PNG by default
page_png = await page.screenshot(format="png", quality=90)
# Page information
@@ -166,7 +166,7 @@ products = await page.extract_content(
- `evaluate(page_function: str, *args)``str` - Execute JavaScript (MUST use (...args) => format)
- `press(key: str)` - Press key on page (supports "Control+A" format)
- `set_viewport_size(width: int, height: int)` - Set viewport dimensions
- `screenshot(format='jpeg', quality=None)``str` - Take page screenshot, return base64
- `screenshot(format='png', quality=None)``str` - Take page screenshot, return base64
- `get_url()``str`, `get_title()``str` - Get page information
- `mouse``Mouse` - Get mouse interface for this page
@@ -181,7 +181,7 @@ products = await page.extract_content(
- `evaluate(page_function: str, *args)``str` - Execute JavaScript on element (this = element)
- `get_attribute(name: str)``str | None` - Get attribute value
- `get_bounding_box()``BoundingBox | None` - Get element position/size
- `screenshot(format='jpeg', quality=None)``str` - Take element screenshot, return base64
- `screenshot(format='png', quality=None)``str` - Take element screenshot, return base64
- `get_basic_info()``ElementInfo` - Get comprehensive element information

View File

@@ -679,7 +679,7 @@ class Element:
except Exception:
return None
async def screenshot(self, format: str = 'jpeg', quality: int | None = None) -> str:
async def screenshot(self, format: str = 'png', quality: int | None = None) -> str:
"""Take a screenshot of this element and return base64 encoded image.
Args:

View File

@@ -188,7 +188,7 @@ class Page:
return js_code
async def screenshot(self, format: str = 'jpeg', quality: int | None = None) -> str:
async def screenshot(self, format: str = 'png', quality: int | None = None) -> str:
"""Take a screenshot and return base64 encoded image.
Args:

View File

@@ -155,7 +155,7 @@ class CreateAgentStepEvent(BaseEvent):
# Capture screenshot as base64 data URL if available
screenshot_url = None
if browser_state_summary.screenshot:
screenshot_url = f'data:image/jpeg;base64,{browser_state_summary.screenshot}'
screenshot_url = f'data:image/png;base64,{browser_state_summary.screenshot}'
import logging
logger = logging.getLogger(__name__)

View File

@@ -122,6 +122,28 @@ def construct_judge_messages(
- The agent made up content that is not in the screenshot or the page state
- The agent calls done action before completing all key points of the task
**IMPOSSIBLE TASK DETECTION:**
Set `impossible_task` to true when the task fundamentally could not be completed due to:
- Vague or ambiguous task instructions that cannot be reasonably interpreted
- Website genuinely broken or non-functional (be conservative - temporary issues don't count)
- Required links/pages truly inaccessible (404, 403, etc.)
- Task requires authentication/login but no credentials were provided
- Task asks for functionality that doesn't exist on the target site
- Other insurmountable external obstacles beyond the agent's control
Do NOT mark as impossible if:
- Agent made poor decisions but task was achievable
- Temporary page loading issues that could be retried
- Agent didn't try the right approach
- Website works but agent struggled with it
**CAPTCHA DETECTION:**
Set `reached_captcha` to true if:
- Screenshots show captcha challenges (reCAPTCHA, hCaptcha, etc.)
- Agent reports being blocked by bot detection
- Error messages indicate captcha/verification requirements
- Any evidence the agent encountered anti-bot measures during execution
**IMPORTANT EVALUATION NOTES:**
- **evaluate for action** - For each key step of the trace, double check whether the action that the agent tried to performed actually happened. If the required action did not actually occur, the verdict should be false.
- **screenshot is not entire content** - The agent has the entire DOM content, but the screenshot is only part of the content. If the agent extracts information from the page, but you do not see it in the screenshot, you can assume this information is there.
@@ -136,9 +158,11 @@ def construct_judge_messages(
Respond with EXACTLY this JSON structure (no additional text before or after):
{{
"reasoning": "Breakdown of user task into key points. Detailed analysis covering: what went well, what didn't work, trajectory quality assessment, tool usage evaluation, output quality review, and overall user satisfaction prediction",
"reasoning": "Breakdown of user task into key points. Detailed analysis covering: what went well, what didn't work, trajectory quality assessment, tool usage evaluation, output quality review, and overall user satisfaction prediction.",
"verdict": true or false,
"failure_reason": "If verdict is false, provide the key reason why the task was not completed successfully. If verdict is true, use an empty string."
"failure_reason": "A brief explanation of key reasons why the task was not completed successfully in case of failure. If verdict is true, use an empty string. Keep it concise and easy to read.",
"impossible_task": true or false,
"reached_captcha": true or false
}}
</response_format>
"""

View File

@@ -42,8 +42,7 @@ async def _format_conversation(messages: list[BaseMessage], response: Any) -> st
lines.append('') # Empty line after each message
# Format response
lines.append(' RESPONSE')
lines.append(json.dumps(json.loads(response.model_dump_json(exclude_unset=True)), indent=2))
lines.append(json.dumps(json.loads(response.model_dump_json(exclude_unset=True)), indent=2, ensure_ascii=False))
return '\n'.join(lines)

View File

@@ -373,8 +373,8 @@ Available tabs:
content_parts.append(
ContentPartImageParam(
image_url=ImageURL(
url=f'data:image/jpeg;base64,{screenshot}',
media_type='image/jpeg',
url=f'data:image/png;base64,{screenshot}',
media_type='image/png',
detail=self.vision_detail_level,
),
)

View File

@@ -973,7 +973,10 @@ class Agent(Generic[Context, AgentStructuredOutput]):
verdict_text = '✅ PASS' if judgement.verdict else '❌ FAIL'
judge_log += f'⚖️ {verdict_color}Judge Verdict: {verdict_text}\033[0m\n'
if judgement.failure_reason:
judge_log += f' Failure: {judgement.failure_reason}\n'
judge_log += f' Failure Reason: {judgement.failure_reason}\n'
if judgement.reached_captcha:
judge_log += ' 🤖 Captcha Detected: Agent encountered captcha challenges\n'
judge_log += ' 👉 🥷 Use Browser Use Cloud for the most stealth browser infra: https://docs.browser-use.com/customize/browser/remote\n'
judge_log += f' {judgement.reasoning}\n'
self.logger.info(judge_log)

View File

@@ -93,7 +93,18 @@ class JudgementResult(BaseModel):
reasoning: str | None = Field(default=None, description='Explanation of the judgement')
verdict: bool = Field(description='Whether the trace was successful or not')
failure_reason: str | None = Field(default=None, description='If the trace was not successful, the reason why')
failure_reason: str | None = Field(
default=None,
description='A brief explanation of key reasons why the task was not completed successfully in case of failure. If verdict is true, use an empty string. Keep it concise and easy to read.',
)
impossible_task: bool = Field(
default=False,
description='True if the task was impossible to complete due to vague instructions, broken website, inaccessible links, missing login credentials, or other insurmountable obstacles',
)
reached_captcha: bool = Field(
default=False,
description='True if the agent encountered captcha challenges during task execution',
)
class ActionResult(BaseModel):

View File

@@ -39,7 +39,7 @@ class ScreenshotWatchdog(BaseWatchdog):
cdp_session = await self.browser_session.get_or_create_cdp_session()
# Prepare screenshot parameters
params = CaptureScreenshotParameters(format='jpeg', quality=60, captureBeyondViewport=False)
params = CaptureScreenshotParameters(format='png', captureBeyondViewport=False)
# Take screenshot using CDP
self.logger.debug(f'[ScreenshotWatchdog] Taking screenshot with params: {params}')

View File

@@ -614,8 +614,8 @@ class CodeAgent:
content_parts.append(
ContentPartImageParam(
image_url=ImageURL(
url=f'data:image/jpeg;base64,{self._last_screenshot}',
media_type='image/jpeg',
url=f'data:image/png;base64,{self._last_screenshot}',
media_type='image/png',
detail='auto',
),
)

View File

@@ -61,7 +61,7 @@ class ImageURL(BaseModel):
[Vision guide](https://platform.openai.com/docs/guides/vision#low-or-high-fidelity-image-understanding).
"""
# needed for Anthropic
media_type: SupportedImageMediaType = 'image/jpeg'
media_type: SupportedImageMediaType = 'image/png'
def __str__(self) -> str:
url_display = _format_image_url(self.url)

View File

@@ -42,6 +42,12 @@ class ChatOpenAI(BaseChatModel):
top_p: float | None = None
add_schema_to_system_prompt: bool = False # Add JSON schema to system prompt instead of using response_format
dont_force_structured_output: bool = False # If True, the model will not be forced to output a structured output
remove_min_items_from_schema: bool = (
False # If True, remove minItems from JSON schema (for compatibility with some providers)
)
remove_defaults_from_schema: bool = (
False # If True, remove default values from JSON schema (for compatibility with some providers)
)
# Client initialization parameters
api_key: str | None = None
@@ -206,7 +212,11 @@ class ChatOpenAI(BaseChatModel):
response_format: JSONSchema = {
'name': 'agent_output',
'strict': True,
'schema': SchemaOptimizer.create_optimized_json_schema(output_format),
'schema': SchemaOptimizer.create_optimized_json_schema(
output_format,
remove_min_items=self.remove_min_items_from_schema,
remove_defaults=self.remove_defaults_from_schema,
),
}
# Add JSON schema to system prompt if requested

View File

@@ -9,13 +9,20 @@ from pydantic import BaseModel
class SchemaOptimizer:
@staticmethod
def create_optimized_json_schema(model: type[BaseModel]) -> dict[str, Any]:
def create_optimized_json_schema(
model: type[BaseModel],
*,
remove_min_items: bool = False,
remove_defaults: bool = False,
) -> dict[str, Any]:
"""
Create the most optimized schema by flattening all $ref/$defs while preserving
FULL descriptions and ALL action definitions. Also ensures OpenAI strict mode compatibility.
Args:
model: The Pydantic model to optimize
remove_min_items: If True, remove minItems from the schema
remove_defaults: If True, remove default values from the schema
Returns:
Optimized schema with all $refs resolved and strict mode compatibility
@@ -26,12 +33,9 @@ class SchemaOptimizer:
# Extract $defs for reference resolution, then flatten everything
defs_lookup = original_schema.get('$defs', {})
def optimize_schema(
obj: Any,
defs_lookup: dict[str, Any] | None = None,
*,
in_properties: bool = False, # NEW: track context
) -> Any:
# Create optimized schema with flattening
# Pass flags to optimize_schema via closure
def optimize_schema(obj: Any, defs_lookup: dict[str, Any] | None = None, *, in_properties: bool = False) -> Any:
"""Apply all optimization techniques including flattening all $ref/$defs"""
if isinstance(obj, dict):
optimized: dict[str, Any] = {}
@@ -65,6 +69,12 @@ class SchemaOptimizer:
referenced_def = defs_lookup[ref_path]
flattened_ref = optimize_schema(referenced_def, defs_lookup)
# Skip minItems/min_items and default if requested (check BEFORE processing)
elif key in ('minItems', 'min_items') and remove_min_items:
continue # Skip minItems/min_items
elif key == 'default' and remove_defaults:
continue # Skip default values
# Keep all anyOf structures (action unions) and resolve any $refs within
elif key == 'anyOf' and isinstance(value, list):
optimized[key] = [optimize_schema(item, defs_lookup) for item in value]
@@ -78,7 +88,17 @@ class SchemaOptimizer:
)
# Keep essential validation fields
elif key in ['type', 'required', 'minimum', 'maximum', 'minItems', 'maxItems', 'pattern', 'default']:
elif key in [
'type',
'required',
'minimum',
'maximum',
'minItems',
'min_items',
'maxItems',
'pattern',
'default',
]:
optimized[key] = value if not isinstance(value, (dict, list)) else optimize_schema(value, defs_lookup)
# Recursively process all other fields
@@ -111,7 +131,6 @@ class SchemaOptimizer:
return [optimize_schema(item, defs_lookup, in_properties=in_properties) for item in obj]
return obj
# Create optimized schema with flattening
optimized_result = optimize_schema(original_schema, defs_lookup)
# Ensure we have a dictionary (should always be the case for schema root)
@@ -140,6 +159,29 @@ class SchemaOptimizer:
ensure_additional_properties_false(optimized_schema)
SchemaOptimizer._make_strict_compatible(optimized_schema)
# Final pass to remove minItems/min_items and default values if requested
if remove_min_items or remove_defaults:
def remove_forbidden_fields(obj: Any) -> None:
"""Recursively remove minItems/min_items and default values"""
if isinstance(obj, dict):
# Remove forbidden keys
if remove_min_items:
obj.pop('minItems', None)
obj.pop('min_items', None)
if remove_defaults:
obj.pop('default', None)
# Recursively process all values
for value in obj.values():
if isinstance(value, (dict, list)):
remove_forbidden_fields(value)
elif isinstance(obj, list):
for item in obj:
if isinstance(item, (dict, list)):
remove_forbidden_fields(item)
remove_forbidden_fields(optimized_schema)
return optimized_schema
@staticmethod

View File

@@ -4,8 +4,14 @@
"name": "Browser Use",
"colors": {
"primary": "#FE750E",
"light": "#FFF7ED",
"dark": "#C2410C"
"light": "#FE750E",
"dark": "#FE750E"
},
"background": {
"color": {
"light": "#FFFFFF",
"dark": "#09090B"
}
},
"favicon": "/favicon.ico",
"contextual": {

View File

@@ -6,5 +6,5 @@ icon: "brain"
1. Copy all content [🔗 from here](https://docs.browser-use.com/llms-full.txt) (~32k tokens)
1. Copy all content [🔗 from here](https://github.com/browser-use/browser-use/blob/main/AGENTS.md) (~32k tokens)
2. Paste it into your favorite coding agent (Cursor, Claude, ChatGPT ...).

View File

@@ -32,17 +32,14 @@ Get your API key from the [Browser Use Cloud](https://cloud.browser-use.com/new-
#### Pricing
ChatBrowserUse offers competitive pricing per 1 million tokens:
ChatBrowserUse offers the best pricing per 1 million tokens:
| Token Type | Price per 1M tokens |
|------------|---------------------|
| Input tokens | $0.50 |
| Output tokens | $3.00 |
| Cached tokens | $0.10 |
| Input tokens | $0.20 |
| Cached tokens | $0.02 |
| Output tokens | $2.00 |
<Note>
Cached tokens provide significant cost savings on repeated context, reducing input costs by 80%.
</Note>
### Google Gemini [example](https://github.com/browser-use/browser-use/blob/main/examples/models/gemini.py)

View File

@@ -0,0 +1,38 @@
import asyncio
import os
from dotenv import load_dotenv
from browser_use import Agent, ChatOpenAI
load_dotenv()
# Get API key from environment variable
api_key = os.getenv('MOONSHOT_API_KEY')
if api_key is None:
print('Make sure you have MOONSHOT_API_KEY set in your .env file')
print('Get your API key from https://platform.moonshot.ai/console/api-keys ')
exit(1)
# Configure Moonshot AI model
llm = ChatOpenAI(
model='kimi-k2-thinking',
base_url='https://api.moonshot.ai/v1',
api_key=api_key,
add_schema_to_system_prompt=True,
remove_min_items_from_schema=True, # Moonshot doesn't support minItems in JSON schema
remove_defaults_from_schema=True, # Moonshot doesn't allow default values with anyOf
)
async def main():
agent = Agent(
task='Search for the latest news about AI and summarize the top 3 articles',
llm=llm,
flash_mode=True,
)
await agent.run()
if __name__ == '__main__':
asyncio.run(main())