Commit Graph

3344 Commits

Author SHA1 Message Date
Magnus Müller
d4a29c4b93 Improves evaluation robustness and reporting
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.

The changes include:

- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
2025-06-23 00:08:14 +02:00
Magnus Müller
be170fb17a Ensures payload serialization preserves dict structure
Adds a type assertion to ensure that the payload remains a dictionary after serialization.

Also, adds type hints to `make_json_serializable` for better code clarity and maintainability.
2025-06-22 23:50:46 +02:00
Magnus Müller
4a26f07c66 Ensures JSON serializability of task results
Adds a utility function to convert objects within a payload to JSON-serializable types before returning the task result.

This change addresses potential issues where the task result contains non-serializable objects (e.g., enums, custom objects), preventing proper data handling.
2025-06-22 23:46:39 +02:00
Magnus Müller
56568a6ce0 Updates judge system import path (#2048)
Updates the import path for the comprehensive judge system to reflect
its new location in the project structure.

This resolves an issue where the previous relative import was causing
import errors.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Updated the import path for the judge system to fix import errors after
moving the file.

<!-- End of auto-generated description by cubic. -->
2025-06-22 23:30:25 +02:00
Magnus Müller
d960a538cb Merge branch 'main' into fix/relative-import-eval 2025-06-22 23:28:07 +02:00
Magnus Müller
0a5a29e4a8 Updates judge system import path
Updates the import path for the comprehensive judge system to reflect its new location in the project structure.

This resolves an issue where the previous relative import was causing import errors.
2025-06-22 23:26:56 +02:00
Magnus Müller
d015c6cb94 Implement comprehensive judge system for task evaluation (#2047)
Added a new judge system in `judge_system.py` that evaluates browser-use
agent runs, providing detailed structured feedback. Updated the
evaluation workflow in `eval.yaml` to include a new command-line
argument for using the comprehensive judge. Modified `service.py` to
integrate the new judge system, allowing for fallback to the original
Mind2Web evaluation if specified. Enhanced error handling and logging
throughout the evaluation process.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Added a comprehensive judge system for evaluating browser-use agent
runs, providing detailed, structured feedback and multi-dimensional
scoring. Updated the evaluation workflow to support both the new judge
and the original Mind2Web judge, with improved error handling and
logging.

- **New Features**
- Introduced `judge_system.py` with a multi-criteria evaluation and JSON
feedback.
- Integrated the new judge into `service.py` with a command-line flag
for judge selection.
  - Enhanced error handling and logging during evaluation.

- **Dependencies**
- Updated `.github/workflows/eval.yaml` to add a flag for selecting the
judge system.

<!-- End of auto-generated description by cubic. -->
2025-06-22 23:17:53 +02:00
Magnus Müller
eeb8024184 Handles varied LLM response formats
Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types.

Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.
2025-06-22 23:12:37 +02:00
Magnus Müller
4629a8d9b7 Fixes relative import and type hints
Fixes a relative import issue for the judge system.

Updates type hints to allow None values for laminar_link and critical_error.

Comments out unused code related to Laminar link updates.
2025-06-22 23:07:10 +02:00
Magnus Müller
be16ff3f69 Implement comprehensive judge system for task evaluation
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
2025-06-22 22:43:57 +02:00
Mert Unsal
4255cb6cc2 Update action description for input_text to clarify functionality (#2044)
Updated the action description for input_text to clarify that it both
clicks and inputs text into an interactive element.
2025-06-22 17:58:07 +02:00
mertunsall
b3ccedf632 Update action description for input_text to clarify functionality 2025-06-22 17:56:53 +02:00
Mert Unsal
952c63a761 Fix input_text (#2040)
Improved input text handling for DOM elements by prioritizing
click-and-type, adding better error handling, and returning clear error
messages on failure.
Previously didn't work on https://flights.google.com/ and now it does.
2025-06-22 17:22:54 +02:00
Mert Unsal
4ce6f769bf Merge branch 'main' into mert/fix_input_message 2025-06-22 17:17:22 +02:00
mertunsall
4e88b221b6 Fix exception handling for text input in DOM elements by raising BrowserError within the try-except block, ensuring proper error reporting. 2025-06-22 17:16:28 +02:00
mertunsall
f896fe40c6 Prioritize click and enter text for input text 2025-06-22 15:35:50 +02:00
mertunsall
258d0b0bec Implemented a try-except block to handle exceptions when inputting text into DOM elements. If an error occurs, a descriptive message is returned to indicate the failure, improving robustness in browser interactions. 2025-06-22 15:34:37 +02:00
Mert Unsal
f825477f3c fix: fixing open file encoding error in windows (#2039)
fix: [Bug: Failed to load system prompt template: 'gbk' codec can't
decode byte 0x94 in position 2390: illegal multibyte sequence
#2038](https://github.com/browser-use/browser-use/issues/2038)

In Python, when using the open() function without specifying an
encoding, the default encoding is determined by:
Python 3.10+: defaults to using the encoding returned by
locale.getpreferredencoding(False)

In my windows computer, locale.getpreferredencoding(False) return
`cp936` (equivalent to GBK), so I add encoding param in open function to
use `utf-8` encoding.
```
import locale

print(locale.getpreferredencoding(False))
```
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Fixed a file encoding error when loading the system prompt template on
Windows by setting the file to open with UTF-8 encoding.

<!-- End of auto-generated description by cubic. -->
2025-06-22 12:29:26 +02:00
mingzhong.li
dbae589623 fix: fixing open file encoding error in windows 2025-06-22 17:52:14 +08:00
Mert Unsal
16f845ca79 Update system prompt (#2034)
Improved the system prompt instructions for browser automation agents to
clarify reasoning steps, element selection, and file handling.
2025-06-22 08:52:18 +02:00
mertunsall
95ca883894 Update system prompt 2025-06-21 23:36:48 +02:00
Mert Unsal
98d08cc040 Revert "Enhance system prompt reasoning (#2022)" (#2033)
Reverted recent changes to the system prompt.
0.3.2
2025-06-21 23:31:11 +02:00
mertunsall
b4783745b9 Revert "Enhance system prompt reasoning (#2022)"
This reverts commit 25a2eecbfd, reversing
changes made to 8194ecbc3e.
2025-06-21 23:27:19 +02:00
Nick Sweeting
e6dd2ae475 error handling during browser launch 2025-06-21 08:05:41 -07:00
Nick Sweeting
4a8a4155b3 try keep alive browsers 2025-06-21 07:23:54 -07:00
Nick Sweeting
b5193c445f make browser launch timeout set using playwright kwarg (#2030) 2025-06-21 07:16:08 -07:00
Nick Sweeting
959e0b2911 make browser launch timeout set using playwright kwarg 2025-06-21 07:14:48 -07:00
Nick Sweeting
6d84267a1f fix for indexerrors during browser launches 2025-06-21 07:05:03 -07:00
Nick Sweeting
6b50bda566 add pyright to lint script 2025-06-21 06:33:47 -07:00
Nick Sweeting
ac22e6ae20 Test fixes, evenbus tweaks, docs updates, and better warnings (#2027) 2025-06-21 06:32:11 -07:00
Nick Sweeting
0af8c8c0fe imports 2025-06-21 06:29:10 -07:00
Nick Sweeting
eb21d92d34 include extras packages in CI to avoid missing imports errors 2025-06-21 06:23:23 -07:00
Nick Sweeting
de67673b79 test fix 2025-06-21 06:19:05 -07:00
Nick Sweeting
046c53a171 hint and lint fixes 2025-06-21 06:16:53 -07:00
Nick Sweeting
a1144052ad tests sync client auth 2025-06-21 06:09:57 -07:00
Nick Sweeting
3209fd95f7 lint and hint fixes 2025-06-21 06:07:21 -07:00
Nick Sweeting
aad78d93ab more type hint fixes 2025-06-21 05:44:49 -07:00
Nick Sweeting
6c695d0a42 more lint and hint fixes 2025-06-21 05:39:17 -07:00
Nick Sweeting
f878b8f07c type hint fixes 2025-06-21 05:16:02 -07:00
Nick Sweeting
eb12440558 fix CI numpy wheel not available on py 3.12 errors 2025-06-21 05:15:42 -07:00
Nick Sweeting
6bc1f7985f more type hint fixes 2025-06-21 04:56:27 -07:00
Nick Sweeting
340bafdd29 move old tests to old folder 2025-06-21 04:47:46 -07:00
Nick Sweeting
e3c145377b fix window resizing 2025-06-21 04:35:31 -07:00
Nick Sweeting
b67be37490 fix type hint errors 2025-06-21 04:35:24 -07:00
Nick Sweeting
875c8fc831 fix tests 2025-06-21 04:08:51 -07:00
Magnus Müller
84c69dacb0 feat: Add highlight_elements flag for controlling element highlighting (#2028)
## Summary
This PR adds a new `highlight_elements` flag that allows users to
control whether interactive elements are highlighted on web pages during
browser automation.

## Changes Made
-  **Frontend (UI)**: Added `highlightElements` field to run settings
store with default value `true`
-  **Backend**: Added `--highlight-elements` CLI argument to
`eval/service.py`
-  **Pipeline**: Pass `highlight_elements` parameter through entire
execution pipeline
-  **GitHub Workflow**: Added support for `highlight_elements` in
`eval.yaml`
-  **Browser Configuration**: Correctly pass flag to `BrowserSession` →
`BrowserProfile`

## How it works
- **UI**: Users can toggle "Highlight Elements" in the Flags section
- **CLI**: Can be enabled with `--highlight-elements` argument  
- **Backend**: Parameter flows through all execution stages
- **Browser**: Controls whether interactive elements are highlighted on
pages during automation

## Benefits
- 🎯 **Better debugging**: Users can see exactly which elements the agent
is interacting with
- 🔧 **Flexible control**: Can be disabled for performance or cleaner
screenshots
- 📱 **UI Integration**: Seamlessly integrated into the evaluation
platform interface
- 🛠️ **CLI Support**: Available for both UI and command-line usage

## Testing
- Verified UI toggle functionality in evaluation platform
- Tested CLI argument parsing and parameter flow
- Confirmed GitHub workflow integration
- Validated browser configuration handling

Resolves the need for user-controllable element highlighting during
browser automation.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Added a highlight_elements flag to let users control if interactive
elements are highlighted during browser automation.

- **New Features**
  - Added UI toggle and CLI flag for element highlighting.
  - Passed highlight_elements through the backend and GitHub workflow.

<!-- End of auto-generated description by cubic. -->
2025-06-21 13:07:34 +02:00
Nick Sweeting
3cf9f3410c fix config issues 2025-06-21 04:03:58 -07:00
Magnus Müller
341a305b2c chore: remove unused dataclass import
- Remove unused 'from dataclasses import dataclass' import from dom/service.py
- Applied by pre-commit hooks cleanup
2025-06-21 13:02:33 +02:00
Magnus Müller
aeea1788fa fix: CLI argument default conflict for highlight_elements
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements

Resolves bug where CLI users got highlighting disabled by default instead of enabled
2025-06-21 12:52:40 +02:00
Nick Sweeting
c451bca15c fix spaces 2025-06-21 03:46:00 -07:00