Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.
The changes include:
- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
Adds a type assertion to ensure that the payload remains a dictionary after serialization.
Also, adds type hints to `make_json_serializable` for better code clarity and maintainability.
Adds a utility function to convert objects within a payload to JSON-serializable types before returning the task result.
This change addresses potential issues where the task result contains non-serializable objects (e.g., enums, custom objects), preventing proper data handling.
Updates the import path for the comprehensive judge system to reflect
its new location in the project structure.
This resolves an issue where the previous relative import was causing
import errors.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Updated the import path for the judge system to fix import errors after
moving the file.
<!-- End of auto-generated description by cubic. -->
Updates the import path for the comprehensive judge system to reflect its new location in the project structure.
This resolves an issue where the previous relative import was causing import errors.
Added a new judge system in `judge_system.py` that evaluates browser-use
agent runs, providing detailed structured feedback. Updated the
evaluation workflow in `eval.yaml` to include a new command-line
argument for using the comprehensive judge. Modified `service.py` to
integrate the new judge system, allowing for fallback to the original
Mind2Web evaluation if specified. Enhanced error handling and logging
throughout the evaluation process.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Added a comprehensive judge system for evaluating browser-use agent
runs, providing detailed, structured feedback and multi-dimensional
scoring. Updated the evaluation workflow to support both the new judge
and the original Mind2Web judge, with improved error handling and
logging.
- **New Features**
- Introduced `judge_system.py` with a multi-criteria evaluation and JSON
feedback.
- Integrated the new judge into `service.py` with a command-line flag
for judge selection.
- Enhanced error handling and logging during evaluation.
- **Dependencies**
- Updated `.github/workflows/eval.yaml` to add a flag for selecting the
judge system.
<!-- End of auto-generated description by cubic. -->
Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types.
Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.
Fixes a relative import issue for the judge system.
Updates type hints to allow None values for laminar_link and critical_error.
Comments out unused code related to Laminar link updates.
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
Improved input text handling for DOM elements by prioritizing
click-and-type, adding better error handling, and returning clear error
messages on failure.
Previously didn't work on https://flights.google.com/ and now it does.
fix: [Bug: Failed to load system prompt template: 'gbk' codec can't
decode byte 0x94 in position 2390: illegal multibyte sequence
#2038](https://github.com/browser-use/browser-use/issues/2038)
In Python, when using the open() function without specifying an
encoding, the default encoding is determined by:
Python 3.10+: defaults to using the encoding returned by
locale.getpreferredencoding(False)
In my windows computer, locale.getpreferredencoding(False) return
`cp936` (equivalent to GBK), so I add encoding param in open function to
use `utf-8` encoding.
```
import locale
print(locale.getpreferredencoding(False))
```
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Fixed a file encoding error when loading the system prompt template on
Windows by setting the file to open with UTF-8 encoding.
<!-- End of auto-generated description by cubic. -->
## Summary
This PR adds a new `highlight_elements` flag that allows users to
control whether interactive elements are highlighted on web pages during
browser automation.
## Changes Made
- ✅ **Frontend (UI)**: Added `highlightElements` field to run settings
store with default value `true`
- ✅ **Backend**: Added `--highlight-elements` CLI argument to
`eval/service.py`
- ✅ **Pipeline**: Pass `highlight_elements` parameter through entire
execution pipeline
- ✅ **GitHub Workflow**: Added support for `highlight_elements` in
`eval.yaml`
- ✅ **Browser Configuration**: Correctly pass flag to `BrowserSession` →
`BrowserProfile`
## How it works
- **UI**: Users can toggle "Highlight Elements" in the Flags section
- **CLI**: Can be enabled with `--highlight-elements` argument
- **Backend**: Parameter flows through all execution stages
- **Browser**: Controls whether interactive elements are highlighted on
pages during automation
## Benefits
- 🎯 **Better debugging**: Users can see exactly which elements the agent
is interacting with
- 🔧 **Flexible control**: Can be disabled for performance or cleaner
screenshots
- 📱 **UI Integration**: Seamlessly integrated into the evaluation
platform interface
- 🛠️ **CLI Support**: Available for both UI and command-line usage
## Testing
- Verified UI toggle functionality in evaluation platform
- Tested CLI argument parsing and parameter flow
- Confirmed GitHub workflow integration
- Validated browser configuration handling
Resolves the need for user-controllable element highlighting during
browser automation.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Added a highlight_elements flag to let users control if interactive
elements are highlighted during browser automation.
- **New Features**
- Added UI toggle and CLI flag for element highlighting.
- Passed highlight_elements through the backend and GitHub workflow.
<!-- End of auto-generated description by cubic. -->
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements
Resolves bug where CLI users got highlighting disabled by default instead of enabled