Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.
The changes include:
- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
Adds a type assertion to ensure that the payload remains a dictionary after serialization.
Also, adds type hints to `make_json_serializable` for better code clarity and maintainability.
Adds a utility function to convert objects within a payload to JSON-serializable types before returning the task result.
This change addresses potential issues where the task result contains non-serializable objects (e.g., enums, custom objects), preventing proper data handling.
Updates the import path for the comprehensive judge system to reflect its new location in the project structure.
This resolves an issue where the previous relative import was causing import errors.
Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types.
Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.
Fixes a relative import issue for the judge system.
Updates type hints to allow None values for laminar_link and critical_error.
Comments out unused code related to Laminar link updates.
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements
Resolves bug where CLI users got highlighting disabled by default instead of enabled
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
- Added integration for Laminar tracing by initializing `AsyncLaminarClient` and creating evaluation links during task execution.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Enhanced logging for task results and added error handling for Laminar datapoint creation.
- Improved the `run_task_with_semaphore` function to manage Laminar evaluation links and update datapoints with evaluation scores.
These changes aim to streamline the evaluation process and improve tracking of task performance.
- Renamed variable `link` to `laminar_task_link` for clarity in the `run_task_with_semaphore` function.
- Updated the creation of `TaskResult` to utilize `laminar_task_link` instead of the previous `link` variable.
- Improved logging to reflect the new variable name, enhancing readability and maintainability.
These changes aim to improve code clarity and maintain consistency in the evaluation workflow.
- Removed the `lmnr[all]` dependency from `pyproject.toml`.
- Added `browser-use[dev,eval]` to `dev-dependencies` for improved development support.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Modified `run_task_with_semaphore` to handle Laminar evaluation links and improved logging for task results.
- Added logic to create a Laminar evaluation link during task execution.
These changes aim to streamline the evaluation process and enhance the overall functionality of the evaluation workflow.
- Updated `run_task_with_semaphore` to use `lmnr_run_id` for evaluation ID instead of `run_id`.
- Added a new helper function `start_new_run` to initiate evaluation runs on the server.
- Improved logging for task results and server interactions.
- Ensured proper handling of environment variables for server configuration.
This refactor aims to streamline the evaluation process and improve error handling.