- Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing.
- Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website.
- Enhanced environment variable loading for improved clarity and consistency.
- Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified.
- Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation.
- Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.
Removes the `fresh_start` option and the stage for loading existing results.
This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.
The changes include:
- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements
Resolves bug where CLI users got highlighting disabled by default instead of enabled
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
- Removed duplicate lmnr dependency from the `dependencies` section in `pyproject.toml`.
- Updated `lmnr` version to `0.6.11` in the `eval` extras group.
- Renamed the evaluation workflow from "Run Evaluation Script" to "Run Laminar Eval Script" for clarity.
- Adjusted the dependency installation command in the workflow to include the `--extra eval` flag.