Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.
The changes include:
- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements
Resolves bug where CLI users got highlighting disabled by default instead of enabled
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
- Removed duplicate lmnr dependency from the `dependencies` section in `pyproject.toml`.
- Updated `lmnr` version to `0.6.11` in the `eval` extras group.
- Renamed the evaluation workflow from "Run Evaluation Script" to "Run Laminar Eval Script" for clarity.
- Adjusted the dependency installation command in the workflow to include the `--extra eval` flag.