Commit Graph

27 Commits

Author SHA1 Message Date
Magnus Müller
d4a29c4b93 Improves evaluation robustness and reporting
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.

The changes include:

- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
2025-06-23 00:08:14 +02:00
Magnus Müller
be16ff3f69 Implement comprehensive judge system for task evaluation
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
2025-06-22 22:43:57 +02:00
Magnus Müller
aeea1788fa fix: CLI argument default conflict for highlight_elements
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements

Resolves bug where CLI users got highlighting disabled by default instead of enabled
2025-06-21 12:52:40 +02:00
Magnus Müller
9292e6c48d feat: add highlight_elements flag for controlling element highlighting
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
2025-06-21 12:40:41 +02:00
Magnus Müller
f1d5dc5a17 Pass laminar_eval_id from frontend 2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4 Monitor eval cpu 2025-06-20 23:35:56 +02:00
mertunsall
1864e52635 Merge branch 'main' into mert/new_everything 2025-06-18 10:00:25 +02:00
Magnus Müller
c66880a8fa Add lmnr package for tracing integration and update eval workflow 2025-06-17 18:58:56 +02:00
mertunsall
0c34c399db Merge branch 'main' into mert/new_everything 2025-06-17 11:19:06 +02:00
Nick Sweeting
4d8f1342e8 try and revert evals change 2025-06-10 23:57:41 -07:00
Nick Sweeting
ffd36eb5da tweak env vars for CI 2025-06-10 03:58:37 -07:00
Magnus Müller
061f2cf089 Increase timeout duration in evaluation workflow from 60 to 360 minutes to accommodate longer-running jobs. 2025-06-09 20:51:11 +02:00
Magnus Müller
fe64e35b3c Add SERPER_API_KEY to evaluation workflow for enhanced API integration 2025-06-09 12:25:50 +02:00
Magnus Müller
b1af578bf3 Add SERPER_API_KEY to evaluation workflow for enhanced API integration 2025-06-09 11:58:34 +02:00
Alezander9
739a3b2c87 add flag to toggle whether webjudge gets to see final agent result 2025-06-08 17:49:05 -07:00
Alezander9
31dcce5390 decrease workflow timeout 2025-06-07 22:08:29 -07:00
Alezander9
bc56c73fe7 add ability to append results to existing run, so we can parallelize 2025-06-07 21:47:47 -07:00
Alezander9
61f9c31a3d feat: support changing eval task set 2025-06-03 10:27:34 -07:00
BroskyBrowser
15cb992618 fix: expose ANCHOR_BROWSER_API_KEY variable in the evals workflow .yaml 2025-06-01 12:58:14 +02:00
Alezander9
ad71ba8d29 add branch name selection into workflow 2025-05-24 19:12:01 -07:00
Alezander9
45dd0a26c2 update eval workflow with new arguments 2025-05-23 14:57:18 -07:00
Alezander9
a3dd8b004b update eval workflow with new arguments 2025-05-23 14:46:14 -07:00
Alezander9
aab470243f update user message default argument 2025-05-22 15:53:49 -07:00
Alezander9
529e43fdd1 update args in workflow script to match new format 2025-05-22 11:36:47 -07:00
Alezander9
4a7e9113ca add claude 4 support and cleanup eval script arguments 2025-05-22 10:54:00 -07:00
Alezander9
0fbfc82da0 switch eval workflow to use new consolidated branch 2025-05-21 14:39:09 -07:00
Alezander9
a8d661b2d0 consolidated changes: adapt refactored eval service to work with new browser and on github actions 2025-05-21 14:36:57 -07:00