Commit Graph

238 Commits

Author SHA1 Message Date
Nick Sweeting
74d02c07a7 increase screenshot timeout to default timeout 2025-06-26 02:59:57 -07:00
Nick Sweeting
170c3e0bb7 remove duplicate timeout config 2025-06-26 01:24:03 -07:00
Nick Sweeting
7c317e9515 always group tests by class so they can reuse fixtures 2025-06-25 23:31:02 -07:00
Gregor Žunič
7a10ae0c96 Squashed commit langchain to native 2025-06-24 12:26:55 +02:00
Nick Sweeting
27f63622bf dont sent telemetry or cloud sync events during evals 2025-06-23 15:32:28 -07:00
Magnus Müller
537e86da4c Simplifies evaluation pipeline execution
Removes the `fresh_start` option and the stage for loading existing results.

This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.
2025-06-23 10:11:07 +02:00
Magnus Müller
d4a29c4b93 Improves evaluation robustness and reporting
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.

The changes include:

- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
2025-06-23 00:08:14 +02:00
Magnus Müller
be16ff3f69 Implement comprehensive judge system for task evaluation
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
2025-06-22 22:43:57 +02:00
Nick Sweeting
ac22e6ae20 Test fixes, evenbus tweaks, docs updates, and better warnings (#2027) 2025-06-21 06:32:11 -07:00
Nick Sweeting
0af8c8c0fe imports 2025-06-21 06:29:10 -07:00
Nick Sweeting
eb21d92d34 include extras packages in CI to avoid missing imports errors 2025-06-21 06:23:23 -07:00
Magnus Müller
aeea1788fa fix: CLI argument default conflict for highlight_elements
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements

Resolves bug where CLI users got highlighting disabled by default instead of enabled
2025-06-21 12:52:40 +02:00
Magnus Müller
9292e6c48d feat: add highlight_elements flag for controlling element highlighting
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
2025-06-21 12:40:41 +02:00
Magnus Müller
f1d5dc5a17 Pass laminar_eval_id from frontend 2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4 Monitor eval cpu 2025-06-20 23:35:56 +02:00
mertunsall
76ef41da45 Merge branch 'main' into mert/new_everything 2025-06-19 14:11:56 +02:00
Magnus Müller
f89a97f256 Delete laminar files since it is integrated to the main eval service 2025-06-19 11:26:53 +02:00
Magnus Müller
0b6ebea431 Merge branch 'main' into mert/new_everything 2025-06-19 09:49:43 +02:00
Nick Sweeting
829eafe982 Merge branch 'main' into eventbus 2025-06-18 10:26:50 -07:00
Magnus Müller
f89a1aac84 Update evaluation workflow to use laminar_eval.py
- Changed the script executed in the evaluation workflow from `eval/service.py` to `eval/laminar_eval.py` for consistency with recent updates.
2025-06-18 16:43:06 +02:00
Magnus Müller
7f1d256964 Update dependencies in pyproject.toml and rename evaluation workflow
- Removed duplicate lmnr dependency from the `dependencies` section in `pyproject.toml`.
- Updated `lmnr` version to `0.6.11` in the `eval` extras group.
- Renamed the evaluation workflow from "Run Evaluation Script" to "Run Laminar Eval Script" for clarity.
- Adjusted the dependency installation command in the workflow to include the `--extra eval` flag.
2025-06-18 16:31:00 +02:00
Robert Kim
01a36a587e gh action for laminar evals 2025-06-18 14:38:27 +01:00
mertunsall
1864e52635 Merge branch 'main' into mert/new_everything 2025-06-18 10:00:25 +02:00
Nick Sweeting
d0ec528802 fixed events 2025-06-17 22:38:59 -07:00
Nick Sweeting
ec32fee074 Merge branch 'main' into eventbus 2025-06-17 15:25:24 -07:00
Nick Sweeting
a114613c09 ignore formatting changes commits 2025-06-17 14:13:34 -07:00
Magnus Müller
c66880a8fa Add lmnr package for tracing integration and update eval workflow 2025-06-17 18:58:56 +02:00
mertunsall
0c34c399db Merge branch 'main' into mert/new_everything 2025-06-17 11:19:06 +02:00
Nick Sweeting
e3d21d33a1 fix evaluate_tasks.py errors in CI 2025-06-16 17:20:27 -07:00
Nick Sweeting
3972f335f9 allow claude github action to use more tools 2025-06-16 15:47:18 -07:00
Nick Sweeting
cdf0706f20 only create pre-release versions manually 2025-06-12 10:49:04 -07:00
Nick Sweeting
06488e11ba fix clickaction error handling test 2025-06-11 00:05:58 -07:00
Nick Sweeting
1fd8e0ec92 try statuses-write option 2025-06-10 23:57:41 -07:00
Nick Sweeting
4d8f1342e8 try and revert evals change 2025-06-10 23:57:41 -07:00
Nick Sweeting
ffd36eb5da tweak env vars for CI 2025-06-10 03:58:37 -07:00
Nick Sweeting
4d95a5db75 cleanup project root dir a bit 2025-06-09 22:34:34 -07:00
Magnus Müller
061f2cf089 Increase timeout duration in evaluation workflow from 60 to 360 minutes to accommodate longer-running jobs. 2025-06-09 20:51:11 +02:00
Magnus Müller
fe64e35b3c Add SERPER_API_KEY to evaluation workflow for enhanced API integration 2025-06-09 12:25:50 +02:00
Magnus Müller
b1af578bf3 Add SERPER_API_KEY to evaluation workflow for enhanced API integration 2025-06-09 11:58:34 +02:00
Alezander9
739a3b2c87 add flag to toggle whether webjudge gets to see final agent result 2025-06-08 17:49:05 -07:00
Alezander9
31dcce5390 decrease workflow timeout 2025-06-07 22:08:29 -07:00
Alezander9
c04710ef61 Merge remote-tracking branch 'upstream/main' into parallel-evals 2025-06-07 21:48:47 -07:00
Alezander9
bc56c73fe7 add ability to append results to existing run, so we can parallelize 2025-06-07 21:47:47 -07:00
Magnus Müller
eaab9f04d7 Enhance GitHub Actions workflow and evaluate_tasks.py to include detailed task evaluation results. The workflow now captures and displays detailed results in a structured format, while the Python script outputs detailed results as JSON for better integration with GitHub Actions. This improves visibility and understanding of task outcomes in the evaluation process. 2025-06-07 13:39:16 +02:00
Magnus Müller
3ecee462a2 Update GitHub Actions workflow permissions to allow writing comments on pull requests and issues, enhancing interaction capabilities for automated testing processes. 2025-06-07 13:22:43 +02:00
Magnus Müller
bdf29c34fb Add PR comment functionality to GitHub Actions workflow for agent task evaluation results. This includes a summary of passed tasks, percentage score, and status emoji based on task outcomes, enhancing visibility of evaluation results directly in pull requests. 2025-06-07 13:16:30 +02:00
Magnus Müller
8d9b24b03a Add summary output for agent tasks evaluation in CI workflow 2025-06-07 11:27:06 +02:00
Magnus Müller
52c60f8367 Remove dependency on tests job in evaluate-tasks step of CI workflow 2025-06-07 11:11:23 +02:00
Magnus Müller
576519ee40 Enhance CI workflow by adding agent tasks evaluation step and updating evaluate_tasks.py to output evaluation results 2025-06-07 10:59:24 +02:00
Magnus Müller
3666d2b077 Add agent tasks evaluation script and update CI workflow to include it 2025-06-07 10:49:03 +02:00