Commit Graph

263 Commits

Author SHA1 Message Date
Magnus Müller
239fd3f86b eval include runner link 2025-07-02 14:10:22 +02:00
Magnus Müller
a9488feead Change to 2 core runners 2025-07-01 12:56:36 +02:00
Magnus Müller
9376d9d91e Update evaluation workflow to use new runner label 'eval-4-core-500' 2025-07-01 12:54:18 +02:00
Magnus Müller
31b503fb42 Name group in eval 2025-07-01 11:14:07 +02:00
Magnus Müller
4e2b5b2f5c Fix typo in evaluation workflow runner name from '16-cores' to '16-core' 2025-07-01 10:45:42 +02:00
Magnus Müller
bd45b18508 Update evaluation workflow to maintain Ubuntu runner with 16 cores 2025-07-01 10:36:38 +02:00
Magnus Müller
bd4066354a eval-runner-16-core 2025-07-01 10:31:33 +02:00
Magnus Müller
81765de5ec Update evaluation workflow to specify Ubuntu runner with 8 cores 2025-07-01 10:09:04 +02:00
Magnus Müller
b9c81ec295 Add support for anchor usage in evaluation script arguments 2025-07-01 08:03:39 +02:00
Magnus Müller
c891af4ad1 Update eval workflow to use ubuntu-latest and streamline dependency installation
- Changed runner from Blacksmith to ubuntu-latest for improved compatibility.
- Updated setup-uv action to use astral-sh/setup-uv@v6.
- Simplified dependency installation steps by removing unnecessary verification and debug outputs.
- Adjusted Playwright version detection and caching actions for better performance.
2025-06-30 11:37:23 +02:00
blacksmith-sh[bot]
a172125484 Migrate workflows to Blacksmith 2025-06-30 09:16:19 +00:00
Nick Sweeting
7d0fb62bda dont re-run entire tests suite before pypi release 2025-06-30 04:35:51 -04:00
Magnus Müller
035fba1f29 eval-runners-cache-enable 2025-06-30 09:30:57 +02:00
Magnus Müller
0954da334d eval-runners-disable-cache 2025-06-30 09:20:02 +02:00
Magnus Müller
32303e156c Add debug steps to eval workflow for repository structure and dependency checks 2025-06-30 09:14:46 +02:00
Magnus Müller
b271370f81 Add logging 2025-06-30 09:09:01 +02:00
Magnus Müller
42c5cb7e73 Remove unnecessary activate-environment option from eval workflow 2025-06-30 09:02:05 +02:00
Magnus Müller
6d344b54d9 Change config to blacksmith 2025-06-30 08:46:19 +02:00
Magnus Müller
5905ad949c feat: add thinking parameter to control agent system prompt
- Add --no-thinking flag to disable thinking in agent system prompt
- Default is true (thinking enabled) for backward compatibility
- Pass thinking parameter through entire evaluation pipeline
- Update GitHub Actions workflow to handle thinking parameter
2025-06-29 20:11:45 +02:00
Nick Sweeting
eb88fe98e9 disable fast docker for now 2025-06-27 06:05:57 -07:00
Nick Sweeting
db6b0ae440 fast-docker 2025-06-27 05:36:38 -07:00
Nick Sweeting
f3dc2b300a Merge branch 'main' into semaphores 2025-06-27 03:28:20 -07:00
Nick Sweeting
fe3af9479a run only one proc for tests for now 2025-06-27 03:02:03 -07:00
Magnus Müller
3de686d6af refactor: remove branch argument from single task mode in evaluation workflow
- Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing.
- Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website.
- Enhanced environment variable loading for improved clarity and consistency.
2025-06-27 10:11:21 +02:00
Magnus Müller
bb11c7e7ca feat: add single task mode support in evaluation workflow
- Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified.
- Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation.
- Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.
2025-06-27 09:57:37 +02:00
Nick Sweeting
74d02c07a7 increase screenshot timeout to default timeout 2025-06-26 02:59:57 -07:00
Nick Sweeting
170c3e0bb7 remove duplicate timeout config 2025-06-26 01:24:03 -07:00
Nick Sweeting
7c317e9515 always group tests by class so they can reuse fixtures 2025-06-25 23:31:02 -07:00
Gregor Žunič
7a10ae0c96 Squashed commit langchain to native 2025-06-24 12:26:55 +02:00
Nick Sweeting
27f63622bf dont sent telemetry or cloud sync events during evals 2025-06-23 15:32:28 -07:00
Magnus Müller
537e86da4c Simplifies evaluation pipeline execution
Removes the `fresh_start` option and the stage for loading existing results.

This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.
2025-06-23 10:11:07 +02:00
Magnus Müller
d4a29c4b93 Improves evaluation robustness and reporting
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.

The changes include:

- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
2025-06-23 00:08:14 +02:00
Magnus Müller
be16ff3f69 Implement comprehensive judge system for task evaluation
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
2025-06-22 22:43:57 +02:00
Nick Sweeting
ac22e6ae20 Test fixes, evenbus tweaks, docs updates, and better warnings (#2027) 2025-06-21 06:32:11 -07:00
Nick Sweeting
0af8c8c0fe imports 2025-06-21 06:29:10 -07:00
Nick Sweeting
eb21d92d34 include extras packages in CI to avoid missing imports errors 2025-06-21 06:23:23 -07:00
Magnus Müller
aeea1788fa fix: CLI argument default conflict for highlight_elements
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements

Resolves bug where CLI users got highlighting disabled by default instead of enabled
2025-06-21 12:52:40 +02:00
Magnus Müller
9292e6c48d feat: add highlight_elements flag for controlling element highlighting
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
2025-06-21 12:40:41 +02:00
Magnus Müller
f1d5dc5a17 Pass laminar_eval_id from frontend 2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4 Monitor eval cpu 2025-06-20 23:35:56 +02:00
mertunsall
76ef41da45 Merge branch 'main' into mert/new_everything 2025-06-19 14:11:56 +02:00
Magnus Müller
f89a97f256 Delete laminar files since it is integrated to the main eval service 2025-06-19 11:26:53 +02:00
Magnus Müller
0b6ebea431 Merge branch 'main' into mert/new_everything 2025-06-19 09:49:43 +02:00
Nick Sweeting
829eafe982 Merge branch 'main' into eventbus 2025-06-18 10:26:50 -07:00
Magnus Müller
f89a1aac84 Update evaluation workflow to use laminar_eval.py
- Changed the script executed in the evaluation workflow from `eval/service.py` to `eval/laminar_eval.py` for consistency with recent updates.
2025-06-18 16:43:06 +02:00
Magnus Müller
7f1d256964 Update dependencies in pyproject.toml and rename evaluation workflow
- Removed duplicate lmnr dependency from the `dependencies` section in `pyproject.toml`.
- Updated `lmnr` version to `0.6.11` in the `eval` extras group.
- Renamed the evaluation workflow from "Run Evaluation Script" to "Run Laminar Eval Script" for clarity.
- Adjusted the dependency installation command in the workflow to include the `--extra eval` flag.
2025-06-18 16:31:00 +02:00
Robert Kim
01a36a587e gh action for laminar evals 2025-06-18 14:38:27 +01:00
mertunsall
1864e52635 Merge branch 'main' into mert/new_everything 2025-06-18 10:00:25 +02:00
Nick Sweeting
d0ec528802 fixed events 2025-06-17 22:38:59 -07:00
Nick Sweeting
ec32fee074 Merge branch 'main' into eventbus 2025-06-17 15:25:24 -07:00