Commit Graph

235 Commits

Author SHA1 Message Date
Nick Sweeting
fd07360f57 simpler page title fetching 2025-07-10 18:35:24 -07:00
Nick Sweeting
cdca6339f6 try all browsers for evals 2025-07-10 16:24:47 -07:00
Nick Sweeting
cb4f6be2a8 Merge branch 'main' into cdp-loading 2025-07-10 05:55:01 -07:00
Gregor Žunič
1eac01dfab openrouter evals support 2025-07-10 12:51:52 +02:00
Nick Sweeting
4b4b93f6cc tweak chrome used for test.yaml evaluate_Tasks 2025-07-09 15:32:10 -07:00
Nick Sweeting
3bd76cea98 tweak chrome used for test.yaml evaluate_Tasks 2025-07-09 15:26:07 -07:00
Magnus Müller
bc5ff33b09 eval remove laminar args 2025-07-09 20:02:00 +02:00
Magnus Müller
f5763bdf49 add browser use logging 2025-07-09 18:01:09 +02:00
Magnus Müller
db3fa28442 eval laminar key name 2025-07-09 17:59:47 +02:00
Nick Sweeting
435426fc9a bump cache action version 2025-07-08 18:40:36 -07:00
Nick Sweeting
b4a8776fec speed up chrome install in CI 2025-07-08 18:19:08 -07:00
Nick Sweeting
1fa7fee4f6 fix cache key for tests 2025-07-08 18:10:14 -07:00
Nick Sweeting
7cf6a26664 fix flipped order 2025-07-08 18:07:43 -07:00
Nick Sweeting
1c6b510f07 use sudo for curl to update 2025-07-08 18:05:20 -07:00
Nick Sweeting
28f0d4d401 use runner arch in cache key 2025-07-08 18:03:11 -07:00
Nick Sweeting
b206db41a1 use consistent bin name 2025-07-08 18:01:24 -07:00
Nick Sweeting
32e5430b62 only cache actual binary 2025-07-08 18:00:55 -07:00
Nick Sweeting
14030006db fix missing sudo 2025-07-08 17:57:18 -07:00
Nick Sweeting
4599f815f2 try to cache chrome apt package 2025-07-08 17:55:44 -07:00
Aitor
e409c36fd7 feat: forward unikraft secrets to the eval workflow .yaml 2025-07-08 17:40:59 +02:00
Nick Sweeting
3f84d1c460 set in_docker in evals 2025-07-08 06:09:30 -07:00
Nick Sweeting
4d8bdb3dbf install all browser versions for evals and tests 2025-07-08 06:05:20 -07:00
Nick Sweeting
7403c33be3 fix user-data-dir matching 2025-07-08 05:05:49 -07:00
Magnus Müller
8ed1f6cb88 Update failing test 2025-07-08 13:43:12 +02:00
Nick Sweeting
fdba54fb34 add pyright to pre-commit hooks 2025-07-07 18:03:55 -07:00
reformedot
b7fa04d336 feat: add parameters to remove images and css in the eval.yaml 2025-07-07 16:26:54 +02:00
Aitor
d032a1ec61 fix: update eval.yaml to use full HD screen resolution 2025-07-07 09:44:50 +02:00
reformedot
9de712d702 feat: added browser settings to browser profile 2025-07-06 20:00:06 +02:00
Mert Unsal
1124e82cd3 Merge branch 'main' into mert/fix_encoding 2025-07-06 18:15:22 +02:00
mertunsall
3bec6fc9bf add qwen2.5-vl-72b-instruct into evals 2025-07-06 18:10:13 +02:00
Magnus Müller
2b7367677f eval-log-level 2025-07-06 17:53:49 +02:00
Magnus Müller
25392d9cde add multiple last screenshots to llm input message 2025-07-06 13:50:31 +02:00
Magnus Müller
0612eb0aae eval-repeat-judge 2025-07-06 12:25:13 +02:00
reformedot
8c5672b1ef feat: added support for Browserbase and Hyperbrowser as available browsers in the eval
fix: revert example change
2025-07-05 16:20:24 +02:00
reformedot
8754e22ce3 feat: added browser arg to the eval script 2025-07-04 23:40:12 +02:00
Saurav Panda
8cf64699ad refc: removed debug logs from eval.yamml 2025-07-04 03:21:16 -07:00
Saurav Panda
f3fa86ea21 feat: added 2fa token parsing logic 2025-07-04 02:58:04 -07:00
Saurav Panda
d87380b643 debugging: gmail 2fa json data 2025-07-04 01:29:53 -07:00
Saurav Panda
4ed5d96ef5 updated eval with toJson mapping 2025-07-04 01:17:43 -07:00
Saurav Panda
bbfbcebd6e feat: added multi credential support 2025-07-04 00:15:38 -07:00
Saurav Panda
c7fedf5117 Merge branch 'main' into 2fa_gmail_integration 2025-07-03 22:03:35 -07:00
Magnus Müller
fc8d6b1c14 eval enable debug 2025-07-03 23:33:25 +02:00
Saurav Panda
2cd21e18e6 feat: removed debug from evals 2025-07-03 10:19:37 -07:00
Saurav Panda
20f66b9fc7 Update eval.yaml 2025-07-03 00:11:39 -07:00
Saurav Panda
673f342067 added some debug for the run 2025-07-02 23:46:01 -07:00
Saurav Panda
e7bfffc566 Merge remote-tracking branch 'upstream' into 2fa_gmail_integration 2025-07-02 18:09:48 -07:00
Saurav Panda
2944178691 feat: added 2fa token in eval 2025-07-02 16:41:36 -07:00
Magnus Müller
5b6a00032c Enhance evaluation workflow with validation and error handling improvements
- Added validation for START_INDEX and TOTAL_TASKS to ensure they are numeric, with default values set to prevent errors.
- Improved logging for task range calculations and runner ID generation, including warnings for non-numeric inputs.
- Enhanced evaluation output handling with comprehensive error capture and logging, ensuring better debugging information is available.
- Implemented checks for the existence of evaluation logs and provided statistics for better visibility into evaluation outcomes.
2025-07-02 19:39:54 +02:00
Magnus Müller
0cbda40a2c Enhance evaluation workflow with improved runner ID generation and progress tracking
- Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables.
- Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API.
- Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.
2025-07-02 19:27:15 +02:00
Magnus Müller
44a180f716 eval-runner-status-updates 2025-07-02 17:38:58 +02:00