Nick Sweeting
fd07360f57
simpler page title fetching
2025-07-10 18:35:24 -07:00
Nick Sweeting
cdca6339f6
try all browsers for evals
2025-07-10 16:24:47 -07:00
Nick Sweeting
cb4f6be2a8
Merge branch 'main' into cdp-loading
2025-07-10 05:55:01 -07:00
Gregor Žunič
1eac01dfab
openrouter evals support
2025-07-10 12:51:52 +02:00
Nick Sweeting
4b4b93f6cc
tweak chrome used for test.yaml evaluate_Tasks
2025-07-09 15:32:10 -07:00
Nick Sweeting
3bd76cea98
tweak chrome used for test.yaml evaluate_Tasks
2025-07-09 15:26:07 -07:00
Magnus Müller
bc5ff33b09
eval remove laminar args
2025-07-09 20:02:00 +02:00
Magnus Müller
f5763bdf49
add browser use logging
2025-07-09 18:01:09 +02:00
Magnus Müller
db3fa28442
eval laminar key name
2025-07-09 17:59:47 +02:00
Nick Sweeting
435426fc9a
bump cache action version
2025-07-08 18:40:36 -07:00
Nick Sweeting
b4a8776fec
speed up chrome install in CI
2025-07-08 18:19:08 -07:00
Nick Sweeting
1fa7fee4f6
fix cache key for tests
2025-07-08 18:10:14 -07:00
Nick Sweeting
7cf6a26664
fix flipped order
2025-07-08 18:07:43 -07:00
Nick Sweeting
1c6b510f07
use sudo for curl to update
2025-07-08 18:05:20 -07:00
Nick Sweeting
28f0d4d401
use runner arch in cache key
2025-07-08 18:03:11 -07:00
Nick Sweeting
b206db41a1
use consistent bin name
2025-07-08 18:01:24 -07:00
Nick Sweeting
32e5430b62
only cache actual binary
2025-07-08 18:00:55 -07:00
Nick Sweeting
14030006db
fix missing sudo
2025-07-08 17:57:18 -07:00
Nick Sweeting
4599f815f2
try to cache chrome apt package
2025-07-08 17:55:44 -07:00
Aitor
e409c36fd7
feat: forward unikraft secrets to the eval workflow .yaml
2025-07-08 17:40:59 +02:00
Nick Sweeting
3f84d1c460
set in_docker in evals
2025-07-08 06:09:30 -07:00
Nick Sweeting
4d8bdb3dbf
install all browser versions for evals and tests
2025-07-08 06:05:20 -07:00
Nick Sweeting
7403c33be3
fix user-data-dir matching
2025-07-08 05:05:49 -07:00
Magnus Müller
8ed1f6cb88
Update failing test
2025-07-08 13:43:12 +02:00
Nick Sweeting
fdba54fb34
add pyright to pre-commit hooks
2025-07-07 18:03:55 -07:00
reformedot
b7fa04d336
feat: add parameters to remove images and css in the eval.yaml
2025-07-07 16:26:54 +02:00
Aitor
d032a1ec61
fix: update eval.yaml to use full HD screen resolution
2025-07-07 09:44:50 +02:00
reformedot
9de712d702
feat: added browser settings to browser profile
2025-07-06 20:00:06 +02:00
Mert Unsal
1124e82cd3
Merge branch 'main' into mert/fix_encoding
2025-07-06 18:15:22 +02:00
mertunsall
3bec6fc9bf
add qwen2.5-vl-72b-instruct into evals
2025-07-06 18:10:13 +02:00
Magnus Müller
2b7367677f
eval-log-level
2025-07-06 17:53:49 +02:00
Magnus Müller
25392d9cde
add multiple last screenshots to llm input message
2025-07-06 13:50:31 +02:00
Magnus Müller
0612eb0aae
eval-repeat-judge
2025-07-06 12:25:13 +02:00
reformedot
8c5672b1ef
feat: added support for Browserbase and Hyperbrowser as available browsers in the eval
...
fix: revert example change
2025-07-05 16:20:24 +02:00
mertunsall
220f0bc994
update models to gpt-4.1
2025-07-05 10:32:49 +02:00
reformedot
8754e22ce3
feat: added browser arg to the eval script
2025-07-04 23:40:12 +02:00
Saurav Panda
8cf64699ad
refc: removed debug logs from eval.yamml
2025-07-04 03:21:16 -07:00
Saurav Panda
f3fa86ea21
feat: added 2fa token parsing logic
2025-07-04 02:58:04 -07:00
Saurav Panda
d87380b643
debugging: gmail 2fa json data
2025-07-04 01:29:53 -07:00
Saurav Panda
4ed5d96ef5
updated eval with toJson mapping
2025-07-04 01:17:43 -07:00
Saurav Panda
bbfbcebd6e
feat: added multi credential support
2025-07-04 00:15:38 -07:00
Saurav Panda
c7fedf5117
Merge branch 'main' into 2fa_gmail_integration
2025-07-03 22:03:35 -07:00
Magnus Müller
fc8d6b1c14
eval enable debug
2025-07-03 23:33:25 +02:00
Saurav Panda
2cd21e18e6
feat: removed debug from evals
2025-07-03 10:19:37 -07:00
Saurav Panda
20f66b9fc7
Update eval.yaml
2025-07-03 00:11:39 -07:00
Saurav Panda
673f342067
added some debug for the run
2025-07-02 23:46:01 -07:00
Saurav Panda
e7bfffc566
Merge remote-tracking branch 'upstream' into 2fa_gmail_integration
2025-07-02 18:09:48 -07:00
Saurav Panda
2944178691
feat: added 2fa token in eval
2025-07-02 16:41:36 -07:00
Magnus Müller
5b6a00032c
Enhance evaluation workflow with validation and error handling improvements
...
- Added validation for START_INDEX and TOTAL_TASKS to ensure they are numeric, with default values set to prevent errors.
- Improved logging for task range calculations and runner ID generation, including warnings for non-numeric inputs.
- Enhanced evaluation output handling with comprehensive error capture and logging, ensuring better debugging information is available.
- Implemented checks for the existence of evaluation logs and provided statistics for better visibility into evaluation outcomes.
2025-07-02 19:39:54 +02:00
Magnus Müller
0cbda40a2c
Enhance evaluation workflow with improved runner ID generation and progress tracking
...
- Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables.
- Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API.
- Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.
2025-07-02 19:27:15 +02:00