reformedot
c3949b8c92
chore: linting
2025-07-05 17:08:43 +02:00
reformedot
88c4f3ff12
fix: handle eval import errors
2025-07-05 17:07:42 +02:00
reformedot
fc34224692
fix: fix PR issues
2025-07-05 16:54:09 +02:00
reformedot
8c5672b1ef
feat: added support for Browserbase and Hyperbrowser as available browsers in the eval
...
fix: revert example change
2025-07-05 16:20:24 +02:00
Mert Unsal
38dee5e627
Merge branch 'main' into mert/improve_gmail_actions
2025-07-05 10:40:17 +02:00
mertunsall
220f0bc994
update models to gpt-4.1
2025-07-05 10:32:49 +02:00
Alexander Yue
50355fb8d5
Merge branch 'main' into auth-distribution
2025-07-04 20:01:38 -07:00
Alezander9
1a3ac38197
Merge remote-tracking branch 'upstream/main' into auth-distribution
2025-07-04 15:20:54 -07:00
reformedot
3dbaea1729
fix: improved anchor browser session creation
2025-07-04 23:43:15 +02:00
reformedot
8754e22ce3
feat: added browser arg to the eval script
2025-07-04 23:40:12 +02:00
Magnus Müller
2c564009f7
Remove unused anchor navigation argument from eval service
2025-07-04 21:37:18 +02:00
Magnus Müller
18fe7620de
Remove memory logging
2025-07-04 21:21:15 +02:00
Magnus Müller
74b4bfd363
Merge remote-tracking branch 'origin/main' into feat/evals-anchor-support
2025-07-04 21:13:09 +02:00
mertunsall
f80bf95260
add gmail connection only for tasks that have OTP
2025-07-04 17:21:18 +02:00
mertunsall
d8a08f088e
fix error in not initializing controller correctly
2025-07-04 14:42:32 +02:00
Saurav Panda
fd5adf4080
Merge branch 'main' into 2fa_gmail_integration
2025-07-04 04:30:01 -07:00
Saurav Panda
abdb8efa9e
refc: removed authenticate function from action and update ActionResult for gmail integrations
2025-07-04 04:27:18 -07:00
mertunsall
053d81a97e
overwrite comprehensive eval too if necessary
2025-07-04 12:34:09 +02:00
Saurav Panda
f3fa86ea21
feat: added 2fa token parsing logic
2025-07-04 02:58:04 -07:00
mertunsall
bcb84ee6b0
Add judge evaluation to login tasks
2025-07-04 11:38:06 +02:00
Saurav Panda
bbfbcebd6e
feat: added multi credential support
2025-07-04 00:15:38 -07:00
Saurav Panda
c7fedf5117
Merge branch 'main' into 2fa_gmail_integration
2025-07-03 22:03:35 -07:00
Saurav Panda
dfc5c916a0
lint issue fix
2025-07-03 21:54:57 -07:00
Saurav Panda
ea03f2dc4c
feat: added login cookie tracker for all the steps
2025-07-03 16:42:39 -07:00
Saurav Panda
6cdfdfd69c
lint fixes
2025-07-03 10:23:26 -07:00
mertunsall
9f41a166ac
bugfix
2025-07-03 13:47:33 +02:00
mertunsall
0d6d759c5c
hotfix
2025-07-03 13:46:00 +02:00
mertunsall
05ef50fdf1
add thinking budget to gemini and fix evals
2025-07-03 13:40:17 +02:00
Magnus Müller
b375f77d18
Update service.py
2025-07-03 10:34:05 +02:00
Magnus Müller
849ba31e2b
eval-dont-go-to-the-website
2025-07-03 10:26:49 +02:00
Saurav Panda
673f342067
added some debug for the run
2025-07-02 23:46:01 -07:00
Saurav Panda
1bfef7ac91
lint fixes
2025-07-02 22:39:53 -07:00
Saurav Panda
055733e8e9
linting fixes
2025-07-02 22:28:33 -07:00
Saurav Panda
335ee6133a
fix: updated the storage_state issue
2025-07-02 22:26:44 -07:00
Saurav Panda
e7bfffc566
Merge remote-tracking branch 'upstream' into 2fa_gmail_integration
2025-07-02 18:09:48 -07:00
Saurav Panda
1e13c0e03f
linting fixes
2025-07-02 17:09:47 -07:00
Alezander9
8beabf6970
fix typing
2025-07-02 17:01:35 -07:00
Saurav Panda
2944178691
feat: added 2fa token in eval
2025-07-02 16:41:36 -07:00
Alezander9
e8db375401
make eval service fetch rotating auth info from server
2025-07-02 16:30:48 -07:00
Magnus Müller
70149369eb
eval-change-timing
2025-07-03 00:19:51 +02:00
mertunsall
58ed5da177
migrate evals to ChatGroq
2025-07-02 23:19:10 +02:00
Magnus Müller
2f8da485b1
Merge branch 'main' into eval-runner-status-updates
2025-07-02 19:28:31 +02:00
Magnus Müller
0cbda40a2c
Enhance evaluation workflow with improved runner ID generation and progress tracking
...
- Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables.
- Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API.
- Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.
2025-07-02 19:27:15 +02:00
Magnus Müller
fe3fe67d50
Refactor evaluation stages in service.py
...
- Moved the formatting and evaluation stages outside the browser session block to ensure they are executed regardless of session state.
- Updated error handling for evaluation and server save stages to maintain consistent logging and task result management.
- Ensured that server save attempts are always made, improving reliability in task completion.
2025-07-02 18:08:28 +02:00
Mert Unsal
f5a4c75579
add haiku to evals ( #2257 )
...
Added support for the claude-3.5-haiku-latest model in evals.
2025-07-02 17:41:47 +02:00
mertunsall
144870a024
add haiku to evals
2025-07-02 17:40:58 +02:00
Magnus Müller
44a180f716
eval-runner-status-updates
2025-07-02 17:38:58 +02:00
Aitor
8a5ca6ceec
Merge branch 'main' into feat/evals-anchor-support
2025-07-02 14:23:45 +02:00
Magnus Müller
239fd3f86b
eval include runner link
2025-07-02 14:10:22 +02:00
mertunsall
3a5f43bb3f
eval should run with temperature 1 for o3
2025-07-02 11:07:58 +02:00