Alezander9
422f910207
rename types helper file to not conflict with python types import
2025-07-05 23:57:16 -07:00
Alezander9
3c525b6007
file name standardization
2025-07-05 23:50:48 -07:00
Alezander9
e0e0ae2c06
fix merge errors
2025-07-05 23:43:50 -07:00
Alezander9
e5a6994a1c
fix merge errors
2025-07-05 23:43:19 -07:00
Alezander9
7c39427f1f
refactor eval script into smaller files and use relative imports for clean code
2025-07-05 23:40:13 -07:00
Aitor
05d2ac8d2a
Merge branch 'main' into fix/improve-browserbase-eval-params
2025-07-06 00:13:34 +02:00
reformedot
cfed7f6f86
fix: improve browserbase eval params
2025-07-05 18:22:22 +02:00
reformedot
c3949b8c92
chore: linting
2025-07-05 17:08:43 +02:00
reformedot
88c4f3ff12
fix: handle eval import errors
2025-07-05 17:07:42 +02:00
reformedot
fc34224692
fix: fix PR issues
2025-07-05 16:54:09 +02:00
reformedot
8c5672b1ef
feat: added support for Browserbase and Hyperbrowser as available browsers in the eval
...
fix: revert example change
2025-07-05 16:20:24 +02:00
Mert Unsal
38dee5e627
Merge branch 'main' into mert/improve_gmail_actions
2025-07-05 10:40:17 +02:00
mertunsall
220f0bc994
update models to gpt-4.1
2025-07-05 10:32:49 +02:00
Alexander Yue
50355fb8d5
Merge branch 'main' into auth-distribution
2025-07-04 20:01:38 -07:00
Alezander9
1a3ac38197
Merge remote-tracking branch 'upstream/main' into auth-distribution
2025-07-04 15:20:54 -07:00
reformedot
3dbaea1729
fix: improved anchor browser session creation
2025-07-04 23:43:15 +02:00
reformedot
8754e22ce3
feat: added browser arg to the eval script
2025-07-04 23:40:12 +02:00
Magnus Müller
2c564009f7
Remove unused anchor navigation argument from eval service
2025-07-04 21:37:18 +02:00
Magnus Müller
18fe7620de
Remove memory logging
2025-07-04 21:21:15 +02:00
Magnus Müller
74b4bfd363
Merge remote-tracking branch 'origin/main' into feat/evals-anchor-support
2025-07-04 21:13:09 +02:00
mertunsall
f80bf95260
add gmail connection only for tasks that have OTP
2025-07-04 17:21:18 +02:00
mertunsall
d8a08f088e
fix error in not initializing controller correctly
2025-07-04 14:42:32 +02:00
Saurav Panda
fd5adf4080
Merge branch 'main' into 2fa_gmail_integration
2025-07-04 04:30:01 -07:00
Saurav Panda
abdb8efa9e
refc: removed authenticate function from action and update ActionResult for gmail integrations
2025-07-04 04:27:18 -07:00
mertunsall
053d81a97e
overwrite comprehensive eval too if necessary
2025-07-04 12:34:09 +02:00
Saurav Panda
f3fa86ea21
feat: added 2fa token parsing logic
2025-07-04 02:58:04 -07:00
mertunsall
bcb84ee6b0
Add judge evaluation to login tasks
2025-07-04 11:38:06 +02:00
Saurav Panda
bbfbcebd6e
feat: added multi credential support
2025-07-04 00:15:38 -07:00
Saurav Panda
c7fedf5117
Merge branch 'main' into 2fa_gmail_integration
2025-07-03 22:03:35 -07:00
Saurav Panda
dfc5c916a0
lint issue fix
2025-07-03 21:54:57 -07:00
Saurav Panda
ea03f2dc4c
feat: added login cookie tracker for all the steps
2025-07-03 16:42:39 -07:00
Saurav Panda
6cdfdfd69c
lint fixes
2025-07-03 10:23:26 -07:00
mertunsall
9f41a166ac
bugfix
2025-07-03 13:47:33 +02:00
mertunsall
0d6d759c5c
hotfix
2025-07-03 13:46:00 +02:00
mertunsall
05ef50fdf1
add thinking budget to gemini and fix evals
2025-07-03 13:40:17 +02:00
Magnus Müller
b375f77d18
Update service.py
2025-07-03 10:34:05 +02:00
Magnus Müller
849ba31e2b
eval-dont-go-to-the-website
2025-07-03 10:26:49 +02:00
Saurav Panda
673f342067
added some debug for the run
2025-07-02 23:46:01 -07:00
Saurav Panda
1bfef7ac91
lint fixes
2025-07-02 22:39:53 -07:00
Saurav Panda
055733e8e9
linting fixes
2025-07-02 22:28:33 -07:00
Saurav Panda
335ee6133a
fix: updated the storage_state issue
2025-07-02 22:26:44 -07:00
Saurav Panda
e7bfffc566
Merge remote-tracking branch 'upstream' into 2fa_gmail_integration
2025-07-02 18:09:48 -07:00
Saurav Panda
1e13c0e03f
linting fixes
2025-07-02 17:09:47 -07:00
Alezander9
8beabf6970
fix typing
2025-07-02 17:01:35 -07:00
Saurav Panda
2944178691
feat: added 2fa token in eval
2025-07-02 16:41:36 -07:00
Alezander9
e8db375401
make eval service fetch rotating auth info from server
2025-07-02 16:30:48 -07:00
Magnus Müller
70149369eb
eval-change-timing
2025-07-03 00:19:51 +02:00
mertunsall
58ed5da177
migrate evals to ChatGroq
2025-07-02 23:19:10 +02:00
Magnus Müller
2f8da485b1
Merge branch 'main' into eval-runner-status-updates
2025-07-02 19:28:31 +02:00
Magnus Müller
0cbda40a2c
Enhance evaluation workflow with improved runner ID generation and progress tracking
...
- Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables.
- Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API.
- Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.
2025-07-02 19:27:15 +02:00