Commit Graph

250 Commits

Author SHA1 Message Date
Alezander9
422f910207 rename types helper file to not conflict with python types import 2025-07-05 23:57:16 -07:00
Alezander9
3c525b6007 file name standardization 2025-07-05 23:50:48 -07:00
Alezander9
e0e0ae2c06 fix merge errors 2025-07-05 23:43:50 -07:00
Alezander9
e5a6994a1c fix merge errors 2025-07-05 23:43:19 -07:00
Alezander9
7c39427f1f refactor eval script into smaller files and use relative imports for clean code 2025-07-05 23:40:13 -07:00
Aitor
05d2ac8d2a Merge branch 'main' into fix/improve-browserbase-eval-params 2025-07-06 00:13:34 +02:00
reformedot
cfed7f6f86 fix: improve browserbase eval params 2025-07-05 18:22:22 +02:00
reformedot
c3949b8c92 chore: linting 2025-07-05 17:08:43 +02:00
reformedot
88c4f3ff12 fix: handle eval import errors 2025-07-05 17:07:42 +02:00
reformedot
fc34224692 fix: fix PR issues 2025-07-05 16:54:09 +02:00
reformedot
8c5672b1ef feat: added support for Browserbase and Hyperbrowser as available browsers in the eval
fix: revert example change
2025-07-05 16:20:24 +02:00
Mert Unsal
38dee5e627 Merge branch 'main' into mert/improve_gmail_actions 2025-07-05 10:40:17 +02:00
mertunsall
220f0bc994 update models to gpt-4.1 2025-07-05 10:32:49 +02:00
Alexander Yue
50355fb8d5 Merge branch 'main' into auth-distribution 2025-07-04 20:01:38 -07:00
Alezander9
1a3ac38197 Merge remote-tracking branch 'upstream/main' into auth-distribution 2025-07-04 15:20:54 -07:00
reformedot
3dbaea1729 fix: improved anchor browser session creation 2025-07-04 23:43:15 +02:00
reformedot
8754e22ce3 feat: added browser arg to the eval script 2025-07-04 23:40:12 +02:00
Magnus Müller
2c564009f7 Remove unused anchor navigation argument from eval service 2025-07-04 21:37:18 +02:00
Magnus Müller
18fe7620de Remove memory logging 2025-07-04 21:21:15 +02:00
Magnus Müller
74b4bfd363 Merge remote-tracking branch 'origin/main' into feat/evals-anchor-support 2025-07-04 21:13:09 +02:00
mertunsall
f80bf95260 add gmail connection only for tasks that have OTP 2025-07-04 17:21:18 +02:00
mertunsall
d8a08f088e fix error in not initializing controller correctly 2025-07-04 14:42:32 +02:00
Saurav Panda
fd5adf4080 Merge branch 'main' into 2fa_gmail_integration 2025-07-04 04:30:01 -07:00
Saurav Panda
abdb8efa9e refc: removed authenticate function from action and update ActionResult for gmail integrations 2025-07-04 04:27:18 -07:00
mertunsall
053d81a97e overwrite comprehensive eval too if necessary 2025-07-04 12:34:09 +02:00
Saurav Panda
f3fa86ea21 feat: added 2fa token parsing logic 2025-07-04 02:58:04 -07:00
mertunsall
bcb84ee6b0 Add judge evaluation to login tasks 2025-07-04 11:38:06 +02:00
Saurav Panda
bbfbcebd6e feat: added multi credential support 2025-07-04 00:15:38 -07:00
Saurav Panda
c7fedf5117 Merge branch 'main' into 2fa_gmail_integration 2025-07-03 22:03:35 -07:00
Saurav Panda
dfc5c916a0 lint issue fix 2025-07-03 21:54:57 -07:00
Saurav Panda
ea03f2dc4c feat: added login cookie tracker for all the steps 2025-07-03 16:42:39 -07:00
Saurav Panda
6cdfdfd69c lint fixes 2025-07-03 10:23:26 -07:00
mertunsall
9f41a166ac bugfix 2025-07-03 13:47:33 +02:00
mertunsall
0d6d759c5c hotfix 2025-07-03 13:46:00 +02:00
mertunsall
05ef50fdf1 add thinking budget to gemini and fix evals 2025-07-03 13:40:17 +02:00
Magnus Müller
b375f77d18 Update service.py 2025-07-03 10:34:05 +02:00
Magnus Müller
849ba31e2b eval-dont-go-to-the-website 2025-07-03 10:26:49 +02:00
Saurav Panda
673f342067 added some debug for the run 2025-07-02 23:46:01 -07:00
Saurav Panda
1bfef7ac91 lint fixes 2025-07-02 22:39:53 -07:00
Saurav Panda
055733e8e9 linting fixes 2025-07-02 22:28:33 -07:00
Saurav Panda
335ee6133a fix: updated the storage_state issue 2025-07-02 22:26:44 -07:00
Saurav Panda
e7bfffc566 Merge remote-tracking branch 'upstream' into 2fa_gmail_integration 2025-07-02 18:09:48 -07:00
Saurav Panda
1e13c0e03f linting fixes 2025-07-02 17:09:47 -07:00
Alezander9
8beabf6970 fix typing 2025-07-02 17:01:35 -07:00
Saurav Panda
2944178691 feat: added 2fa token in eval 2025-07-02 16:41:36 -07:00
Alezander9
e8db375401 make eval service fetch rotating auth info from server 2025-07-02 16:30:48 -07:00
Magnus Müller
70149369eb eval-change-timing 2025-07-03 00:19:51 +02:00
mertunsall
58ed5da177 migrate evals to ChatGroq 2025-07-02 23:19:10 +02:00
Magnus Müller
2f8da485b1 Merge branch 'main' into eval-runner-status-updates 2025-07-02 19:28:31 +02:00
Magnus Müller
0cbda40a2c Enhance evaluation workflow with improved runner ID generation and progress tracking
- Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables.
- Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API.
- Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.
2025-07-02 19:27:15 +02:00