Commit Graph

31 Commits

Author SHA1 Message Date
Alezander9
d9f1fb7bb4 track and report repo that code was run on in evals 2025-06-03 14:07:42 -07:00
Alezander9
61f9c31a3d feat: support changing eval task set 2025-06-03 10:27:34 -07:00
Alezander9
4c7e173b62 add logs to track semaphore management 2025-05-24 12:49:04 -07:00
Alezander9
25b9c44bc3 fix issue where out of place return statement could skip server upload function 2025-05-24 12:03:07 -07:00
Alezander9
38c7307169 add more logs to track down missing tasks in pipeline 2025-05-24 10:25:22 -07:00
Alezander9
a3dd8b004b update eval workflow with new arguments 2025-05-23 14:46:14 -07:00
Alezander9
3cfda361d1 move navigating to starting url code into the browser session setup 2025-05-23 11:44:04 -07:00
Alezander9
620fe6d254 add new layers of stage specific exception handlers 2025-05-23 11:35:12 -07:00
Alezander9
8233338d1f new fix for gpt-o4-mini 2025-05-22 16:46:26 -07:00
Alezander9
c2d1ec50dc refactor run task function and exception handling to be more elegant, and add fix for gpt-o4-mini 2025-05-22 16:42:28 -07:00
Alezander9
b8c453ee47 add gpt o4 mini 2025-05-22 16:03:57 -07:00
Alezander9
5f5a0931fe add claude 4 opus 2025-05-22 11:07:34 -07:00
Alezander9
4a7e9113ca add claude 4 support and cleanup eval script arguments 2025-05-22 10:54:00 -07:00
Alezander9
bd4091fd1a remove old debug messages 2025-05-21 16:01:22 -07:00
Alezander9
a68b371ab5 remove old code and add timeouts to agent runs to prevent possible github action infinite stall 2025-05-21 14:50:23 -07:00
Alezander9
a8d661b2d0 consolidated changes: adapt refactored eval service to work with new browser and on github actions 2025-05-21 14:36:57 -07:00
Nick Sweeting
3e66046046 linter 2025-05-13 17:18:50 -07:00
Shoya SHIRAKI
4f117a5956 fix(eval): update GOOGLE_API_KEY comment to GEMINI_API_KEY 2025-05-03 09:16:57 +09:00
Nick Sweeting
2be4ba4f70 more pyupgrade changes 2025-05-02 20:50:21 +08:00
Magnus Müller
db29a1c9d5 Track the source with an parameter 2025-05-01 20:04:31 +08:00
Christian Clauss
fb3282527d Detect blocking synchronous commands in asyncio code 2025-04-21 22:30:43 +02:00
Alezander9
aeaef4af57 feat: add gpt-4o-mini 2025-04-21 10:24:22 -07:00
Alezander9
8449b11ae0 refactored run task to more elegeantly handle various failure modes and still send results to server 2025-04-19 15:28:05 -07:00
Alezander9
bc95b0e2fa feat: add user message as argument, pass total tasks into run info sent to server 2025-04-19 11:58:33 -07:00
Alezander9
0890da1eb3 feat: clear old files on new eval run unless fresh start is disabled 2025-04-19 11:15:20 -07:00
Alezander9
8cf206997c addressed feedback from mrge bot 2025-04-18 10:57:30 -07:00
Alezander9
4b0ebc3189 feat: update service to improve efficiency, fetch task via server and post results to server 2025-04-18 10:18:07 -07:00
Alezander9
8e07e31bb7 feat: add --no-vision as an argument to disable vision 2025-04-17 19:58:09 -07:00
Alezander9
057a69e298 feat: allow eval script to use any model for BU agent, add vision disabling for XAI models, test all models work in eval script 2025-04-17 19:51:50 -07:00
Alezander9
a386390652 address feedback from PR 2025-04-16 14:20:36 -07:00
Alezander9
a999325181 feat: add new evaluation script 2025-04-16 12:09:57 -07:00