Commit Graph

32 Commits

Author SHA1 Message Date
Shoya SHIRAKI
4f117a5956 fix(eval): update GOOGLE_API_KEY comment to GEMINI_API_KEY 2025-05-03 09:16:57 +09:00
Nick Sweeting
2be4ba4f70 more pyupgrade changes 2025-05-02 20:50:21 +08:00
Magnus Müller
db29a1c9d5 Track the source with an parameter 2025-05-01 20:04:31 +08:00
Parth A. Patel
5250aefe58 fix: temperature can not be zero for reasoning models 2025-04-29 01:45:21 -07:00
Parth A. Patel
3860d526d4 eval: adding o4 mini support 2025-04-29 01:38:34 -07:00
Christian Clauss
fb3282527d Detect blocking synchronous commands in asyncio code 2025-04-21 22:30:43 +02:00
Nick Sweeting
25d067d67b Merge pull request #1421 from Alezander9/connect-eval-tool 2025-04-21 12:45:19 -07:00
Nick Sweeting
26d869c4f1 Merge branch 'main' into improve-eval-tool 2025-04-21 11:18:45 -07:00
Alezander9
aeaef4af57 feat: add gpt-4o-mini 2025-04-21 10:24:22 -07:00
Alezander9
8449b11ae0 refactored run task to more elegeantly handle various failure modes and still send results to server 2025-04-19 15:28:05 -07:00
Alezander9
bc95b0e2fa feat: add user message as argument, pass total tasks into run info sent to server 2025-04-19 11:58:33 -07:00
Alezander9
0890da1eb3 feat: clear old files on new eval run unless fresh start is disabled 2025-04-19 11:15:20 -07:00
dha-aa
9b51ddd773 upgrade Grok model to for improved capabilities 2025-04-19 01:37:18 +00:00
Alezander9
8cf206997c addressed feedback from mrge bot 2025-04-18 10:57:30 -07:00
Alezander9
4b0ebc3189 feat: update service to improve efficiency, fetch task via server and post results to server 2025-04-18 10:18:07 -07:00
Alezander9
8e07e31bb7 feat: add --no-vision as an argument to disable vision 2025-04-17 19:58:09 -07:00
Alezander9
057a69e298 feat: allow eval script to use any model for BU agent, add vision disabling for XAI models, test all models work in eval script 2025-04-17 19:51:50 -07:00
Alezander9
a386390652 address feedback from PR 2025-04-16 14:20:36 -07:00
Alezander9
a999325181 feat: add new evaluation script 2025-04-16 12:09:57 -07:00
Parth A. Patel
0c4a1ee0f9 nit: rename eval file to correct model name 2025-04-14 15:50:12 -07:00
Parth A. Patel
e448513f9d evals: more models for evals 2025-04-14 15:48:54 -07:00
lorenss-m
2615e1286d custom browser addition 2025-04-04 18:42:49 -07:00
Nick Sweeting
fb6fa259a8 apply ruff safe fixes 2025-03-28 18:11:36 -07:00
Nick Sweeting
e85e8f468d add more pre-commit-hooks 2025-03-28 17:15:10 -07:00
Nick Sweeting
ec56bfe81b run ruff on the entire codebase 2025-03-28 01:22:48 -07:00
Siddhant Somani
1f9386d636 Add grok eval 2025-03-15 18:09:27 -07:00
Magnus Müller
f2f8cf850d Eval models 2025-02-23 09:15:28 -08:00
Magnus Müller
de57b4f55c More eval examples 2025-02-22 08:51:55 -08:00
Magnus Müller
688a465fc5 Include test for no bounding-boxes 2025-02-21 16:18:57 -08:00
Magnus Müller
01a01312d6 Eval for claude and no-vision 2025-02-20 19:40:50 -08:00
Magnus Müller
76e4c2630d Update gpt-4o.py 2025-02-16 02:25:04 +01:00
magmueller
7fb04ed9f1 Eval file 2025-02-10 16:20:14 -08:00