Shoya SHIRAKI
|
4f117a5956
|
fix(eval): update GOOGLE_API_KEY comment to GEMINI_API_KEY
|
2025-05-03 09:16:57 +09:00 |
|
Nick Sweeting
|
2be4ba4f70
|
more pyupgrade changes
|
2025-05-02 20:50:21 +08:00 |
|
Magnus Müller
|
db29a1c9d5
|
Track the source with an parameter
|
2025-05-01 20:04:31 +08:00 |
|
Parth A. Patel
|
5250aefe58
|
fix: temperature can not be zero for reasoning models
|
2025-04-29 01:45:21 -07:00 |
|
Parth A. Patel
|
3860d526d4
|
eval: adding o4 mini support
|
2025-04-29 01:38:34 -07:00 |
|
Christian Clauss
|
fb3282527d
|
Detect blocking synchronous commands in asyncio code
|
2025-04-21 22:30:43 +02:00 |
|
Nick Sweeting
|
25d067d67b
|
Merge pull request #1421 from Alezander9/connect-eval-tool
|
2025-04-21 12:45:19 -07:00 |
|
Nick Sweeting
|
26d869c4f1
|
Merge branch 'main' into improve-eval-tool
|
2025-04-21 11:18:45 -07:00 |
|
Alezander9
|
aeaef4af57
|
feat: add gpt-4o-mini
|
2025-04-21 10:24:22 -07:00 |
|
Alezander9
|
8449b11ae0
|
refactored run task to more elegeantly handle various failure modes and still send results to server
|
2025-04-19 15:28:05 -07:00 |
|
Alezander9
|
bc95b0e2fa
|
feat: add user message as argument, pass total tasks into run info sent to server
|
2025-04-19 11:58:33 -07:00 |
|
Alezander9
|
0890da1eb3
|
feat: clear old files on new eval run unless fresh start is disabled
|
2025-04-19 11:15:20 -07:00 |
|
dha-aa
|
9b51ddd773
|
upgrade Grok model to for improved capabilities
|
2025-04-19 01:37:18 +00:00 |
|
Alezander9
|
8cf206997c
|
addressed feedback from mrge bot
|
2025-04-18 10:57:30 -07:00 |
|
Alezander9
|
4b0ebc3189
|
feat: update service to improve efficiency, fetch task via server and post results to server
|
2025-04-18 10:18:07 -07:00 |
|
Alezander9
|
8e07e31bb7
|
feat: add --no-vision as an argument to disable vision
|
2025-04-17 19:58:09 -07:00 |
|
Alezander9
|
057a69e298
|
feat: allow eval script to use any model for BU agent, add vision disabling for XAI models, test all models work in eval script
|
2025-04-17 19:51:50 -07:00 |
|
Alezander9
|
a386390652
|
address feedback from PR
|
2025-04-16 14:20:36 -07:00 |
|
Alezander9
|
a999325181
|
feat: add new evaluation script
|
2025-04-16 12:09:57 -07:00 |
|
Parth A. Patel
|
0c4a1ee0f9
|
nit: rename eval file to correct model name
|
2025-04-14 15:50:12 -07:00 |
|
Parth A. Patel
|
e448513f9d
|
evals: more models for evals
|
2025-04-14 15:48:54 -07:00 |
|
lorenss-m
|
2615e1286d
|
custom browser addition
|
2025-04-04 18:42:49 -07:00 |
|
Nick Sweeting
|
fb6fa259a8
|
apply ruff safe fixes
|
2025-03-28 18:11:36 -07:00 |
|
Nick Sweeting
|
e85e8f468d
|
add more pre-commit-hooks
|
2025-03-28 17:15:10 -07:00 |
|
Nick Sweeting
|
ec56bfe81b
|
run ruff on the entire codebase
|
2025-03-28 01:22:48 -07:00 |
|
Siddhant Somani
|
1f9386d636
|
Add grok eval
|
2025-03-15 18:09:27 -07:00 |
|
Magnus Müller
|
f2f8cf850d
|
Eval models
|
2025-02-23 09:15:28 -08:00 |
|
Magnus Müller
|
de57b4f55c
|
More eval examples
|
2025-02-22 08:51:55 -08:00 |
|
Magnus Müller
|
688a465fc5
|
Include test for no bounding-boxes
|
2025-02-21 16:18:57 -08:00 |
|
Magnus Müller
|
01a01312d6
|
Eval for claude and no-vision
|
2025-02-20 19:40:50 -08:00 |
|
Magnus Müller
|
76e4c2630d
|
Update gpt-4o.py
|
2025-02-16 02:25:04 +01:00 |
|
magmueller
|
7fb04ed9f1
|
Eval file
|
2025-02-10 16:20:14 -08:00 |
|