Magnus Müller
f1d5dc5a17
Pass laminar_eval_id from frontend
2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4
Monitor eval cpu
2025-06-20 23:35:56 +02:00
mertunsall
76ef41da45
Merge branch 'main' into mert/new_everything
2025-06-19 14:11:56 +02:00
Magnus Müller
f89a97f256
Delete laminar files since it is integrated to the main eval service
2025-06-19 11:26:53 +02:00
Magnus Müller
0b6ebea431
Merge branch 'main' into mert/new_everything
2025-06-19 09:49:43 +02:00
Magnus Müller
f89a1aac84
Update evaluation workflow to use laminar_eval.py
...
- Changed the script executed in the evaluation workflow from `eval/service.py` to `eval/laminar_eval.py` for consistency with recent updates.
2025-06-18 16:43:06 +02:00
Magnus Müller
7f1d256964
Update dependencies in pyproject.toml and rename evaluation workflow
...
- Removed duplicate lmnr dependency from the `dependencies` section in `pyproject.toml`.
- Updated `lmnr` version to `0.6.11` in the `eval` extras group.
- Renamed the evaluation workflow from "Run Evaluation Script" to "Run Laminar Eval Script" for clarity.
- Adjusted the dependency installation command in the workflow to include the `--extra eval` flag.
2025-06-18 16:31:00 +02:00
Robert Kim
01a36a587e
gh action for laminar evals
2025-06-18 14:38:27 +01:00
mertunsall
1864e52635
Merge branch 'main' into mert/new_everything
2025-06-18 10:00:25 +02:00
Magnus Müller
c66880a8fa
Add lmnr package for tracing integration and update eval workflow
2025-06-17 18:58:56 +02:00
mertunsall
0c34c399db
Merge branch 'main' into mert/new_everything
2025-06-17 11:19:06 +02:00
Nick Sweeting
e3d21d33a1
fix evaluate_tasks.py errors in CI
2025-06-16 17:20:27 -07:00
Nick Sweeting
3972f335f9
allow claude github action to use more tools
2025-06-16 15:47:18 -07:00
Nick Sweeting
cdf0706f20
only create pre-release versions manually
2025-06-12 10:49:04 -07:00
Nick Sweeting
06488e11ba
fix clickaction error handling test
2025-06-11 00:05:58 -07:00
Nick Sweeting
1fd8e0ec92
try statuses-write option
2025-06-10 23:57:41 -07:00
Nick Sweeting
4d8f1342e8
try and revert evals change
2025-06-10 23:57:41 -07:00
Nick Sweeting
ffd36eb5da
tweak env vars for CI
2025-06-10 03:58:37 -07:00
Magnus Müller
061f2cf089
Increase timeout duration in evaluation workflow from 60 to 360 minutes to accommodate longer-running jobs.
2025-06-09 20:51:11 +02:00
Magnus Müller
fe64e35b3c
Add SERPER_API_KEY to evaluation workflow for enhanced API integration
2025-06-09 12:25:50 +02:00
Magnus Müller
b1af578bf3
Add SERPER_API_KEY to evaluation workflow for enhanced API integration
2025-06-09 11:58:34 +02:00
Alezander9
739a3b2c87
add flag to toggle whether webjudge gets to see final agent result
2025-06-08 17:49:05 -07:00
Alezander9
31dcce5390
decrease workflow timeout
2025-06-07 22:08:29 -07:00
Alezander9
c04710ef61
Merge remote-tracking branch 'upstream/main' into parallel-evals
2025-06-07 21:48:47 -07:00
Alezander9
bc56c73fe7
add ability to append results to existing run, so we can parallelize
2025-06-07 21:47:47 -07:00
Magnus Müller
eaab9f04d7
Enhance GitHub Actions workflow and evaluate_tasks.py to include detailed task evaluation results. The workflow now captures and displays detailed results in a structured format, while the Python script outputs detailed results as JSON for better integration with GitHub Actions. This improves visibility and understanding of task outcomes in the evaluation process.
2025-06-07 13:39:16 +02:00
Magnus Müller
3ecee462a2
Update GitHub Actions workflow permissions to allow writing comments on pull requests and issues, enhancing interaction capabilities for automated testing processes.
2025-06-07 13:22:43 +02:00
Magnus Müller
bdf29c34fb
Add PR comment functionality to GitHub Actions workflow for agent task evaluation results. This includes a summary of passed tasks, percentage score, and status emoji based on task outcomes, enhancing visibility of evaluation results directly in pull requests.
2025-06-07 13:16:30 +02:00
Magnus Müller
8d9b24b03a
Add summary output for agent tasks evaluation in CI workflow
2025-06-07 11:27:06 +02:00
Magnus Müller
52c60f8367
Remove dependency on tests job in evaluate-tasks step of CI workflow
2025-06-07 11:11:23 +02:00
Magnus Müller
576519ee40
Enhance CI workflow by adding agent tasks evaluation step and updating evaluate_tasks.py to output evaluation results
2025-06-07 10:59:24 +02:00
Magnus Müller
3666d2b077
Add agent tasks evaluation script and update CI workflow to include it
2025-06-07 10:49:03 +02:00
Magnus Müller
074169f80f
Merge branch 'main' into tests/eval
2025-06-07 09:15:41 +02:00
Nick Sweeting
4358924964
only install chromium in tests
2025-06-06 19:32:20 -07:00
Magnus Müller
42dac3dce1
Add API key environment variables to GitHub Actions workflow for enhanced test capabilities
2025-06-07 01:11:48 +02:00
Nick Sweeting
8504bc4c7b
parallelize playwright tests using loop scope=session and pytest-xdist
2025-06-06 02:03:25 -07:00
Alezander9
61f9c31a3d
feat: support changing eval task set
2025-06-03 10:27:34 -07:00
BroskyBrowser
15cb992618
fix: expose ANCHOR_BROWSER_API_KEY variable in the evals workflow .yaml
2025-06-01 12:58:14 +02:00
Nick Sweeting
bfb6b26274
Merge branch 'main' into new-eval
2025-05-25 18:58:08 -07:00
Nick Sweeting
1d2cb46d73
Update claude.yml
2025-05-25 06:07:37 -04:00
Nick Sweeting
6d0758764a
Add Claude PR Assistant workflow
2025-05-25 03:04:06 -07:00
Alezander9
ad71ba8d29
add branch name selection into workflow
2025-05-24 19:12:01 -07:00
Alezander9
1f113fa640
Merge remote-tracking branch 'upstream/main' into new-eval
2025-05-24 10:25:43 -07:00
Nick Sweeting
04a3c881df
add docker setting for ci
2025-05-23 22:22:10 -07:00
Nick Sweeting
cacc7c2020
allow running publish manually
2025-05-23 22:18:09 -07:00
Nick Sweeting
8e8f9a2381
allow running publish manually
2025-05-23 22:17:09 -07:00
Nick Sweeting
4196e79faa
fix publish action
2025-05-23 22:15:34 -07:00
Nick Sweeting
3d10260543
fix missing link between find_tests and test job in CI
2025-05-23 19:22:55 -07:00
Nick Sweeting
37a36dbd28
catch failure case up-front
2025-05-23 19:12:46 -07:00
Nick Sweeting
6a1ed628e3
properly split filenamees out of ls results in test discovery
2025-05-23 19:10:25 -07:00