Gregor Žunič
1eac01dfab
openrouter evals support
2025-07-10 12:51:52 +02:00
Gregor Žunič
b850a43a4f
added grok-4
2025-07-10 11:00:17 +02:00
Alezander9
bb7db834d5
add 2FA usage into the auth prompt injeection section
2025-07-09 18:24:27 -07:00
Alezander9
e035d490c3
always tell agent to not use credentials unless necessary when provided credentials in evals
2025-07-09 18:03:49 -07:00
Magnus Müller
04bbe1c10b
Remove debug judge
2025-07-09 11:17:07 +02:00
Magnus Müller
8b4418cd39
eval fix cookie judge
2025-07-09 10:57:00 +02:00
Alezander9
88b8dd3a02
more error categories
2025-07-08 18:37:31 -07:00
Magnus Müller
9a83fedd53
Refactor Agent step handling and logging
...
- Moved step timing initialization and step counter increment to the start of the step method for better tracking.
- Updated logging messages to reflect the correct step number after the refactor.
- Increased timeout for step execution to 5 minutes and added graceful handling for timeout errors.
- Adjusted initial value of n_steps in AgentState to start from 0 instead of 1.
- Added observability decorators to resource monitoring functions for improved debugging.
2025-07-09 01:07:32 +02:00
Magnus Müller
5bde6dbe98
Enhance observability by adding debug observation points in browser profile and session classes
...
- Added `@observe_debug` decorators to methods in `BrowserProfile` and `BrowserSession` to track display configuration, connection methods, and setup processes.
- Updated `eval/service.py` to initialize Laminar with specific disabled instruments and corrected the usage of the `observe` decorator for the `run_agent_with_browser` function.
2025-07-07 12:55:22 +02:00
Magnus Müller
d03196c09c
Merge remote-tracking branch 'origin/main' into include-lmnr
2025-07-06 23:42:14 +02:00
Magnus Müller
904a68d94f
Merge branch 'main' into update-judge-summary
2025-07-06 23:04:38 +02:00
Magnus Müller
d13d01eeeb
Update judge summary
2025-07-06 23:04:29 +02:00
Magnus Müller
d8c36921a0
system prompt for open ended and specifc tasks
2025-07-06 21:36:04 +02:00
Magnus Müller
f43a8044aa
eval-update-judge-file-system-error
2025-07-06 21:01:22 +02:00
reformedot
9de712d702
feat: added browser settings to browser profile
2025-07-06 20:00:06 +02:00
mertunsall
e38bf7b457
add model back
2025-07-06 18:16:12 +02:00
Magnus Müller
f3c4ef9927
Merge branch 'main' into add-multiple-last-screenshots-to-llm-input-message
2025-07-06 14:26:12 +02:00
Magnus Müller
25392d9cde
add multiple last screenshots to llm input message
2025-07-06 13:50:31 +02:00
Magnus Müller
286bed3599
add max diff btw judges
2025-07-06 13:00:16 +02:00
Magnus Müller
e42450060c
fix reasoning text display
2025-07-06 12:53:50 +02:00
Magnus Müller
98bc0d5057
parallel judges
2025-07-06 12:45:26 +02:00
Magnus Müller
3f0861b5f3
Fix enumerate bug
2025-07-06 12:28:31 +02:00
Magnus Müller
551deea62b
Include score in reasoning
2025-07-06 12:27:18 +02:00
Magnus Müller
0612eb0aae
eval-repeat-judge
2025-07-06 12:25:13 +02:00
Alezander9
14e0a510ac
replace relative imports with package-qualified imports
2025-07-06 00:10:59 -07:00
Alezander9
492753a0eb
address cubic comments
2025-07-06 00:03:02 -07:00
Alezander9
422f910207
rename types helper file to not conflict with python types import
2025-07-05 23:57:16 -07:00
Alezander9
3c525b6007
file name standardization
2025-07-05 23:50:48 -07:00
Alezander9
e0e0ae2c06
fix merge errors
2025-07-05 23:43:50 -07:00
Alezander9
e5a6994a1c
fix merge errors
2025-07-05 23:43:19 -07:00
Alezander9
7c39427f1f
refactor eval script into smaller files and use relative imports for clean code
2025-07-05 23:40:13 -07:00
Aitor
05d2ac8d2a
Merge branch 'main' into fix/improve-browserbase-eval-params
2025-07-06 00:13:34 +02:00
Magnus Müller
af738d8da7
Add logging to judge
2025-07-05 23:31:47 +02:00
Magnus Müller
5e5dfd3eda
Add span_type
2025-07-05 23:25:47 +02:00
Magnus Müller
a4c9b992ba
Add observability module for browser-use with optional lmnr integration and debug mode support
2025-07-05 23:08:00 +02:00
reformedot
cfed7f6f86
fix: improve browserbase eval params
2025-07-05 18:22:22 +02:00
reformedot
c3949b8c92
chore: linting
2025-07-05 17:08:43 +02:00
reformedot
88c4f3ff12
fix: handle eval import errors
2025-07-05 17:07:42 +02:00
reformedot
fc34224692
fix: fix PR issues
2025-07-05 16:54:09 +02:00
reformedot
8c5672b1ef
feat: added support for Browserbase and Hyperbrowser as available browsers in the eval
...
fix: revert example change
2025-07-05 16:20:24 +02:00
Mert Unsal
38dee5e627
Merge branch 'main' into mert/improve_gmail_actions
2025-07-05 10:40:17 +02:00
mertunsall
220f0bc994
update models to gpt-4.1
2025-07-05 10:32:49 +02:00
Magnus Müller
625aedce7e
Merge branch 'main' into fix-browser-error-recovery
2025-07-05 10:26:21 +02:00
Alexander Yue
50355fb8d5
Merge branch 'main' into auth-distribution
2025-07-04 20:01:38 -07:00
Alezander9
1a3ac38197
Merge remote-tracking branch 'upstream/main' into auth-distribution
2025-07-04 15:20:54 -07:00
reformedot
3dbaea1729
fix: improved anchor browser session creation
2025-07-04 23:43:15 +02:00
reformedot
8754e22ce3
feat: added browser arg to the eval script
2025-07-04 23:40:12 +02:00
Magnus Müller
2c564009f7
Remove unused anchor navigation argument from eval service
2025-07-04 21:37:18 +02:00
Magnus Müller
18fe7620de
Remove memory logging
2025-07-04 21:21:15 +02:00
Magnus Müller
74b4bfd363
Merge remote-tracking branch 'origin/main' into feat/evals-anchor-support
2025-07-04 21:13:09 +02:00