Commit Graph

283 Commits

Author SHA1 Message Date
Gregor Žunič
1eac01dfab openrouter evals support 2025-07-10 12:51:52 +02:00
Gregor Žunič
b850a43a4f added grok-4 2025-07-10 11:00:17 +02:00
Alezander9
bb7db834d5 add 2FA usage into the auth prompt injeection section 2025-07-09 18:24:27 -07:00
Alezander9
e035d490c3 always tell agent to not use credentials unless necessary when provided credentials in evals 2025-07-09 18:03:49 -07:00
Magnus Müller
04bbe1c10b Remove debug judge 2025-07-09 11:17:07 +02:00
Magnus Müller
8b4418cd39 eval fix cookie judge 2025-07-09 10:57:00 +02:00
Alezander9
88b8dd3a02 more error categories 2025-07-08 18:37:31 -07:00
Magnus Müller
9a83fedd53 Refactor Agent step handling and logging
- Moved step timing initialization and step counter increment to the start of the step method for better tracking.
- Updated logging messages to reflect the correct step number after the refactor.
- Increased timeout for step execution to 5 minutes and added graceful handling for timeout errors.
- Adjusted initial value of n_steps in AgentState to start from 0 instead of 1.
- Added observability decorators to resource monitoring functions for improved debugging.
2025-07-09 01:07:32 +02:00
Magnus Müller
5bde6dbe98 Enhance observability by adding debug observation points in browser profile and session classes
- Added `@observe_debug` decorators to methods in `BrowserProfile` and `BrowserSession` to track display configuration, connection methods, and setup processes.
- Updated `eval/service.py` to initialize Laminar with specific disabled instruments and corrected the usage of the `observe` decorator for the `run_agent_with_browser` function.
2025-07-07 12:55:22 +02:00
Magnus Müller
d03196c09c Merge remote-tracking branch 'origin/main' into include-lmnr 2025-07-06 23:42:14 +02:00
Magnus Müller
904a68d94f Merge branch 'main' into update-judge-summary 2025-07-06 23:04:38 +02:00
Magnus Müller
d13d01eeeb Update judge summary 2025-07-06 23:04:29 +02:00
Magnus Müller
d8c36921a0 system prompt for open ended and specifc tasks 2025-07-06 21:36:04 +02:00
Magnus Müller
f43a8044aa eval-update-judge-file-system-error 2025-07-06 21:01:22 +02:00
reformedot
9de712d702 feat: added browser settings to browser profile 2025-07-06 20:00:06 +02:00
mertunsall
e38bf7b457 add model back 2025-07-06 18:16:12 +02:00
Magnus Müller
f3c4ef9927 Merge branch 'main' into add-multiple-last-screenshots-to-llm-input-message 2025-07-06 14:26:12 +02:00
Magnus Müller
25392d9cde add multiple last screenshots to llm input message 2025-07-06 13:50:31 +02:00
Magnus Müller
286bed3599 add max diff btw judges 2025-07-06 13:00:16 +02:00
Magnus Müller
e42450060c fix reasoning text display 2025-07-06 12:53:50 +02:00
Magnus Müller
98bc0d5057 parallel judges 2025-07-06 12:45:26 +02:00
Magnus Müller
3f0861b5f3 Fix enumerate bug 2025-07-06 12:28:31 +02:00
Magnus Müller
551deea62b Include score in reasoning 2025-07-06 12:27:18 +02:00
Magnus Müller
0612eb0aae eval-repeat-judge 2025-07-06 12:25:13 +02:00
Alezander9
14e0a510ac replace relative imports with package-qualified imports 2025-07-06 00:10:59 -07:00
Alezander9
492753a0eb address cubic comments 2025-07-06 00:03:02 -07:00
Alezander9
422f910207 rename types helper file to not conflict with python types import 2025-07-05 23:57:16 -07:00
Alezander9
3c525b6007 file name standardization 2025-07-05 23:50:48 -07:00
Alezander9
e0e0ae2c06 fix merge errors 2025-07-05 23:43:50 -07:00
Alezander9
e5a6994a1c fix merge errors 2025-07-05 23:43:19 -07:00
Alezander9
7c39427f1f refactor eval script into smaller files and use relative imports for clean code 2025-07-05 23:40:13 -07:00
Aitor
05d2ac8d2a Merge branch 'main' into fix/improve-browserbase-eval-params 2025-07-06 00:13:34 +02:00
Magnus Müller
af738d8da7 Add logging to judge 2025-07-05 23:31:47 +02:00
Magnus Müller
5e5dfd3eda Add span_type 2025-07-05 23:25:47 +02:00
Magnus Müller
a4c9b992ba Add observability module for browser-use with optional lmnr integration and debug mode support 2025-07-05 23:08:00 +02:00
reformedot
cfed7f6f86 fix: improve browserbase eval params 2025-07-05 18:22:22 +02:00
reformedot
c3949b8c92 chore: linting 2025-07-05 17:08:43 +02:00
reformedot
88c4f3ff12 fix: handle eval import errors 2025-07-05 17:07:42 +02:00
reformedot
fc34224692 fix: fix PR issues 2025-07-05 16:54:09 +02:00
reformedot
8c5672b1ef feat: added support for Browserbase and Hyperbrowser as available browsers in the eval
fix: revert example change
2025-07-05 16:20:24 +02:00
Mert Unsal
38dee5e627 Merge branch 'main' into mert/improve_gmail_actions 2025-07-05 10:40:17 +02:00
mertunsall
220f0bc994 update models to gpt-4.1 2025-07-05 10:32:49 +02:00
Magnus Müller
625aedce7e Merge branch 'main' into fix-browser-error-recovery 2025-07-05 10:26:21 +02:00
Alexander Yue
50355fb8d5 Merge branch 'main' into auth-distribution 2025-07-04 20:01:38 -07:00
Alezander9
1a3ac38197 Merge remote-tracking branch 'upstream/main' into auth-distribution 2025-07-04 15:20:54 -07:00
reformedot
3dbaea1729 fix: improved anchor browser session creation 2025-07-04 23:43:15 +02:00
reformedot
8754e22ce3 feat: added browser arg to the eval script 2025-07-04 23:40:12 +02:00
Magnus Müller
2c564009f7 Remove unused anchor navigation argument from eval service 2025-07-04 21:37:18 +02:00
Magnus Müller
18fe7620de Remove memory logging 2025-07-04 21:21:15 +02:00
Magnus Müller
74b4bfd363 Merge remote-tracking branch 'origin/main' into feat/evals-anchor-support 2025-07-04 21:13:09 +02:00