Magnus Müller
f1d5dc5a17
Pass laminar_eval_id from frontend
2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4
Monitor eval cpu
2025-06-20 23:35:56 +02:00
Magnus Müller
4a8cf30dac
Merge branch 'main' into mert/new_everything
2025-06-20 12:27:19 +02:00
Magnus Müller
0e5a8942f3
Add gemini-2.5-flash
2025-06-20 12:19:47 +02:00
Magnus Müller
4c2952d640
Squashed commit of the following:
...
commit a9cf53a1b1
Merge: 5aa62c11 0f9ffa10
Author: Magnus Müller <67061560+MagMueller@users.noreply.github.com >
Date: Fri Jun 20 10:41:19 2025 +0200
Set user_data_dir to None (#2015 )
<!-- This is an auto-generated description by cubic. -->
Changed browser session setup to use incognito mode by setting
user_data_dir to None, preventing persistent state between evaluation
runs.
<!-- End of auto-generated description by cubic. -->
commit 0f9ffa1072
Author: Magnus Müller <67061560+MagMueller@users.noreply.github.com >
Date: Fri Jun 20 10:38:01 2025 +0200
Set user_data_dir to None
commit 5aa62c1113
Merge: d8a9d21b e559ff5e
Author: Nick Sweeting <git@sweeting.me >
Date: Thu Jun 19 23:01:49 2025 -0700
Fix cross-origin iframe DOM retrieval (#1965 )
commit d8a9d21b00
Merge: 3e5f3049 b6be1583
Author: Nick Sweeting <git@sweeting.me >
Date: Thu Jun 19 23:01:21 2025 -0700
Fix critical domain restriction bypass vulnerability (#2006 )
commit b6be158319
Author: Sahar <saharhashai@gmail.com >
Date: Thu Jun 19 02:28:34 2025 -0700
Delete tests/ci/test_security_url_validation.py
commit aca4b57329
Author: Sahar <saharhashai@gmail.com >
Date: Thu Jun 19 02:27:57 2025 -0700
Delete SECURITY_FIX_REPORT.md
commit 45872c1e45
Author: Your Name <your.email@example.com >
Date: Thu Jun 19 11:24:50 2025 +0200
fix(security): prevent domain restriction bypass in controller actions
- Add domain validation to controller.click() and controller.type() methods
- Implement comprehensive security checks before executing actions
- Prevent potential prompt injection and unauthorized data access
- Add extensive test coverage for domain validation scenarios
- Update documentation with security considerations
This critical fix prevents complete bypass of domain restrictions that
could enable attackers to perform unauthorized actions on any domain.
commit e559ff5eaa
Merge: 19ae8a11 f348e0c5
Author: Nick Sweeting <git@sweeting.me >
Date: Sat Jun 14 01:56:09 2025 -0700
Merge branch 'main' into main
commit 19ae8a1146
Merge: e1b3ff9e 08ed0be3
Author: Nick Sweeting <git@sweeting.me >
Date: Sat Jun 14 00:31:30 2025 -0700
Merge branch 'main' into main
commit e1b3ff9e9d
Author: Ilya Biryukov <ilbiryuk@microsoft.com >
Date: Thu Jun 12 17:40:40 2025 -0700
Revert changes to examples/features/multiple_agents_same_browser.py
commit d20a3b55d6
Author: Ilya Biryukov <ilbiryuk@microsoft.com >
Date: Thu Jun 12 17:30:59 2025 -0700
Fix pre-commit lint issues and compile error in multiple_agents_same_browser
commit 13d5468aa2
Author: Ilya Biryukov <ilbiryuk@microsoft.com >
Date: Thu Jun 12 14:07:21 2025 -0700
Fix cross-origin iframe DOM retrieval
2025-06-20 10:51:06 +02:00
Magnus Müller
0f9ffa1072
Set user_data_dir to None
2025-06-20 10:38:01 +02:00
Magnus Müller
90ae26316e
Refactor ActionResult to standardize the inclusion of extracted content, replacing update_only_read_state with include_extracted_content_only_once across multiple services. This change enhances clarity in memory management and ensures consistent handling of extracted content.
2025-06-19 23:18:30 +02:00
Magnus Müller
ce880e5e35
Refactor ActionResult handling across multiple services to standardize the use of long_term_memory, replacing memory references. Update related logic to ensure extracted content is consistently managed for improved clarity and error handling.
2025-06-19 23:11:55 +02:00
Magnus Müller
c62d14d9ed
Update action result handling in service.py to include extracted content in memory for paused agent states and SERPER API key checks, enhancing error reporting consistency.
2025-06-19 22:52:53 +02:00
mertunsall
86083b09c9
reset eval to main to be comparable
2025-06-19 17:02:58 +02:00
mertunsall
76ef41da45
Merge branch 'main' into mert/new_everything
2025-06-19 14:11:56 +02:00
Magnus Müller
f89a97f256
Delete laminar files since it is integrated to the main eval service
2025-06-19 11:26:53 +02:00
Magnus Müller
0b6ebea431
Merge branch 'main' into mert/new_everything
2025-06-19 09:49:43 +02:00
Magnus Müller
2c627a93fc
Fix event loop error
2025-06-18 23:55:27 +02:00
Magnus Müller
5b4b3dd5aa
Integrate Laminar tracing and enhance evaluation workflow
...
- Added integration for Laminar tracing by initializing `AsyncLaminarClient` and creating evaluation links during task execution.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Enhanced logging for task results and added error handling for Laminar datapoint creation.
- Improved the `run_task_with_semaphore` function to manage Laminar evaluation links and update datapoints with evaluation scores.
These changes aim to streamline the evaluation process and improve tracking of task performance.
2025-06-18 23:27:30 +02:00
Magnus Müller
081f5747c6
Refactor run_task_with_semaphore to use laminar_task_link
...
- Renamed variable `link` to `laminar_task_link` for clarity in the `run_task_with_semaphore` function.
- Updated the creation of `TaskResult` to utilize `laminar_task_link` instead of the previous `link` variable.
- Improved logging to reflect the new variable name, enhancing readability and maintainability.
These changes aim to improve code clarity and maintain consistency in the evaluation workflow.
2025-06-18 22:41:56 +02:00
Magnus Müller
158aa5b719
Enhance evaluation workflow in laminar_eval.py and update dependencies
...
- Removed the `lmnr[all]` dependency from `pyproject.toml`.
- Added `browser-use[dev,eval]` to `dev-dependencies` for improved development support.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Modified `run_task_with_semaphore` to handle Laminar evaluation links and improved logging for task results.
- Added logic to create a Laminar evaluation link during task execution.
These changes aim to streamline the evaluation process and enhance the overall functionality of the evaluation workflow.
2025-06-18 22:41:05 +02:00
Magnus Müller
9547ebb3bb
Refactor laminar_eval.py to enhance task evaluation workflow
...
- Updated `run_task_with_semaphore` to use `lmnr_run_id` for evaluation ID instead of `run_id`.
- Added a new helper function `start_new_run` to initiate evaluation runs on the server.
- Improved logging for task results and server interactions.
- Ensured proper handling of environment variables for server configuration.
This refactor aims to streamline the evaluation process and improve error handling.
2025-06-18 21:04:47 +02:00
Robert Kim
3548d0fe35
lint
2025-06-18 14:55:00 +01:00
Magnus Müller
3e796a5773
Merge branch 'main' into laminar_evals
2025-06-18 15:34:26 +02:00
Robert Kim
a0d6a08119
v0
2025-06-18 14:21:48 +01:00
mertunsall
49bdba1578
Add evaluation criterion for handling vague user tasks in service.py
2025-06-18 13:18:40 +02:00
mertunsall
3c5c3b69cb
Refine evaluation criteria and system message in service.py
2025-06-18 11:53:34 +02:00
mertunsall
1864e52635
Merge branch 'main' into mert/new_everything
2025-06-18 10:00:25 +02:00
Magnus Müller
eb2aabb7f8
Remove duplicate
2025-06-17 19:21:29 +02:00
Magnus Müller
c66880a8fa
Add lmnr package for tracing integration and update eval workflow
2025-06-17 18:58:56 +02:00
mertunsall
0c34c399db
Merge branch 'main' into mert/new_everything
2025-06-17 11:19:06 +02:00
Nick Sweeting
86abd92b79
reset timeouts in evals back to faster defaults
2025-06-11 01:46:41 -07:00
Nick Sweeting
520bbbe6c0
Update eval/service.py
...
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-06-11 02:28:13 -04:00
Nick Sweeting
1c64c483d9
allow https errors on evals because they include http urls for some reason
2025-06-10 06:27:35 -07:00
Nick Sweeting
ed6db5802b
fix timeouts order of magnitude
2025-06-10 05:42:23 -07:00
Nick Sweeting
a0ee5de2ad
tweak browsersession timeouts
2025-06-10 05:18:56 -07:00
Alezander9
8e7663758d
modify service to accept tasks with login cookies
2025-06-09 22:09:48 -07:00
Magnus Müller
c48d83f54d
Refactor web search functionality to improve SERP API integration. Updated action description for clarity, streamlined data handling by excluding unnecessary fields, and enhanced logging for debugging. Returns only relevant organic search results.
2025-06-09 15:38:19 +02:00
Magnus Müller
75413c31c7
Merge branch 'main' into mert/new_everything
2025-06-09 11:48:57 +02:00
Alezander9
739a3b2c87
add flag to toggle whether webjudge gets to see final agent result
2025-06-08 17:49:05 -07:00
Alezander9
bc56c73fe7
add ability to append results to existing run, so we can parallelize
2025-06-07 21:47:47 -07:00
Magnus Müller
52d4191ba9
Merge branch 'main' into mert/new_everything
2025-06-07 17:58:09 +02:00
Alezander9
b1cc677b01
add samba nova models
2025-06-06 17:38:36 -07:00
Alezander9
d9f1fb7bb4
track and report repo that code was run on in evals
2025-06-03 14:07:42 -07:00
Magnus Müller
aec9fd29a3
Merge branch 'main' into mert/new_everything
2025-06-03 19:59:17 +02:00
Alezander9
61f9c31a3d
feat: support changing eval task set
2025-06-03 10:27:34 -07:00
mertunsall
9025e1ccba
clean a little
2025-06-03 17:37:24 +02:00
mertunsall
286d3dec60
introduce max tokens
2025-05-31 18:22:34 +02:00
Alezander9
4c7e173b62
add logs to track semaphore management
2025-05-24 12:49:04 -07:00
Alezander9
25b9c44bc3
fix issue where out of place return statement could skip server upload function
2025-05-24 12:03:07 -07:00
Alezander9
38c7307169
add more logs to track down missing tasks in pipeline
2025-05-24 10:25:22 -07:00
Alezander9
a3dd8b004b
update eval workflow with new arguments
2025-05-23 14:46:14 -07:00
Alezander9
3cfda361d1
move navigating to starting url code into the browser session setup
2025-05-23 11:44:04 -07:00
Alezander9
620fe6d254
add new layers of stage specific exception handlers
2025-05-23 11:35:12 -07:00