Commit Graph

92 Commits

Author SHA1 Message Date
Magnus Müller
f1d5dc5a17 Pass laminar_eval_id from frontend 2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4 Monitor eval cpu 2025-06-20 23:35:56 +02:00
Magnus Müller
4a8cf30dac Merge branch 'main' into mert/new_everything 2025-06-20 12:27:19 +02:00
Magnus Müller
0e5a8942f3 Add gemini-2.5-flash 2025-06-20 12:19:47 +02:00
Magnus Müller
4c2952d640 Squashed commit of the following:
commit a9cf53a1b1
Merge: 5aa62c11 0f9ffa10
Author: Magnus Müller <67061560+MagMueller@users.noreply.github.com>
Date:   Fri Jun 20 10:41:19 2025 +0200

    Set user_data_dir to None (#2015)

    <!-- This is an auto-generated description by cubic. -->
    Changed browser session setup to use incognito mode by setting
    user_data_dir to None, preventing persistent state between evaluation
    runs.

    <!-- End of auto-generated description by cubic. -->

commit 0f9ffa1072
Author: Magnus Müller <67061560+MagMueller@users.noreply.github.com>
Date:   Fri Jun 20 10:38:01 2025 +0200

    Set user_data_dir to None

commit 5aa62c1113
Merge: d8a9d21b e559ff5e
Author: Nick Sweeting <git@sweeting.me>
Date:   Thu Jun 19 23:01:49 2025 -0700

    Fix cross-origin iframe DOM retrieval (#1965)

commit d8a9d21b00
Merge: 3e5f3049 b6be1583
Author: Nick Sweeting <git@sweeting.me>
Date:   Thu Jun 19 23:01:21 2025 -0700

    Fix critical domain restriction bypass vulnerability (#2006)

commit b6be158319
Author: Sahar <saharhashai@gmail.com>
Date:   Thu Jun 19 02:28:34 2025 -0700

    Delete tests/ci/test_security_url_validation.py

commit aca4b57329
Author: Sahar <saharhashai@gmail.com>
Date:   Thu Jun 19 02:27:57 2025 -0700

    Delete SECURITY_FIX_REPORT.md

commit 45872c1e45
Author: Your Name <your.email@example.com>
Date:   Thu Jun 19 11:24:50 2025 +0200

    fix(security): prevent domain restriction bypass in controller actions

    - Add domain validation to controller.click() and controller.type() methods
    - Implement comprehensive security checks before executing actions
    - Prevent potential prompt injection and unauthorized data access
    - Add extensive test coverage for domain validation scenarios
    - Update documentation with security considerations

    This critical fix prevents complete bypass of domain restrictions that
    could enable attackers to perform unauthorized actions on any domain.

commit e559ff5eaa
Merge: 19ae8a11 f348e0c5
Author: Nick Sweeting <git@sweeting.me>
Date:   Sat Jun 14 01:56:09 2025 -0700

    Merge branch 'main' into main

commit 19ae8a1146
Merge: e1b3ff9e 08ed0be3
Author: Nick Sweeting <git@sweeting.me>
Date:   Sat Jun 14 00:31:30 2025 -0700

    Merge branch 'main' into main

commit e1b3ff9e9d
Author: Ilya Biryukov <ilbiryuk@microsoft.com>
Date:   Thu Jun 12 17:40:40 2025 -0700

    Revert changes to  examples/features/multiple_agents_same_browser.py

commit d20a3b55d6
Author: Ilya Biryukov <ilbiryuk@microsoft.com>
Date:   Thu Jun 12 17:30:59 2025 -0700

    Fix pre-commit lint issues and compile error in multiple_agents_same_browser

commit 13d5468aa2
Author: Ilya Biryukov <ilbiryuk@microsoft.com>
Date:   Thu Jun 12 14:07:21 2025 -0700

    Fix cross-origin iframe DOM retrieval
2025-06-20 10:51:06 +02:00
Magnus Müller
0f9ffa1072 Set user_data_dir to None 2025-06-20 10:38:01 +02:00
Magnus Müller
90ae26316e Refactor ActionResult to standardize the inclusion of extracted content, replacing update_only_read_state with include_extracted_content_only_once across multiple services. This change enhances clarity in memory management and ensures consistent handling of extracted content. 2025-06-19 23:18:30 +02:00
Magnus Müller
ce880e5e35 Refactor ActionResult handling across multiple services to standardize the use of long_term_memory, replacing memory references. Update related logic to ensure extracted content is consistently managed for improved clarity and error handling. 2025-06-19 23:11:55 +02:00
Magnus Müller
c62d14d9ed Update action result handling in service.py to include extracted content in memory for paused agent states and SERPER API key checks, enhancing error reporting consistency. 2025-06-19 22:52:53 +02:00
mertunsall
86083b09c9 reset eval to main to be comparable 2025-06-19 17:02:58 +02:00
mertunsall
76ef41da45 Merge branch 'main' into mert/new_everything 2025-06-19 14:11:56 +02:00
Magnus Müller
f89a97f256 Delete laminar files since it is integrated to the main eval service 2025-06-19 11:26:53 +02:00
Magnus Müller
0b6ebea431 Merge branch 'main' into mert/new_everything 2025-06-19 09:49:43 +02:00
Magnus Müller
2c627a93fc Fix event loop error 2025-06-18 23:55:27 +02:00
Magnus Müller
5b4b3dd5aa Integrate Laminar tracing and enhance evaluation workflow
- Added integration for Laminar tracing by initializing `AsyncLaminarClient` and creating evaluation links during task execution.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Enhanced logging for task results and added error handling for Laminar datapoint creation.
- Improved the `run_task_with_semaphore` function to manage Laminar evaluation links and update datapoints with evaluation scores.

These changes aim to streamline the evaluation process and improve tracking of task performance.
2025-06-18 23:27:30 +02:00
Magnus Müller
081f5747c6 Refactor run_task_with_semaphore to use laminar_task_link
- Renamed variable `link` to `laminar_task_link` for clarity in the `run_task_with_semaphore` function.
- Updated the creation of `TaskResult` to utilize `laminar_task_link` instead of the previous `link` variable.
- Improved logging to reflect the new variable name, enhancing readability and maintainability.

These changes aim to improve code clarity and maintain consistency in the evaluation workflow.
2025-06-18 22:41:56 +02:00
Magnus Müller
158aa5b719 Enhance evaluation workflow in laminar_eval.py and update dependencies
- Removed the `lmnr[all]` dependency from `pyproject.toml`.
- Added `browser-use[dev,eval]` to `dev-dependencies` for improved development support.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Modified `run_task_with_semaphore` to handle Laminar evaluation links and improved logging for task results.
- Added logic to create a Laminar evaluation link during task execution.

These changes aim to streamline the evaluation process and enhance the overall functionality of the evaluation workflow.
2025-06-18 22:41:05 +02:00
Magnus Müller
9547ebb3bb Refactor laminar_eval.py to enhance task evaluation workflow
- Updated `run_task_with_semaphore` to use `lmnr_run_id` for evaluation ID instead of `run_id`.
- Added a new helper function `start_new_run` to initiate evaluation runs on the server.
- Improved logging for task results and server interactions.
- Ensured proper handling of environment variables for server configuration.

This refactor aims to streamline the evaluation process and improve error handling.
2025-06-18 21:04:47 +02:00
Robert Kim
3548d0fe35 lint 2025-06-18 14:55:00 +01:00
Magnus Müller
3e796a5773 Merge branch 'main' into laminar_evals 2025-06-18 15:34:26 +02:00
Robert Kim
a0d6a08119 v0 2025-06-18 14:21:48 +01:00
mertunsall
49bdba1578 Add evaluation criterion for handling vague user tasks in service.py 2025-06-18 13:18:40 +02:00
mertunsall
3c5c3b69cb Refine evaluation criteria and system message in service.py 2025-06-18 11:53:34 +02:00
mertunsall
1864e52635 Merge branch 'main' into mert/new_everything 2025-06-18 10:00:25 +02:00
Magnus Müller
eb2aabb7f8 Remove duplicate 2025-06-17 19:21:29 +02:00
Magnus Müller
c66880a8fa Add lmnr package for tracing integration and update eval workflow 2025-06-17 18:58:56 +02:00
mertunsall
0c34c399db Merge branch 'main' into mert/new_everything 2025-06-17 11:19:06 +02:00
Nick Sweeting
86abd92b79 reset timeouts in evals back to faster defaults 2025-06-11 01:46:41 -07:00
Nick Sweeting
520bbbe6c0 Update eval/service.py
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-06-11 02:28:13 -04:00
Nick Sweeting
1c64c483d9 allow https errors on evals because they include http urls for some reason 2025-06-10 06:27:35 -07:00
Nick Sweeting
ed6db5802b fix timeouts order of magnitude 2025-06-10 05:42:23 -07:00
Nick Sweeting
a0ee5de2ad tweak browsersession timeouts 2025-06-10 05:18:56 -07:00
Alezander9
8e7663758d modify service to accept tasks with login cookies 2025-06-09 22:09:48 -07:00
Magnus Müller
c48d83f54d Refactor web search functionality to improve SERP API integration. Updated action description for clarity, streamlined data handling by excluding unnecessary fields, and enhanced logging for debugging. Returns only relevant organic search results. 2025-06-09 15:38:19 +02:00
Magnus Müller
75413c31c7 Merge branch 'main' into mert/new_everything 2025-06-09 11:48:57 +02:00
Alezander9
739a3b2c87 add flag to toggle whether webjudge gets to see final agent result 2025-06-08 17:49:05 -07:00
Alezander9
bc56c73fe7 add ability to append results to existing run, so we can parallelize 2025-06-07 21:47:47 -07:00
Magnus Müller
52d4191ba9 Merge branch 'main' into mert/new_everything 2025-06-07 17:58:09 +02:00
Alezander9
b1cc677b01 add samba nova models 2025-06-06 17:38:36 -07:00
Alezander9
d9f1fb7bb4 track and report repo that code was run on in evals 2025-06-03 14:07:42 -07:00
Magnus Müller
aec9fd29a3 Merge branch 'main' into mert/new_everything 2025-06-03 19:59:17 +02:00
Alezander9
61f9c31a3d feat: support changing eval task set 2025-06-03 10:27:34 -07:00
mertunsall
9025e1ccba clean a little 2025-06-03 17:37:24 +02:00
mertunsall
286d3dec60 introduce max tokens 2025-05-31 18:22:34 +02:00
Alezander9
4c7e173b62 add logs to track semaphore management 2025-05-24 12:49:04 -07:00
Alezander9
25b9c44bc3 fix issue where out of place return statement could skip server upload function 2025-05-24 12:03:07 -07:00
Alezander9
38c7307169 add more logs to track down missing tasks in pipeline 2025-05-24 10:25:22 -07:00
Alezander9
a3dd8b004b update eval workflow with new arguments 2025-05-23 14:46:14 -07:00
Alezander9
3cfda361d1 move navigating to starting url code into the browser session setup 2025-05-23 11:44:04 -07:00
Alezander9
620fe6d254 add new layers of stage specific exception handlers 2025-05-23 11:35:12 -07:00