Commit Graph

109 Commits

Author SHA1 Message Date
Magnus Müller
d4a29c4b93 Improves evaluation robustness and reporting
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.

The changes include:

- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
2025-06-23 00:08:14 +02:00
Magnus Müller
be170fb17a Ensures payload serialization preserves dict structure
Adds a type assertion to ensure that the payload remains a dictionary after serialization.

Also, adds type hints to `make_json_serializable` for better code clarity and maintainability.
2025-06-22 23:50:46 +02:00
Magnus Müller
4a26f07c66 Ensures JSON serializability of task results
Adds a utility function to convert objects within a payload to JSON-serializable types before returning the task result.

This change addresses potential issues where the task result contains non-serializable objects (e.g., enums, custom objects), preventing proper data handling.
2025-06-22 23:46:39 +02:00
Magnus Müller
0a5a29e4a8 Updates judge system import path
Updates the import path for the comprehensive judge system to reflect its new location in the project structure.

This resolves an issue where the previous relative import was causing import errors.
2025-06-22 23:26:56 +02:00
Magnus Müller
eeb8024184 Handles varied LLM response formats
Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types.

Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.
2025-06-22 23:12:37 +02:00
Magnus Müller
4629a8d9b7 Fixes relative import and type hints
Fixes a relative import issue for the judge system.

Updates type hints to allow None values for laminar_link and critical_error.

Comments out unused code related to Laminar link updates.
2025-06-22 23:07:10 +02:00
Magnus Müller
be16ff3f69 Implement comprehensive judge system for task evaluation
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
2025-06-22 22:43:57 +02:00
Nick Sweeting
4a8a4155b3 try keep alive browsers 2025-06-21 07:23:54 -07:00
Nick Sweeting
ac22e6ae20 Test fixes, evenbus tweaks, docs updates, and better warnings (#2027) 2025-06-21 06:32:11 -07:00
Nick Sweeting
0af8c8c0fe imports 2025-06-21 06:29:10 -07:00
Nick Sweeting
3209fd95f7 lint and hint fixes 2025-06-21 06:07:21 -07:00
Nick Sweeting
aad78d93ab more type hint fixes 2025-06-21 05:44:49 -07:00
Nick Sweeting
6bc1f7985f more type hint fixes 2025-06-21 04:56:27 -07:00
Magnus Müller
aeea1788fa fix: CLI argument default conflict for highlight_elements
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements

Resolves bug where CLI users got highlighting disabled by default instead of enabled
2025-06-21 12:52:40 +02:00
Magnus Müller
9292e6c48d feat: add highlight_elements flag for controlling element highlighting
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
2025-06-21 12:40:41 +02:00
Nick Sweeting
d8c6876e08 tweak BrowserSession failure message details 2025-06-21 01:03:09 -07:00
Magnus Müller
9b72be9ea0 Update logging format in service.py to include logger name for better context in log messages. 2025-06-21 09:55:38 +02:00
Magnus Müller
f1d5dc5a17 Pass laminar_eval_id from frontend 2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4 Monitor eval cpu 2025-06-20 23:35:56 +02:00
Magnus Müller
4a8cf30dac Merge branch 'main' into mert/new_everything 2025-06-20 12:27:19 +02:00
Magnus Müller
0e5a8942f3 Add gemini-2.5-flash 2025-06-20 12:19:47 +02:00
Magnus Müller
4c2952d640 Squashed commit of the following:
commit a9cf53a1b1
Merge: 5aa62c11 0f9ffa10
Author: Magnus Müller <67061560+MagMueller@users.noreply.github.com>
Date:   Fri Jun 20 10:41:19 2025 +0200

    Set user_data_dir to None (#2015)

    <!-- This is an auto-generated description by cubic. -->
    Changed browser session setup to use incognito mode by setting
    user_data_dir to None, preventing persistent state between evaluation
    runs.

    <!-- End of auto-generated description by cubic. -->

commit 0f9ffa1072
Author: Magnus Müller <67061560+MagMueller@users.noreply.github.com>
Date:   Fri Jun 20 10:38:01 2025 +0200

    Set user_data_dir to None

commit 5aa62c1113
Merge: d8a9d21b e559ff5e
Author: Nick Sweeting <git@sweeting.me>
Date:   Thu Jun 19 23:01:49 2025 -0700

    Fix cross-origin iframe DOM retrieval (#1965)

commit d8a9d21b00
Merge: 3e5f3049 b6be1583
Author: Nick Sweeting <git@sweeting.me>
Date:   Thu Jun 19 23:01:21 2025 -0700

    Fix critical domain restriction bypass vulnerability (#2006)

commit b6be158319
Author: Sahar <saharhashai@gmail.com>
Date:   Thu Jun 19 02:28:34 2025 -0700

    Delete tests/ci/test_security_url_validation.py

commit aca4b57329
Author: Sahar <saharhashai@gmail.com>
Date:   Thu Jun 19 02:27:57 2025 -0700

    Delete SECURITY_FIX_REPORT.md

commit 45872c1e45
Author: Your Name <your.email@example.com>
Date:   Thu Jun 19 11:24:50 2025 +0200

    fix(security): prevent domain restriction bypass in controller actions

    - Add domain validation to controller.click() and controller.type() methods
    - Implement comprehensive security checks before executing actions
    - Prevent potential prompt injection and unauthorized data access
    - Add extensive test coverage for domain validation scenarios
    - Update documentation with security considerations

    This critical fix prevents complete bypass of domain restrictions that
    could enable attackers to perform unauthorized actions on any domain.

commit e559ff5eaa
Merge: 19ae8a11 f348e0c5
Author: Nick Sweeting <git@sweeting.me>
Date:   Sat Jun 14 01:56:09 2025 -0700

    Merge branch 'main' into main

commit 19ae8a1146
Merge: e1b3ff9e 08ed0be3
Author: Nick Sweeting <git@sweeting.me>
Date:   Sat Jun 14 00:31:30 2025 -0700

    Merge branch 'main' into main

commit e1b3ff9e9d
Author: Ilya Biryukov <ilbiryuk@microsoft.com>
Date:   Thu Jun 12 17:40:40 2025 -0700

    Revert changes to  examples/features/multiple_agents_same_browser.py

commit d20a3b55d6
Author: Ilya Biryukov <ilbiryuk@microsoft.com>
Date:   Thu Jun 12 17:30:59 2025 -0700

    Fix pre-commit lint issues and compile error in multiple_agents_same_browser

commit 13d5468aa2
Author: Ilya Biryukov <ilbiryuk@microsoft.com>
Date:   Thu Jun 12 14:07:21 2025 -0700

    Fix cross-origin iframe DOM retrieval
2025-06-20 10:51:06 +02:00
Magnus Müller
0f9ffa1072 Set user_data_dir to None 2025-06-20 10:38:01 +02:00
Magnus Müller
90ae26316e Refactor ActionResult to standardize the inclusion of extracted content, replacing update_only_read_state with include_extracted_content_only_once across multiple services. This change enhances clarity in memory management and ensures consistent handling of extracted content. 2025-06-19 23:18:30 +02:00
Magnus Müller
ce880e5e35 Refactor ActionResult handling across multiple services to standardize the use of long_term_memory, replacing memory references. Update related logic to ensure extracted content is consistently managed for improved clarity and error handling. 2025-06-19 23:11:55 +02:00
Magnus Müller
c62d14d9ed Update action result handling in service.py to include extracted content in memory for paused agent states and SERPER API key checks, enhancing error reporting consistency. 2025-06-19 22:52:53 +02:00
mertunsall
86083b09c9 reset eval to main to be comparable 2025-06-19 17:02:58 +02:00
mertunsall
76ef41da45 Merge branch 'main' into mert/new_everything 2025-06-19 14:11:56 +02:00
Magnus Müller
f89a97f256 Delete laminar files since it is integrated to the main eval service 2025-06-19 11:26:53 +02:00
Magnus Müller
0b6ebea431 Merge branch 'main' into mert/new_everything 2025-06-19 09:49:43 +02:00
Magnus Müller
2c627a93fc Fix event loop error 2025-06-18 23:55:27 +02:00
Magnus Müller
5b4b3dd5aa Integrate Laminar tracing and enhance evaluation workflow
- Added integration for Laminar tracing by initializing `AsyncLaminarClient` and creating evaluation links during task execution.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Enhanced logging for task results and added error handling for Laminar datapoint creation.
- Improved the `run_task_with_semaphore` function to manage Laminar evaluation links and update datapoints with evaluation scores.

These changes aim to streamline the evaluation process and improve tracking of task performance.
2025-06-18 23:27:30 +02:00
Magnus Müller
081f5747c6 Refactor run_task_with_semaphore to use laminar_task_link
- Renamed variable `link` to `laminar_task_link` for clarity in the `run_task_with_semaphore` function.
- Updated the creation of `TaskResult` to utilize `laminar_task_link` instead of the previous `link` variable.
- Improved logging to reflect the new variable name, enhancing readability and maintainability.

These changes aim to improve code clarity and maintain consistency in the evaluation workflow.
2025-06-18 22:41:56 +02:00
Magnus Müller
158aa5b719 Enhance evaluation workflow in laminar_eval.py and update dependencies
- Removed the `lmnr[all]` dependency from `pyproject.toml`.
- Added `browser-use[dev,eval]` to `dev-dependencies` for improved development support.
- Updated `TaskResult` class to include an optional `laminar_task_link` for task-specific links.
- Modified `run_task_with_semaphore` to handle Laminar evaluation links and improved logging for task results.
- Added logic to create a Laminar evaluation link during task execution.

These changes aim to streamline the evaluation process and enhance the overall functionality of the evaluation workflow.
2025-06-18 22:41:05 +02:00
Magnus Müller
9547ebb3bb Refactor laminar_eval.py to enhance task evaluation workflow
- Updated `run_task_with_semaphore` to use `lmnr_run_id` for evaluation ID instead of `run_id`.
- Added a new helper function `start_new_run` to initiate evaluation runs on the server.
- Improved logging for task results and server interactions.
- Ensured proper handling of environment variables for server configuration.

This refactor aims to streamline the evaluation process and improve error handling.
2025-06-18 21:04:47 +02:00
Robert Kim
3548d0fe35 lint 2025-06-18 14:55:00 +01:00
Magnus Müller
3e796a5773 Merge branch 'main' into laminar_evals 2025-06-18 15:34:26 +02:00
Robert Kim
a0d6a08119 v0 2025-06-18 14:21:48 +01:00
mertunsall
49bdba1578 Add evaluation criterion for handling vague user tasks in service.py 2025-06-18 13:18:40 +02:00
mertunsall
3c5c3b69cb Refine evaluation criteria and system message in service.py 2025-06-18 11:53:34 +02:00
mertunsall
1864e52635 Merge branch 'main' into mert/new_everything 2025-06-18 10:00:25 +02:00
Magnus Müller
eb2aabb7f8 Remove duplicate 2025-06-17 19:21:29 +02:00
Magnus Müller
c66880a8fa Add lmnr package for tracing integration and update eval workflow 2025-06-17 18:58:56 +02:00
mertunsall
0c34c399db Merge branch 'main' into mert/new_everything 2025-06-17 11:19:06 +02:00
Nick Sweeting
86abd92b79 reset timeouts in evals back to faster defaults 2025-06-11 01:46:41 -07:00
Nick Sweeting
520bbbe6c0 Update eval/service.py
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
2025-06-11 02:28:13 -04:00
Nick Sweeting
1c64c483d9 allow https errors on evals because they include http urls for some reason 2025-06-10 06:27:35 -07:00
Nick Sweeting
ed6db5802b fix timeouts order of magnitude 2025-06-10 05:42:23 -07:00
Nick Sweeting
a0ee5de2ad tweak browsersession timeouts 2025-06-10 05:18:56 -07:00
Alezander9
8e7663758d modify service to accept tasks with login cookies 2025-06-09 22:09:48 -07:00