Commit Graph

3672 Commits

Author SHA1 Message Date
Nick Sweeting
8ba2ddaef1 switch_tab and bring_to_front + autorestart browser 2025-06-27 04:17:26 -07:00
Nick Sweeting
96e1743962 hack to force browser validity 2025-06-27 03:56:51 -07:00
Nick Sweeting
93b6b365f5 lower max screenshot size 2025-06-27 03:56:32 -07:00
Nick Sweeting
248dcedab9 dunno why pid detection stopped working 2025-06-27 03:43:45 -07:00
Nick Sweeting
072ad02a1b fix tests 2025-06-27 03:41:15 -07:00
Nick Sweeting
69fa8fde09 fix logging 2025-06-27 03:40:16 -07:00
Nick Sweeting
9760ac76c2 more test fixes 2025-06-27 03:36:54 -07:00
Nick Sweeting
88736d704d bump version 2025-06-27 03:36:41 -07:00
Nick Sweeting
d5678324cb comment 2025-06-27 03:28:34 -07:00
Nick Sweeting
dd5edc5324 test fixes 2025-06-27 03:28:34 -07:00
Nick Sweeting
f3dc2b300a Merge branch 'main' into semaphores 2025-06-27 03:28:20 -07:00
Nick Sweeting
ffba61d8f0 tweak step timeouts 2025-06-27 03:16:42 -07:00
Nick Sweeting
6fd0883923 more test fixes 2025-06-27 03:03:12 -07:00
Nick Sweeting
fe3af9479a run only one proc for tests for now 2025-06-27 03:02:03 -07:00
Nick Sweeting
96ed583d68 fix connection tests 2025-06-27 02:57:17 -07:00
Magnus Müller
6dce3ea50d Fix single task run ID handling for cloud execution (#2146)
## Problem

Single task cloud execution was failing with validation errors because
the evaluation script was generating local run IDs instead of using the
provided Convex run ID.

## Solution

- Modified  to respect the provided  parameter in single task mode
- Removed fragile local run ID generation that caused database
validation errors
- Now follows the same robust pattern as regular evaluation runs
- Maintains backward compatibility for local single task runs without
server

## Changes

- **service.py**: Use provided run_id when available in single task mode
instead of always generating a local ID

## Testing

- Single task cloud execution now works without validation errors
- Results are properly saved to the correct Convex run
- Local single task runs still work as fallback

Fixes the issue where single task runs were showing success but failing
to save results due to run ID mismatch.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Fixed single task cloud execution to use the provided Convex run ID,
preventing validation errors and ensuring results are saved correctly.

- **Bug Fixes**
- Uses the given run ID in single task mode instead of always generating
a local one.
  - Keeps local single task runs working as before.

<!-- End of auto-generated description by cubic. -->
2025-06-27 11:51:49 +02:00
Nick Sweeting
c0c895da58 fix tests 2025-06-27 02:50:49 -07:00
Magnus Müller
772bc063a2 fix: use provided run_id in single task mode instead of generating local ID
- Fixes single task cloud execution to properly use Convex run ID
- Removes fragile local run ID generation that caused validation errors
- Ensures single task results are saved to correct run in database
- Maintains backward compatibility for local single task runs
2025-06-27 11:40:54 +02:00
Nick Sweeting
4bbae404ca whoops 2025-06-27 02:35:59 -07:00
Nick Sweeting
c6dae628b7 fix type hinter 2025-06-27 02:35:37 -07:00
Nick Sweeting
fff91453d1 dont model_copy if possible 2025-06-27 02:33:46 -07:00
Nick Sweeting
584a68cb89 always copy pid 2025-06-27 02:30:20 -07:00
Nick Sweeting
13142831a2 allow proceeding without pid 2025-06-27 02:20:36 -07:00
Nick Sweeting
641a73c509 remove CDP direct calls to playwright private APIs 2025-06-27 02:10:08 -07:00
Nick Sweeting
28ca11319c add overload warnings 2025-06-27 01:59:38 -07:00
Nick Sweeting
bc6c305005 fix cdp screenshot args 2025-06-27 01:56:11 -07:00
Magnus Müller
f1cb0079d0 eval-single-task-run (#2142)
Auto-generated PR for branch: eval-single-task-run
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Added support for running a single evaluation task by passing task
details directly, without needing to fetch tasks from the server.

- **New Features**
- Accepts task ID, text, website, and branch as parameters for single
task runs.
- Skips server fetch and uses a local run ID if only a single task is
provided.
- Results are saved locally or to the server if credentials are present.

<!-- End of auto-generated description by cubic. -->
2025-06-27 10:16:26 +02:00
Magnus Müller
3de686d6af refactor: remove branch argument from single task mode in evaluation workflow
- Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing.
- Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website.
- Enhanced environment variable loading for improved clarity and consistency.
2025-06-27 10:11:21 +02:00
Nick Sweeting
9cc9164285 tweak browser session start test 2025-06-27 01:07:25 -07:00
Nick Sweeting
bce64a4840 update dockerfile 2025-06-27 00:59:06 -07:00
Magnus Müller
bb11c7e7ca feat: add single task mode support in evaluation workflow
- Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified.
- Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation.
- Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.
2025-06-27 09:57:37 +02:00
Nick Sweeting
1e93507fe3 fix cdp calls 2025-06-27 00:57:09 -07:00
Nick Sweeting
35b9d754da add new tests 2025-06-27 00:55:42 -07:00
Nick Sweeting
892db27b3c clone session inside model_copy 2025-06-27 00:44:06 -07:00
Nick Sweeting
12801b7754 only kill browser resources if owned by the session 2025-06-27 00:40:51 -07:00
Nick Sweeting
984b287ede lower CI performance expectations 2025-06-27 00:12:37 -07:00
Nick Sweeting
db7998efd0 fix type checks 2025-06-27 00:10:30 -07:00
Nick Sweeting
6c05a720ed auto-clear singletonLock on startup 2025-06-27 00:09:40 -07:00
Nick Sweeting
1415562d6e add portalocker multiprocess semaphore 2025-06-26 23:53:28 -07:00
Nick Sweeting
51b7e8ae66 always clear context and check connection properly using async call 2025-06-26 23:52:54 -07:00
Nick Sweeting
946ef4068d add stop hints and better logging 2025-06-26 23:52:19 -07:00
Nick Sweeting
ce2db1b9a8 exclude debug files from pytest 2025-06-26 22:58:36 -07:00
Nick Sweeting
747959a3d0 keep reference to the original browser session when copying 2025-06-26 22:45:05 -07:00
Nick Sweeting
112162cd3f prevent parent session from being garbage collected 2025-06-26 22:35:07 -07:00
Nick Sweeting
4f1fd7523f fix playwright internals calls 2025-06-26 22:33:23 -07:00
Nick Sweeting
7e5fc61b08 use raw channel directly for cdp 2025-06-26 20:06:45 -07:00
Nick Sweeting
7f4a3017c4 use raw channel directly for cdp 2025-06-26 20:03:36 -07:00
Nick Sweeting
c3bd596961 manually copy over browser_context instance when forking BrowserSession 2025-06-26 20:00:56 -07:00
Nick Sweeting
e054756f13 increase parallel browsers for stress test 2025-06-26 19:53:43 -07:00
Nick Sweeting
48fd583ae1 use raw CDP call for screenshot instead of playwright CDP session 2025-06-26 19:52:50 -07:00