Commit Graph

34 Commits

Author SHA1 Message Date
Magnus Müller
672e749d0f fix evaluate test 2025-10-24 14:17:48 -07:00
Magnus Müller
3cbea6fbf6 Add env key 2025-10-24 02:13:27 -07:00
Magnus Müller
6cf5c54075 Add api key 2025-10-24 02:12:40 -07:00
Magnus Müller
1404935a5c Passed 0% 2025-10-24 01:56:39 -07:00
Magnus Müller
2c7d3372e1 Add setup-chromium job to pre-cache chromium before parallel test runs
Prevents 40 parallel runners from racing to install chromium simultaneously on cache miss.

Before: 40 runners × 2min = 80 runner-minutes wasted on first run
After: 1 runner installs (2min), then 40 runners use cached version (10s each)

Savings on cache miss: ~78 runner-minutes per workflow run
2025-10-24 01:19:06 -07:00
Magnus Müller
7a463675ff Update LLM integration in evaluate_tasks.py to use ChatGoogle instead of ChatOpenAI for agent and judge initialization 2025-10-12 00:50:22 -07:00
Cyrus
9d7dfd7738 replace aiofiles with anyio.Path 2025-09-08 11:37:34 -06:00
Magnus Müller
4a8d1e883b Fix test 2025-08-29 18:29:14 -07:00
Magnus Müller
0373511de6 Set is local to false by default 2025-08-27 10:08:17 -07:00
Magnus Müller
82b241ddbb Linter new line 2025-08-26 18:13:11 -07:00
Nick Sweeting
878586edcd examples fixes 2025-08-14 14:32:32 -07:00
Nick Sweeting
5c551272cb fix screenshots 2025-08-08 05:14:01 -07:00
Nick Sweeting
d958260c42 tweak test fixture scope 2025-07-11 00:44:10 -07:00
Nick Sweeting
e461380998 switch evaluate_tasks back to patchright+chrome 2025-07-09 14:51:42 -07:00
mertunsall
13aa27b84e fix tests 2025-07-09 16:20:07 +02:00
Gregor Žunič
7a10ae0c96 Squashed commit langchain to native 2025-06-24 12:26:55 +02:00
Nick Sweeting
3209fd95f7 lint and hint fixes 2025-06-21 06:07:21 -07:00
Nick Sweeting
e3d21d33a1 fix evaluate_tasks.py errors in CI 2025-06-16 17:20:27 -07:00
Nick Sweeting
b0ed680419 use patchright for stealth tests 2025-06-09 15:29:37 -07:00
Magnus Müller
eaab9f04d7 Enhance GitHub Actions workflow and evaluate_tasks.py to include detailed task evaluation results. The workflow now captures and displays detailed results in a structured format, while the Python script outputs detailed results as JSON for better integration with GitHub Actions. This improves visibility and understanding of task outcomes in the evaluation process. 2025-06-07 13:39:16 +02:00
Magnus Müller
e6171e0fc1 Disable Chromium sandbox in BrowserProfile for CI environment in evaluate_tasks.py to improve compatibility with GitHub Actions. 2025-06-07 12:59:22 +02:00
Magnus Müller
43d96da06c Enhance error handling and debugging in evaluate_tasks.py by adding browser and LLM test calls. Capture and log errors during browser session initiation and agent execution, improving overall troubleshooting capabilities. 2025-06-07 12:53:14 +02:00
Magnus Müller
9899262186 Add additional debug logging in evaluate_tasks.py to track task execution flow and subprocess outputs. Enhance error reporting by capturing and displaying full stdout and stderr for better troubleshooting during agent execution. 2025-06-07 12:48:04 +02:00
Magnus Müller
54226c7b51 Add detailed debug logging in evaluate_tasks.py to capture agent execution steps and outputs. Enhance error reporting for subprocess failures and improve output handling when no agent output is provided. 2025-06-07 12:43:51 +02:00
Magnus Müller
592b9afbce Refactor evaluate_tasks.py to run agent tasks in separate subprocesses, enhancing isolation and error handling. Introduce argparse for task selection and improve logging management during task execution. 2025-06-07 12:37:07 +02:00
Magnus Müller
8246bcd299 Refactor run_task in evaluate_tasks.py to create a dedicated BrowserSession for each agent, improving session management and ensuring headless execution with --no-sandbox argument. 2025-06-07 11:58:01 +02:00
Magnus Müller
38cfa86738 Update browser_use_pip.yaml to simplify output requirements and refactor run_task in evaluate_tasks.py to remove shared profile parameter, enhancing browser session management with a dedicated profile. 2025-06-07 11:50:30 +02:00
Magnus Müller
bf13dbb452 Refactor run_task function in evaluate_tasks.py to accept a shared profile for consistent browser sessions across tasks 2025-06-07 11:40:19 +02:00
Magnus Müller
d50581e87c Remove --no-sandbox argument from BrowserProfile in evaluate_tasks.py to simplify configuration 2025-06-07 11:28:46 +02:00
Magnus Müller
df14af7f00 Add --no-sandbox argument to BrowserProfile in evaluate_tasks.py for enhanced security during task execution 2025-06-07 11:20:26 +02:00
Magnus Müller
b29dbe5f2a Remove keep_alive option from BrowserProfile in evaluate_tasks.py to streamline session management 2025-06-07 11:10:51 +02:00
Magnus Müller
576519ee40 Enhance CI workflow by adding agent tasks evaluation step and updating evaluate_tasks.py to output evaluation results 2025-06-07 10:59:24 +02:00
Magnus Müller
412904dc65 Add keep_alive option to BrowserProfile in evaluate_tasks.py for improved session persistence 2025-06-07 10:53:58 +02:00
Magnus Müller
3666d2b077 Add agent tasks evaluation script and update CI workflow to include it 2025-06-07 10:49:03 +02:00