Magnus Müller
672e749d0f
fix evaluate test
2025-10-24 14:17:48 -07:00
Magnus Müller
3cbea6fbf6
Add env key
2025-10-24 02:13:27 -07:00
Magnus Müller
6cf5c54075
Add api key
2025-10-24 02:12:40 -07:00
Magnus Müller
1404935a5c
Passed 0%
2025-10-24 01:56:39 -07:00
Magnus Müller
2c7d3372e1
Add setup-chromium job to pre-cache chromium before parallel test runs
...
Prevents 40 parallel runners from racing to install chromium simultaneously on cache miss.
Before: 40 runners × 2min = 80 runner-minutes wasted on first run
After: 1 runner installs (2min), then 40 runners use cached version (10s each)
Savings on cache miss: ~78 runner-minutes per workflow run
2025-10-24 01:19:06 -07:00
Magnus Müller
7a463675ff
Update LLM integration in evaluate_tasks.py to use ChatGoogle instead of ChatOpenAI for agent and judge initialization
2025-10-12 00:50:22 -07:00
Cyrus
9d7dfd7738
replace aiofiles with anyio.Path
2025-09-08 11:37:34 -06:00
Magnus Müller
4a8d1e883b
Fix test
2025-08-29 18:29:14 -07:00
Magnus Müller
0373511de6
Set is local to false by default
2025-08-27 10:08:17 -07:00
Magnus Müller
82b241ddbb
Linter new line
2025-08-26 18:13:11 -07:00
Nick Sweeting
878586edcd
examples fixes
2025-08-14 14:32:32 -07:00
Nick Sweeting
5c551272cb
fix screenshots
2025-08-08 05:14:01 -07:00
Nick Sweeting
d958260c42
tweak test fixture scope
2025-07-11 00:44:10 -07:00
Nick Sweeting
e461380998
switch evaluate_tasks back to patchright+chrome
2025-07-09 14:51:42 -07:00
mertunsall
13aa27b84e
fix tests
2025-07-09 16:20:07 +02:00
Gregor Žunič
7a10ae0c96
Squashed commit langchain to native
2025-06-24 12:26:55 +02:00
Nick Sweeting
3209fd95f7
lint and hint fixes
2025-06-21 06:07:21 -07:00
Nick Sweeting
e3d21d33a1
fix evaluate_tasks.py errors in CI
2025-06-16 17:20:27 -07:00
Nick Sweeting
b0ed680419
use patchright for stealth tests
2025-06-09 15:29:37 -07:00
Magnus Müller
eaab9f04d7
Enhance GitHub Actions workflow and evaluate_tasks.py to include detailed task evaluation results. The workflow now captures and displays detailed results in a structured format, while the Python script outputs detailed results as JSON for better integration with GitHub Actions. This improves visibility and understanding of task outcomes in the evaluation process.
2025-06-07 13:39:16 +02:00
Magnus Müller
e6171e0fc1
Disable Chromium sandbox in BrowserProfile for CI environment in evaluate_tasks.py to improve compatibility with GitHub Actions.
2025-06-07 12:59:22 +02:00
Magnus Müller
43d96da06c
Enhance error handling and debugging in evaluate_tasks.py by adding browser and LLM test calls. Capture and log errors during browser session initiation and agent execution, improving overall troubleshooting capabilities.
2025-06-07 12:53:14 +02:00
Magnus Müller
9899262186
Add additional debug logging in evaluate_tasks.py to track task execution flow and subprocess outputs. Enhance error reporting by capturing and displaying full stdout and stderr for better troubleshooting during agent execution.
2025-06-07 12:48:04 +02:00
Magnus Müller
54226c7b51
Add detailed debug logging in evaluate_tasks.py to capture agent execution steps and outputs. Enhance error reporting for subprocess failures and improve output handling when no agent output is provided.
2025-06-07 12:43:51 +02:00
Magnus Müller
592b9afbce
Refactor evaluate_tasks.py to run agent tasks in separate subprocesses, enhancing isolation and error handling. Introduce argparse for task selection and improve logging management during task execution.
2025-06-07 12:37:07 +02:00
Magnus Müller
8246bcd299
Refactor run_task in evaluate_tasks.py to create a dedicated BrowserSession for each agent, improving session management and ensuring headless execution with --no-sandbox argument.
2025-06-07 11:58:01 +02:00
Magnus Müller
38cfa86738
Update browser_use_pip.yaml to simplify output requirements and refactor run_task in evaluate_tasks.py to remove shared profile parameter, enhancing browser session management with a dedicated profile.
2025-06-07 11:50:30 +02:00
Magnus Müller
bf13dbb452
Refactor run_task function in evaluate_tasks.py to accept a shared profile for consistent browser sessions across tasks
2025-06-07 11:40:19 +02:00
Magnus Müller
d50581e87c
Remove --no-sandbox argument from BrowserProfile in evaluate_tasks.py to simplify configuration
2025-06-07 11:28:46 +02:00
Magnus Müller
df14af7f00
Add --no-sandbox argument to BrowserProfile in evaluate_tasks.py for enhanced security during task execution
2025-06-07 11:20:26 +02:00
Magnus Müller
b29dbe5f2a
Remove keep_alive option from BrowserProfile in evaluate_tasks.py to streamline session management
2025-06-07 11:10:51 +02:00
Magnus Müller
576519ee40
Enhance CI workflow by adding agent tasks evaluation step and updating evaluate_tasks.py to output evaluation results
2025-06-07 10:59:24 +02:00
Magnus Müller
412904dc65
Add keep_alive option to BrowserProfile in evaluate_tasks.py for improved session persistence
2025-06-07 10:53:58 +02:00
Magnus Müller
3666d2b077
Add agent tasks evaluation script and update CI workflow to include it
2025-06-07 10:49:03 +02:00