browser-use

mirror of https://github.com/browser-use/browser-use synced 2026-05-06 17:52:15 +02:00

Author	SHA1	Message	Date
Nick Sweeting	f3dc2b300a	Merge branch 'main' into semaphores	2025-06-27 03:28:20 -07:00
Nick Sweeting	fe3af9479a	run only one proc for tests for now	2025-06-27 03:02:03 -07:00
Magnus Müller	3de686d6af	refactor: remove branch argument from single task mode in evaluation workflow - Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing. - Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website. - Enhanced environment variable loading for improved clarity and consistency.	2025-06-27 10:11:21 +02:00
Magnus Müller	bb11c7e7ca	feat: add single task mode support in evaluation workflow - Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified. - Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation. - Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.	2025-06-27 09:57:37 +02:00
Nick Sweeting	74d02c07a7	increase screenshot timeout to default timeout	2025-06-26 02:59:57 -07:00
Nick Sweeting	170c3e0bb7	remove duplicate timeout config	2025-06-26 01:24:03 -07:00
Nick Sweeting	7c317e9515	always group tests by class so they can reuse fixtures	2025-06-25 23:31:02 -07:00
Nick Sweeting	27f63622bf	dont sent telemetry or cloud sync events during evals	2025-06-23 15:32:28 -07:00
Magnus Müller	537e86da4c	Simplifies evaluation pipeline execution Removes the `fresh_start` option and the stage for loading existing results. This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.	2025-06-23 10:11:07 +02:00
Magnus Müller	d4a29c4b93	Improves evaluation robustness and reporting Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation. The changes include: - Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility. - Improves error handling during evaluation by capturing and logging the last part of the output on failure. - Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score. - Includes comprehensive evaluation data for debugging purposes.	2025-06-23 00:08:14 +02:00
Magnus Müller	be16ff3f69	Implement comprehensive judge system for task evaluation Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.	2025-06-22 22:43:57 +02:00
Nick Sweeting	ac22e6ae20	Test fixes, evenbus tweaks, docs updates, and better warnings (#2027 )	2025-06-21 06:32:11 -07:00
Nick Sweeting	0af8c8c0fe	imports	2025-06-21 06:29:10 -07:00
Nick Sweeting	eb21d92d34	include extras packages in CI to avoid missing imports errors	2025-06-21 06:23:23 -07:00
Magnus Müller	aeea1788fa	fix: CLI argument default conflict for highlight_elements - Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false') - Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True - Update GitHub workflow to use new flag logic (add flag when highlight_elements=false) - Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements Resolves bug where CLI users got highlighting disabled by default instead of enabled	2025-06-21 12:52:40 +02:00
Magnus Müller	9292e6c48d	feat: add highlight_elements flag for controlling element highlighting - Add --highlight-elements CLI argument to eval/service.py - Pass highlight_elements through entire execution pipeline - Add highlight_elements support to GitHub workflow (eval.yaml) - Allow users to control whether interactive elements are highlighted on pages during automation - Improves debugging and visibility options for browser automation	2025-06-21 12:40:41 +02:00
Magnus Müller	f1d5dc5a17	Pass laminar_eval_id from frontend	2025-06-21 09:31:14 +02:00
Magnus Müller	83d92513a4	Monitor eval cpu	2025-06-20 23:35:56 +02:00
mertunsall	76ef41da45	Merge branch 'main' into mert/new_everything	2025-06-19 14:11:56 +02:00
Magnus Müller	f89a97f256	Delete laminar files since it is integrated to the main eval service	2025-06-19 11:26:53 +02:00
Magnus Müller	0b6ebea431	Merge branch 'main' into mert/new_everything	2025-06-19 09:49:43 +02:00
Magnus Müller	f89a1aac84	Update evaluation workflow to use laminar_eval.py - Changed the script executed in the evaluation workflow from `eval/service.py` to `eval/laminar_eval.py` for consistency with recent updates.	2025-06-18 16:43:06 +02:00
Magnus Müller	7f1d256964	Update dependencies in pyproject.toml and rename evaluation workflow - Removed duplicate lmnr dependency from the `dependencies` section in `pyproject.toml`. - Updated `lmnr` version to `0.6.11` in the `eval` extras group. - Renamed the evaluation workflow from "Run Evaluation Script" to "Run Laminar Eval Script" for clarity. - Adjusted the dependency installation command in the workflow to include the `--extra eval` flag.	2025-06-18 16:31:00 +02:00
Robert Kim	01a36a587e	gh action for laminar evals	2025-06-18 14:38:27 +01:00
mertunsall	1864e52635	Merge branch 'main' into mert/new_everything	2025-06-18 10:00:25 +02:00
Magnus Müller	c66880a8fa	Add lmnr package for tracing integration and update eval workflow	2025-06-17 18:58:56 +02:00
mertunsall	0c34c399db	Merge branch 'main' into mert/new_everything	2025-06-17 11:19:06 +02:00
Nick Sweeting	e3d21d33a1	fix evaluate_tasks.py errors in CI	2025-06-16 17:20:27 -07:00
Nick Sweeting	3972f335f9	allow claude github action to use more tools	2025-06-16 15:47:18 -07:00
Nick Sweeting	cdf0706f20	only create pre-release versions manually	2025-06-12 10:49:04 -07:00
Nick Sweeting	06488e11ba	fix clickaction error handling test	2025-06-11 00:05:58 -07:00
Nick Sweeting	1fd8e0ec92	try statuses-write option	2025-06-10 23:57:41 -07:00
Nick Sweeting	4d8f1342e8	try and revert evals change	2025-06-10 23:57:41 -07:00
Nick Sweeting	ffd36eb5da	tweak env vars for CI	2025-06-10 03:58:37 -07:00
Magnus Müller	061f2cf089	Increase timeout duration in evaluation workflow from 60 to 360 minutes to accommodate longer-running jobs.	2025-06-09 20:51:11 +02:00
Magnus Müller	fe64e35b3c	Add SERPER_API_KEY to evaluation workflow for enhanced API integration	2025-06-09 12:25:50 +02:00
Magnus Müller	b1af578bf3	Add SERPER_API_KEY to evaluation workflow for enhanced API integration	2025-06-09 11:58:34 +02:00
Alezander9	739a3b2c87	add flag to toggle whether webjudge gets to see final agent result	2025-06-08 17:49:05 -07:00
Alezander9	31dcce5390	decrease workflow timeout	2025-06-07 22:08:29 -07:00
Alezander9	c04710ef61	Merge remote-tracking branch 'upstream/main' into parallel-evals	2025-06-07 21:48:47 -07:00
Alezander9	bc56c73fe7	add ability to append results to existing run, so we can parallelize	2025-06-07 21:47:47 -07:00
Magnus Müller	eaab9f04d7	Enhance GitHub Actions workflow and evaluate_tasks.py to include detailed task evaluation results. The workflow now captures and displays detailed results in a structured format, while the Python script outputs detailed results as JSON for better integration with GitHub Actions. This improves visibility and understanding of task outcomes in the evaluation process.	2025-06-07 13:39:16 +02:00
Magnus Müller	3ecee462a2	Update GitHub Actions workflow permissions to allow writing comments on pull requests and issues, enhancing interaction capabilities for automated testing processes.	2025-06-07 13:22:43 +02:00
Magnus Müller	bdf29c34fb	Add PR comment functionality to GitHub Actions workflow for agent task evaluation results. This includes a summary of passed tasks, percentage score, and status emoji based on task outcomes, enhancing visibility of evaluation results directly in pull requests.	2025-06-07 13:16:30 +02:00
Magnus Müller	8d9b24b03a	Add summary output for agent tasks evaluation in CI workflow	2025-06-07 11:27:06 +02:00
Magnus Müller	52c60f8367	Remove dependency on tests job in evaluate-tasks step of CI workflow	2025-06-07 11:11:23 +02:00
Magnus Müller	576519ee40	Enhance CI workflow by adding agent tasks evaluation step and updating evaluate_tasks.py to output evaluation results	2025-06-07 10:59:24 +02:00
Magnus Müller	3666d2b077	Add agent tasks evaluation script and update CI workflow to include it	2025-06-07 10:49:03 +02:00
Magnus Müller	074169f80f	Merge branch 'main' into tests/eval	2025-06-07 09:15:41 +02:00
Nick Sweeting	4358924964	only install chromium in tests	2025-06-06 19:32:20 -07:00

1 2 3 4

164 Commits