browser-use

mirror of https://github.com/browser-use/browser-use synced 2026-05-06 17:52:15 +02:00

Author	SHA1	Message	Date
Magnus Müller	d4a29c4b93	Improves evaluation robustness and reporting Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation. The changes include: - Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility. - Improves error handling during evaluation by capturing and logging the last part of the output on failure. - Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score. - Includes comprehensive evaluation data for debugging purposes.	2025-06-23 00:08:14 +02:00
Magnus Müller	be16ff3f69	Implement comprehensive judge system for task evaluation Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.	2025-06-22 22:43:57 +02:00
Nick Sweeting	ac22e6ae20	Test fixes, evenbus tweaks, docs updates, and better warnings (#2027 )	2025-06-21 06:32:11 -07:00
Nick Sweeting	0af8c8c0fe	imports	2025-06-21 06:29:10 -07:00
Nick Sweeting	eb21d92d34	include extras packages in CI to avoid missing imports errors	2025-06-21 06:23:23 -07:00
Magnus Müller	aeea1788fa	fix: CLI argument default conflict for highlight_elements - Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false') - Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True - Update GitHub workflow to use new flag logic (add flag when highlight_elements=false) - Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements Resolves bug where CLI users got highlighting disabled by default instead of enabled	2025-06-21 12:52:40 +02:00
Magnus Müller	9292e6c48d	feat: add highlight_elements flag for controlling element highlighting - Add --highlight-elements CLI argument to eval/service.py - Pass highlight_elements through entire execution pipeline - Add highlight_elements support to GitHub workflow (eval.yaml) - Allow users to control whether interactive elements are highlighted on pages during automation - Improves debugging and visibility options for browser automation	2025-06-21 12:40:41 +02:00
Magnus Müller	f1d5dc5a17	Pass laminar_eval_id from frontend	2025-06-21 09:31:14 +02:00
Magnus Müller	83d92513a4	Monitor eval cpu	2025-06-20 23:35:56 +02:00
mertunsall	76ef41da45	Merge branch 'main' into mert/new_everything	2025-06-19 14:11:56 +02:00
Magnus Müller	f89a97f256	Delete laminar files since it is integrated to the main eval service	2025-06-19 11:26:53 +02:00
Magnus Müller	0b6ebea431	Merge branch 'main' into mert/new_everything	2025-06-19 09:49:43 +02:00
Nick Sweeting	829eafe982	Merge branch 'main' into eventbus	2025-06-18 10:26:50 -07:00
Magnus Müller	f89a1aac84	Update evaluation workflow to use laminar_eval.py - Changed the script executed in the evaluation workflow from `eval/service.py` to `eval/laminar_eval.py` for consistency with recent updates.	2025-06-18 16:43:06 +02:00
Magnus Müller	7f1d256964	Update dependencies in pyproject.toml and rename evaluation workflow - Removed duplicate lmnr dependency from the `dependencies` section in `pyproject.toml`. - Updated `lmnr` version to `0.6.11` in the `eval` extras group. - Renamed the evaluation workflow from "Run Evaluation Script" to "Run Laminar Eval Script" for clarity. - Adjusted the dependency installation command in the workflow to include the `--extra eval` flag.	2025-06-18 16:31:00 +02:00
Robert Kim	01a36a587e	gh action for laminar evals	2025-06-18 14:38:27 +01:00
mertunsall	1864e52635	Merge branch 'main' into mert/new_everything	2025-06-18 10:00:25 +02:00
Nick Sweeting	d0ec528802	fixed events	2025-06-17 22:38:59 -07:00
Nick Sweeting	ec32fee074	Merge branch 'main' into eventbus	2025-06-17 15:25:24 -07:00
Nick Sweeting	a114613c09	ignore formatting changes commits	2025-06-17 14:13:34 -07:00
Magnus Müller	c66880a8fa	Add lmnr package for tracing integration and update eval workflow	2025-06-17 18:58:56 +02:00
mertunsall	0c34c399db	Merge branch 'main' into mert/new_everything	2025-06-17 11:19:06 +02:00
Nick Sweeting	e3d21d33a1	fix evaluate_tasks.py errors in CI	2025-06-16 17:20:27 -07:00
Nick Sweeting	3972f335f9	allow claude github action to use more tools	2025-06-16 15:47:18 -07:00
Nick Sweeting	cdf0706f20	only create pre-release versions manually	2025-06-12 10:49:04 -07:00
Nick Sweeting	06488e11ba	fix clickaction error handling test	2025-06-11 00:05:58 -07:00
Nick Sweeting	1fd8e0ec92	try statuses-write option	2025-06-10 23:57:41 -07:00
Nick Sweeting	4d8f1342e8	try and revert evals change	2025-06-10 23:57:41 -07:00
Nick Sweeting	ffd36eb5da	tweak env vars for CI	2025-06-10 03:58:37 -07:00
Nick Sweeting	4d95a5db75	cleanup project root dir a bit	2025-06-09 22:34:34 -07:00
Magnus Müller	061f2cf089	Increase timeout duration in evaluation workflow from 60 to 360 minutes to accommodate longer-running jobs.	2025-06-09 20:51:11 +02:00
Magnus Müller	fe64e35b3c	Add SERPER_API_KEY to evaluation workflow for enhanced API integration	2025-06-09 12:25:50 +02:00
Magnus Müller	b1af578bf3	Add SERPER_API_KEY to evaluation workflow for enhanced API integration	2025-06-09 11:58:34 +02:00
Alezander9	739a3b2c87	add flag to toggle whether webjudge gets to see final agent result	2025-06-08 17:49:05 -07:00
Alezander9	31dcce5390	decrease workflow timeout	2025-06-07 22:08:29 -07:00
Alezander9	c04710ef61	Merge remote-tracking branch 'upstream/main' into parallel-evals	2025-06-07 21:48:47 -07:00
Alezander9	bc56c73fe7	add ability to append results to existing run, so we can parallelize	2025-06-07 21:47:47 -07:00
Magnus Müller	eaab9f04d7	Enhance GitHub Actions workflow and evaluate_tasks.py to include detailed task evaluation results. The workflow now captures and displays detailed results in a structured format, while the Python script outputs detailed results as JSON for better integration with GitHub Actions. This improves visibility and understanding of task outcomes in the evaluation process.	2025-06-07 13:39:16 +02:00
Magnus Müller	3ecee462a2	Update GitHub Actions workflow permissions to allow writing comments on pull requests and issues, enhancing interaction capabilities for automated testing processes.	2025-06-07 13:22:43 +02:00
Magnus Müller	bdf29c34fb	Add PR comment functionality to GitHub Actions workflow for agent task evaluation results. This includes a summary of passed tasks, percentage score, and status emoji based on task outcomes, enhancing visibility of evaluation results directly in pull requests.	2025-06-07 13:16:30 +02:00
Magnus Müller	8d9b24b03a	Add summary output for agent tasks evaluation in CI workflow	2025-06-07 11:27:06 +02:00
Magnus Müller	52c60f8367	Remove dependency on tests job in evaluate-tasks step of CI workflow	2025-06-07 11:11:23 +02:00
Magnus Müller	576519ee40	Enhance CI workflow by adding agent tasks evaluation step and updating evaluate_tasks.py to output evaluation results	2025-06-07 10:59:24 +02:00
Magnus Müller	3666d2b077	Add agent tasks evaluation script and update CI workflow to include it	2025-06-07 10:49:03 +02:00
Magnus Müller	074169f80f	Merge branch 'main' into tests/eval	2025-06-07 09:15:41 +02:00
Nick Sweeting	4358924964	only install chromium in tests	2025-06-06 19:32:20 -07:00
Magnus Müller	42dac3dce1	Add API key environment variables to GitHub Actions workflow for enhanced test capabilities	2025-06-07 01:11:48 +02:00
Nick Sweeting	8504bc4c7b	parallelize playwright tests using loop scope=session and pytest-xdist	2025-06-06 02:03:25 -07:00
Alezander9	61f9c31a3d	feat: support changing eval task set	2025-06-03 10:27:34 -07:00
BroskyBrowser	15cb992618	fix: expose ANCHOR_BROWSER_API_KEY variable in the evals workflow .yaml	2025-06-01 12:58:14 +02:00

1 2 3 4 5

232 Commits