browser-use

mirror of https://github.com/browser-use/browser-use synced 2026-05-06 17:52:15 +02:00

Author	SHA1	Message	Date
Magnus Müller	772bc063a2	fix: use provided run_id in single task mode instead of generating local ID - Fixes single task cloud execution to properly use Convex run ID - Removes fragile local run ID generation that caused validation errors - Ensures single task results are saved to correct run in database - Maintains backward compatibility for local single task runs	2025-06-27 11:40:54 +02:00
Magnus Müller	3de686d6af	refactor: remove branch argument from single task mode in evaluation workflow - Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing. - Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website. - Enhanced environment variable loading for improved clarity and consistency.	2025-06-27 10:11:21 +02:00
Magnus Müller	bb11c7e7ca	feat: add single task mode support in evaluation workflow - Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified. - Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation. - Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.	2025-06-27 09:57:37 +02:00
Magnus Müller	e004a7efd8	Add explicit scoring criteria for tasks with no output - Clarify that tasks producing no output must receive low scores - Improve evaluation consistency for incomplete task scenarios - Ensure judges properly penalize agents that fail to produce results This helps maintain evaluation accuracy by explicitly stating that lack of output is a significant failure condition that should be reflected in low scoring.	2025-06-26 15:34:01 +02:00
Magnus Müller	4a4336e1a1	Improve judge system evaluation criteria - Remove verbose architecture documentation comments - Add guideline to avoid file system usage for short tasks - Add stricter evaluation criteria for blocked/impossible tasks - Add criteria for bad trajectories and tool usage - Add requirement for task completion scoring - Add trajectory optimization considerations	2025-06-26 15:01:10 +02:00
Magnus Müller	8121110caa	Include max_steps_reached	2025-06-25 16:08:57 +02:00
Magnus Müller	c2fa3afd66	Improve system prompt	2025-06-25 15:47:38 +02:00
Magnus Müller	5abd4ee004	Limit number of input steps to judge	2025-06-25 15:17:08 +02:00
Magnus Müller	d7db550ce8	Update the system prompt for improvement tips	2025-06-25 11:43:26 +02:00
Magnus Müller	9132114dc8	Improve judge system prompt for browser_state	2025-06-25 11:23:31 +02:00
Magnus Müller	58c3b25cfd	Complete last_message flow to comprehensive judge - Update reformat_agent_history to accept and save last_message parameter - Ensure last_message flows from run_agent_with_browser to comprehensive judge - Add last_message to result.json structure for judge evaluation - Judge now has complete context including agent's final reasoning	2025-06-25 10:34:50 +02:00
Magnus Müller	06c4af0622	Include last message in judge	2025-06-25 10:14:14 +02:00
Magnus Müller	1fb67ac676	Add last_message parameter and anti-hallucination instruction to judge - Add last_message parameter to comprehensive_judge and judge_with_retry functions - Include agent's last message in evaluation context for better assessment - Add instruction to not evaluate for hallucination since agent has more info - Update function signatures and docstrings with proper parameter documentation - Remove dependency on action results since they're now in last_message	2025-06-25 09:51:25 +02:00
Magnus Müller	a65995846a	System-prompt	2025-06-25 09:23:53 +02:00
Magnus Müller	1b51780428	Improve judge system alignment with browser-use architecture - Add new error categories for output format and data extraction issues - Update system prompt to clarify browser_state contains readable text - Remove misaligned focus on extract_structured_data usage - Improve evaluation criteria for browser-use specific behaviors - Streamline error categories and remove unused ones - Better alignment with agent's actual capabilities and expected behaviors	2025-06-25 08:36:20 +02:00
Magnus Müller	9b491e9578	Fix Mind2Web judge evaluation: Replace model.invoke() with model.ainvoke() - Fixed AttributeError where evaluation was calling non-existent model.invoke() method - Changed to use model.ainvoke() which is the correct async method for browser-use LLM models - Updated response handling from .content to .completion to match browser-use format - Fixes Convex validation error where onlineMind2WebEvaluationJudgement was null - Affected functions: judge_task_result, identify_key_points, judge_image This resolves the 'Failed to save task result' errors in evaluation pipeline.	2025-06-25 08:05:29 +02:00
Magnus Müller	033418092f	Enhance system prompt clarity and evaluation criteria Refines the system prompt in judge_system.py by improving the context about the browser-use agent and updating evaluation criteria for better clarity. Adjusts the JSON response structure to reflect changes in task satisfaction and trajectory quality metrics.	2025-06-24 22:41:26 +02:00
Magnus Müller	9a11523cd8	Refactor argument formatting in get_llm function Improves readability by restructuring the kwargs dictionary for the 'anthropic' case in the get_llm function, ensuring consistent formatting and clarity.	2025-06-24 22:34:52 +02:00
Magnus Müller	22ba6bbfdb	Improve context about broser-use	2025-06-24 22:02:18 +02:00
Magnus Müller	19f48a75a3	System prompt judge	2025-06-24 21:49:33 +02:00
Magnus Müller	46ba7b79a9	Improve judge system history processing - Enhanced prepare_agent_steps function to properly process complete history - Removed commented-out logic for excluding 'done' actions - Improved trajectory quality assessment and evaluation logic - Better handling of agent execution steps for comprehensive judging	2025-06-24 14:55:22 +02:00
Gregor Žunič	70e9b271ad	changed model_name to model in evals for ChatAnthropic	2025-06-24 14:02:02 +02:00
Gregor Žunič	7a10ae0c96	Squashed commit langchain to native	2025-06-24 12:26:55 +02:00
Magnus Müller	e2325c31c6	Enhances TaskResult to include critical issues in comprehensive evaluation data Adds 'comprehensiveJudgeEvaluationCriticalIssues' to the TaskResult payload, allowing for better tracking and reporting of critical issues identified during evaluations.	2025-06-23 11:04:29 +02:00
Magnus Müller	2cea68e1f0	Refines error reporting and improvement tips in judge system Updates the judge system to always mention the error being addressed before providing improvement tips. Enhances clarity in the improvement tips by specifying the context of each suggestion, ensuring better guidance for users.	2025-06-23 10:50:45 +02:00
Magnus Müller	537e86da4c	Simplifies evaluation pipeline execution Removes the `fresh_start` option and the stage for loading existing results. This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.	2025-06-23 10:11:07 +02:00
Magnus Müller	5c33e0e7f2	Improve-judge prompt	2025-06-23 09:04:44 +02:00
Magnus Müller	e4b03e712e	Enhances comprehensive evaluation data handling in TaskResult Updates the TaskResult class to improve the processing of comprehensive evaluation data. The changes include: - Introduces fallback to retrieve 'comprehensive_judge' if 'comprehensive_evaluation' is not present. - Converts enum lists for 'task_categories' and 'error_categories' to string lists for better database compatibility. - Updates payload to include the processed categories and errors, ensuring proper data structure for evaluation results.	2025-06-23 01:23:29 +02:00
Magnus Müller	a00076973a	Renames laminar link key in task result Renames the 'laminar_link' key to 'laminarTaskLink' in the TaskResult dictionary for consistency. Adds an async decorator and docstring to the comprehensive judge fallback function to improve code clarity and maintainability.	2025-06-23 00:35:40 +02:00
Magnus Müller	d4a29c4b93	Improves evaluation robustness and reporting Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation. The changes include: - Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility. - Improves error handling during evaluation by capturing and logging the last part of the output on failure. - Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score. - Includes comprehensive evaluation data for debugging purposes.	2025-06-23 00:08:14 +02:00
Magnus Müller	be170fb17a	Ensures payload serialization preserves dict structure Adds a type assertion to ensure that the payload remains a dictionary after serialization. Also, adds type hints to `make_json_serializable` for better code clarity and maintainability.	2025-06-22 23:50:46 +02:00
Magnus Müller	4a26f07c66	Ensures JSON serializability of task results Adds a utility function to convert objects within a payload to JSON-serializable types before returning the task result. This change addresses potential issues where the task result contains non-serializable objects (e.g., enums, custom objects), preventing proper data handling.	2025-06-22 23:46:39 +02:00
Magnus Müller	0a5a29e4a8	Updates judge system import path Updates the import path for the comprehensive judge system to reflect its new location in the project structure. This resolves an issue where the previous relative import was causing import errors.	2025-06-22 23:26:56 +02:00
Magnus Müller	eeb8024184	Handles varied LLM response formats Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types. Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.	2025-06-22 23:12:37 +02:00
Magnus Müller	4629a8d9b7	Fixes relative import and type hints Fixes a relative import issue for the judge system. Updates type hints to allow None values for laminar_link and critical_error. Comments out unused code related to Laminar link updates.	2025-06-22 23:07:10 +02:00
Magnus Müller	be16ff3f69	Implement comprehensive judge system for task evaluation Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.	2025-06-22 22:43:57 +02:00
Nick Sweeting	4a8a4155b3	try keep alive browsers	2025-06-21 07:23:54 -07:00
Nick Sweeting	ac22e6ae20	Test fixes, evenbus tweaks, docs updates, and better warnings (#2027 )	2025-06-21 06:32:11 -07:00
Nick Sweeting	0af8c8c0fe	imports	2025-06-21 06:29:10 -07:00
Nick Sweeting	3209fd95f7	lint and hint fixes	2025-06-21 06:07:21 -07:00
Nick Sweeting	aad78d93ab	more type hint fixes	2025-06-21 05:44:49 -07:00
Nick Sweeting	6bc1f7985f	more type hint fixes	2025-06-21 04:56:27 -07:00
Magnus Müller	aeea1788fa	fix: CLI argument default conflict for highlight_elements - Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false') - Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True - Update GitHub workflow to use new flag logic (add flag when highlight_elements=false) - Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements Resolves bug where CLI users got highlighting disabled by default instead of enabled	2025-06-21 12:52:40 +02:00
Magnus Müller	9292e6c48d	feat: add highlight_elements flag for controlling element highlighting - Add --highlight-elements CLI argument to eval/service.py - Pass highlight_elements through entire execution pipeline - Add highlight_elements support to GitHub workflow (eval.yaml) - Allow users to control whether interactive elements are highlighted on pages during automation - Improves debugging and visibility options for browser automation	2025-06-21 12:40:41 +02:00
Nick Sweeting	d8c6876e08	tweak BrowserSession failure message details	2025-06-21 01:03:09 -07:00
Magnus Müller	9b72be9ea0	Update logging format in service.py to include logger name for better context in log messages.	2025-06-21 09:55:38 +02:00
Magnus Müller	f1d5dc5a17	Pass laminar_eval_id from frontend	2025-06-21 09:31:14 +02:00
Magnus Müller	83d92513a4	Monitor eval cpu	2025-06-20 23:35:56 +02:00
Magnus Müller	4a8cf30dac	Merge branch 'main' into mert/new_everything	2025-06-20 12:27:19 +02:00
Magnus Müller	0e5a8942f3	Add gemini-2.5-flash	2025-06-20 12:19:47 +02:00

1 2 3

138 Commits