browser-use

mirror of https://github.com/browser-use/browser-use synced 2026-05-13 17:56:35 +02:00

Author	SHA1	Message	Date
Alezander9	e8db375401	make eval service fetch rotating auth info from server	2025-07-02 16:30:48 -07:00
Magnus Müller	70149369eb	eval-change-timing	2025-07-03 00:19:51 +02:00
mertunsall	58ed5da177	migrate evals to ChatGroq	2025-07-02 23:19:10 +02:00
Magnus Müller	2f8da485b1	Merge branch 'main' into eval-runner-status-updates	2025-07-02 19:28:31 +02:00
Magnus Müller	0cbda40a2c	Enhance evaluation workflow with improved runner ID generation and progress tracking - Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables. - Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API. - Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.	2025-07-02 19:27:15 +02:00
Magnus Müller	fe3fe67d50	Refactor evaluation stages in service.py - Moved the formatting and evaluation stages outside the browser session block to ensure they are executed regardless of session state. - Updated error handling for evaluation and server save stages to maintain consistent logging and task result management. - Ensured that server save attempts are always made, improving reliability in task completion.	2025-07-02 18:08:28 +02:00
Mert Unsal	f5a4c75579	add haiku to evals (#2257 ) Added support for the claude-3.5-haiku-latest model in evals.	2025-07-02 17:41:47 +02:00
mertunsall	144870a024	add haiku to evals	2025-07-02 17:40:58 +02:00
Magnus Müller	44a180f716	eval-runner-status-updates	2025-07-02 17:38:58 +02:00
Magnus Müller	239fd3f86b	eval include runner link	2025-07-02 14:10:22 +02:00
mertunsall	3a5f43bb3f	eval should run with temperature 1 for o3	2025-07-02 11:07:58 +02:00
Magnus Müller	4a0eab6fb8	test git automation functions	2025-07-02 10:36:41 +02:00
mertunsall	4e7cf3a964	Update model names for gemini-2.5 to remove preview suffixes in service.py	2025-07-01 13:40:05 +02:00
Magnus Müller	b17e8f3f88	Max 3 min for judge	2025-07-01 12:36:17 +02:00
mertunsall	d56556a366	cloudflare and antibot are captcha	2025-07-01 08:51:38 +02:00
Magnus Müller	f5fe5dff66	Add --use-anchor argument to evaluation script for enhanced navigation	2025-07-01 08:14:28 +02:00
mertunsall	6951605ec2	fix	2025-07-01 07:55:17 +02:00
mertunsall	fc1d44cf87	small change to judge	2025-07-01 07:52:55 +02:00
Magnus Müller	86e02752fa	Refactor error categories in judge_system.py for clarity and organization - Renamed and reorganized error categories for better readability and understanding. - Added new error categories: LOGIN_FAILED, LLM_CALL_ERROR, WAIT_TOO_SHORT, BROWSER_CRASHES, and ELEMENT_INTERACTION_ERROR. - Removed deprecated categories and consolidated related ones. - Updated documentation to clarify the purpose of each error category and provided actionable improvement tips.	2025-06-30 15:13:58 +02:00
Nick Sweeting	b2698440e1	ignore missing datamodel_code_generator	2025-06-30 01:07:21 -07:00
mertunsall	63179e4450	allow eval to download	2025-06-29 21:24:59 +02:00
Magnus Müller	8968a24dc5	Enhance TaskResult to convert usage data to JSON string if it's a dictionary. This improves data handling and ensures consistent payload structure for usage information.	2025-06-29 21:07:56 +02:00
Magnus Müller	5905ad949c	feat: add thinking parameter to control agent system prompt - Add --no-thinking flag to disable thinking in agent system prompt - Default is true (thinking enabled) for backward compatibility - Pass thinking parameter through entire evaluation pipeline - Update GitHub Actions workflow to handle thinking parameter	2025-06-29 20:11:45 +02:00
Magnus Müller	8e69bc603b	Enhanced logging in reformat_agent_history to provide detailed insights into agent history usage data, including type checks and warnings for missing usage information.	2025-06-29 20:01:54 +02:00
Magnus Müller	01be75c20f	Enhanced logging in TaskResult to include token usage details for improved debugging and monitoring of resource consumption.	2025-06-29 19:57:48 +02:00
Magnus Müller	df2e65fcd9	Added logging for format data in TaskResult to enhance debugging capabilities.	2025-06-29 19:48:59 +02:00
Gregor Žunič	2c884040c4	Merge branch 'main' into gregor/eng-238-fix-token-counts-on-eval-platform	2025-06-29 19:40:29 +02:00
Gregor Žunič	d281ab871f	Merge remote-tracking branch 'origin/main' into gregor/eng-244-better-structured-output	2025-06-29 18:13:20 +02:00
Gregor Žunič	184fa8efb7	tests for schema generator	2025-06-29 18:09:25 +02:00
Magnus Müller	ea152c2c1e	Fix judge system for structured output - Convert JudgeResult from dataclass to Pydantic BaseModel for structured output compatibility - Remove ScoreBreakdown class and simplify JudgeResult structure - Update system prompt to clarify final_score as percentage of task completion - Add specific examples for task completion scoring - Fix ChatInvokeCompletion response handling for structured output - Add Cloudflare challenge scoring rule (score = 0) - Simplify response extraction logic	2025-06-29 17:53:30 +02:00
Gregor Žunič	48ae2d19cf	much better structured output + utils for generating pydantic schema	2025-06-29 17:42:19 +02:00
Gregor Žunič	9f8a91a812	Merge remote-tracking branch 'origin/main' into gregor/eng-238-fix-token-counts-on-eval-platform	2025-06-29 17:04:54 +02:00
Magnus Müller	3f49254e8a	Merge branch 'main' into eval-logins	2025-06-29 16:57:00 +02:00
Magnus Müller	344b35aa79	Refactor response handling in judge system evaluation - Simplified the extraction of JudgeResult from the model's response by removing unnecessary error handling and logging for unexpected response structures. - This change streamlines the evaluation process, ensuring that valid responses are directly returned without additional checks. These modifications aim to enhance the clarity and efficiency of the evaluation logic.	2025-06-29 16:15:36 +02:00
Magnus Müller	7fe23c2d82	Enhance error handling and logging in judge system evaluation - Improved logging to capture the structure of unexpected responses from the model. - Updated response handling to correctly extract the JudgeResult from the completion attribute. - Enhanced fallback mechanism for invalid response structures, providing clearer error messages. These changes aim to increase robustness and clarity in the evaluation process.	2025-06-29 16:12:01 +02:00
Magnus Müller	b7fff509f9	Refactor JudgeResult to use Pydantic model and streamline evaluation process - Replaced dataclass ScoreBreakdown with Pydantic's BaseModel for JudgeResult. - Updated scoring guidelines to reflect percentage-based final scores. - Removed unnecessary fields and improved JSON response structure. - Enhanced error handling and logging for evaluation failures. - Simplified parsing logic for structured responses from the model.	2025-06-29 15:47:45 +02:00
Gregor Žunič	b37e026f89	added usage data to agent history and evals	2025-06-29 11:28:20 +02:00
Magnus Müller	e22122a01e	System prompt update	2025-06-29 08:57:50 +02:00
Magnus Müller	024681e871	Dont show ...[cut for eval system] if its fully displayed	2025-06-29 08:20:51 +02:00
Alezander9	68d150c522	last commit to retrigger tests	2025-06-28 21:21:11 -07:00
Alezander9	002ef30515	Resolve merge conflicts: integrate login cookie evaluation with new browser setup and evaluation architecture	2025-06-28 20:46:47 -07:00
Magnus Müller	beae2fd20b	fix: add proper namespace support for Pydantic model_rebuild() with union types - Add required imports (Optional, Union, etc.) to module namespace before exec() - Use _types_namespace parameter in model_rebuild() to resolve forward references - Fix handling of JSON schema union types like ['string', 'null'] - Now properly supports schemas with Optional fields and union types This resolves the 'Optional is not defined' error when using structured output schemas with nullable fields in the evaluation system.	2025-06-28 23:36:34 +02:00
Magnus Müller	3e0c0f74e4	feat: fix structured output schema conversion and improve error handling - Fix datamodel-code-generator enum value (PydanticV2BaseModel) - Add missing Optional import in fallback schema conversion - Improve error handling with proper exception chaining - Add debug logging for schema conversion process - Fix temporary file cleanup to prevent NameError - Ensure robust fallback when datamodel-code-generator unavailable Resolves issues with JSON schema to Pydantic model conversion for structured output functionality in eval system.	2025-06-28 23:14:05 +02:00
Magnus Müller	c3687ea9ea	Refactor create_pydantic_model_from_schema to improve temporary file handling and error management during model generation	2025-06-28 21:21:02 +02:00
Magnus Müller	6305570942	Update pyproject.toml to exclude 'private_example/' and refactor service.py for improved Pydantic model generation and code cleanup	2025-06-28 21:18:51 +02:00
Cursor Agent	ad2614fb60	Changes from background composer bc-4b95c2c0-b3db-46d7-82ec-5bb4d84dd5b1	2025-06-28 18:58:34 +00:00
Cursor Agent	5b6edc50ee	Implement structured output with datamodel-code-generator for dynamic Pydantic models Co-authored-by: mamagnus00 <mamagnus00@gmail.com>	2025-06-28 18:53:33 +00:00
Cursor Agent	14e649ab97	Checkpoint before follow-up message	2025-06-28 18:51:49 +00:00
Cursor Agent	816d51839b	Checkpoint before follow-up message	2025-06-28 18:43:41 +00:00
Cursor Agent	b2baabd934	Remove output_schema handling from Agent and judge system Co-authored-by: mamagnus00 <mamagnus00@gmail.com>	2025-06-28 18:29:10 +00:00

1 2 3 4

195 Commits