Commit Graph

138 Commits

Author SHA1 Message Date
Magnus Müller
772bc063a2 fix: use provided run_id in single task mode instead of generating local ID
- Fixes single task cloud execution to properly use Convex run ID
- Removes fragile local run ID generation that caused validation errors
- Ensures single task results are saved to correct run in database
- Maintains backward compatibility for local single task runs
2025-06-27 11:40:54 +02:00
Magnus Müller
3de686d6af refactor: remove branch argument from single task mode in evaluation workflow
- Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing.
- Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website.
- Enhanced environment variable loading for improved clarity and consistency.
2025-06-27 10:11:21 +02:00
Magnus Müller
bb11c7e7ca feat: add single task mode support in evaluation workflow
- Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified.
- Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation.
- Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.
2025-06-27 09:57:37 +02:00
Magnus Müller
e004a7efd8 Add explicit scoring criteria for tasks with no output
- Clarify that tasks producing no output must receive low scores
- Improve evaluation consistency for incomplete task scenarios
- Ensure judges properly penalize agents that fail to produce results

This helps maintain evaluation accuracy by explicitly stating that
lack of output is a significant failure condition that should be
reflected in low scoring.
2025-06-26 15:34:01 +02:00
Magnus Müller
4a4336e1a1 Improve judge system evaluation criteria
- Remove verbose architecture documentation comments
- Add guideline to avoid file system usage for short tasks
- Add stricter evaluation criteria for blocked/impossible tasks
- Add criteria for bad trajectories and tool usage
- Add requirement for task completion scoring
- Add trajectory optimization considerations
2025-06-26 15:01:10 +02:00
Magnus Müller
8121110caa Include max_steps_reached 2025-06-25 16:08:57 +02:00
Magnus Müller
c2fa3afd66 Improve system prompt 2025-06-25 15:47:38 +02:00
Magnus Müller
5abd4ee004 Limit number of input steps to judge 2025-06-25 15:17:08 +02:00
Magnus Müller
d7db550ce8 Update the system prompt for improvement tips 2025-06-25 11:43:26 +02:00
Magnus Müller
9132114dc8 Improve judge system prompt for browser_state 2025-06-25 11:23:31 +02:00
Magnus Müller
58c3b25cfd Complete last_message flow to comprehensive judge
- Update reformat_agent_history to accept and save last_message parameter
- Ensure last_message flows from run_agent_with_browser to comprehensive judge
- Add last_message to result.json structure for judge evaluation
- Judge now has complete context including agent's final reasoning
2025-06-25 10:34:50 +02:00
Magnus Müller
06c4af0622 Include last message in judge 2025-06-25 10:14:14 +02:00
Magnus Müller
1fb67ac676 Add last_message parameter and anti-hallucination instruction to judge
- Add last_message parameter to comprehensive_judge and judge_with_retry functions
- Include agent's last message in evaluation context for better assessment
- Add instruction to not evaluate for hallucination since agent has more info
- Update function signatures and docstrings with proper parameter documentation
- Remove dependency on action results since they're now in last_message
2025-06-25 09:51:25 +02:00
Magnus Müller
a65995846a System-prompt 2025-06-25 09:23:53 +02:00
Magnus Müller
1b51780428 Improve judge system alignment with browser-use architecture
- Add new error categories for output format and data extraction issues
- Update system prompt to clarify browser_state contains readable text
- Remove misaligned focus on extract_structured_data usage
- Improve evaluation criteria for browser-use specific behaviors
- Streamline error categories and remove unused ones
- Better alignment with agent's actual capabilities and expected behaviors
2025-06-25 08:36:20 +02:00
Magnus Müller
9b491e9578 Fix Mind2Web judge evaluation: Replace model.invoke() with model.ainvoke()
- Fixed AttributeError where evaluation was calling non-existent model.invoke() method
- Changed to use model.ainvoke() which is the correct async method for browser-use LLM models
- Updated response handling from .content to .completion to match browser-use format
- Fixes Convex validation error where onlineMind2WebEvaluationJudgement was null
- Affected functions: judge_task_result, identify_key_points, judge_image

This resolves the 'Failed to save task result' errors in evaluation pipeline.
2025-06-25 08:05:29 +02:00
Magnus Müller
033418092f Enhance system prompt clarity and evaluation criteria
Refines the system prompt in judge_system.py by improving the context about the browser-use agent and updating evaluation criteria for better clarity. Adjusts the JSON response structure to reflect changes in task satisfaction and trajectory quality metrics.
2025-06-24 22:41:26 +02:00
Magnus Müller
9a11523cd8 Refactor argument formatting in get_llm function
Improves readability by restructuring the kwargs dictionary for the 'anthropic' case in the get_llm function, ensuring consistent formatting and clarity.
2025-06-24 22:34:52 +02:00
Magnus Müller
22ba6bbfdb Improve context about broser-use 2025-06-24 22:02:18 +02:00
Magnus Müller
19f48a75a3 System prompt judge 2025-06-24 21:49:33 +02:00
Magnus Müller
46ba7b79a9 Improve judge system history processing
- Enhanced prepare_agent_steps function to properly process complete history
- Removed commented-out logic for excluding 'done' actions
- Improved trajectory quality assessment and evaluation logic
- Better handling of agent execution steps for comprehensive judging
2025-06-24 14:55:22 +02:00
Gregor Žunič
70e9b271ad changed model_name to model in evals for ChatAnthropic 2025-06-24 14:02:02 +02:00
Gregor Žunič
7a10ae0c96 Squashed commit langchain to native 2025-06-24 12:26:55 +02:00
Magnus Müller
e2325c31c6 Enhances TaskResult to include critical issues in comprehensive evaluation data
Adds 'comprehensiveJudgeEvaluationCriticalIssues' to the TaskResult payload, allowing for better tracking and reporting of critical issues identified during evaluations.
2025-06-23 11:04:29 +02:00
Magnus Müller
2cea68e1f0 Refines error reporting and improvement tips in judge system
Updates the judge system to always mention the error being addressed before providing improvement tips. Enhances clarity in the improvement tips by specifying the context of each suggestion, ensuring better guidance for users.
2025-06-23 10:50:45 +02:00
Magnus Müller
537e86da4c Simplifies evaluation pipeline execution
Removes the `fresh_start` option and the stage for loading existing results.

This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.
2025-06-23 10:11:07 +02:00
Magnus Müller
5c33e0e7f2 Improve-judge prompt 2025-06-23 09:04:44 +02:00
Magnus Müller
e4b03e712e Enhances comprehensive evaluation data handling in TaskResult
Updates the TaskResult class to improve the processing of comprehensive evaluation data. The changes include:

- Introduces fallback to retrieve 'comprehensive_judge' if 'comprehensive_evaluation' is not present.
- Converts enum lists for 'task_categories' and 'error_categories' to string lists for better database compatibility.
- Updates payload to include the processed categories and errors, ensuring proper data structure for evaluation results.
2025-06-23 01:23:29 +02:00
Magnus Müller
a00076973a Renames laminar link key in task result
Renames the 'laminar_link' key to 'laminarTaskLink' in the
TaskResult dictionary for consistency.

Adds an async decorator and docstring to the comprehensive judge
fallback function to improve code clarity and maintainability.
2025-06-23 00:35:40 +02:00
Magnus Müller
d4a29c4b93 Improves evaluation robustness and reporting
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.

The changes include:

- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
2025-06-23 00:08:14 +02:00
Magnus Müller
be170fb17a Ensures payload serialization preserves dict structure
Adds a type assertion to ensure that the payload remains a dictionary after serialization.

Also, adds type hints to `make_json_serializable` for better code clarity and maintainability.
2025-06-22 23:50:46 +02:00
Magnus Müller
4a26f07c66 Ensures JSON serializability of task results
Adds a utility function to convert objects within a payload to JSON-serializable types before returning the task result.

This change addresses potential issues where the task result contains non-serializable objects (e.g., enums, custom objects), preventing proper data handling.
2025-06-22 23:46:39 +02:00
Magnus Müller
0a5a29e4a8 Updates judge system import path
Updates the import path for the comprehensive judge system to reflect its new location in the project structure.

This resolves an issue where the previous relative import was causing import errors.
2025-06-22 23:26:56 +02:00
Magnus Müller
eeb8024184 Handles varied LLM response formats
Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types.

Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.
2025-06-22 23:12:37 +02:00
Magnus Müller
4629a8d9b7 Fixes relative import and type hints
Fixes a relative import issue for the judge system.

Updates type hints to allow None values for laminar_link and critical_error.

Comments out unused code related to Laminar link updates.
2025-06-22 23:07:10 +02:00
Magnus Müller
be16ff3f69 Implement comprehensive judge system for task evaluation
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
2025-06-22 22:43:57 +02:00
Nick Sweeting
4a8a4155b3 try keep alive browsers 2025-06-21 07:23:54 -07:00
Nick Sweeting
ac22e6ae20 Test fixes, evenbus tweaks, docs updates, and better warnings (#2027) 2025-06-21 06:32:11 -07:00
Nick Sweeting
0af8c8c0fe imports 2025-06-21 06:29:10 -07:00
Nick Sweeting
3209fd95f7 lint and hint fixes 2025-06-21 06:07:21 -07:00
Nick Sweeting
aad78d93ab more type hint fixes 2025-06-21 05:44:49 -07:00
Nick Sweeting
6bc1f7985f more type hint fixes 2025-06-21 04:56:27 -07:00
Magnus Müller
aeea1788fa fix: CLI argument default conflict for highlight_elements
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements

Resolves bug where CLI users got highlighting disabled by default instead of enabled
2025-06-21 12:52:40 +02:00
Magnus Müller
9292e6c48d feat: add highlight_elements flag for controlling element highlighting
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation
2025-06-21 12:40:41 +02:00
Nick Sweeting
d8c6876e08 tweak BrowserSession failure message details 2025-06-21 01:03:09 -07:00
Magnus Müller
9b72be9ea0 Update logging format in service.py to include logger name for better context in log messages. 2025-06-21 09:55:38 +02:00
Magnus Müller
f1d5dc5a17 Pass laminar_eval_id from frontend 2025-06-21 09:31:14 +02:00
Magnus Müller
83d92513a4 Monitor eval cpu 2025-06-20 23:35:56 +02:00
Magnus Müller
4a8cf30dac Merge branch 'main' into mert/new_everything 2025-06-20 12:27:19 +02:00
Magnus Müller
0e5a8942f3 Add gemini-2.5-flash 2025-06-20 12:19:47 +02:00