- Fixes single task cloud execution to properly use Convex run ID
- Removes fragile local run ID generation that caused validation errors
- Ensures single task results are saved to correct run in database
- Maintains backward compatibility for local single task runs
- Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing.
- Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website.
- Enhanced environment variable loading for improved clarity and consistency.
- Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified.
- Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation.
- Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.
- Clarify that tasks producing no output must receive low scores
- Improve evaluation consistency for incomplete task scenarios
- Ensure judges properly penalize agents that fail to produce results
This helps maintain evaluation accuracy by explicitly stating that
lack of output is a significant failure condition that should be
reflected in low scoring.
- Update reformat_agent_history to accept and save last_message parameter
- Ensure last_message flows from run_agent_with_browser to comprehensive judge
- Add last_message to result.json structure for judge evaluation
- Judge now has complete context including agent's final reasoning
- Add last_message parameter to comprehensive_judge and judge_with_retry functions
- Include agent's last message in evaluation context for better assessment
- Add instruction to not evaluate for hallucination since agent has more info
- Update function signatures and docstrings with proper parameter documentation
- Remove dependency on action results since they're now in last_message
- Add new error categories for output format and data extraction issues
- Update system prompt to clarify browser_state contains readable text
- Remove misaligned focus on extract_structured_data usage
- Improve evaluation criteria for browser-use specific behaviors
- Streamline error categories and remove unused ones
- Better alignment with agent's actual capabilities and expected behaviors
- Fixed AttributeError where evaluation was calling non-existent model.invoke() method
- Changed to use model.ainvoke() which is the correct async method for browser-use LLM models
- Updated response handling from .content to .completion to match browser-use format
- Fixes Convex validation error where onlineMind2WebEvaluationJudgement was null
- Affected functions: judge_task_result, identify_key_points, judge_image
This resolves the 'Failed to save task result' errors in evaluation pipeline.
Refines the system prompt in judge_system.py by improving the context about the browser-use agent and updating evaluation criteria for better clarity. Adjusts the JSON response structure to reflect changes in task satisfaction and trajectory quality metrics.
Improves readability by restructuring the kwargs dictionary for the 'anthropic' case in the get_llm function, ensuring consistent formatting and clarity.
- Enhanced prepare_agent_steps function to properly process complete history
- Removed commented-out logic for excluding 'done' actions
- Improved trajectory quality assessment and evaluation logic
- Better handling of agent execution steps for comprehensive judging
Adds 'comprehensiveJudgeEvaluationCriticalIssues' to the TaskResult payload, allowing for better tracking and reporting of critical issues identified during evaluations.
Updates the judge system to always mention the error being addressed before providing improvement tips. Enhances clarity in the improvement tips by specifying the context of each suggestion, ensuring better guidance for users.
Removes the `fresh_start` option and the stage for loading existing results.
This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.
Updates the TaskResult class to improve the processing of comprehensive evaluation data. The changes include:
- Introduces fallback to retrieve 'comprehensive_judge' if 'comprehensive_evaluation' is not present.
- Converts enum lists for 'task_categories' and 'error_categories' to string lists for better database compatibility.
- Updates payload to include the processed categories and errors, ensuring proper data structure for evaluation results.
Renames the 'laminar_link' key to 'laminarTaskLink' in the
TaskResult dictionary for consistency.
Adds an async decorator and docstring to the comprehensive judge
fallback function to improve code clarity and maintainability.
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.
The changes include:
- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.
Adds a type assertion to ensure that the payload remains a dictionary after serialization.
Also, adds type hints to `make_json_serializable` for better code clarity and maintainability.
Adds a utility function to convert objects within a payload to JSON-serializable types before returning the task result.
This change addresses potential issues where the task result contains non-serializable objects (e.g., enums, custom objects), preventing proper data handling.
Updates the import path for the comprehensive judge system to reflect its new location in the project structure.
This resolves an issue where the previous relative import was causing import errors.
Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types.
Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.
Fixes a relative import issue for the judge system.
Updates type hints to allow None values for laminar_link and critical_error.
Comments out unused code related to Laminar link updates.
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
- Change from --highlight-elements (action='store_true') to --no-highlight-elements (action='store_false')
- Fix CLI argument defaulting to False when flag not provided, conflicting with function default of True
- Update GitHub workflow to use new flag logic (add flag when highlight_elements=false)
- Ensure consistent behavior: highlighting enabled by default, can be disabled with --no-highlight-elements
Resolves bug where CLI users got highlighting disabled by default instead of enabled
- Add --highlight-elements CLI argument to eval/service.py
- Pass highlight_elements through entire execution pipeline
- Add highlight_elements support to GitHub workflow (eval.yaml)
- Allow users to control whether interactive elements are highlighted on pages during automation
- Improves debugging and visibility options for browser automation