- Added validation for START_INDEX and TOTAL_TASKS to ensure they are numeric, with default values set to prevent errors.
- Improved logging for task range calculations and runner ID generation, including warnings for non-numeric inputs.
- Enhanced evaluation output handling with comprehensive error capture and logging, ensuring better debugging information is available.
- Implemented checks for the existence of evaluation logs and provided statistics for better visibility into evaluation outcomes.
- Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables.
- Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API.
- Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.
- Changed runner from Blacksmith to ubuntu-latest for improved compatibility.
- Updated setup-uv action to use astral-sh/setup-uv@v6.
- Simplified dependency installation steps by removing unnecessary verification and debug outputs.
- Adjusted Playwright version detection and caching actions for better performance.
- Add --no-thinking flag to disable thinking in agent system prompt
- Default is true (thinking enabled) for backward compatibility
- Pass thinking parameter through entire evaluation pipeline
- Update GitHub Actions workflow to handle thinking parameter
- Eliminated the branch argument from both eval.yaml and service.py for single task mode, simplifying argument parsing.
- Updated related logic to ensure backward compatibility while maintaining functionality for task ID, text, and website.
- Enhanced environment variable loading for improved clarity and consistency.
- Introduced parameters for single task mode in eval.yaml, allowing task ID, text, website, and branch to be specified.
- Updated service.py to handle single task mode, including conditional saving to the server and local run ID generation.
- Enhanced argument parsing to accommodate single task mode, ensuring backward compatibility with existing multi-task functionality.
Removes the `fresh_start` option and the stage for loading existing results.
This change streamlines the evaluation pipeline by removing the option to load existing results. The pipeline now always executes from the browser setup stage, ensuring consistent and repeatable evaluation runs.
Enhances evaluation by improving error handling, providing more detailed logging, and adding a local summary calculation.
The changes include:
- Adds comprehensive judge fallback to Mind2Web judge and ensures backward compatibility.
- Improves error handling during evaluation by capturing and logging the last part of the output on failure.
- Adds a new function to calculate a summary of local evaluation results, displaying total tasks, success rate, and average score.
- Includes comprehensive evaluation data for debugging purposes.