- Added support for dynamic runner ID generation that aligns with GitHub Actions patterns, incorporating start index from environment variables.
- Updated the evaluation script to send detailed progress updates, including task range and total assigned tasks, to the tracking API.
- Improved error handling and logging for runner registration and completion updates to ensure reliability during evaluations.
- Moved the formatting and evaluation stages outside the browser session block to ensure they are executed regardless of session state.
- Updated error handling for evaluation and server save stages to maintain consistent logging and task result management.
- Ensured that server save attempts are always made, improving reliability in task completion.
- Renamed and reorganized error categories for better readability and understanding.
- Added new error categories: LOGIN_FAILED, LLM_CALL_ERROR, WAIT_TOO_SHORT, BROWSER_CRASHES, and ELEMENT_INTERACTION_ERROR.
- Removed deprecated categories and consolidated related ones.
- Updated documentation to clarify the purpose of each error category and provided actionable improvement tips.
- Add --no-thinking flag to disable thinking in agent system prompt
- Default is true (thinking enabled) for backward compatibility
- Pass thinking parameter through entire evaluation pipeline
- Update GitHub Actions workflow to handle thinking parameter
- Simplified the extraction of JudgeResult from the model's response by removing unnecessary error handling and logging for unexpected response structures.
- This change streamlines the evaluation process, ensuring that valid responses are directly returned without additional checks.
These modifications aim to enhance the clarity and efficiency of the evaluation logic.
- Improved logging to capture the structure of unexpected responses from the model.
- Updated response handling to correctly extract the JudgeResult from the completion attribute.
- Enhanced fallback mechanism for invalid response structures, providing clearer error messages.
These changes aim to increase robustness and clarity in the evaluation process.
- Replaced dataclass ScoreBreakdown with Pydantic's BaseModel for JudgeResult.
- Updated scoring guidelines to reflect percentage-based final scores.
- Removed unnecessary fields and improved JSON response structure.
- Enhanced error handling and logging for evaluation failures.
- Simplified parsing logic for structured responses from the model.
- Add required imports (Optional, Union, etc.) to module namespace before exec()
- Use _types_namespace parameter in model_rebuild() to resolve forward references
- Fix handling of JSON schema union types like ['string', 'null']
- Now properly supports schemas with Optional fields and union types
This resolves the 'Optional is not defined' error when using structured
output schemas with nullable fields in the evaluation system.