Commit Graph

3 Commits

Author SHA1 Message Date
Magnus Müller
eeb8024184 Handles varied LLM response formats
Ensures the judge system can correctly parse LLM responses, accommodating both string and list content types.

Adds a fallback mechanism to guarantee a result even if maximum retry attempts are exceeded, enhancing robustness and type safety.
2025-06-22 23:12:37 +02:00
Magnus Müller
4629a8d9b7 Fixes relative import and type hints
Fixes a relative import issue for the judge system.

Updates type hints to allow None values for laminar_link and critical_error.

Comments out unused code related to Laminar link updates.
2025-06-22 23:07:10 +02:00
Magnus Müller
be16ff3f69 Implement comprehensive judge system for task evaluation
Added a new judge system in `judge_system.py` that evaluates browser-use agent runs, providing detailed structured feedback. Updated the evaluation workflow in `eval.yaml` to include a new command-line argument for using the comprehensive judge. Modified `service.py` to integrate the new judge system, allowing for fallback to the original Mind2Web evaluation if specified. Enhanced error handling and logging throughout the evaluation process.
2025-06-22 22:43:57 +02:00