browser-use

Magnus Müller 67d3f136eb Add explicit scoring criteria for tasks with no output (#2124 )

This PR improves the evaluation judge system by adding explicit criteria
for scoring tasks that produce no output.

**Changes:**
- Added clarification that tasks producing no output must receive low
scores
- Improved evaluation consistency for incomplete task scenarios
- Ensures judges properly penalize agents that fail to produce results

**Problem Addressed:**
Previously, the evaluation criteria didn't explicitly state how to
handle cases where agents complete their steps but produce no meaningful
output. This could lead to inconsistent scoring where some judges might
give moderate scores for "effort" even when no results were delivered.

**Benefits:**
- **Clearer guidance**: Judges now have explicit instruction on scoring
no-output scenarios
- **Consistent evaluation**: Reduces variability in how different judges
score incomplete tasks
- **Better agent assessment**: Ensures that failure to produce output is
properly reflected in scoring
- **Improved training data**: More accurate scores help improve future
agent development

**Impact:**
This change helps maintain evaluation accuracy by making it clear that
lack of output is a significant failure condition that should result in
low scores, regardless of the apparent "effort" shown in the agent's
trajectory.

**Testing:**
- No breaking changes to existing functionality
- Pre-commit hooks pass successfully
- Change only affects evaluation scoring guidance

2025-06-26 15:34:52 +02:00