Commit Graph

196 Commits

Author SHA1 Message Date
Aitor
8a5ca6ceec Merge branch 'main' into feat/evals-anchor-support 2025-07-02 14:23:45 +02:00
Magnus Müller
239fd3f86b eval include runner link 2025-07-02 14:10:22 +02:00
mertunsall
3a5f43bb3f eval should run with temperature 1 for o3 2025-07-02 11:07:58 +02:00
Magnus Müller
4a0eab6fb8 test git automation functions 2025-07-02 10:36:41 +02:00
mertunsall
4e7cf3a964 Update model names for gemini-2.5 to remove preview suffixes in service.py 2025-07-01 13:40:05 +02:00
Magnus Müller
b17e8f3f88 Max 3 min for judge 2025-07-01 12:36:17 +02:00
Aitor
0c225d2afa Merge branch 'main' into feat/evals-anchor-support 2025-07-01 09:03:59 +02:00
mertunsall
d56556a366 cloudflare and antibot are captcha 2025-07-01 08:51:38 +02:00
Magnus Müller
f5fe5dff66 Add --use-anchor argument to evaluation script for enhanced navigation 2025-07-01 08:14:28 +02:00
mertunsall
6951605ec2 fix 2025-07-01 07:55:17 +02:00
mertunsall
fc1d44cf87 small change to judge 2025-07-01 07:52:55 +02:00
BroskyBrowser
3b38d95601 chore: linter 2025-06-30 20:26:45 +02:00
BroskyBrowser
f875e89bf8 chore: merge main 2025-06-30 19:15:38 +02:00
Magnus Müller
86e02752fa Refactor error categories in judge_system.py for clarity and organization
- Renamed and reorganized error categories for better readability and understanding.
- Added new error categories: LOGIN_FAILED, LLM_CALL_ERROR, WAIT_TOO_SHORT, BROWSER_CRASHES, and ELEMENT_INTERACTION_ERROR.
- Removed deprecated categories and consolidated related ones.
- Updated documentation to clarify the purpose of each error category and provided actionable improvement tips.
2025-06-30 15:13:58 +02:00
BroskyBrowser
c14a363299 feat: added --use-anchor flag support 2025-06-30 11:35:14 +02:00
Nick Sweeting
b2698440e1 ignore missing datamodel_code_generator 2025-06-30 01:07:21 -07:00
mertunsall
63179e4450 allow eval to download 2025-06-29 21:24:59 +02:00
Magnus Müller
8968a24dc5 Enhance TaskResult to convert usage data to JSON string if it's a dictionary. This improves data handling and ensures consistent payload structure for usage information. 2025-06-29 21:07:56 +02:00
Magnus Müller
5905ad949c feat: add thinking parameter to control agent system prompt
- Add --no-thinking flag to disable thinking in agent system prompt
- Default is true (thinking enabled) for backward compatibility
- Pass thinking parameter through entire evaluation pipeline
- Update GitHub Actions workflow to handle thinking parameter
2025-06-29 20:11:45 +02:00
Magnus Müller
8e69bc603b Enhanced logging in reformat_agent_history to provide detailed insights into agent history usage data, including type checks and warnings for missing usage information. 2025-06-29 20:01:54 +02:00
Magnus Müller
01be75c20f Enhanced logging in TaskResult to include token usage details for improved debugging and monitoring of resource consumption. 2025-06-29 19:57:48 +02:00
Magnus Müller
df2e65fcd9 Added logging for format data in TaskResult to enhance debugging capabilities. 2025-06-29 19:48:59 +02:00
Gregor Žunič
2c884040c4 Merge branch 'main' into gregor/eng-238-fix-token-counts-on-eval-platform 2025-06-29 19:40:29 +02:00
Gregor Žunič
d281ab871f Merge remote-tracking branch 'origin/main' into gregor/eng-244-better-structured-output 2025-06-29 18:13:20 +02:00
Gregor Žunič
184fa8efb7 tests for schema generator 2025-06-29 18:09:25 +02:00
Magnus Müller
ea152c2c1e Fix judge system for structured output
- Convert JudgeResult from dataclass to Pydantic BaseModel for structured output compatibility
- Remove ScoreBreakdown class and simplify JudgeResult structure
- Update system prompt to clarify final_score as percentage of task completion
- Add specific examples for task completion scoring
- Fix ChatInvokeCompletion response handling for structured output
- Add Cloudflare challenge scoring rule (score = 0)
- Simplify response extraction logic
2025-06-29 17:53:30 +02:00
Gregor Žunič
48ae2d19cf much better structured output + utils for generating pydantic schema 2025-06-29 17:42:19 +02:00
Gregor Žunič
9f8a91a812 Merge remote-tracking branch 'origin/main' into gregor/eng-238-fix-token-counts-on-eval-platform 2025-06-29 17:04:54 +02:00
Magnus Müller
3f49254e8a Merge branch 'main' into eval-logins 2025-06-29 16:57:00 +02:00
Magnus Müller
344b35aa79 Refactor response handling in judge system evaluation
- Simplified the extraction of JudgeResult from the model's response by removing unnecessary error handling and logging for unexpected response structures.
- This change streamlines the evaluation process, ensuring that valid responses are directly returned without additional checks.

These modifications aim to enhance the clarity and efficiency of the evaluation logic.
2025-06-29 16:15:36 +02:00
Magnus Müller
7fe23c2d82 Enhance error handling and logging in judge system evaluation
- Improved logging to capture the structure of unexpected responses from the model.
- Updated response handling to correctly extract the JudgeResult from the completion attribute.
- Enhanced fallback mechanism for invalid response structures, providing clearer error messages.

These changes aim to increase robustness and clarity in the evaluation process.
2025-06-29 16:12:01 +02:00
Magnus Müller
b7fff509f9 Refactor JudgeResult to use Pydantic model and streamline evaluation process
- Replaced dataclass ScoreBreakdown with Pydantic's BaseModel for JudgeResult.
- Updated scoring guidelines to reflect percentage-based final scores.
- Removed unnecessary fields and improved JSON response structure.
- Enhanced error handling and logging for evaluation failures.
- Simplified parsing logic for structured responses from the model.
2025-06-29 15:47:45 +02:00
Gregor Žunič
b37e026f89 added usage data to agent history and evals 2025-06-29 11:28:20 +02:00
Magnus Müller
e22122a01e System prompt update 2025-06-29 08:57:50 +02:00
Magnus Müller
024681e871 Dont show ...[cut for eval system] if its fully displayed 2025-06-29 08:20:51 +02:00
Alezander9
68d150c522 last commit to retrigger tests 2025-06-28 21:21:11 -07:00
Alezander9
002ef30515 Resolve merge conflicts: integrate login cookie evaluation with new browser setup and evaluation architecture 2025-06-28 20:46:47 -07:00
Magnus Müller
beae2fd20b fix: add proper namespace support for Pydantic model_rebuild() with union types
- Add required imports (Optional, Union, etc.) to module namespace before exec()
- Use _types_namespace parameter in model_rebuild() to resolve forward references
- Fix handling of JSON schema union types like ['string', 'null']
- Now properly supports schemas with Optional fields and union types

This resolves the 'Optional is not defined' error when using structured
output schemas with nullable fields in the evaluation system.
2025-06-28 23:36:34 +02:00
Magnus Müller
3e0c0f74e4 feat: fix structured output schema conversion and improve error handling
- Fix datamodel-code-generator enum value (PydanticV2BaseModel)
- Add missing Optional import in fallback schema conversion
- Improve error handling with proper exception chaining
- Add debug logging for schema conversion process
- Fix temporary file cleanup to prevent NameError
- Ensure robust fallback when datamodel-code-generator unavailable

Resolves issues with JSON schema to Pydantic model conversion for
structured output functionality in eval system.
2025-06-28 23:14:05 +02:00
Magnus Müller
c3687ea9ea Refactor create_pydantic_model_from_schema to improve temporary file handling and error management during model generation 2025-06-28 21:21:02 +02:00
Magnus Müller
6305570942 Update pyproject.toml to exclude 'private_example/' and refactor service.py for improved Pydantic model generation and code cleanup 2025-06-28 21:18:51 +02:00
Cursor Agent
ad2614fb60 Changes from background composer bc-4b95c2c0-b3db-46d7-82ec-5bb4d84dd5b1 2025-06-28 18:58:34 +00:00
Cursor Agent
5b6edc50ee Implement structured output with datamodel-code-generator for dynamic Pydantic models
Co-authored-by: mamagnus00 <mamagnus00@gmail.com>
2025-06-28 18:53:33 +00:00
Cursor Agent
14e649ab97 Checkpoint before follow-up message 2025-06-28 18:51:49 +00:00
Cursor Agent
816d51839b Checkpoint before follow-up message 2025-06-28 18:43:41 +00:00
Cursor Agent
b2baabd934 Remove output_schema handling from Agent and judge system
Co-authored-by: mamagnus00 <mamagnus00@gmail.com>
2025-06-28 18:29:10 +00:00
Cursor Agent
d8684d5b9d Add structured output schema support for agent tasks and evaluation
Co-authored-by: mamagnus00 <mamagnus00@gmail.com>
2025-06-28 18:12:58 +00:00
Mert Unsal
bda38011cd Merge branch 'main' into cursor/remove-memory-support-and-raise-error-b7bd 2025-06-27 13:40:56 +02:00
mertunsall
98bd8813b1 format 2025-06-27 13:28:54 +02:00
Cursor Agent
f9741adba5 Remove deprecated memory system and related code
Co-authored-by: mailmertunsal <mailmertunsal@gmail.com>
2025-06-27 09:48:09 +00:00