Aitor
8a5ca6ceec
Merge branch 'main' into feat/evals-anchor-support
2025-07-02 14:23:45 +02:00
Magnus Müller
239fd3f86b
eval include runner link
2025-07-02 14:10:22 +02:00
mertunsall
3a5f43bb3f
eval should run with temperature 1 for o3
2025-07-02 11:07:58 +02:00
Magnus Müller
4a0eab6fb8
test git automation functions
2025-07-02 10:36:41 +02:00
mertunsall
4e7cf3a964
Update model names for gemini-2.5 to remove preview suffixes in service.py
2025-07-01 13:40:05 +02:00
Magnus Müller
b17e8f3f88
Max 3 min for judge
2025-07-01 12:36:17 +02:00
Aitor
0c225d2afa
Merge branch 'main' into feat/evals-anchor-support
2025-07-01 09:03:59 +02:00
mertunsall
d56556a366
cloudflare and antibot are captcha
2025-07-01 08:51:38 +02:00
Magnus Müller
f5fe5dff66
Add --use-anchor argument to evaluation script for enhanced navigation
2025-07-01 08:14:28 +02:00
mertunsall
6951605ec2
fix
2025-07-01 07:55:17 +02:00
mertunsall
fc1d44cf87
small change to judge
2025-07-01 07:52:55 +02:00
BroskyBrowser
3b38d95601
chore: linter
2025-06-30 20:26:45 +02:00
BroskyBrowser
f875e89bf8
chore: merge main
2025-06-30 19:15:38 +02:00
Magnus Müller
86e02752fa
Refactor error categories in judge_system.py for clarity and organization
...
- Renamed and reorganized error categories for better readability and understanding.
- Added new error categories: LOGIN_FAILED, LLM_CALL_ERROR, WAIT_TOO_SHORT, BROWSER_CRASHES, and ELEMENT_INTERACTION_ERROR.
- Removed deprecated categories and consolidated related ones.
- Updated documentation to clarify the purpose of each error category and provided actionable improvement tips.
2025-06-30 15:13:58 +02:00
BroskyBrowser
c14a363299
feat: added --use-anchor flag support
2025-06-30 11:35:14 +02:00
Nick Sweeting
b2698440e1
ignore missing datamodel_code_generator
2025-06-30 01:07:21 -07:00
mertunsall
63179e4450
allow eval to download
2025-06-29 21:24:59 +02:00
Magnus Müller
8968a24dc5
Enhance TaskResult to convert usage data to JSON string if it's a dictionary. This improves data handling and ensures consistent payload structure for usage information.
2025-06-29 21:07:56 +02:00
Magnus Müller
5905ad949c
feat: add thinking parameter to control agent system prompt
...
- Add --no-thinking flag to disable thinking in agent system prompt
- Default is true (thinking enabled) for backward compatibility
- Pass thinking parameter through entire evaluation pipeline
- Update GitHub Actions workflow to handle thinking parameter
2025-06-29 20:11:45 +02:00
Magnus Müller
8e69bc603b
Enhanced logging in reformat_agent_history to provide detailed insights into agent history usage data, including type checks and warnings for missing usage information.
2025-06-29 20:01:54 +02:00
Magnus Müller
01be75c20f
Enhanced logging in TaskResult to include token usage details for improved debugging and monitoring of resource consumption.
2025-06-29 19:57:48 +02:00
Magnus Müller
df2e65fcd9
Added logging for format data in TaskResult to enhance debugging capabilities.
2025-06-29 19:48:59 +02:00
Gregor Žunič
2c884040c4
Merge branch 'main' into gregor/eng-238-fix-token-counts-on-eval-platform
2025-06-29 19:40:29 +02:00
Gregor Žunič
d281ab871f
Merge remote-tracking branch 'origin/main' into gregor/eng-244-better-structured-output
2025-06-29 18:13:20 +02:00
Gregor Žunič
184fa8efb7
tests for schema generator
2025-06-29 18:09:25 +02:00
Magnus Müller
ea152c2c1e
Fix judge system for structured output
...
- Convert JudgeResult from dataclass to Pydantic BaseModel for structured output compatibility
- Remove ScoreBreakdown class and simplify JudgeResult structure
- Update system prompt to clarify final_score as percentage of task completion
- Add specific examples for task completion scoring
- Fix ChatInvokeCompletion response handling for structured output
- Add Cloudflare challenge scoring rule (score = 0)
- Simplify response extraction logic
2025-06-29 17:53:30 +02:00
Gregor Žunič
48ae2d19cf
much better structured output + utils for generating pydantic schema
2025-06-29 17:42:19 +02:00
Gregor Žunič
9f8a91a812
Merge remote-tracking branch 'origin/main' into gregor/eng-238-fix-token-counts-on-eval-platform
2025-06-29 17:04:54 +02:00
Magnus Müller
3f49254e8a
Merge branch 'main' into eval-logins
2025-06-29 16:57:00 +02:00
Magnus Müller
344b35aa79
Refactor response handling in judge system evaluation
...
- Simplified the extraction of JudgeResult from the model's response by removing unnecessary error handling and logging for unexpected response structures.
- This change streamlines the evaluation process, ensuring that valid responses are directly returned without additional checks.
These modifications aim to enhance the clarity and efficiency of the evaluation logic.
2025-06-29 16:15:36 +02:00
Magnus Müller
7fe23c2d82
Enhance error handling and logging in judge system evaluation
...
- Improved logging to capture the structure of unexpected responses from the model.
- Updated response handling to correctly extract the JudgeResult from the completion attribute.
- Enhanced fallback mechanism for invalid response structures, providing clearer error messages.
These changes aim to increase robustness and clarity in the evaluation process.
2025-06-29 16:12:01 +02:00
Magnus Müller
b7fff509f9
Refactor JudgeResult to use Pydantic model and streamline evaluation process
...
- Replaced dataclass ScoreBreakdown with Pydantic's BaseModel for JudgeResult.
- Updated scoring guidelines to reflect percentage-based final scores.
- Removed unnecessary fields and improved JSON response structure.
- Enhanced error handling and logging for evaluation failures.
- Simplified parsing logic for structured responses from the model.
2025-06-29 15:47:45 +02:00
Gregor Žunič
b37e026f89
added usage data to agent history and evals
2025-06-29 11:28:20 +02:00
Magnus Müller
e22122a01e
System prompt update
2025-06-29 08:57:50 +02:00
Magnus Müller
024681e871
Dont show ...[cut for eval system] if its fully displayed
2025-06-29 08:20:51 +02:00
Alezander9
68d150c522
last commit to retrigger tests
2025-06-28 21:21:11 -07:00
Alezander9
002ef30515
Resolve merge conflicts: integrate login cookie evaluation with new browser setup and evaluation architecture
2025-06-28 20:46:47 -07:00
Magnus Müller
beae2fd20b
fix: add proper namespace support for Pydantic model_rebuild() with union types
...
- Add required imports (Optional, Union, etc.) to module namespace before exec()
- Use _types_namespace parameter in model_rebuild() to resolve forward references
- Fix handling of JSON schema union types like ['string', 'null']
- Now properly supports schemas with Optional fields and union types
This resolves the 'Optional is not defined' error when using structured
output schemas with nullable fields in the evaluation system.
2025-06-28 23:36:34 +02:00
Magnus Müller
3e0c0f74e4
feat: fix structured output schema conversion and improve error handling
...
- Fix datamodel-code-generator enum value (PydanticV2BaseModel)
- Add missing Optional import in fallback schema conversion
- Improve error handling with proper exception chaining
- Add debug logging for schema conversion process
- Fix temporary file cleanup to prevent NameError
- Ensure robust fallback when datamodel-code-generator unavailable
Resolves issues with JSON schema to Pydantic model conversion for
structured output functionality in eval system.
2025-06-28 23:14:05 +02:00
Magnus Müller
c3687ea9ea
Refactor create_pydantic_model_from_schema to improve temporary file handling and error management during model generation
2025-06-28 21:21:02 +02:00
Magnus Müller
6305570942
Update pyproject.toml to exclude 'private_example/' and refactor service.py for improved Pydantic model generation and code cleanup
2025-06-28 21:18:51 +02:00
Cursor Agent
ad2614fb60
Changes from background composer bc-4b95c2c0-b3db-46d7-82ec-5bb4d84dd5b1
2025-06-28 18:58:34 +00:00
Cursor Agent
5b6edc50ee
Implement structured output with datamodel-code-generator for dynamic Pydantic models
...
Co-authored-by: mamagnus00 <mamagnus00@gmail.com >
2025-06-28 18:53:33 +00:00
Cursor Agent
14e649ab97
Checkpoint before follow-up message
2025-06-28 18:51:49 +00:00
Cursor Agent
816d51839b
Checkpoint before follow-up message
2025-06-28 18:43:41 +00:00
Cursor Agent
b2baabd934
Remove output_schema handling from Agent and judge system
...
Co-authored-by: mamagnus00 <mamagnus00@gmail.com >
2025-06-28 18:29:10 +00:00
Cursor Agent
d8684d5b9d
Add structured output schema support for agent tasks and evaluation
...
Co-authored-by: mamagnus00 <mamagnus00@gmail.com >
2025-06-28 18:12:58 +00:00
Mert Unsal
bda38011cd
Merge branch 'main' into cursor/remove-memory-support-and-raise-error-b7bd
2025-06-27 13:40:56 +02:00
mertunsall
98bd8813b1
format
2025-06-27 13:28:54 +02:00
Cursor Agent
f9741adba5
Remove deprecated memory system and related code
...
Co-authored-by: mailmertunsal <mailmertunsal@gmail.com >
2025-06-27 09:48:09 +00:00