Commit Graph

3952 Commits

Author SHA1 Message Date
Aitor
8a5ca6ceec Merge branch 'main' into feat/evals-anchor-support 2025-07-02 14:23:45 +02:00
Magnus Müller
31e243d789 eval include runner link (#2253)
Auto-generated PR for: eval include runner link
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Added support for passing the GitHub workflow run URL through the
evaluation pipeline for better tracking and visibility.

- **New Features**
- The workflow URL is now constructed and passed as a command-line
argument.
- The evaluation service accepts and stores the workflow URL for each
run.

<!-- End of auto-generated description by cubic. -->
2025-07-02 14:10:54 +02:00
Magnus Müller
239fd3f86b eval include runner link 2025-07-02 14:10:22 +02:00
Mert Unsal
a8b79accbc New element updates in browser state (#2248)
- Update DOM element highlighting from *[index]* to <new>[index]</new>
- Update system prompts to reflect new <new> tag format
- Provides clearer semantic meaning for new elements in browser state
2025-07-02 11:30:56 +02:00
Mert Unsal
8094204488 Merge branch 'main' into cursor/create-branch-for-new-element-updates-0831 2025-07-02 11:27:48 +02:00
Mert Unsal
c3f2c87f15 Clarify new elements tagging criteria (#2249)
Clarify system prompt description of `<new>` tags to specify new
clickable elements.
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Updated the system prompt to clarify that elements tagged with <new> are
clickable elements that appeared since the last step, if the URL has not
changed.

<!-- End of auto-generated description by cubic. -->
2025-07-02 11:27:22 +02:00
Cursor Agent
9dc85ee203 Checkpoint before follow-up message 2025-07-02 09:25:18 +00:00
Cursor Agent
fd5bb775dc Clarify description of <new> elements in browser state documentation
Co-authored-by: mailmertunsal <mailmertunsal@gmail.com>
2025-07-02 09:25:07 +00:00
Cursor Agent
755b96e6a0 Simplify note about <new> tag in browser state description
Co-authored-by: mailmertunsal <mailmertunsal@gmail.com>
2025-07-02 09:23:26 +00:00
Cursor Agent
e13c4cee9c Clarify definition of <new> tag in browser state documentation
Co-authored-by: mailmertunsal <mailmertunsal@gmail.com>
2025-07-02 09:22:12 +00:00
Mert Unsal
7ae2e70ce2 Remove mem0 dependencies from pyproject (#2245)
Removed the mem0ai dependency from pyproject.toml to simplify project
requirements.
2025-07-02 11:11:17 +02:00
Mert Unsal
bba5b6fbb5 Quick fix for eval (#2246)
Fixed the evaluation service so that both 'gpt-o4-mini' and 'gpt-o3'
models use temperature 1 as required.
2025-07-02 11:08:32 +02:00
mertunsall
3a5f43bb3f eval should run with temperature 1 for o3 2025-07-02 11:07:58 +02:00
Cursor Agent
695171f90b Remove mem0ai dependency from project requirements
Co-authored-by: mailmertunsal <mailmertunsal@gmail.com>
2025-07-02 09:00:45 +00:00
Magnus Müller
5a42354d53 remove git function helpers (#2244)
Auto-generated PR for: remove git function helpers
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Removed the git-functions.sh script, which included helper functions for
automating branch creation, commits, pushes, and pull requests. This
cleans up unused shell helpers from the codebase.

<!-- End of auto-generated description by cubic. -->
2025-07-02 10:45:01 +02:00
Magnus Müller
7ae04893ee Merge branch 'main' into remove-git-function-helpers 2025-07-02 10:44:45 +02:00
Magnus Müller
18db8926bd remove git function helpers 2025-07-02 10:44:24 +02:00
Mert Unsal
95045620de Add sensitive data example (#2243)
Added example for sensitive data handling
2025-07-02 10:41:52 +02:00
mertunsall
8880a2db37 add sensitive data example 2025-07-02 10:40:58 +02:00
Magnus Müller
4a0eab6fb8 test git automation functions 2025-07-02 10:36:41 +02:00
Mert Unsal
1f097275bd Remove xpath from click_element_by_index (#2240)
Removed the optional xpath field from the ClickElementAction model - this was wrong.
2025-07-02 10:15:12 +02:00
Mert Unsal
0d781ce20c fix (#2241)
Fixed a bug where the reasoning_effort parameter was sent to all models.
Now it is only sent to supported models.
2025-07-02 10:13:47 +02:00
mertunsall
b37f5294c5 fix 2025-07-02 10:08:11 +02:00
mertunsall
fbd3a11737 fix 2025-07-02 10:01:56 +02:00
Cursor Agent
ee2c1d2ad0 Remove optional xpath from ClickElementAction model
Co-authored-by: mailmertunsal <mailmertunsal@gmail.com>
2025-07-02 07:43:21 +00:00
Mert Unsal
fc98494b00 Remove xpath from input text action (#2239)
Removed the optional xpath field from the InputTextAction model to
simplify its structure.
2025-07-02 09:37:23 +02:00
Cursor Agent
c24d5b4320 Remove optional xpath parameter from InputTextAction model
Co-authored-by: mamagnus00 <mamagnus00@gmail.com>
2025-07-01 21:51:05 +00:00
Cursor Agent
05ef6dcc82 Replace asterisk markers with <new> tags for new elements
- Update DOM element highlighting from *[index]* to <new>[index]</new>
- Update system prompts to reflect new <new> tag format
- Provides clearer semantic meaning for new elements in browser state
2025-07-01 21:47:57 +00:00
Mert Unsal
b7f461c904 Add reasoning_low parameter to OpenAI chat model (#2236)
Added reasoning_low parameter to OpenAI chat model to control the
reasoning effort of the model.
2025-07-01 23:25:53 +02:00
mertunsall
b290e64b86 fix types 2025-07-01 23:24:20 +02:00
mertunsall
6bdb4d5920 Refactor reasoning_effort parameter to be non-optional and default to 'low' 2025-07-01 23:20:50 +02:00
mertunsall
1885dabc48 default reasoning effort to low 2025-07-01 23:18:54 +02:00
Mert Unsal
03a636b6d9 Update model names for gemini-2.5 to remove preview suffixes in servi… (#2228)
…ce.py
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Updated the model names for gemini-2.5-pro and gemini-2.5-flash to
remove preview suffixes and use the latest stable names.

<!-- End of auto-generated description by cubic. -->
2025-07-01 13:40:44 +02:00
mertunsall
4e7cf3a964 Update model names for gemini-2.5 to remove preview suffixes in service.py 2025-07-01 13:40:05 +02:00
Mert Unsal
bfb0fae642 feat: resilient memory (#2223)
- Introduce a max_history_items parameter to limit the memory of the model
- changed the system messages to have <sys> tag instead of <s> to avoid confusion with HTML
- Got rid of MessageMetadata, SupportedMessageTypes and implemented cleaner MessageManagerState
- Implemented a HistoryItem class to cleanly reconstruct agent history description
2025-07-01 13:00:40 +02:00
Magnus Müller
fba794354e eval-4-core-runners (#2227)
Auto-generated PR for branch: eval-4-core-runners
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Updated the evaluation workflow to use the new 'eval-4-core-500' runner
label instead of 'ubuntu-latest-16-core'.

<!-- End of auto-generated description by cubic. -->
2025-07-01 12:56:55 +02:00
Magnus Müller
a9488feead Change to 2 core runners 2025-07-01 12:56:36 +02:00
Magnus Müller
0713b057fc Merge branch 'main' into eval-4-core-runners 2025-07-01 12:55:18 +02:00
Mert Unsal
e1aecf855b Merge branch 'main' into mert/resillient_memory 2025-07-01 12:54:20 +02:00
Magnus Müller
9376d9d91e Update evaluation workflow to use new runner label 'eval-4-core-500' 2025-07-01 12:54:18 +02:00
Mert Unsal
0842d84596 dont use fs for short tasks, dont initialize results.md (#2218)
Stopped initializing the file system with results.md and updated prompts
and tests to only use todo.md by default. Now, results.md is created
only for long tasks.

Also included a extract_links parameter in extract_content so that agent can find URLs or links
2025-07-01 12:54:06 +02:00
mertunsall
729fc798ca add assertion for safety 2025-07-01 12:49:03 +02:00
Mert Unsal
f396e448d7 Merge branch 'main' into mert/resillient_memory 2025-07-01 12:47:30 +02:00
Mert Unsal
ebd151aafd Merge branch 'main' into mert/dont_use_fs 2025-07-01 12:47:11 +02:00
Magnus Müller
27bea3622c eval-test-new-runners (#2225)
Auto-generated PR for branch: eval-test-new-runners
    
<!-- This is an auto-generated description by cubic. -->
---

## Summary by cubic
Added a 3-minute timeout to the comprehensive judge evaluation to
prevent long-running tasks.

<!-- End of auto-generated description by cubic. -->
2025-07-01 12:44:49 +02:00
Magnus Müller
b17e8f3f88 Max 3 min for judge 2025-07-01 12:36:17 +02:00
mertunsall
63c0f5d086 convert to pydantic 2025-07-01 12:32:54 +02:00
mertunsall
3b53e8b495 fix linter 2025-07-01 12:23:52 +02:00
Mert Unsal
d5fe291a38 Merge branch 'main' into mert/dont_use_fs 2025-07-01 12:05:27 +02:00
Mert Unsal
9f1e1bd929 Merge branch 'main' into mert/resillient_memory 2025-07-01 12:05:10 +02:00