- Improve clarity on using extract_structured_data and user request
processing.
- Added guidance on file system to avoid overwriting existing content.
- Included additional reasoning patterns for better task management and
progress tracking.
<!-- This is an auto-generated description by cubic. -->
## Summary by cubic
Added detailed CPU and memory monitoring to the evaluation workflow and
service to help track resource usage and catch issues like high memory
or CPU load during eval runs.
- **New Features**
- Logs system resource stats before, during, and after evaluation runs.
- Starts a background resource monitor that checks CPU, memory, and
process counts every 30 seconds.
- Adds signal handling and heartbeat logging for better debugging and
graceful shutdowns.
- Collects and uploads resource logs and debug info as workflow
artifacts.
<!-- End of auto-generated description by cubic. -->
- Improve clarity on using extract_structured_data and user request processing.
- Added guidance on file system to avoid overwriting existing content.
- Included additional reasoning patterns for better task management and progress tracking.
- Change reasoning rules in system prompt
- Flatten AgentOutput to get rid of current_state
- Add agent initialization in history.
- Update the example tool call.
- Add a current_state property to AgentOutput for compatibility.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Flattened the AgentOutput model by removing the current_state field,
updated the system prompt reasoning rules, and added agent
initialization to the history for better clarity and compatibility.
- **Refactors**
- AgentOutput now uses top-level fields instead of a nested
current_state.
- Added a current_state property for backward compatibility.
- Simplified and clarified reasoning rules in the system prompt.
- Updated example tool call and included agent initialization in the
history.
<!-- End of auto-generated description by cubic. -->
- Flatten AgentOutput to get rid of current_state
- Add agent initialization in history.
- Update the example tool call.
- Add a current_state property to AgentOutput for compatibility.
## Agent State and System Prompt
This PR aims to improve the agent by rewriting the way the state is
constructed, as described in `system_prompt.md`, new state consists of:
1. Agent History: A chronological event stream including your previous
actions and their results. This may be truncated or partially omitted.
2. Agent State: Includes the ultimate goal provided by the user, current
progress, and relevant contextual memory.
3. Browser State: Contains current URL, open tabs, interactive elements
indexed for actions, visible page content, and (sometimes) any visual
context provided by screenshots or page snapshots.
4. Read State: If your previous action involved reading a file or
extracting content (e.g., from a webpage), the full result will be
included here. This data is **only shown in the current step** and will
not appear in future Agent History. You are responsible for saving or
interpreting the information appropriately during this step.
Please refer to `system_prompt.md` for explanation of what the model
gets in its context. The PR rewrites the `MessageManager` to achieve the
displaying of this state. This helps maintaining a better state as the
model always sees a constant number of messages: system prompt, example
tool calls, and current state.
## File System
The model now has access to a File System it can interact with so that
it can make a plan, write intermediate results, etc.
1. Agent is initialized with two files `todo.md` and `results.md`. First
one is used so that the model can plan and the contents of this document
is always displayed in agent's state. Second one is so that model can
accumulate results for long tasks. File system is always displayed in
Agent State in format:
file_name — num_lines lines
2. Model is allowed to use 3 functionalities, `write_file`,
`append_file`, and `read_file`. We should improve the descriptions of
these functions. Finally, there should be an option to "edit a string"
(Replace line `"{content}"` to `"{new_content}"`.
3. Currently, for safety reasons, agent can ONLY create files in format
`{file_name}.{md|txt}` and cannot create subdirectories or go back. This
makes sure that agent's
4. The file system is currently launched in a temporary directory with a
random uuid - this can be changed so that we just use `tempfile`'s
temporary file and temporary directory functionality, stored in memory
and to be deleted when program terminates.
6. Optionally, the user might want to keep the results saved in the
directory so there should be an option to set a directory which will not
be deleted at the end.
7. Finally, the agent is not allowed to be initialized in an existing
directory. This is to make sure that files written in a previous do not
impact a new agent's behavior accidentally. However, this is annoying as
it requires the user to delete the directory between multiple runs.
## AgentBrain
AgentBrain object has a new thinking field - need to make sure that this
is taken care of in the entire codebase, saved/displayed when necessary.
## AgentResult
AgentResult object has 2 new fields, `memory` which is what will be
added to Action History that the agent will see and `update_read_state`,
a boolean that determines whether the action should result in updating
the read state. This is used for actions such as `extract_content` and
`read_file`, where the model should see the content of the file once but
file contents shouldn't remain in history forever.
## Details and Next Steps
1. Currently browser state is capped at 10k characters - this is very
primitive and should be changed with a smarter semantic processing
layer.
2. I think we want to permanently include file system in the Agent -
this feature seems very important. So the controller functions for using
a file system should be moved from agent into controller.
3. We should add a function call for sending user a message with the
results before calling `done` - this is simply more clean and more
intuitive for the model.
4. Save the current page as PDF file should directly use the file
system.
5. Currently, there seems to be a bug with `append_file`, should be
fixed.
6. Not all language models seem to work with current version - for
example, maybe we should get rid of `current_state` field.
7. Get rid of langchain and do our own LLM calls for better tractability
and more control over what goes into the LLM.
I think there are a lot more things to be addressed regarding this PR,
but this is all I could gather.
<!-- This is an auto-generated description by cubic. -->
## Summary by cubic
Changed browser session setup to use incognito mode by setting
user_data_dir to None, preventing persistent state between evaluation
runs.
<!-- End of auto-generated description by cubic. -->