fix(cdp): timeout-wrap CDPClient.send_raw to break silent WebSocket hangs

cdp_use.CDPClient.send_raw awaits a future that only resolves when the
browser sends a response with a matching message id. There is no timeout
on that await. Against the cloud browser service, the failure mode we
observed is: WebSocket stays alive at the TCP/keepalive layer (proxy
keeps pong-ing our pings), but the browser upstream is dead / unhealthy
and never sends any CDP response. send_raw's future never resolves, and
every higher-level timeout in browser-use (session.start's 15s connect
guard, agent.step_timeout, tools.act's action timeout) relies on
eventually getting a response — so they all wait forever too.

Evidence from a 170k-task collector run: 1,090 empty-history traces,
100% hit the 240s outer watchdog, median duration 582s, max 2214s, with
cloud HTTP layer clean throughout (all 200/201). One sample showed
/json/version returning 200 OK and then 5 minutes of total silence on
the WebSocket before forced stop — classic silent-hang.

Fix: add TimeoutWrappedCDPClient, a thin subclass of cdp_use.CDPClient
that wraps send_raw in asyncio.wait_for(timeout=cdp_request_timeout_s).
Any CDP method that doesn't respond within the cap raises plain
TimeoutError, which propagates through existing `except TimeoutError`
handlers in session.py / tools/service.py. Uses the same defensive env
parse pattern as BROWSER_USE_ACTION_TIMEOUT_S — rejects empty /
non-numeric / nan / inf / non-positive values with a warning fallback.

Default is 60s: generous for slow operations like Page.captureScreenshot
or Page.printToPDF on heavy pages, but well below the 180s step timeout
and any typical outer watchdog. Override via BROWSER_USE_CDP_TIMEOUT_S.

Wired into both CDPClient construction sites in session.py (initial
connect + reconnect path). All 17 existing real-browser tests
(test_action_blank_page, test_multi_act_guards) still pass.
This commit is contained in:
Saurav Panda
2026-04-20 17:40:32 -07:00
parent a9a7201d8d
commit 9a09c4d7dc
3 changed files with 194 additions and 2 deletions

View File

@@ -0,0 +1,107 @@
"""Per-CDP-request timeout wrapper around cdp_use.CDPClient.
cdp_use's `send_raw()` awaits a future that only resolves when the browser
sends a matching response. If the server goes silent mid-session (observed
failure mode against remote cloud browsers: WebSocket stays "alive" at the
TCP/keepalive layer while the browser container is dead or the proxy has
lost its upstream) the future never resolves and the whole agent hangs.
This module provides a thin subclass that wraps each `send_raw()` in
`asyncio.wait_for`. Any CDP method that doesn't get a response within the
cap raises `TimeoutError`, which propagates through existing
error-handling paths in browser-use instead of hanging indefinitely.
Configure the cap via:
- `BROWSER_USE_CDP_TIMEOUT_S` env var (process-wide default)
- `TimeoutWrappedCDPClient(..., cdp_request_timeout_s=...)` constructor arg
Default (60s) is generous for slow operations like `Page.captureScreenshot`
or `Page.printToPDF` on heavy pages, but well below the 180s agent step
timeout and the typical outer agent watchdog.
"""
from __future__ import annotations
import asyncio
import logging
import math
import os
from typing import Any
from cdp_use import CDPClient
logger = logging.getLogger(__name__)
_CDP_TIMEOUT_FALLBACK_S = 60.0
def _parse_env_cdp_timeout(raw: str | None) -> float:
"""Parse BROWSER_USE_CDP_TIMEOUT_S defensively.
Accepts only finite positive values; everything else falls back to the
hardcoded default with a warning. Mirrors the guard on
BROWSER_USE_ACTION_TIMEOUT_S in tools/service.py — a bad env value here
would otherwise make every CDP call time out immediately (nan) or never
(inf / negative / zero).
"""
if raw is None or raw == '':
return _CDP_TIMEOUT_FALLBACK_S
try:
parsed = float(raw)
except ValueError:
logger.warning(
'Invalid BROWSER_USE_CDP_TIMEOUT_S=%r; falling back to %.0fs',
raw,
_CDP_TIMEOUT_FALLBACK_S,
)
return _CDP_TIMEOUT_FALLBACK_S
if not math.isfinite(parsed) or parsed <= 0:
logger.warning(
'BROWSER_USE_CDP_TIMEOUT_S=%r is not a finite positive number; falling back to %.0fs',
raw,
_CDP_TIMEOUT_FALLBACK_S,
)
return _CDP_TIMEOUT_FALLBACK_S
return parsed
DEFAULT_CDP_REQUEST_TIMEOUT_S: float = _parse_env_cdp_timeout(os.getenv('BROWSER_USE_CDP_TIMEOUT_S'))
class TimeoutWrappedCDPClient(CDPClient):
"""CDPClient subclass that enforces a per-request timeout on send_raw.
Any CDP method that doesn't receive a response within `cdp_request_timeout_s`
raises `TimeoutError` instead of hanging forever. This turns silent-hang
failure modes (cloud proxy alive, browser dead) into fast observable errors.
"""
def __init__(
self,
*args: Any,
cdp_request_timeout_s: float | None = None,
**kwargs: Any,
) -> None:
super().__init__(*args, **kwargs)
self._cdp_request_timeout_s: float = (
cdp_request_timeout_s if cdp_request_timeout_s is not None else DEFAULT_CDP_REQUEST_TIMEOUT_S
)
async def send_raw(
self,
method: str,
params: Any | None = None,
session_id: str | None = None,
) -> dict[str, Any]:
try:
return await asyncio.wait_for(
super().send_raw(method=method, params=params, session_id=session_id),
timeout=self._cdp_request_timeout_s,
)
except TimeoutError as e:
# Raise a plain TimeoutError so existing `except TimeoutError`
# handlers in browser-use / tools treat this uniformly.
raise TimeoutError(
f'CDP method {method!r} did not respond within {self._cdp_request_timeout_s:.0f}s. '
f'The browser may be unresponsive (silent WebSocket — container crashed or proxy lost upstream).'
) from e