Client decode resilience — surfacing and recovering from decode stalls¶
SHIPPED. All six changes below are implemented. Client: a tagged logger (
widgets/src/debug.ts), observable decode/config error paths +optimizeForLatency(widgets/src/worker/videoDecode.ts), and a pure stall watchdog that rebuilds the decoder and requests recovery (widgets/src/worker/stallWatchdog.ts,tests/unit/stallWatchdog.test.ts). Server: adecoder_resetcontrol + an inflight-timeout backstop (src/pdum/rfb/session.py,tests/test_session.py). The reusable practice is written up indocs/agentic_frontend_debugging.md. Kept here for the design rationale.
Status: shipped (was: proposal). Scope: the browser client (@habemus-papadum/rfb-widgets
worker) plus one complementary server-side backstop.
Why this exists¶
A hardware-decoder quirk in the NVENC path (the SPS didn't signal
max_num_reorder_frames=0, so the browser's hardware VideoDecoder buffered its DPB
before emitting anything) froze the viewer. That root cause is fixed in the encoder
(packages/nvenc/src/cpp/nvenc_ext.cpp sets zeroReorderDelay + the VUI
bitstreamRestrictionFlag; regression test
tests/test_nvenc_gpu_pdum.py::test_binding_signals_zero_reorder_in_sps).
But the incident exposed a latent fragility that has nothing to do with NVENC: when the decoder stops producing output for any reason, the client and server deadlock silently and permanently, with nothing in the console to explain it and no path back — switching codec or scene doesn't recover it. The trigger will differ next time (a transient decode error on the frames in flight, a GPU reset, a dropped keyframe, a codec the HW decoder buffers differently); the failure shape will be identical. This proposal makes that failure observable and self-healing.
The deadlock, precisely¶
Two invariants collide:
- Client only ACKs on display. The single place the worker emits an
ackisonDisplayed()(widgets/src/worker/entry.ts:56), called from theVideoDecoder'soutputcallback (widgets/src/worker/videoDecode.ts:40). Every ACK carriesdisplayed: true. No output ⇒ no ACK. - Server only sends when
inflighthas room.RfbSessionadds each sent payload's seq toinflight(src/pdum/rfb/session.py:156) and only clears it when anackarrives (session.py:113,inflight.discard(seq)). Before encoding/sending the next frame it drops iflen(inflight) >= max_inflight(session.py:236, defaultmax_inflight=2).
So once ~2 sent frames fail to display:
decoder emits nothing ──► no displayed ACK ──► server inflight stays full
▲ │
└────────── server drops every new frame ◄─────┘ (never sends the frame
(incl. the forced keyframe) that might un-stick decode)
It is a mutual wait. Critically, the usual escape hatches don't fire:
request_keyframe(sent on gate reset / decode-queue backlog,videoDecode.ts:51,entry.ts:89) only setsforce_keyframe = Trueserver-side (session.py:122). The keyframe still can't be sent — it hits the sameinflight >= max_inflightdrop first.- Switching backend/scene doesn't help — the publisher's new frames enter the same full
inflightand are dropped. The session is stranded until the connection is torn down.
The observability gap¶
Even while deadlocked, the client says nothing:
- The
VideoDecodererror callback is silent (videoDecode.ts:49): it resets the gate and requests a keyframe, but neverconsole.warns and never posts anerrormessage — so the app'sonError(RemoteFramebufferView.ts:231) never fires and the console is empty. This is the literal "no console logs to help debug" the incident reported. - The stall watchdog that does exist keys off
decodeQueueSize(shouldRequestKeyframe,backpressure.ts:47). A decoder that accepts chunks and buffers them (the reorder case) keepsdecodeQueueSizelow, so this heuristic never trips. It detects a backlog, not a stall. - The HUD's
displayedcounter simply stops incrementing; there is no "received N chunks, displayed 0" signal that would name the problem.
Proposed changes¶
Ordered by leverage. (1) and (2) are the minimum that turns a silent brick into a visible, self-healing hiccup; (5) is the robust backstop that breaks the deadlock even against a buggy or old client.
-
Make decode errors observable. In the
VideoDecodererrorhandler (videoDecode.ts:49) and on watchdog trip:console.warnwith the codec + error, andpost({ type: "error", … })/ a softerdecode_warningsoonError/onStatscan show it. Distinguish recoverable (reset + keyframe) from fatal (configure threw / codec unsupported). -
A real stall watchdog. Track
queued − displayedand wall-clock since the lastoutput. If chunks were queued but nothing displayed for ~1 s, treat it as a stall: surface it (per 1), then run recovery (per 3). This is the signaldecodeQueueSizecan't give. -
Recovery that actually un-sticks decode. On stall: fully rebuild the decoder (
close()+ freshconfigure()), re-arm theKeyframeGate, and request a keyframe. A freshconfigure()clears any bad decoder buffering state that a keyframe alone won't. -
Release the server's
inflighton a client-side stall. Recovery is useless if the server still won't send. The client must tell the server to stop waiting for frames it will never display — e.g. adecoder_reset/nack{through_seq}control that clearsinflight(and forces a keyframe) server-side. Do not just ACK the stalled seqs withdisplayed: true— that would corrupt the displayed-FIFO seq attribution and RTT. A distinct control keeps the "displayed means displayed" invariant intact. -
Server-side inflight timeout (defense in depth). Independently of the client, if a seq sits in
inflightunacked for > T (e.g. 2 s), the session should drop it, force a keyframe, and log. This single change breaks the deadlock from the server side even when the client is an old build or wedged in a way (4) doesn't cover — the most robust backstop, and it belongs with the existing latest-frame-wins policy insession.py. -
optimizeForLatency: trueinconfigure()(complementary). Correct for a real-time, low-latency stream and reduces decoder-side buffering/latency generally. Note it is a hint: in this incident the hardware decoder ignored it (which is why the encoder-side VUI fix was the real cure), so it is defense in depth, not a substitute for (1)–(5).
Testing¶
- Vitest can cover the core logic without a browser: the deadlock is reproducible purely
from "N chunks queued via
BackpressureController.onQueued, zeroonDisplayed" → assert the watchdog trips, an error is surfaced, and a reset/nackis emitted. The server inflight-timeout is aRfbSessionunit test (advance the fake clock, assertinflightclears and a keyframe is forced). - Headless caveat. Playwright's SwiftShader is a software decoder and emits regardless
of the VUI, so it cannot reproduce the original hardware-buffering trigger — only a real
GPU-backed Chrome does. Resilience tests must therefore simulate the stall (no
output) rather than rely on a real decoder to stall.
Out of scope¶
- The NVENC reorder VUI itself (already fixed upstream of this doc).
- Reworking the backpressure model; this proposal keeps latest-frame-wins and the displayed-ACK semantics, adding only stall detection, recovery, and a timeout backstop.