Internals¶
How the pieces fit together, the wire protocol, and the design decisions behind the session loop, the H.264 path, and the worker. Read the Python and JavaScript guides first for the public API.
End-to-end data flow¶
This is the per-connection, session-internal view. The public entry point is
display.publish(ndarray), which stores the latest frame; each connection's
_ClientFeed is the FrameSource the session pulls from below, and input drained
by display.poll_events() is what arrives as handle_event here.
Python Browser (main thread) Worker
------ --------------------- ------
FrameSource.next_frame() ─ RawFrame
│
▼
EncoderBackend.encode() ─ EncodedPayload
│ (image bytes | H.264 Annex B AU)
▼
RfbSession.encode_loop ── pack ──► WebSocket ───────────────────────► onmessage
▲ │
│ ack / request_keyframe / event / set_viewport ▼
RfbSession.recv_loop ◄──────────── WebSocket ◄── normalized events unpack header
│ ▲ │
▼ │ image_frame│video_chunk
FrameSource.handle_event DOM events (pointer/key/ │ │
wheel/resize) normalized createImageBitmap | VideoDecoder
on the main thread │ │
▼ ▼
draw → OffscreenCanvas
Three concerns stay independent so one API can negotiate the best backend: FrameSource → EncoderBackend → transport.
Wire protocol¶
Two kinds of messages share one WebSocket.
Control (JSON text). Client → server: hello, ack, request_keyframe,
set_viewport, event. Server → client: config (sent right after hello; carries
coords:"frame-pixels" and, when non-default, pixel_ratio / color), plus optional
set_quality / stats.
Payloads (binary). Each image or encoded video access unit is one binary message with a self-describing envelope:
# pdum/rfb/protocol.py
def pack_binary_message(header: dict, payload: bytes) -> bytes:
h = json.dumps(header, separators=(",", ":")).encode("utf-8")
return struct.pack("<I", len(h)) + h + bytes(payload)
// widgets/src/protocol.ts
export function unpackBinaryMessage(input: ArrayBuffer | Uint8Array): UnpackedMessage;
A single self-describing envelope is deliberately chosen over a two-message
"JSON header, then binary payload" scheme: it is atomic (no pairing state, no
"binary arrived before its header" race) and keeps ordering trivial. The Python
packer and the TypeScript unpacker are kept byte-compatible by committed fixtures
(widgets/tests/fixtures/protocol/*) generated from pack_binary_message and
asserted in Vitest.
Image header: {type:"image_frame", seq, timestamp_us, width, height, mime}.
Video header: {type:"video_chunk", seq, timestamp_us, duration_us, width, height,
codec, bitstream:"annexb", keyframe}. Both gain optional pixel_ratio (render DPR) and
color (descriptor) fields, emitted only when non-default so committed fixtures and
older clients are unaffected; the session stamps them from the source RawFrame after
encode (so no encoder needs to know they exist). See
Sizing, DPR & color.
Capability negotiation¶
worker: probeCapabilities() ─ hello{supported:[...], device_pixel_ratio}
│
server: select_transport(supported, has_h264, has_nvenc) ─ BackendSelection
│
server: build_encoder(selection) ─ config{transport, codec, width, height}
select_transport prefers H.264 when the client lists webcodecs/h264-annexb and
a video encoder exists (NVENC over libx264 when present), else the best shared
image format. has_nvenc already exists in the signature so an NVENC backend
changes no callers.
The session loop¶
RfbSession.run() runs two coroutines under an asyncio.TaskGroup:
recv_loopiterates inbound messages, dispatching via_handle_control:ackclears the in-flight set,request_keyframearms a keyframe,eventandset_viewportgo tosource.handle_event.encode_looprepeatedly runs_encode_step: pull the next frame, and iflen(inflight) >= max_inflightdrop it before encoding (and force the next sent frame to be a keyframe); otherwise rebuild the encoder if the frame size changed, encode in a worker thread, and send.
Key decisions:
- Encode off the event loop.
encode()is CPU-bound and synchronous, so it runs viaawait asyncio.to_thread(...); the receive loop keeps draining ACKs and the in-flight set keeps moving. - Latest-frame-wins, drop before encoding. Dropping already-encoded delta frames would strand the browser on references it never received; dropping pre-encode and forcing the next keyframe keeps the stream decodable. The first frame to every client is a keyframe.
- Fixed-resolution encoders. libx264 is configured for one size, so a frame
whose dimensions changed triggers
encoder_factory(w, h)and a forced keyframe; the browser re-configure()s its decoder on the newcodedsize. - Clean shutdown. A client disconnect surfaces as
ConnectionClosed; both loops swallow it and setclosed, so theTaskGroupcompletes without noise.
Single-step helpers (_encode_step, _handle_control) exist so unit tests drive
the policy deterministically with a FakeWebSocket + FakeEncoder, with no socket
or thread scheduling.
The H.264 path¶
H264CpuEncoder uses a bare av.CodecContext.create("libx264", "w"), which emits
Annex B with in-band SPS/PPS — exactly WebCodecs' Annex B mode (never route
through an mp4 muxer, which produces AVCC). The gaps in the original sketch are
fixed:
- Forced IDR:
forced-idr=1at creation and per-framevf.pict_type = PictureType.Ion a forced keyframe (a plainIframe withoutforced-idrcan be a non-IDR the browser treats as a delta). - Pixel format: RGB is explicitly
reformated toyuv420p(PyAV does not auto-convert on encode); dimensions are even. - Low latency:
ultrafast/zerolatency,bframes=0,keyint=min-keyint=fpsfor a 1-second IDR cadence,annexb=1/repeat-headers=1to keep parameter sets in-band.
On the browser, VideoDecoder.configure({codec, codedWidth, codedHeight}) omits
description (SPS/PPS are in-band). The KeyframeGate drops delta chunks until the
first keyframe after every connect/reconnect/reconfigure; a decoder error resets
the gate and sends request_keyframe.
Because there are no B-frames, decoder output order equals input order, so a
FIFO of queued seqs attributes each displayed frame for the displayed:true
ACK. (Enabling B-frames would break that assumption — it is documented in the
code.)
Pipelined encode (token-based seq attribution)¶
By default the encoders are synchronous 1-in-1-out: encode(frame_N) returns
frame N's AU, stamped seq=N. serve(encode_pipeline_depth=k>0) opts into a
pipelined path on backends that implement it (see
pipelined_encode.md). The design point is that the
session does not change — it already iterates payloads and books each
payload.seq (_encode_step lines ~242–243, send_payload). All the pipelining
lives below the EncoderBackend seam:
- Binding (
packages/vtenc'sVtEncoder): alongside the synchronousencode()→bytes, asubmit(frame, seq, force_idr)issuesVTCompressionSessionEncodeFramewithoutCompleteFrames, carryingseqas the per-framesourceFrameRefCon. The output callback recovers it and pushes a seq-tagged access unit onto a queue;submit()drains whatever is ready and returns a list of(recovered_seq, annexb, keyframe)tuples (0..N).flush_pipeline()completes the in-flight tail.packages/nvenc'sNvencEncoderimplements the samesubmit()/flush_pipeline()shape atextra_output_delay=depth; NVIDIA's vendoredNvEncoderhelper ownsinputTimeStamp, so it recovers seq from an in-order FIFO of the tags pushed atsubmit()instead — valid becauseframeIntervalP=1forces output order == input order (seepipelined_encode.md). - Wrapper (
encoders/vtenc.pyVideoToolboxEncoder, andencoders/nvenc_gpu_pdum.pyNvencGpuPdumEncoder): withpipeline_depth>0,encode()callssubmit()and stamps each payload with the recovered seq (not the call'sframe.seq), looking the originaltimestamp_usup from a small{seq: timestamp_us}in-flight map.pipeline_depth=0keeps the synchronous path byte-identical. - Plumbing:
build_encoder(..., pipeline_depth=)forwards it to the video factory; only backends that implement the pipelined path consume it (the rest drop it → a depth>0 request runs synchronously, not an error).serve()/add_stream()threadencode_pipeline_depthto the per-stream_StreamHost, whose factory passes it on.
This keeps the no-B-frames invariant (output order == input order); only the
delay between input and output grows. The drop/keyframe policy is unchanged — frames
are still dropped before submit, so the encoder pipeline only ever holds a valid
reference chain. Backpressure's max_inflight bounds the wire (sent-but-unacked),
independent of the encoder's in-flight depth. On VideoToolbox this path is correct but
not faster (low-latency RC is synchronous); it exists for the NVENC backend.
The worker¶
One unified worker handles both transports (selected per message by header
type) because there is exactly one WebSocket and one transferred OffscreenCanvas
per session, and the server may switch transport mid-session.
entry.ts bootstrap; owns the WebSocket; routes control vs binary;
forwards main-thread messages (init/event/resize/capture/dispose)
connection (in entry) hello after probeCapabilities; keyframe-gate reset on (re)open
renderer.ts OffscreenCanvas 2D wrapper: draw / resize / readPixels / toBlob
imageDecode.ts image_frame -> createImageBitmap -> draw -> bitmap.close()
videoDecode.ts VideoPipeline: VideoDecoder lifecycle, gate, FIFO seq attribution
backpressure.ts BackpressureController + KeyframeGate (pure, unit-tested)
Resource lifetime is explicit: every VideoFrame and ImageBitmap is close()d
immediately after drawing, or the decoder stalls within seconds. Payload views are
copied into fresh Uint8Arrays before handing them to Blob/EncodedVideoChunk.
transferControlToOffscreen() is one-way: after transfer the main thread must
never touch that canvas's bitmap, so all resize/DPR changes are messaged to the
worker, which sets OffscreenCanvas.width/height.
Main ↔ worker contract¶
Main → worker: init (transfers the canvas + url + options), event, resize,
capture, dispose. Worker → main: ready, state, stats, capture-result
(carries the ImageData/Blob and the lastDisplayedSeq it measured), error.
Event normalization + the coordinate contract¶
events.ts (main thread) maps DOM events to the
renderview vocabulary (shared by jupyter_rfb /
pygfx / fastplotlib) in CSS canvas coordinates, then posts them to the worker.
mapButton/mapButtons translate DOM button enums/bitmask to renderview's
0=none,1=left,2=right,3=middle (button) and pressed-button tuple (buttons).
computeBackingSize derives the backing store and the effective ratio reported in
set_viewport (logical width/height + physical pwidth/pheight + ratio).
The frame-pixel contract. All frame↔canvas geometry lives in one pure, unit-tested
module, viewport.ts (frameDestRect / backingToFrame, for fill/contain/cover) —
used by both the renderer (drawing, letterboxed per fit) and the worker's event path.
Before sending, the worker remaps each pointer/wheel event CSS → backing (× DPR) → frame
pixels through the current fit, adding inside (false in letterbox padding) and
pixel_ratio (the frame's render DPR). So the publisher receives coordinates that index
its framebuffer directly, correct under any fit or DPR — this is the invariant the original
HiDPI off-by-DPR bug violated, and the reason drawing and hit-testing share one transform.
The frame's render DPR cancels out of the geometry (it scales frame and backing
symmetrically), so it rides only as the event echo. config.coords advertises the space
("frame-pixels", unconditional in this version).
Module map¶
Push model. The public API is
serve(width, height) -> Display; youdisplay.publish(ndarray)from your own loop and drain input withdisplay.poll_events().serve()runs the WebSocket server as a background task. Each connection gets its ownRfbSession, fed by an internal per-connection_ClientFeed(theFrameSourcethe session pulls). The pullFrameSourceclasses insources.pyare internal-only now.RfbSessionis unchanged — it sees aChannel(transport.py) and a feed, both satisfying its thin seams.Two additive layers sit on top of that core without changing it: a hub (
serve_server()→Server) fronts several namedDisplays on one port, routed by URL path; and an opt-in ASGI front-end (asgi.py) drives the same per-connection lifecycle (_StreamHost._serve_connection) over a Starlette WebSocket. Both reuse the identical session/encoder/backpressure path.
src/pdum/rfb/
types.py RawFrame, EncodedPayload, InputEvent, FrameSource/EncoderBackend protocols (dep-free)
protocol.py envelope, header builders, control parsing, select_transport
session.py RfbSession: loops, backpressure, keyframe policy
display.py Display (publish/poll_events/aclose) + internal _ClientFeed (per connection)
auth.py AuthContext / Authenticator / Principal (pluggable, no JWT dep)
transport.py Channel protocol + WebSocketTransport (the transport seam)
asgi.py opt-in [asgi] Starlette front-end: rfb_endpoint / rfb_hub_endpoint
+ _AsgiConn (same core; drives _StreamHost._serve_connection)
sources.py BaseFrameSource, RenderCallbackSource, OnDemandFrameSource (internal now)
gpu.py zero-copy GPU helpers: rgb_to_nv12, cuda_frame, context sharing, probes
metrics.py SessionMetrics (encode_ms, bytes, RTT, fps, bitrate, ...)
adaptive.py AdaptiveQualityController (opt-in via serve(adaptive=True))
benchmark.py `python -m pdum.rfb.benchmark` — offline image vs H.264 w/ real PSNR
cli.py `pdum-rfb` CLI: doctor (probe encode paths) + benchmark
server.py serve()->Display, serve_server()->Server hub (named streams,
URL-path routing, /streams REST), _StreamHost (transport-neutral
_serve_connection), _WsConn adapter, `python -m` CLI
encoders/
base.py registry + build_encoder (registers h264_cpu + nvenc_cpu + nvenc_gpu_pyav + nvenc_gpu_pdum)
image.py ImageEncoder (Pillow)
h264_cpu.py H264CpuEncoder + h264_cpu_available / self_test
nvenc_cpu.py NvencCpuEncoder (host-input GPU h264_nvenc) + nvenc_cpu_available
nvenc_gpu_pyav.py NvencGpuPyavEncoder (zero-copy CUDA NV12 -> h264_nvenc, PyAV >= 18)
nvenc_gpu_pdum.py NvencGpuPdumEncoder (PyAV-free; rides habemus-papadum-nvenc / pdum.nvenc)
testing.py SyntheticFrameSource, fakes, NAL/decode helpers, fixture gen
widgets/src/
index.ts public exports
RemoteFramebufferView.ts main-thread controller (canvas, events, resize, capture, setFit)
viewport.ts pure frame<->canvas geometry (frameDestRect / backingToFrame; fit modes)
protocol.ts events.ts eventTypes.ts capabilities.ts backpressure.ts types.ts
workerFactory.ts inline worker (?worker&inline)
worker/{entry,renderer,imageDecode,videoDecode}.ts
Testing architecture¶
Three layers verify the system with no display and no manual clicking:
- Python (
pytest). Protocol round-trips (+ golden fixtures for the JS side), image-encoder validity (re-decoded with Pillow), session invariants (max_inflight, keyframe-first, latest-frame-wins, forced-keyframe-on-drop, event delivery), negotiation, and — for H.264 — the produced Annex B bitstream is decoded back with PyAV to prove validity. One real loopback-socket test covers the handshake + HTTP side channel. - JS unit (Vitest). The protocol unpacker is asserted byte-for-byte against the Python-generated fixtures; event-coordinate scaling and the backpressure/keyframe-gate logic are tested in isolation.
- Browser e2e (Playwright + headless Chromium).
webServerboots the Python server (streaming the deterministictest_cardpattern) and a production build of the demo. A spec decodes real frames, reads back canvas pixels via thecapturehook, and checks they form a validrender_test_pattern(k)frame — the four palette colors in the correct spatial cycle, viamatchedRotation(the TypeScript mirror of Python'srender_test_pattern; flat quadrant colors keep lossy decode within tolerance). The check is on the frame's structure, not a specificseq: the browser-visiblelastDisplayedSeqis a per-client wire counter, not the server's render counter, so the two need not match. A second spec injects real pointer/key/wheel events and asserts the server received the normalized versions viaGET /recorded-events. The image path is unconditional; the H.264 path is gated onVideoDecoder.isConfigSupportedand skipped-with-log where the browser lacksavc1.
Extension points¶
- Encoders.
register_video_encoder(name, factory)+ thehas_nvencflag inselect_transportare the seam new backends slot into with no changes to the session or transport. Three NVENC backends already ride it:nvenc_cpu(host-inputh264_nvenc),nvenc_gpu_pyav(zero-copy CUDA→NVENC via PyAV ≥ 18), andnvenc_gpu_pdum(PyAV-free, via the siblingpdum.nvencpackage). - Frames. Push
ndarrays, CuPy/DLPack CUDA tensors, orRawFrames toDisplay.publish()— a CUDA tensor becomes amemory="cuda"frame for the GPU encoders. The internal per-connection_ClientFeedis theFrameSourcethe session pulls;BaseFrameSourceremains for internal/test use. - Transport. The session only needs an object with
await send(...)and async iteration. TheTransport/Channelabstraction wraps this without touching the encoder or source layers — thewebsocketslistener and the ASGI/Starlette front-end both ride the same seam.