Skip to content

Internals

How the pieces fit together, the wire protocol, and the design decisions behind the session loop, the H.264 path, and the worker. Read the Python and JavaScript guides first for the public API.

End-to-end data flow

This is the per-connection, session-internal view. The public entry point is display.publish(ndarray), which stores the latest frame; each connection's _ClientFeed is the FrameSource the session pulls from below, and input drained by display.poll_events() is what arrives as handle_event here.

 Python                                   Browser (main thread)        Worker
 ------                                   ---------------------        ------
 FrameSource.next_frame() ─ RawFrame
 EncoderBackend.encode() ─ EncodedPayload
        │  (image bytes | H.264 Annex B AU)
 RfbSession.encode_loop ── pack ──► WebSocket ───────────────────────► onmessage
        ▲                                                                  │
        │  ack / request_keyframe / event / set_viewport                   ▼
 RfbSession.recv_loop ◄──────────── WebSocket ◄── normalized events    unpack header
        │                                            ▲                     │
        ▼                                            │              image_frame│video_chunk
 FrameSource.handle_event                  DOM events (pointer/key/      │     │
                                            wheel/resize) normalized  createImageBitmap | VideoDecoder
                                            on the main thread            │     │
                                                                          ▼     ▼
                                                                   draw → OffscreenCanvas

Three concerns stay independent so one API can negotiate the best backend: FrameSource → EncoderBackend → transport.

Wire protocol

Two kinds of messages share one WebSocket.

Control (JSON text). Client → server: hello, ack, request_keyframe, set_viewport, event. Server → client: config (sent right after hello; carries coords:"frame-pixels" and, when non-default, pixel_ratio / color), plus optional set_quality / stats.

Payloads (binary). Each image or encoded video access unit is one binary message with a self-describing envelope:

uint32le header_byte_length | utf8 JSON header | raw payload bytes
# pdum/rfb/protocol.py
def pack_binary_message(header: dict, payload: bytes) -> bytes:
    h = json.dumps(header, separators=(",", ":")).encode("utf-8")
    return struct.pack("<I", len(h)) + h + bytes(payload)
// widgets/src/protocol.ts
export function unpackBinaryMessage(input: ArrayBuffer | Uint8Array): UnpackedMessage;

A single self-describing envelope is deliberately chosen over a two-message "JSON header, then binary payload" scheme: it is atomic (no pairing state, no "binary arrived before its header" race) and keeps ordering trivial. The Python packer and the TypeScript unpacker are kept byte-compatible by committed fixtures (widgets/tests/fixtures/protocol/*) generated from pack_binary_message and asserted in Vitest.

Image header: {type:"image_frame", seq, timestamp_us, width, height, mime}. Video header: {type:"video_chunk", seq, timestamp_us, duration_us, width, height, codec, bitstream:"annexb", keyframe}. Both gain optional pixel_ratio (render DPR) and color (descriptor) fields, emitted only when non-default so committed fixtures and older clients are unaffected; the session stamps them from the source RawFrame after encode (so no encoder needs to know they exist). See Sizing, DPR & color.

Capability negotiation

worker: probeCapabilities() ─ hello{supported:[...], device_pixel_ratio}
server: select_transport(supported, has_h264, has_nvenc) ─ BackendSelection
server: build_encoder(selection)  ─ config{transport, codec, width, height}

select_transport prefers H.264 when the client lists webcodecs/h264-annexb and a video encoder exists (NVENC over libx264 when present), else the best shared image format. has_nvenc already exists in the signature so an NVENC backend changes no callers.

The session loop

RfbSession.run() runs two coroutines under an asyncio.TaskGroup:

  • recv_loop iterates inbound messages, dispatching via _handle_control: ack clears the in-flight set, request_keyframe arms a keyframe, event and set_viewport go to source.handle_event.
  • encode_loop repeatedly runs _encode_step: pull the next frame, and if len(inflight) >= max_inflight drop it before encoding (and force the next sent frame to be a keyframe); otherwise rebuild the encoder if the frame size changed, encode in a worker thread, and send.

Key decisions:

  • Encode off the event loop. encode() is CPU-bound and synchronous, so it runs via await asyncio.to_thread(...); the receive loop keeps draining ACKs and the in-flight set keeps moving.
  • Latest-frame-wins, drop before encoding. Dropping already-encoded delta frames would strand the browser on references it never received; dropping pre-encode and forcing the next keyframe keeps the stream decodable. The first frame to every client is a keyframe.
  • Fixed-resolution encoders. libx264 is configured for one size, so a frame whose dimensions changed triggers encoder_factory(w, h) and a forced keyframe; the browser re-configure()s its decoder on the new coded size.
  • Clean shutdown. A client disconnect surfaces as ConnectionClosed; both loops swallow it and set closed, so the TaskGroup completes without noise.

Single-step helpers (_encode_step, _handle_control) exist so unit tests drive the policy deterministically with a FakeWebSocket + FakeEncoder, with no socket or thread scheduling.

The H.264 path

H264CpuEncoder uses a bare av.CodecContext.create("libx264", "w"), which emits Annex B with in-band SPS/PPS — exactly WebCodecs' Annex B mode (never route through an mp4 muxer, which produces AVCC). The gaps in the original sketch are fixed:

  • Forced IDR: forced-idr=1 at creation and per-frame vf.pict_type = PictureType.I on a forced keyframe (a plain I frame without forced-idr can be a non-IDR the browser treats as a delta).
  • Pixel format: RGB is explicitly reformated to yuv420p (PyAV does not auto-convert on encode); dimensions are even.
  • Low latency: ultrafast / zerolatency, bframes=0, keyint=min-keyint=fps for a 1-second IDR cadence, annexb=1 / repeat-headers=1 to keep parameter sets in-band.

On the browser, VideoDecoder.configure({codec, codedWidth, codedHeight}) omits description (SPS/PPS are in-band). The KeyframeGate drops delta chunks until the first keyframe after every connect/reconnect/reconfigure; a decoder error resets the gate and sends request_keyframe.

Because there are no B-frames, decoder output order equals input order, so a FIFO of queued seqs attributes each displayed frame for the displayed:true ACK. (Enabling B-frames would break that assumption — it is documented in the code.)

Pipelined encode (token-based seq attribution)

By default the encoders are synchronous 1-in-1-out: encode(frame_N) returns frame N's AU, stamped seq=N. serve(encode_pipeline_depth=k>0) opts into a pipelined path on backends that implement it (see pipelined_encode.md). The design point is that the session does not change — it already iterates payloads and books each payload.seq (_encode_step lines ~242–243, send_payload). All the pipelining lives below the EncoderBackend seam:

  • Binding (packages/vtenc's VtEncoder): alongside the synchronous encode()→bytes, a submit(frame, seq, force_idr) issues VTCompressionSessionEncodeFrame without CompleteFrames, carrying seq as the per-frame sourceFrameRefCon. The output callback recovers it and pushes a seq-tagged access unit onto a queue; submit() drains whatever is ready and returns a list of (recovered_seq, annexb, keyframe) tuples (0..N). flush_pipeline() completes the in-flight tail. packages/nvenc's NvencEncoder implements the same submit()/flush_pipeline() shape at extra_output_delay=depth; NVIDIA's vendored NvEncoder helper owns inputTimeStamp, so it recovers seq from an in-order FIFO of the tags pushed at submit() instead — valid because frameIntervalP=1 forces output order == input order (see pipelined_encode.md).
  • Wrapper (encoders/vtenc.py VideoToolboxEncoder, and encoders/nvenc_gpu_pdum.py NvencGpuPdumEncoder): with pipeline_depth>0, encode() calls submit() and stamps each payload with the recovered seq (not the call's frame.seq), looking the original timestamp_us up from a small {seq: timestamp_us} in-flight map. pipeline_depth=0 keeps the synchronous path byte-identical.
  • Plumbing: build_encoder(..., pipeline_depth=) forwards it to the video factory; only backends that implement the pipelined path consume it (the rest drop it → a depth>0 request runs synchronously, not an error). serve()/add_stream() thread encode_pipeline_depth to the per-stream _StreamHost, whose factory passes it on.

This keeps the no-B-frames invariant (output order == input order); only the delay between input and output grows. The drop/keyframe policy is unchanged — frames are still dropped before submit, so the encoder pipeline only ever holds a valid reference chain. Backpressure's max_inflight bounds the wire (sent-but-unacked), independent of the encoder's in-flight depth. On VideoToolbox this path is correct but not faster (low-latency RC is synchronous); it exists for the NVENC backend.

The worker

One unified worker handles both transports (selected per message by header type) because there is exactly one WebSocket and one transferred OffscreenCanvas per session, and the server may switch transport mid-session.

entry.ts        bootstrap; owns the WebSocket; routes control vs binary;
                forwards main-thread messages (init/event/resize/capture/dispose)
connection      (in entry) hello after probeCapabilities; keyframe-gate reset on (re)open
renderer.ts     OffscreenCanvas 2D wrapper: draw / resize / readPixels / toBlob
imageDecode.ts  image_frame -> createImageBitmap -> draw -> bitmap.close()
videoDecode.ts  VideoPipeline: VideoDecoder lifecycle, gate, FIFO seq attribution
backpressure.ts BackpressureController + KeyframeGate (pure, unit-tested)

Resource lifetime is explicit: every VideoFrame and ImageBitmap is close()d immediately after drawing, or the decoder stalls within seconds. Payload views are copied into fresh Uint8Arrays before handing them to Blob/EncodedVideoChunk.

transferControlToOffscreen() is one-way: after transfer the main thread must never touch that canvas's bitmap, so all resize/DPR changes are messaged to the worker, which sets OffscreenCanvas.width/height.

Main ↔ worker contract

Main → worker: init (transfers the canvas + url + options), event, resize, capture, dispose. Worker → main: ready, state, stats, capture-result (carries the ImageData/Blob and the lastDisplayedSeq it measured), error.

Event normalization + the coordinate contract

events.ts (main thread) maps DOM events to the renderview vocabulary (shared by jupyter_rfb / pygfx / fastplotlib) in CSS canvas coordinates, then posts them to the worker. mapButton/mapButtons translate DOM button enums/bitmask to renderview's 0=none,1=left,2=right,3=middle (button) and pressed-button tuple (buttons). computeBackingSize derives the backing store and the effective ratio reported in set_viewport (logical width/height + physical pwidth/pheight + ratio).

The frame-pixel contract. All frame↔canvas geometry lives in one pure, unit-tested module, viewport.ts (frameDestRect / backingToFrame, for fill/contain/cover) — used by both the renderer (drawing, letterboxed per fit) and the worker's event path. Before sending, the worker remaps each pointer/wheel event CSS → backing (× DPR) → frame pixels through the current fit, adding inside (false in letterbox padding) and pixel_ratio (the frame's render DPR). So the publisher receives coordinates that index its framebuffer directly, correct under any fit or DPR — this is the invariant the original HiDPI off-by-DPR bug violated, and the reason drawing and hit-testing share one transform. The frame's render DPR cancels out of the geometry (it scales frame and backing symmetrically), so it rides only as the event echo. config.coords advertises the space ("frame-pixels", unconditional in this version).

Module map

Push model. The public API is serve(width, height) -> Display; you display.publish(ndarray) from your own loop and drain input with display.poll_events(). serve() runs the WebSocket server as a background task. Each connection gets its own RfbSession, fed by an internal per-connection _ClientFeed (the FrameSource the session pulls). The pull FrameSource classes in sources.py are internal-only now. RfbSession is unchanged — it sees a Channel (transport.py) and a feed, both satisfying its thin seams.

Two additive layers sit on top of that core without changing it: a hub (serve_server()Server) fronts several named Displays on one port, routed by URL path; and an opt-in ASGI front-end (asgi.py) drives the same per-connection lifecycle (_StreamHost._serve_connection) over a Starlette WebSocket. Both reuse the identical session/encoder/backpressure path.

src/pdum/rfb/
  types.py          RawFrame, EncodedPayload, InputEvent, FrameSource/EncoderBackend protocols (dep-free)
  protocol.py       envelope, header builders, control parsing, select_transport
  session.py        RfbSession: loops, backpressure, keyframe policy
  display.py        Display (publish/poll_events/aclose) + internal _ClientFeed (per connection)
  auth.py           AuthContext / Authenticator / Principal (pluggable, no JWT dep)
  transport.py      Channel protocol + WebSocketTransport (the transport seam)
  asgi.py           opt-in [asgi] Starlette front-end: rfb_endpoint / rfb_hub_endpoint
                    + _AsgiConn (same core; drives _StreamHost._serve_connection)
  sources.py        BaseFrameSource, RenderCallbackSource, OnDemandFrameSource (internal now)
  gpu.py            zero-copy GPU helpers: rgb_to_nv12, cuda_frame, context sharing, probes
  metrics.py        SessionMetrics (encode_ms, bytes, RTT, fps, bitrate, ...)
  adaptive.py       AdaptiveQualityController (opt-in via serve(adaptive=True))
  benchmark.py      `python -m pdum.rfb.benchmark` — offline image vs H.264 w/ real PSNR
  cli.py            `pdum-rfb` CLI: doctor (probe encode paths) + benchmark
  server.py         serve()->Display, serve_server()->Server hub (named streams,
                    URL-path routing, /streams REST), _StreamHost (transport-neutral
                    _serve_connection), _WsConn adapter, `python -m` CLI
  encoders/
    base.py         registry + build_encoder (registers h264_cpu + nvenc_cpu + nvenc_gpu_pyav + nvenc_gpu_pdum)
    image.py        ImageEncoder (Pillow)
    h264_cpu.py    H264CpuEncoder + h264_cpu_available / self_test
    nvenc_cpu.py        NvencCpuEncoder (host-input GPU h264_nvenc) + nvenc_cpu_available
    nvenc_gpu_pyav.py   NvencGpuPyavEncoder (zero-copy CUDA NV12 -> h264_nvenc, PyAV >= 18)
    nvenc_gpu_pdum.py    NvencGpuPdumEncoder (PyAV-free; rides habemus-papadum-nvenc / pdum.nvenc)
  testing.py        SyntheticFrameSource, fakes, NAL/decode helpers, fixture gen

widgets/src/
  index.ts                public exports
  RemoteFramebufferView.ts main-thread controller (canvas, events, resize, capture, setFit)
  viewport.ts             pure frame<->canvas geometry (frameDestRect / backingToFrame; fit modes)
  protocol.ts  events.ts  eventTypes.ts  capabilities.ts  backpressure.ts  types.ts
  workerFactory.ts        inline worker (?worker&inline)
  worker/{entry,renderer,imageDecode,videoDecode}.ts

Testing architecture

Three layers verify the system with no display and no manual clicking:

  1. Python (pytest). Protocol round-trips (+ golden fixtures for the JS side), image-encoder validity (re-decoded with Pillow), session invariants (max_inflight, keyframe-first, latest-frame-wins, forced-keyframe-on-drop, event delivery), negotiation, and — for H.264 — the produced Annex B bitstream is decoded back with PyAV to prove validity. One real loopback-socket test covers the handshake + HTTP side channel.
  2. JS unit (Vitest). The protocol unpacker is asserted byte-for-byte against the Python-generated fixtures; event-coordinate scaling and the backpressure/keyframe-gate logic are tested in isolation.
  3. Browser e2e (Playwright + headless Chromium). webServer boots the Python server (streaming the deterministic test_card pattern) and a production build of the demo. A spec decodes real frames, reads back canvas pixels via the capture hook, and checks they form a valid render_test_pattern(k) frame — the four palette colors in the correct spatial cycle, via matchedRotation (the TypeScript mirror of Python's render_test_pattern; flat quadrant colors keep lossy decode within tolerance). The check is on the frame's structure, not a specific seq: the browser-visible lastDisplayedSeq is a per-client wire counter, not the server's render counter, so the two need not match. A second spec injects real pointer/key/wheel events and asserts the server received the normalized versions via GET /recorded-events. The image path is unconditional; the H.264 path is gated on VideoDecoder.isConfigSupported and skipped-with-log where the browser lacks avc1.

Extension points

  • Encoders. register_video_encoder(name, factory) + the has_nvenc flag in select_transport are the seam new backends slot into with no changes to the session or transport. Three NVENC backends already ride it: nvenc_cpu (host-input h264_nvenc), nvenc_gpu_pyav (zero-copy CUDA→NVENC via PyAV ≥ 18), and nvenc_gpu_pdum (PyAV-free, via the sibling pdum.nvenc package).
  • Frames. Push ndarrays, CuPy/DLPack CUDA tensors, or RawFrames to Display.publish() — a CUDA tensor becomes a memory="cuda" frame for the GPU encoders. The internal per-connection _ClientFeed is the FrameSource the session pulls; BaseFrameSource remains for internal/test use.
  • Transport. The session only needs an object with await send(...) and async iteration. The Transport/Channel abstraction wraps this without touching the encoder or source layers — the websockets listener and the ASGI/Starlette front-end both ride the same seam.