Zero-copy GPU encoding (CUDA → NVENC)¶
Stream a GPU-resident framebuffer straight to NVENC with no host copy: a
CUDA NV12 (or RGB) buffer — a CuPy / PyTorch / any __dlpack__ tensor — is handed
to h264_nvenc via PyAV's VideoFrame.from_dlpack, and the encoder reads device
memory directly. This is the GPU counterpart to the host
NvencCpuEncoder, which uploads host rgb24 and reformats to
yuv420p on the CPU first.
For a render-on-GPU scientific pipeline this removes the CPU color-conversion and the PCIe upload from every frame and frees the CPU entirely.
Measured payoff¶
Per-frame encode latency on an RTX 4090 Laptop GPU (moving gradient, vbr,
delay=0), CPU-origin (host rgb24 → CPU yuv420p reformat → upload → encode)
vs GPU zero-copy (device RGB → on-GPU NV12 → from_dlpack → encode):
| Resolution | RGB→NV12 (GPU) | GPU encode (zero-copy) | GPU total | GPU fps | CPU-origin | CPU fps | speed-up |
|---|---|---|---|---|---|---|---|
| 1280×720 | 0.009 ms | 1.36 ms | 1.37 ms | 730 | 3.25 ms | 307 | 2.4× |
| 1920×1080 | 0.014 ms | 2.49 ms | 2.50 ms | 400 | 7.26 ms | 138 | 2.9× |
| 2560×1440 | 0.021 ms | 3.52 ms | 3.54 ms | 282 | 12.73 ms | 79 | 3.6× |
| 3840×2160 | 0.057 ms | 7.08 ms | 7.08 ms | 141 | 30.53 ms | 33 | 4.3× |
The NVENC kernel itself is GPU-bound either way; the win is removing the CPU
rgb→yuv reformat (libswscale, single-threaded — brutal at 4K) and the per-frame
PCIe upload (~0.5 ms at 1080p, ~2.2 ms at 4K). Reproduce with
python -m pdum.rfb.benchmark --gpu (see Benchmark).
What NV12 is¶
NVENC's native input is NV12: YUV 4:2:0, 8-bit, semi-planar.
- Y (luma) — full resolution,
W×Hbytes. - UV (chroma) — half resolution in both axes, one interleaved plane of
U,V,U,V…((H/2)×Wbytes). Each chroma sample is shared by a 2×2 pixel block.
Total 1.5 bytes/pixel (vs 3 for RGB). "Semi-planar" = Y separate, U/V interleaved
(NV12), unlike yuv420p/I420's three separate planes. Critically, NVENC wants NV12
as one contiguous allocation — Y plane, then the UV plane at byte offset
pitch·height — because it reads UV relative to the Y base pointer. This module's
rgb_to_nv12 produces exactly that layout; nv12_planes slices it
back into the two DLPack planes.
Quick start¶
import asyncio, cupy as cp, pdum.rfb as rfb
# 1) BEFORE any framework's first CUDA op (CuPy/PyTorch/JAX): share the device
# primary context with FFmpeg. This pokes the CUDA driver directly (no CuPy),
# so it must run before anything activates the primary context — otherwise the
# flags don't take effect.
rfb.enable_cuda_context_sharing()
async def main():
# 2) gpu=True selects the zero-copy CUDA→NVENC encoder (validated at startup).
display = await rfb.serve(1920, 1080, port=8765, gpu=True)
try:
while True:
for ev in display.poll_events():
... # handle input
frame_rgb = render_on_gpu() # a CuPy (H, W, 3) uint8 array
display.publish(frame_rgb) # zero-copy: stays on the GPU
await asyncio.sleep(1 / 60)
finally:
await display.aclose()
asyncio.run(main())
Publish a CuPy (H, W, 3) array directly, or — to skip even the RGB→NV12 step —
publish an already-NV12 frame:
nv12 = rfb.gpu.rgb_to_nv12(frame_rgb) # contiguous (H+H//2, W) on GPU
display.publish(rfb.gpu.cuda_frame(nv12, pixel_format="nv12", height=1080))
Any framework works as long as the tensor exposes __cuda_array_interface__ or a
CUDA __dlpack__ — CuPy, PyTorch, and JAX all do, and all run on the device
primary context, so the pointer FFmpeg/NVENC sees is valid (that's what
enable_cuda_context_sharing guarantees; it operates on the primary context
via the CUDA driver, not on any one library). One caveat: a framework that
creates its own non-primary context — e.g. Numba's CUDA target, which calls
cuCtxCreate rather than retaining the primary context — produces pointers that
live in a different context and can't be registered, even after the call. CuPy,
PyTorch, and JAX are not in that category.
Requirements¶
pdum.rfb.gpu.cuda_zerocopy_available() returns True only when all hold (it
runs an actual one-frame encode to be sure, and caches the result):
- CuPy —
cupy-cuda13x/cupy-cuda12x(cp314 wheels exist; works on 3.14). - An NVENC-capable GPU + driver — same gate as the host NVENC backend
(
pdum.rfb.encoders.nvenc.nvenc_cpu_available()). - PyAV that can encode CUDA frames — PyAV ≥ 18.
from_dlpack(frame creation) is in 17.0, but feeding a CUDA frame to an encoder — adopting the frame'shw_frames_ctxbeforeavcodec_open2— lands in 18.0 (PyAV #2199), unreleased at time of writing (the fix is onmain).
On PyAV 17.x the encode raises avcodec_open2(...) returned 22; hw_frames_ctx must
be set when using GPU frames as input.
Installing it today (before PyAV 18.0 ships on PyPI)¶
CuPy installs normally (pip install habemus-papadum-rfb[gpu-cuda13], or
[gpu-cuda12] for CUDA 12). The only catch is PyAV ≥ 18, which isn't on PyPI
yet. Three paths, easiest first; all land in your active env (or $PYTHON):
1. Prebuilt self-contained wheel (recommended). A wheel with a CUDA-enabled ffmpeg bundled in — no system ffmpeg, no compiler, no env vars. Host it on a GitHub release (see Building & hosting below), then:
PYAV_WHEEL_URL=https://github.com/<owner>/<repo>/releases/download/<tag>/av-...whl \
./scripts/install-gpu.sh # installs the wheel + CuPy, then self-tests
# or directly: uv pip install <that-url> cupy-cuda13x
2. Build from source (one command). No prebuilt wheel needed — the script
fetches a CUDA ffmpeg (a BtbN LGPL shared build) and builds PyAV from a pinned
commit, baking an rpath so no LD_LIBRARY_PATH is needed at runtime:
./scripts/install-gpu.sh # ~1 min the first time; uv caches the build
# CUPY_PACKAGE=cupy-cuda12x ./scripts/install-gpu.sh # for a CUDA 12 toolkit
3. When PyAV 18.0 is released: add "av>=18" to the [gpu-cuda13]/[gpu-cuda12]
extras and it collapses to a one-step pip install habemus-papadum-rfb[gpu-cuda13]
— the 18.0 wheel bundles a CUDA-capable ffmpeg, so no build and no system ffmpeg.
Building & hosting the wheel (maintainers)¶
scripts/build-cuda-av-wheel.sh builds the self-contained wheel(s):
PYTHON_VERSIONS="3.12 3.13 3.14" ./scripts/build-cuda-av-wheel.sh # -> dist/cuda-wheels/
gh release create gpu-av18-<date> dist/cuda-wheels/av-*.whl \
--title "PyAV 18 (CUDA/NVENC) wheels" --notes "Self-contained; bundles LGPL ffmpeg."
It links PyAV against a BtbN LGPL ffmpeg (has h264_nvenc + the CUDA hwcontext,
--disable-libx264 ⇒ no GPL components) and runs auditwheel repair to bundle the
ffmpeg .sos (tagged manylinux_2_28 ⇒ installs on RHEL8 / Ubuntu 18.10+ and
newer). libcuda/libnvidia-encode are not bundled — they come from the host
driver, as they must. Licensing: the bundled ffmpeg is LGPL, so redistributing
the wheel carries LGPL obligations (offer the corresponding ffmpeg source / build
config). Hosting in this repo's GitHub Releases (not committed to the tree) is
the simplest option; a PEP 503 index on GitHub Pages is a later nicety.
Two gotchas the library handles for you¶
One shared CUDA context¶
CuPy uses the device primary context. FFmpeg's CUDA hwcontext (primary_ctx=1)
expects that context to have been created with CU_CTX_SCHED_BLOCKING_SYNC flags.
If CuPy activates it first with the default (auto) flags:
primary_ctx=Truefails with "Primary context already active with incompatible flags"; and- a separate
primary_ctx=Falsecontext can't register CuPy's pointers — NVENC "resource register failed (23)", because a device pointer from one context isn't valid in another on the same device.
enable_cuda_context_sharing() pre-sets the flags (via the CUDA driver
cuDevicePrimaryCtxSetFlags). Call it once, before any CuPy/PyTorch CUDA op
(importing CuPy is fine; the first allocation/op is what activates the context).
serve(gpu=True) and the encoder call it defensively too, but if CuPy has already
run, it is too late for that process.
NV12 must be one contiguous allocation¶
Pass NVENC two separate CuPy arrays for Y and UV and registration fails. Allocate
one buffer and slice views — which is what rgb_to_nv12 / nv12_planes do:
nv12 = cp.empty((H + H // 2, W), cp.uint8) # one allocation
y, uv = nv12[:H], nv12[H:] # views; uv at base + W*H
RGB → NV12 conversion options¶
NVENC needs YUV, so a GPU RGB buffer must be converted first. Cheapest-effort first:
- A custom CuPy
RawKernel— whatpdum.rfb.gpu.rgb_to_nv12uses (BT.601 limited range). ~20 lines of CUDA C, no extra dependency, ~0.01 ms at 1080p. Recommended — the conversion is so cheap that nothing else buys anything. - NPP (
nppiRGBToNV12_*) — NVIDIA's prebuilt image primitives, ships with CUDA; fast and battle-tested but adds an NPP binding. - CV-CUDA /
nvcv—cvcuda.cvtcolor; a heavier dependency, worthwhile only if you already use it. - PyNvVideoCodec / VPF — bundle convert and encode, but have no cp314 wheel (see the NVENC-source route).
Can we avoid building PyAV from source on < 18?¶
Short answer: no pure-Python monkey-patch exists; you must build PyAV from source (or wait for the 18.0 wheel). Investigated and ruled out:
HWAccel(settinghw_device_ctx) — PyAV can set the encoder'shw_device_ctxfrom Python viaHWAccel, but NVENC explicitly rejects it for GPU input: "hw_frames_ctx must be set when using GPU frames as input". It needshw_frames_ctxspecifically.- A
ctypespoke atavctx->hw_frames_ctx— PyAV exposes no Python handle to the underlyingAVCodecContext/AVFramepointers, and Cython cdef-object offsets are not stable ABI. Not viable.
So < 18 needs a build. Good news: no custom FFmpeg is required — the stock
PyPI av wheel's bundled ffmpeg already has the CUDA hwcontext (it's auto-enabled
by the nv-codec-headers + nvenc dependency; it just isn't a separate
--enable-cuda token, which is why from_dlpack(primary_ctx=False) works on the
stock wheel today). You only need to rebuild PyAV against an ffmpeg dev tree.
This is what scripts/install-gpu.sh
automates (Option A). The manual forms, for reference:
Option A — build PyAV main / a pinned commit (the official fix)¶
# needs a CUDA ffmpeg dev tree on PKG_CONFIG_PATH (a BtbN LGPL/GPL "shared" release —
# no compiling ffmpeg yourself); LDFLAGS bakes an rpath so no LD_LIBRARY_PATH at runtime
PKG_CONFIG_PATH=/path/to/ffmpeg/lib/pkgconfig LDFLAGS="-Wl,-rpath,/path/to/ffmpeg/lib" \
uv pip install --no-cache --no-binary av "av @ git+https://github.com/PyAV-Org/PyAV@main"
uv caches built wheels by git commit, not by the ffmpeg you link against — so use
--no-cache(or--refresh) when (re)building against a specific ffmpeg, or a stale wheel may be reused silently.
Option B — the minimal patch on 17.1.0 (pin to a known version)¶
Two edits to the PyAV sdist, then build from source. They are exactly what 18.0 does (#2199):
include/avcodec.pxd— declare the field (the cdef struct omits it): ```diff AVHWAccel hwaccel AVBufferRef hw_device_ctx- AVBufferRef *hw_frames_ctx ```
av/video/codeccontext.py— adopt a hardware input frame'shw_frames_ctxbefore the encoder is opened:@cython.cfunc def _prepare_and_time_rebase_frames_for_encode(self, frame: Frame): if (not self.is_open and frame is not None and frame.ptr.hw_frames_ctx and not self.ptr.hw_frames_ctx): self.ptr.hw_frames_ctx = lib.av_buffer_ref(frame.ptr.hw_frames_ctx) return CodecContext._prepare_and_time_rebase_frames_for_encode(self, frame)
Either way, cuda_zerocopy_available() flips to True and everything below works.
API¶
All of pdum.rfb.gpu lazy-imports CuPy, so importing it is always safe.
| Symbol | Purpose |
|---|---|
enable_cuda_context_sharing(device_id=0) |
Pre-set primary-ctx flags so CuPy + FFmpeg share one context. Call first. |
cuda_zerocopy_available() |
True iff the full stack works (cached; runs a real encode). |
rgb_to_nv12(rgb, *, out=None) |
Device (H,W,3) → contiguous NV12 (H+H//2, W) (custom kernel). |
nv12_planes(packed) |
Slice contiguous NV12 into (Y, UV) DLPack-ready views. |
cuda_frame(array, *, pixel_format="auto", ...) |
Wrap a device tensor as a CUDA RawFrame for publish(). |
to_host_rgb(frame) |
Download a CUDA frame to host rgb24 (used by the image fallback). |
HostFrameAdapter(inner) |
Wrap a host encoder so it tolerates CUDA frames (downloads first). |
NvencGpuPyavEncoder |
The EncoderBackend (registered as "nvenc_gpu_pyav"). |
publish() accepts a CuPy (H,W,3|4) tensor directly (or a cuda_frame for NV12),
and serve(gpu=True) selects NvencGpuPyavEncoder for every viewer.
Architecture & integration¶
RawFrame.memory == "cuda"(the type already modelled this) carries the device tensor;Display.publishtags CuPy/DLPack tensors automatically.NvencGpuPyavEncoder(encoders/nvenc_gpu_pyav.py) subclasses the hostH264CpuEncoder, swapping only the input handling: it accepts a CUDAnv12frame (true zero-copy), a CUDArgb24/rgba8frame (on-GPU convert first), or a host frame (uploaded then converted — a graceful fallback). It reuses one contiguous NV12 staging buffer (safe becausedelay=0consumes each frame before the next), and one persistentCudaContextso every frame shares the encoder'shw_frames_ctx.- Wire format, Annex B packing, forced-keyframe handling, and backpressure are inherited unchanged — the browser side needs nothing new.
- Image-only viewers on a GPU-publishing display still work: their image
encoder is wrapped in
HostFrameAdapter, which downloads each CUDA frame to hostrgb24(NV12 is converted on the GPU first). GPU mode otherwise targets WebCodecs (H.264) viewers.
Benchmark¶
# CPU-origin vs GPU zero-copy, per resolution, with CUDA-event-timed conversion
python -m pdum.rfb.benchmark --gpu
Reports, per resolution: the RGB→NV12 conversion cost (timed with
cupy.cuda.Event markers), the zero-copy encode latency, and the CPU-origin
latency for comparison. Requires the full stack above.
Alternative: the NVENC-source route¶
PyAV is the pragmatic backend (one dependency, no build once 18.0 ships). The other route is NVIDIA's own binding to the Video Codec SDK:
- PyNvVideoCodec / VPF — takes CUDA arrays directly (DLPack / CAI), bundles its
own color conversion, and bypasses ffmpeg entirely. But: no cp314 wheel and no
sdist on PyPI, so it can't
pip installon 3.14. Building from the Video Codec SDK source is possible (CUDA +nvccare present on a dev box) but needs the SDK headers (nvEncodeAPI.h) and is a heavier, NVIDIA-version-coupled dependency. - A direct
ctypes/cffibinding tolibnvidia-encode— no build step (dlopen the driver lib), maximal control, but a large amount of NVENC-API plumbing to maintain.
Trade-off: PyAV reuses our existing Annex-B / decode-back test infrastructure and
adds zero new Python dependencies; the NVENC-SDK route removes the ffmpeg layer and
the PyAV-18 dependency but adds a build step and a hand-maintained binding. If the
SDK source is available, the most interesting evaluation is whether a thin
PyNvVideoCodec build (or a minimal cffi shim) can match the PyAV path's latency
while accepting the same DLPack frames gpu.cuda_frame already produces — in
which case it could slot in behind the same register_video_encoder("nvenc_gpu_pyav",
...) seam.
Caveats¶
- Consumer GPUs can transiently
EINVAL(or rarely hard-fault) on rapid NVENC session open/close churn. Production uses one long-lived encoder per connection and is unaffected; the test suite retries and GCs between encoders. - Publish a fresh device buffer per frame — viewers share the reference and may
read it asynchronously (same rule as the host path). Or opt into
serve(gpu=True, own_frames=True):publish()then does a device-to-device copy into a recycled server-owned CuPy buffer, so you may reuse your own device tensor immediately (no reallocation, no release callback). See the frame ownership model. - Even dimensions only (NV12), and
width ≥ 160(NVENC minimum). - The encoder uses device 0 and the primary context; multi-GPU selection is a future extension.