Performance¶
Per-frame encode numbers for every path, measured on one box with the built-in
benchmark. Use them to choose an encoder (see Installation) and to
sanity-check your own hardware with pdum-rfb benchmark.
Test box¶
| GPU | NVIDIA GeForce RTX 4090 Laptop GPU |
| Driver | 580.159.04 |
| OS / Python | Linux (Ubuntu 24.04) · CPython 3.14.6 |
| Pattern / frames | gradient, 120 frames, 30 fps, 10 Mbps target (H.264) |
| Tool | pdum-rfb benchmark (wraps python -m pdum.rfb.benchmark) |
"enc ms" is mean wall-clock per encode() call. For the GPU-resident rows
(nvenc-gpu-pyav, nvenc-gpu-pdum) it covers the on-GPU RGB→NV12 conversion and the
encode, with cudaDeviceSynchronize() on both sides — the realistic
"everything-on-GPU" cost. For nvenc-cpu (host) it covers the CPU rgb→yuv reformat +
PCIe upload + encode, i.e. what you pay when frames originate on the CPU. PSNR is
measured by decoding the bitstream back (Pillow / PyAV) and comparing to the source.
Headline — 1920×1080¶
| Path | enc ms | p95 ms | KB/frame | Mbps@30 | PSNR dB | Notes |
|---|---|---|---|---|---|---|
nvenc-gpu-pdum (SDK, GPU) |
2.02 | 2.21 | 32.7 | 8.03 | 44.30 | NVENC SDK; no PyAV |
nvenc-gpu-pyav (PyAV 18, GPU) |
3.08 | 3.30 | 34.4 | 8.45 | 43.61 | zero-copy via ffmpeg |
h264-cpu (libx264, CPU) |
5.39 | 6.19 | 40.9 | 10.06 | 44.18 | software |
nvenc-cpu (PyAV, host) |
9.20 | 9.12 | 34.4 | 8.44 | 43.66 | CPU reformat + upload |
jpeg q80 (image) |
3.63 | 3.94 | 91.2 | 22.41 | 34.31 | image-per-frame |
Both GPU-resident paths beat everything else; the SDK path is fastest (less
per-frame overhead than routing through ffmpeg's h264_nvenc). The host nvenc-cpu row
is slower than CPU libx264 here — that's the CPU rgb→yuv + PCIe upload tax, which
the GPU-resident paths skip entirely. Image-per-frame is fast to encode but an order
of magnitude larger on the wire at much lower quality.
Encode latency vs resolution (ms/frame)¶
| Path | 1280×720 | 1920×1080 | 2560×1440 | 3840×2160 |
|---|---|---|---|---|
nvenc-gpu-pdum (SDK, GPU) |
1.06 | 2.02 | 2.79 | 5.31 |
nvenc-gpu-pyav (PyAV 18, GPU) |
1.93 | 3.08 | 3.84 | 7.53 |
h264-cpu (libx264, CPU) |
3.71 | 5.39 | 9.07 | 16.22 |
jpeg q80 (image) |
2.02 | 3.63 | 6.68 | 15.88 |
nvenc-cpu (PyAV, host) |
6.16 | 9.20 | 14.40 | 30.29 |
The GPU-resident paths scale far better: at 4K the SDK path is 5.3 ms (≈188
fps headroom) versus 30 ms for host NVENC and 16 ms for CPU libx264. The host
path degrades fastest because the single-threaded libswscale rgb→yuv reformat and
the PCIe upload both grow with pixel count.
Takeaways¶
- Rendering on the GPU? Use a GPU-resident path and keep frames on the device.
The
nvenc-gpu-pdumpath is the fastest measured here and the easiest to install (a prebuilt wheel, no PyAV-18 build) — it's whatpdum-rfb doctorrecommends. nvenc-gpu-pyav(PyAV 18) reaches nearly the same speed if you prefer the PyAV/ffmpeg stack; the gap is per-frame wrapper overhead, not the encode itself.- Frames originate on the CPU?
h264-cpu(libx264) is the portable choice and often beats host NVENC once you count the reformat + upload. Reach for hostnvenc-cpumainly to offload the CPU, not for latency. - Image path is for stills/snapshots and the lossless-final still, not motion.
Apple Silicon — VideoToolbox (macOS)¶
Different box, different encoder — so these are not comparable to the RTX 4090
table above; read them on their own. Measured on an M-series Mac (macOS 26, MLX
0.31) with examples/mlx_vt_bench.py: MLX renders on the GPU, a Metal kernel converts
RGB→NV12 on the GPU, and Apple VideoToolbox encodes (serve(gpu=True) on macOS).
| Resolution | convert RGB→NV12 (GPU) | VideoToolbox encode | encode() total |
fps |
|---|---|---|---|---|
| 1280×720 | 0.36 ms | 5.61 ms | 5.67 ms | 142 |
| 1920×1080 | 0.44 ms | 5.83 ms | 5.97 ms | 134 |
| 2560×1440 | 0.52 ms | 9.33 ms | 9.55 ms | 89 |
| 3840×2160 | 0.63 ms | 18.92 ms | 19.35 ms | 47 |
- Encode is a near-flat ~5.6 ms floor at 720p and 1080p (VideoToolbox's synchronous
low-latency
CompleteFrameslatency dominates over pixel throughput) before compute takes over at 1440p/4K. - The GPU color conversion is the lever. Sub-millisecond (≈0.3–0.6 ms) on the GPU
vs ~6.6 ms for the numpy/CPU conversion at 1080p — a ~23× cut that also frees a
CPU core. This is what publishing an MLX
mx.array(over a plain numpy array) buys. - Input zero-copy and pipelining were measured to not help here (unified memory, synchronous RC). See the Apple Metal guide.
On a Mac, pdum-rfb benchmark auto-includes a vtenc row (no flag — the same
auto-detect as host NVENC on Linux): vtenc-gpu when MLX converts on the GPU, else
vtenc-cpu. That gives a single per-frame encode figure comparable to the other rows;
examples/mlx_vt_bench.py gives the render/convert/copy/encode breakdown above.
Reproduce¶
pip install 'habemus-papadum-rfb[cli]'
pdum-rfb doctor # what's available + the recommended path
pdum-rfb benchmark --sizes 1280x720,1920x1080,2560x1440,3840x2160 --bitrate 10M
doctor on the test box:
Component Status Detail
Python ✓ ok 3.14.6 (need ≥3.14)
h264-cpu — CPU H.264 (libx264) ✓ ok libx264 present
nvenc-cpu — host NVENC (PyAV h264_nvenc) ✓ ok available
nvenc-gpu-pyav — zero-copy CUDA→NVENC (PyAV≥18) ✓ ok available # PyAV-18 venv only
nvenc-gpu-pdum — NVENC SDK (pdum.nvenc) ✓ ok available (no PyAV needed)
→ Recommended: nvenc-gpu-pdum — NVENC SDK (pdum.nvenc): fastest GPU path, no PyAV
benchmark auto-detects what's installed: the nvenc-gpu-pyav row appears only with
PyAV ≥ 18, and nvenc-gpu-pdum only with the habemus-papadum-nvenc package.
Methodology notes & caveats¶
- The
nvenc-gpu-pyav(PyAV 18) row was measured in a separate, throwaway venv built withscripts/install-gpu.sh(PyAV 18.0.0rc0 from source). The project's own venv stays on PyAV 17.1 on purpose, so this number does not come from the dev env; it was produced with the identicalbenchmark_nvenc_gpu_pyavharness at the same settings and is directly comparable to the other rows. - Encoder configs are close but not byte-identical across paths (preset/tuning
differ between the SDK binding and PyAV's
h264_nvenc), so treat small PSNR/size differences as noise; the latency ranking is the robust result. - Consumer GPUs cap concurrent NVENC sessions and can transiently stall under rapid session open/close; production uses one long-lived encoder. Numbers are steady-state over 120 frames after a forced IDR on frame 0.
- These
nvenc-gpu-pdumfigures are the default zero-latency path (extra_output_delay=0: each frame's access unit returns from its ownencode(), no pipeline overlap), so they are honest end-to-end latency, not a pipelined best case. Measured ~2.3 ms at 1080p here — still the fastest path. Opting intoserve(encode_pipeline_depth=k>0)tradesk/fpsof latency for higher sustained throughput (≈1.2× at 1080p, up to ~1.5× at 720p; see the measured table in Pipelined encode). The default stays0for the interactive, latest-frame-wins use case. - Synthetic
gradientpattern; real scenes change bitrate/PSNR but not the latency ordering. Bitrate is a 10 Mbps VBR target. - See Zero-copy CUDA→NVENC and the NVENC SDK evaluation for the architecture behind the two GPU rows.