Performance¶

Per-frame encode numbers for every path, measured on one box with the built-in benchmark. Use them to choose an encoder (see Installation) and to sanity-check your own hardware with pdum-rfb benchmark.

Test box¶


GPU	NVIDIA GeForce RTX 4090 Laptop GPU
Driver	580.159.04
OS / Python	Linux (Ubuntu 24.04) · CPython 3.14.6
Pattern / frames	`gradient`, 120 frames, 30 fps, 10 Mbps target (H.264)
Tool	`pdum-rfb benchmark` (wraps `python -m pdum.rfb.benchmark`)

"enc ms" is mean wall-clock per encode() call. For the GPU-resident rows (nvenc-gpu-pyav, nvenc-gpu-pdum) it covers the on-GPU RGB→NV12 conversion and the encode, with cudaDeviceSynchronize() on both sides — the realistic "everything-on-GPU" cost. For nvenc-cpu (host) it covers the CPU rgb→yuv reformat + PCIe upload + encode, i.e. what you pay when frames originate on the CPU. PSNR is measured by decoding the bitstream back (Pillow / PyAV) and comparing to the source.

Headline — 1920×1080¶

Path	enc ms	p95 ms	KB/frame	Mbps@30	PSNR dB	Notes
`nvenc-gpu-pdum` (SDK, GPU)	2.02	2.21	32.7	8.03	44.30	NVENC SDK; no PyAV
`nvenc-gpu-pyav` (PyAV 18, GPU)	3.08	3.30	34.4	8.45	43.61	zero-copy via ffmpeg
`h264-cpu` (libx264, CPU)	5.39	6.19	40.9	10.06	44.18	software
`nvenc-cpu` (PyAV, host)	9.20	9.12	34.4	8.44	43.66	CPU reformat + upload
`jpeg q80` (image)	3.63	3.94	91.2	22.41	34.31	image-per-frame

Both GPU-resident paths beat everything else; the SDK path is fastest (less per-frame overhead than routing through ffmpeg's h264_nvenc). The host nvenc-cpu row is slower than CPU libx264 here — that's the CPU rgb→yuv + PCIe upload tax, which the GPU-resident paths skip entirely. Image-per-frame is fast to encode but an order of magnitude larger on the wire at much lower quality.

Encode latency vs resolution (ms/frame)¶

Path	1280×720	1920×1080	2560×1440	3840×2160
`nvenc-gpu-pdum` (SDK, GPU)	1.06	2.02	2.79	5.31
`nvenc-gpu-pyav` (PyAV 18, GPU)	1.93	3.08	3.84	7.53
`h264-cpu` (libx264, CPU)	3.71	5.39	9.07	16.22
`jpeg q80` (image)	2.02	3.63	6.68	15.88
`nvenc-cpu` (PyAV, host)	6.16	9.20	14.40	30.29

The GPU-resident paths scale far better: at 4K the SDK path is 5.3 ms (≈188 fps headroom) versus 30 ms for host NVENC and 16 ms for CPU libx264. The host path degrades fastest because the single-threaded libswscale rgb→yuv reformat and the PCIe upload both grow with pixel count.

Takeaways¶

Rendering on the GPU? Use a GPU-resident path and keep frames on the device. The nvenc-gpu-pdum path is the fastest measured here and the easiest to install (a prebuilt wheel, no PyAV-18 build) — it's what pdum-rfb doctor recommends.
nvenc-gpu-pyav (PyAV 18) reaches nearly the same speed if you prefer the PyAV/ffmpeg stack; the gap is per-frame wrapper overhead, not the encode itself.
Frames originate on the CPU? h264-cpu (libx264) is the portable choice and often beats host NVENC once you count the reformat + upload. Reach for host nvenc-cpu mainly to offload the CPU, not for latency.
Image path is for stills/snapshots and the lossless-final still, not motion.

Apple Silicon — VideoToolbox (macOS)¶

Different box, different encoder — so these are not comparable to the RTX 4090 table above; read them on their own. Measured on an M-series Mac (macOS 26, MLX 0.31) with examples/mlx_vt_bench.py: MLX renders on the GPU, a Metal kernel converts RGB→NV12 on the GPU, and Apple VideoToolbox encodes (serve(gpu=True) on macOS).

Resolution	convert RGB→NV12 (GPU)	VideoToolbox encode	`encode()` total	fps
1280×720	0.36 ms	5.61 ms	5.67 ms	142
1920×1080	0.44 ms	5.83 ms	5.97 ms	134
2560×1440	0.52 ms	9.33 ms	9.55 ms	89
3840×2160	0.63 ms	18.92 ms	19.35 ms	47

Encode is a near-flat ~5.6 ms floor at 720p and 1080p (VideoToolbox's synchronous low-latency CompleteFrames latency dominates over pixel throughput) before compute takes over at 1440p/4K.
The GPU color conversion is the lever. Sub-millisecond (≈0.3–0.6 ms) on the GPU vs ~6.6 ms for the numpy/CPU conversion at 1080p — a ~23× cut that also frees a CPU core. This is what publishing an MLX mx.array (over a plain numpy array) buys.
Input zero-copy and pipelining were measured to not help here (unified memory, synchronous RC). See the Apple Metal guide.

On a Mac, pdum-rfb benchmark auto-includes a vtenc row (no flag — the same auto-detect as host NVENC on Linux): vtenc-gpu when MLX converts on the GPU, else vtenc-cpu. That gives a single per-frame encode figure comparable to the other rows; examples/mlx_vt_bench.py gives the render/convert/copy/encode breakdown above.

Reproduce¶

pip install 'habemus-papadum-rfb[cli]'
pdum-rfb doctor                 # what's available + the recommended path
pdum-rfb benchmark --sizes 1280x720,1920x1080,2560x1440,3840x2160 --bitrate 10M

doctor on the test box:

 Component                                        Status   Detail
 Python                                           ✓ ok     3.14.6 (need ≥3.14)
 h264-cpu — CPU H.264 (libx264)                   ✓ ok     libx264 present
 nvenc-cpu — host NVENC (PyAV h264_nvenc)         ✓ ok     available
 nvenc-gpu-pyav — zero-copy CUDA→NVENC (PyAV≥18)  ✓ ok     available   # PyAV-18 venv only
 nvenc-gpu-pdum — NVENC SDK (pdum.nvenc)          ✓ ok     available (no PyAV needed)
 → Recommended: nvenc-gpu-pdum — NVENC SDK (pdum.nvenc): fastest GPU path, no PyAV

benchmark auto-detects what's installed: the nvenc-gpu-pyav row appears only with PyAV ≥ 18, and nvenc-gpu-pdum only with the habemus-papadum-nvenc package.

Methodology notes & caveats¶

The nvenc-gpu-pyav (PyAV 18) row was measured in a separate, throwaway venv built with scripts/install-gpu.sh (PyAV 18.0.0rc0 from source). The project's own venv stays on PyAV 17.1 on purpose, so this number does not come from the dev env; it was produced with the identical benchmark_nvenc_gpu_pyav harness at the same settings and is directly comparable to the other rows.
Encoder configs are close but not byte-identical across paths (preset/tuning differ between the SDK binding and PyAV's h264_nvenc), so treat small PSNR/size differences as noise; the latency ranking is the robust result.
Consumer GPUs cap concurrent NVENC sessions and can transiently stall under rapid session open/close; production uses one long-lived encoder. Numbers are steady-state over 120 frames after a forced IDR on frame 0.
These nvenc-gpu-pdum figures are the default zero-latency path (extra_output_delay=0: each frame's access unit returns from its own encode(), no pipeline overlap), so they are honest end-to-end latency, not a pipelined best case. Measured ~2.3 ms at 1080p here — still the fastest path. Opting into serve(encode_pipeline_depth=k>0) trades k/fps of latency for higher sustained throughput (≈1.2× at 1080p, up to ~1.5× at 720p; see the measured table in Pipelined encode). The default stays 0 for the interactive, latest-frame-wins use case.
Synthetic gradient pattern; real scenes change bitrate/PSNR but not the latency ordering. Bitrate is a 10 Mbps VBR target.
See Zero-copy CUDA→NVENC and the NVENC SDK evaluation for the architecture behind the two GPU rows.