Blog About Hardware guide
28 August 2023

Proxmox performance vs Native for Machine Learning

by Mathieu Poliquin

As stated in the last post I made some performance test with proxmox and you can find the results. For those in a hurry the conclusion is that performance is very close to native and now proxmox is what I use on my home server

Hardware specs:

Software specs:

Settings used

For the gpu passthrough I used the settings I described in my previous blog post on how to setup it up.

Windows 11 VM:

Ubuntu 22.04 VM:

As you noticed CPU type is set to host, this is important to get similar results as below.

Cinebench r23

all cores

On native Windows I got 22400 as highest score for r23 which represents 1.8% difference

For the test with no e-cores, I just set the affinity to the first 16 core/threads as the last 4 are e cores as opposed to disabling them in the BIOS.

no e cores

Machine learning

python3 --env=Pong-Atari2600 --num_timesteps=10_000_000 --num_env=20
Test Proxmox Native
Pong 2062 fps/s 2192 fps/s

Pytorch benchmark:

python [MODELNAME] -d cuda -t eval
Test Proxmox (ms) Native (ms)
llama 15.950 16.315
resnet50 33.554 32.152
resnet152 78.817 77.712
vgg16 8.983 9.153
hf_gpt2 30.571 29.933
hf_gpt2_large 148.251 147.398
hf_bert 10.905 11.136
yolov3 50.128 48.884


CUDA-Z returns similar results to native:

CUDA-Z Report
Version: 0.10.251 64 bit
OS Version: Linux 6.2.0-31-generic #31~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Aug 16 13:45:26 UTC 2 x86_64
Driver Version: 535.86.05
Driver Dll Version: 12.20 (535.86.05)
Runtime Dll Version: 6.50

Core Information
	Name: NVIDIA GeForce RTX 2060
	Compute Capability: 7.5
	Clock Rate: 1680 MHz
	PCI Location: 0:0:16
	Multiprocessors: 34
	Threads Per Multiproc.: 1024
	Warp Size: 32
	Regs Per Block: 65536
	Threads Per Block: 1024
	Threads Dimensions: 1024 x 1024 x 64
	Grid Dimensions: 2147483647 x 65535 x 65535
	Watchdog Enabled: Yes
	Integrated GPU: No
	Concurrent Kernels: Yes
	Compute Mode: Default
	Stream Priorities: Yes

Memory Information
	Total Global: 11.7401 GiB
	Bus Width: 192 bits
	Clock Rate: 7001 MHz
	Error Correction: No
	L2 Cache Size: 48 KiB
	Shared Per Block: 48 KiB
	Pitch: 2048 MiB
	Total Constant: 64 KiB
	Texture Alignment: 512 B
	Texture 1D Size: 131072
	Texture 2D Size: 131072 x 65536
	Texture 3D Size: 16384 x 16384 x 16384
	GPU Overlap: Yes
	Map Host Memory: Yes
	Unified Addressing: Yes
	Async Engine: No

Performance Information
Memory Copy
	Host Pinned to Device: 11.757 GiB/s
	Host Pageable to Device: 11.2587 GiB/s
	Device to Host Pinned: 12.2091 GiB/s
	Device to Host Pageable: 10.8422 GiB/s
	Device to Device: 127.455 GiB/s
GPU Core Performance
	Single-precision Float: 8515.46 Gflop/s
	Double-precision Float: 201.306 Gflop/s
	64-bit Integer: 1764.03 Giop/s
	32-bit Integer: 6277.72 Giop/s
	24-bit Integer: 6212.7 Giop/s


As stated in the intro, for my use cases proxmox performance is quite close to native and is well worth all the benefits of virtualization. In a next update I plan to test ssd speed and also network bandwidth and a bit more ML tasks such as one with stable-diffusion

tags: Proxmox - performance - machine learning - Ubuntu - Windows - Cinebench r23