Benchmarks on the DLWSs
From Robin
Current revision as of 14:48, 14 December 2022
Contents |
ROBIN GPU Benchmarking
We performed a benchmark on the deep learning workstations to assess their performance in terms of training throughput. We have documented the process for reproducibility.
Methodology
Training throughput is a better performance measure for GPUs when training deep learning algorithms than e.g. FLOPS, as per Lambda Labs. Since we used image data, the unit is [images/sec].
Tested Environment
- OS: Red Hat Enterprise Linux release 8.7 (Ootpa)
- TensorFlow version: 2.4.1
- CUDA Version 11.7
- CUDNN Version 7.X.X
Benchmarking tool
We used a script that allows us to tweak several parameters:
- Toggle XLA
- Number of GPUs
- Number of Batches
- Number of Runs
- Model type (ResNet50, ResNet152, AlexNet, Inceptionv3, Inceptionv4, VGG-16)
- Precision (floating point 16 / 32)
- Inference / training mode
Dataset
The code also allows us to use synthetic or real data. Synthetic data is composed of images of random pixel colors generated directly on the GPU.
Parameters for this benchmark
- XLA disabled
- 1 GPU
- 100 batches
- 1 run
- ResNet50 model
- fp32 precision
- training
- ImageNet (synthetic data)
Specifically, the command used is:
./batch_benchmark.sh 1 1 1 100 10 config/config_resnet50_replicated_fp32_train_syn
Results
Workstation Name | Rudolph | Dunder | Dancer | Vixen | ||
---|---|---|---|---|---|---|
GPU | NVIDIA GeForce GTX 1080 Ti | NVIDIA GeForce GTX 3090 | NVIDIA GeForce GTX 3090 | NVIDIA GeForce GTX 3070 | ||
RAM size [MB] | 10216 | 22378 | 22373 | 7144 | ||
Compute Capability | 6.1 | 8.6 | 8.6 | 8.6 | ||
Batch size | 56 | 112 | 112 | 112 | 40 | 112 |
Throughput [images/sec] | 211.78 | 213.38 | 532.09 | 531.04 | 285.59 | OOM* |
*OOM = out of memory error
The trend we see is that the larger the RAM and the higher the compute capability, the higher the throughput. This is exactly what we expect!
Notes
- The default batch size is automatically determined by the script, as a function of the available GPU RAM size and the number of tunable parameters in the chosen model. For more details about choosing batch sizes, see here
- It would be interesting to see how much the XLA option increases the training throughput.