Benchmarks on the DLWSs

Fra Robin

Gå til: navigasjon, søk

Innhold

ROBIN GPU Benchmarking

We performed a benchmark on the deep learning workstations to assess their performance in terms of training throughput. We have documented the process for reproducibility.

Methodology

Training throughput is a better performance measure for GPUs when training deep learning algorithms than e.g. FLOPS, as per Lambda Labs. Since we used image data, the unit is [images/sec].

Tested Environment

  • OS: Red Hat Enterprise Linux release 8.7 (Ootpa)
  • TensorFlow version: 2.4.1
  • CUDA Version 11.7
  • CUDNN Version 7.X.X

Benchmarking tool

We used a script that allows us to tweak several parameters:

  • Toggle XLA
  • Number of GPUs
  • Number of Batches
  • Number of Runs
  • Model type (ResNet50, ResNet152, AlexNet, Inceptionv3, Inceptionv4, VGG-16)
  • Precision (floating point 16 / 32)
  • Inference / training mode

Dataset

The code also allows us to use synthetic or real data. Synthetic data is composed of images of random pixel colors generated directly on the GPU.

Parameters for this benchmark

  • XLA disabled
  • 1 GPU
  • 100 batches
  • 1 run
  • ResNet50 model
  • fp32 precision
  • training
  • ImageNet (synthetic data)

Specifically, the command used is:

./batch_benchmark.sh 1 1 1 100 10 config/config_resnet50_replicated_fp32_train_syn

Results

Caption
Workstation Name Rudolph Dunder Dancer Vixen
GPU NVIDIA GeForce GTX 1080 Ti NVIDIA GeForce GTX 3090 NVIDIA GeForce GTX 3090 NVIDIA GeForce GTX 3070
RAM size [MB] 10216 22378 22373 7144
Compute Capability 6.1 8.6 8.6 8.6
Batch size 56 112 112 112 40 112
Throughput [images/sec] 211.78 213.38 532.09 531.04 285.59 OOM*

*OOM = out of memory error

The trend we see is that the larger the RAM and the higher the compute capability, the higher the throughput. This is exactly what we expect!

Notes

  • The default batch size is automatically determined by the script, as a function of the available GPU RAM size and the number of tunable parameters in the chosen model. For more details about choosing batch sizes, see here
  • It would be interesting to see how much the XLA option increases the training throughput.
Personlige verktøy