Performance¶

Performance values for the Charmed HPC Benchmarks running on Microsoft Azure are provided on this page. Steps for reproducing these results are available in the benchmarking suite documentation. These results are intended to provided best-possible marks for performance to help guide tuning and identification of bottlenecks. Real-world performance may fluctuate depending on factors such as varying cluster workloads and resource contention.

Metrics¶

GPU performance and InfiniBand interconnect performance are measured on suitably equipped VMs to illustrate the ability of Charmed HPC to leverage the HPC capabilities of the cloud.

GPU performance is measured for both single precision workloads, where speed and larger scale is preferred over numerical accuracy, and double precision workloads, where additional accuracy is required.

Latency and bandwidth are measured for InfiniBand to determine how quickly a message can traverse the network, as well as the network throughput. These metrics govern the ability of the cluster to run MPI, and other distributed memory dependent applications across nodes.

GPU flops¶

GPU single and double precision floating point operations per second (flops) are measured by the gpu_burn stress test, with code modifications by the ReFrame developers, built using the CUDA toolkit.

InfiniBand interconnect latency and bandwidth¶

InfiniBand RDMA latency and bandwidth are measured by the OSU Micro-Benchmarks for MPI and the Intel MPI Benchmarks (IMB), built with OpenMPI and GCC. All runs are performed on two cluster compute nodes, one MPI process per node.

Point-to-point performance is measured with the OSU osu_bw and osu_latency benchmarks, as well as the IMB MPI-1 PingPong benchmark.

Collective performance is measured with the OSU osu_alltoall and osu_allreduce benchmarks, as well as the IMB MPI-1 AllReduce benchmark.

Method¶

All performance tests run under the following base software:

Juju 3.6.3
Ubuntu 24.04
ReFrame 4.7.4

on a Charmed HPC cluster deployed on Microsoft Azure, with the following specifications:

application	VM size	charm	channel	revision
`login`	1x `Standard_D2as_v6`	sackd	edge	13
`slurmctld`	1x `Standard_D2as_v6`	slurmctld	edge	95
`slurmdbd`	1x `Standard_D2as_v6`	slurmdbd	edge	87
`mysql`	1x `Standard_D2as_v6`	mysql	8.0/stable	313
`nc4as-t4-v3`	1x `Standard_NC4as_T4_v3`	slurmd	edge	116
`hb120rs-v3`	2x `Standard_HB120rs_v3`	slurmd	edge	116
`nfs-share-client`	N/A - subordinate charm	filesystem-client	edge	15
`nfs-share-server`	1x `Standard_D2as_v6`	nfs-server-proxy	edge	21

Where no specific VM features are required, VM instances are Standard_D2as_v6, the default size for Juju 3.6.3.

Tesla T4 GPU - `Standard_NC4as_T4_v3` instances¶

GPU tests are built with the following Ubuntu software package:

nvidia-cuda-toolkit_12.0.140~12.0.1-4build4

Runs are performed on a single Standard_NC4as_T4_v3 instance, equipped with an NVIDIA Tesla T4 and deployed as Juju application nc4as-t4-v3 as described in the table above.

HDR InfiniBand - `Standard_HB120rs_v3` instances¶

MPI tests are built and run with the following Ubuntu software packages:

openmpi-bin_4.1.6-7ubuntu2
libopenmpi-dev_4.1.6-7ubuntu2
gcc-13_13.3.0-6ubuntu2~24.04

Runs are performed on two Standard_HB120rs_v3 instances, equipped with 200 Gb/s HDR InfiniBand and deployed as Juju application hb120rs-v3 as described in the table above.

Results¶

The results presented are the best performance achieved by the corresponding metric across all tests.

metric	VM size	result	unit	note
Tesla T4 single precision	1x `Standard_NC4as_T4_v3`	4454	Gflops/s
Tesla T4 double precision	1x `Standard_NC4as_T4_v3`	252	Gflops/s
InfiniBand latency	2x `Standard_HB120rs_v3`	1.59	us	1 byte transfer size
InfiniBand bandwidth	2x `Standard_HB120rs_v3`	196.06	Gb/s	4 MiB transfer size

For comparison of InfiniBand results, Microsoft-published performance data for the HBv3-series is available on the Azure website.