Performance¶
Performance values for the Charmed HPC Benchmarks running on Microsoft Azure are provided on this page. Steps for reproducing these results are available in the benchmarking suite documentation. These results are intended to provided best-possible marks for performance to help guide tuning and identification of bottlenecks. Real-world performance may fluctuate depending on factors such as varying cluster workloads and resource contention.
Metrics¶
GPU performance and InfiniBand interconnect performance are measured on suitably equipped VMs to illustrate the ability of Charmed HPC to leverage the HPC capabilities of the cloud.
GPU performance is measured for both single precision workloads, where speed and larger scale is preferred over numerical accuracy, and double precision workloads, where additional accuracy is required.
Latency and bandwidth are measured for InfiniBand to determine how quickly a message can traverse the network, as well as the network throughput. These metrics govern the ability of the cluster to run MPI, and other distributed memory dependent applications across nodes.
GPU flops¶
GPU single and double precision floating point operations per second (flops) are measured by the gpu_burn stress test, with code modifications by the ReFrame developers, built using the CUDA toolkit.
InfiniBand interconnect latency and bandwidth¶
InfiniBand RDMA latency and bandwidth are measured by the OSU Micro-Benchmarks for MPI and the Intel MPI Benchmarks (IMB), built with OpenMPI and GCC. All runs are performed on two cluster compute nodes, one MPI process per node.
Point-to-point performance is measured with the OSU osu_bw
and osu_latency
benchmarks, as well as the IMB MPI-1 PingPong
benchmark.
Collective performance is measured with the OSU osu_alltoall
and osu_allreduce
benchmarks, as well as the IMB MPI-1 AllReduce
benchmark.
Method¶
All performance tests run under the following base software:
Juju 3.6.3
Ubuntu 24.04
ReFrame 4.7.4
on a Charmed HPC cluster deployed on Microsoft Azure, with the following specifications:
application |
VM size |
charm |
channel |
revision |
---|---|---|---|---|
|
1x |
sackd |
edge |
13 |
|
1x |
slurmctld |
edge |
95 |
|
1x |
slurmdbd |
edge |
87 |
|
1x |
mysql |
8.0/stable |
313 |
|
1x |
slurmd |
edge |
116 |
|
2x |
slurmd |
edge |
116 |
|
N/A - subordinate charm |
filesystem-client |
edge |
15 |
|
1x |
nfs-server-proxy |
edge |
21 |
Where no specific VM features are required, VM instances are Standard_D2as_v6
, the default size for Juju 3.6.3.
Tesla T4 GPU - Standard_NC4as_T4_v3
instances¶
GPU tests are built with the following Ubuntu software package:
nvidia-cuda-toolkit_12.0.140~12.0.1-4build4
Runs are performed on a single Standard_NC4as_T4_v3
instance, equipped with an NVIDIA Tesla T4 and deployed as Juju application nc4as-t4-v3
as described in the table above.
HDR InfiniBand - Standard_HB120rs_v3
instances¶
MPI tests are built and run with the following Ubuntu software packages:
openmpi-bin_4.1.6-7ubuntu2
libopenmpi-dev_4.1.6-7ubuntu2
gcc-13_13.3.0-6ubuntu2~24.04
Runs are performed on two Standard_HB120rs_v3
instances, equipped with 200 Gb/s HDR InfiniBand and deployed as Juju application hb120rs-v3
as described in the table above.
Results¶
The results presented are the best performance achieved by the corresponding metric across all tests.
metric |
VM size |
result |
unit |
note |
---|---|---|---|---|
Tesla T4 single precision |
1x |
4454 |
Gflops/s |
|
Tesla T4 double precision |
1x |
252 |
Gflops/s |
|
InfiniBand latency |
2x |
1.59 |
us |
1 byte transfer size |
InfiniBand bandwidth |
2x |
196.06 |
Gb/s |
4 MiB transfer size |
For comparison of InfiniBand results, Microsoft-published performance data for the HBv3-series
is available on the Azure website.