Manage Slurm¶
Migrate a single slurmctld unit to high availability¶
To migrate a previously deployed single slurmctld
unit to a high availability (HA) setup, a low-latency shared file system must be integrated to enable sharing of controller data across all slurmctld
units. For guidance on choosing and deploying a shared file system, see the following sections:
Once a chosen shared file system has been deployed and made available via a proxy or other file system provider charm, run the following, substituting [filesystem-provider]
with the name of the provider charm, to deploy a slurmctld
HA setup with two units (a primary and single backup):
Warning
This migration requires cluster downtime.
The slurmctld
service is stopped during the copy of Slurm data from the unit local /var/lib/slurm/checkpoint
to shared storage. Downtime varies depending on the scale of data to be transferred and transfer rate to the shared storage.
This is a one-time cluster downtime. Once the data migration is complete, no further downtime is necessary when adding or removing slurmctld
units.
juju deploy filesystem-client --channel latest/edge
juju integrate filesystem-client:filesystem [filesystem-provider]:filesystem
juju integrate slurmctld:mount filesystem-client:mount
juju add-unit -n 1 slurmctld
main.tf
¶module "filesystem-client" {
source = "git::https://github.com/charmed-hpc/filesystem-charms//charms/filesystem-client/terraform"
model_name = juju_model.slurm.name
}
resource "juju_integration" "provider-to-filesystem" {
model = juju_model.slurm.name
application {
name = module.[filesystem-provider].app_name
endpoint = module.[filesystem-provider].provides.filesystem
}
application {
name = module.filesystem-client.app_name
endpoint = module.filesystem-client.requires.filesystem
}
}
module "slurmctld" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
model_name = juju_model.slurm.name
constraints = "arch=amd64 virt-type=virtual-machine"
units = 2
}
resource "juju_integration" "filesystem-to-slurmctld" {
model = juju_model.slurm.name
application {
name = module.slurmctld.app_name
endpoint = module.slurmctld.provides.mount
}
application {
name = module.filesystem-client.app_name
endpoint = module.filesystem-client.requires.mount
}
}
Once slurmctld
is scaled up, the output of the juju status
command should be similar to the following, varying by choice of shared file system - here CephFS:
user@host:~$
juju status
Model Controller Cloud/Region Version SLA Timestamp
slurm charmed-hpc localhost/localhost 3.6.0 unsupported 17:16:37Z
App Version Status Scale Charm Channel Rev Exposed Message
cephfs-server-proxy active 1 cephfs-server-proxy latest/edge 25 no
filesystem-client active 1 filesystem-client latest/edge 20 no Integrated with `cephfs` provider
mysql 8.0.39-0ubun... active 1 mysql 8.0/stable 313 no
sackd 23.11.4-1.2u... active 1 sackd latest/edge 4 no
slurmctld 23.11.4-1.2u... active 1 slurmctld latest/edge 86 no primary - UP
slurmd 23.11.4-1.2u... active 1 slurmd latest/edge 107 no
slurmdbd 23.11.4-1.2u... active 1 slurmdbd latest/edge 78 no
slurmrestd 23.11.4-1.2u... active 1 slurmrestd latest/edge 80 no
Unit Workload Agent Machine Public address Ports Message
mysql/0* active idle 5 10.32.18.127 3306,33060/tcp Primary
sackd/0* active idle 4 10.32.18.203
slurmctld/0* active idle 0 10.32.18.15 primary - UP
filesystem-client/0* active idle 10.32.18.15 Mounted filesystem at `/srv/slurmctld-statefs`
slurmctld/1 active idle 6 10.32.18.204 backup - UP
filesystem-client/1 active idle 10.32.18.204 Mounted filesystem at `/srv/slurmctld-statefs`
slurmd/0* active idle 1 10.32.18.207
slurmdbd/0* active idle 2 10.32.18.102
slurmrestd/0* active idle 3 10.32.18.9
Machine State Address Inst id Base AZ Message
0 started 10.32.18.15 juju-d566c2-0 [email protected] Running
1 started 10.32.18.207 juju-d566c2-1 [email protected] Running
2 started 10.32.18.102 juju-d566c2-2 [email protected] Running
3 started 10.32.18.9 juju-d566c2-3 [email protected] Running
4 started 10.32.18.203 juju-d566c2-4 [email protected] Running
5 started 10.32.18.127 juju-d566c2-5 [email protected] Running
6 started 10.32.18.204 juju-d566c2-6 [email protected] Running