How to deploy Slurm

This how-to guide shows you how to deploy the Slurm workload manager as the resource management and job scheduling service of your Charmed HPC cluster. The deployment, management, and operations of Slurm are controlled by the Slurm charms.

Prerequisites

To successfully deploy Slurm in your Charmed HPC cluster, you will at least need:

Once you have verified that you have met the prerequisites above, proceed to the instructions below.

Deploy Slurm

You have two options for deploying Slurm:

  1. Using the Juju CLI client.

  2. Using the Juju Terraform client.

If you want to use Terraform to deploy Slurm, see the Install and manage the client (terraform juju) how-to in the Juju documentation for additional requirements.

If you are deploying Slurm on LXD, see Deploying Slurm on LXD for more information on additional constraints that must be passed to Juju.

To deploy Slurm using the Juju CLI client, first create the slurm model that will hold the deployment. The slurm model is the abstraction that will hold the resources — machines, integrations, network spaces, storage, etc. — that are provisioned as part of your Slurm deployment.

Run the following command to create the slurm model in your charmed-hpc machine cloud:

juju add-model slurm charmed-hpc

Now, with slurm model created, run the following set of commands to deploy the Slurm daemons with MySQL as the storage back-end for slurmdbd:

juju deploy sackd --base "[email protected]" --channel "edge"
juju deploy slurmctld --base "[email protected]" --channel "edge"
juju deploy slurmd --base "[email protected]" --channel "edge"
juju deploy slurmdbd --base "[email protected]" --channel "edge"
juju deploy slurmrestd --base "[email protected]" --channel "edge"
juju deploy mysql --channel "8.0/stable"

juju deploy only deploys the Slurm charms. juju integrate integrates the charms together which will trigger the necessary events for the Slurm daemons to reach active status. Run the following set of commands to integrate the Slurm daemons together:

juju integrate slurmctld sackd
juju integrate slurmctld slurmd
juju integrate slurmctld slurmdbd
juju integrate slurmctld slurmrestd
juju integrate slurmdbd mysql:database

After a few minutes, your Slurm deployment will become active. The output of the juju status command should be similar to the following:

user@host:~$ juju status
Model  Controller   Cloud/Region         Version  SLA          Timestampslurm  charmed-hpc  localhost/localhost  3.6.0    unsupported  17:16:37Z App         Version          Status  Scale  Charm       Channel      Rev  Exposed  Messagemysql       8.0.39-0ubun...  active      1  mysql       8.0/stable   313  nosackd       23.11.4-1.2u...  active      1  sackd       latest/edge    4  noslurmctld   23.11.4-1.2u...  active      1  slurmctld   latest/edge   86  noslurmd      23.11.4-1.2u...  active      1  slurmd      latest/edge  107  noslurmdbd    23.11.4-1.2u...  active      1  slurmdbd    latest/edge   78  noslurmrestd  23.11.4-1.2u...  active      1  slurmrestd  latest/edge   80  no Unit           Workload  Agent      Machine  Public address  Ports           Messagemysql/0*       active    idle       5        10.32.18.127    3306,33060/tcp  Primarysackd/0*       active    idle       4        10.32.18.203slurmctld/0*   active    idle       0        10.32.18.15slurmd/0*      active    idle       1        10.32.18.207slurmdbd/0*    active    idle       2        10.32.18.102slurmrestd/0*  active    idle       3        10.32.18.9 Machine  State    Address       Inst id        Base          AZ  Message0        started  10.32.18.15   juju-d566c2-0  ubuntu@24.04      Running1        started  10.32.18.207  juju-d566c2-1  ubuntu@24.04      Running2        started  10.32.18.102  juju-d566c2-2  ubuntu@24.04      Running3        started  10.32.18.9    juju-d566c2-3  ubuntu@24.04      Running4        started  10.32.18.203  juju-d566c2-4  ubuntu@24.04      Running5        started  10.32.18.127  juju-d566c2-5  ubuntu@22.04      Running

To deploy Slurm using the Juju Terraform client, first configure Terraform to use the Juju provider in your deployment plan.

main.tf
terraform {
  required_providers {
    juju = {
      source  = "juju/juju"
      version = ">= 0.16.0"
    }
  }
}

Now create the slurm model that will hold the deployment. The slurm model is the abstraction that will hold the resources — machines, integrations, network spaces, storage, etc. — that are provisioned as part of your Slurm deployment. This resource will direct Juju to create the model slurm:

main.tf
resource "juju_model" "slurm" {
  name = "slurm"

  cloud {
    name = "charmed-hpc"
  }
}

With the slurm juju_model resource defined, declare the following set of modules in your Terraform plan. These modules will direct Juju to deploy the Slurm daemons with MySQL as the storage back-end for slurmdbd:

main.tf
module "sackd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
  model_name  = juju_model.slurm.name
}

module "slurmctld" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
  model_name  = juju_model.slurm.name
}

module "slurmd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
  model_name  = juju_model.slurm.name
}

module "slurmdbd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
  model_name  = juju_model.slurm.name
}

module "slurmrestd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
  model_name  = juju_model.slurm.name
}

module "mysql" {
  source          = "git::https://github.com/canonical/mysql-operator//terraform"
  juju_model_name = juju_model.slurm.name
}

Declaring the modules only deploys the Slurm charms. Integrations are still required to trigger the necessary events for the Slurm daemons to reach active status. Declare the following set of resources in your deployment plan. These resources will direct Juju to integrate the Slurm daemons together:

main.tf
resource "juju_integration" "sackd-to-slurmctld" {
  model = juju_model.slurm.name

  application {
    name     = module.sackd.app_name
    endpoint = module.sackd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.login-node
  }
}

resource "juju_integration" "slurmd-to-slurmctld" {
  model = juju_model.slurm.name

  application {
    name     = module.slurmd.app_name
    endpoint = module.slurmd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmd
  }
}

resource "juju_integration" "slurmdbd-to-slurmctld" {
  model = juju_model.slurm.name

  application {
    name     = module.slurmdbd.app_name
    endpoint = module.slurmdbd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmdbd
  }
}

resource "juju_integration" "slurmrestd-to-slurmctld" {
  model = juju_model.slurm.name

  application {
    name     = module.slurmrestd.app_name
    endpoint = module.slurmrestd.provides.slurmctld
  }

  application {
    name     = module.slurmctld.app_name
    endpoint = module.slurmctld.requires.slurmrestd
  }
}

resource "juju_integration" "slurmdbd-to-mysql" {
  model = juju_model.slurm.name

  application {
    name     = module.mysql.application_name
    endpoint = module.mysql.provides.database
  }

  application {
    name     = module.slurmdbd.app_name
    endpoint = module.slurmdbd.requires.database
  }
}

With all the charm modules, juju_model, and juju_integration resources declared in your deployment plan, you are now ready time to deploy Slurm. Expand the dropdown below to see the full deployment plan:

Full Slurm deployment plan
main.tf
  1terraform {
  2  required_providers {
  3    juju = {
  4      source  = "juju/juju"
  5      version = ">= 0.16.0"
  6    }
  7  }
  8}
  9
 10resource "juju_model" "slurm" {
 11  name = "slurm"
 12
 13  cloud {
 14    name = "charmed-hpc"
 15  }
 16}
 17
 18module "sackd" {
 19  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
 20  model_name  = juju_model.slurm.name
 21}
 22
 23module "slurmctld" {
 24  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
 25  model_name  = juju_model.slurm.name
 26}
 27
 28module "slurmd" {
 29  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
 30  model_name  = juju_model.slurm.name
 31}
 32
 33module "slurmdbd" {
 34  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
 35  model_name  = juju_model.slurm.name
 36}
 37
 38module "slurmrestd" {
 39  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
 40  model_name  = juju_model.slurm.name
 41}
 42
 43module "mysql" {
 44  source          = "git::https://github.com/canonical/mysql-operator//terraform"
 45  juju_model_name = juju_model.slurm.name
 46}
 47
 48resource "juju_integration" "sackd-to-slurmctld" {
 49  model = juju_model.slurm.name
 50
 51  application {
 52    name     = module.sackd.app_name
 53    endpoint = module.sackd.provides.slurmctld
 54  }
 55
 56  application {
 57    name     = module.slurmctld.app_name
 58    endpoint = module.slurmctld.requires.login-node
 59  }
 60}
 61
 62resource "juju_integration" "slurmd-to-slurmctld" {
 63  model = juju_model.slurm.name
 64
 65  application {
 66    name     = module.slurmd.app_name
 67    endpoint = module.slurmd.provides.slurmctld
 68  }
 69
 70  application {
 71    name     = module.slurmctld.app_name
 72    endpoint = module.slurmctld.requires.slurmd
 73  }
 74}
 75
 76resource "juju_integration" "slurmdbd-to-slurmctld" {
 77  model = juju_model.slurm.name
 78
 79  application {
 80    name     = module.slurmdbd.app_name
 81    endpoint = module.slurmdbd.provides.slurmctld
 82  }
 83
 84  application {
 85    name     = module.slurmctld.app_name
 86    endpoint = module.slurmctld.requires.slurmdbd
 87  }
 88}
 89
 90resource "juju_integration" "slurmrestd-to-slurmctld" {
 91  model = juju_model.slurm.name
 92
 93  application {
 94    name     = module.slurmrestd.app_name
 95    endpoint = module.slurmrestd.provides.slurmctld
 96  }
 97
 98  application {
 99    name     = module.slurmctld.app_name
100    endpoint = module.slurmctld.requires.slurmrestd
101  }
102}
103
104resource "juju_integration" "slurmdbd-to-mysql" {
105  model = juju_model.slurm.name
106
107  application {
108    name     = module.mysql.application_name
109    endpoint = module.mysql.provides.database
110  }
111
112  application {
113    name     = module.slurmdbd.app_name
114    endpoint = module.slurmdbd.requires.database
115  }
116}

After verifying that your plan is correct, run the following set of commands to deploy Slurm using Terraform and the Juju provider:

terraform init
terraform apply -auto-approve

Tip

You can run terraform validate to validate your Slurm deployment plan before applying it. You can also run terraform plan to see the speculative execution plan that Terraform will follow to deploy the Slurm charms, however, note that terraform plan will not actually execute plan.

After a few minutes, your Slurm deployment will become active. The output of the juju status command should be similar to the following:

user@host:~$ juju status
Model  Controller   Cloud/Region         Version  SLA          Timestampslurm  charmed-hpc  localhost/localhost  3.6.0    unsupported  17:16:37Z App         Version          Status  Scale  Charm       Channel      Rev  Exposed  Messagemysql       8.0.39-0ubun...  active      1  mysql       8.0/stable   313  nosackd       23.11.4-1.2u...  active      1  sackd       latest/edge    4  noslurmctld   23.11.4-1.2u...  active      1  slurmctld   latest/edge   86  noslurmd      23.11.4-1.2u...  active      1  slurmd      latest/edge  107  noslurmdbd    23.11.4-1.2u...  active      1  slurmdbd    latest/edge   78  noslurmrestd  23.11.4-1.2u...  active      1  slurmrestd  latest/edge   80  no Unit           Workload  Agent      Machine  Public address  Ports           Messagemysql/0*       active    idle       5        10.32.18.127    3306,33060/tcp  Primarysackd/0*       active    idle       4        10.32.18.203slurmctld/0*   active    idle       0        10.32.18.15slurmd/0*      active    idle       1        10.32.18.207slurmdbd/0*    active    idle       2        10.32.18.102slurmrestd/0*  active    idle       3        10.32.18.9 Machine  State    Address       Inst id        Base          AZ  Message0        started  10.32.18.15   juju-d566c2-0  ubuntu@24.04      Running1        started  10.32.18.207  juju-d566c2-1  ubuntu@24.04      Running2        started  10.32.18.102  juju-d566c2-2  ubuntu@24.04      Running3        started  10.32.18.9    juju-d566c2-3  ubuntu@24.04      Running4        started  10.32.18.203  juju-d566c2-4  ubuntu@24.04      Running5        started  10.32.18.127  juju-d566c2-5  ubuntu@22.04      Running

Deploying Slurm on LXD

The Slurm charms can deploy, manage, and operate Slurm on any supported machine cloud, however, each cloud has their own permutations. On LXD, if you deploy the charms to system containers rather than virtual machines, Slurm cannot use the recommended process tracking plugin proctrack/cgroup, and additional modifications must be made to the default LXD profile.

To deploy the Slurm charms to virtual machines rather than system containers, pass the constraint "virt-type=virtual-machine" to Juju when deploying the charms:

juju deploy sackd --base "[email protected]" --channel "edge" --constraints="virt-type=virtual-machine"
juju deploy slurmctld --base "[email protected]" --channel "edge" --constraints="virt-type=virtual-machine"
juju deploy slurmd --base "[email protected]" --channel "edge" --constraints="virt-type=virtual-machine"
juju deploy slurmdbd --base "[email protected]" --channel "edge" --constraints="virt-type=virtual-machine"
juju deploy slurmrestd --base "[email protected]" --channel "edge" --constraints="virt-type=virtual-machine"
juju deploy mysql --channel "8.0/stable" --constraints="virt-type=virtual-machine"
main.tf
module "sackd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/sackd/terraform"
  model_name  = juju_model.slurm.name
  constraints = "arch=amd64 virt-type=virtual-machine"
}

module "slurmctld" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"
  model_name  = juju_model.slurm.name
  constraints = "arch=amd64 virt-type=virtual-machine"
}

module "slurmd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmd/terraform"
  model_name  = juju_model.slurm.name
  constraints = "arch=amd64 virt-type=virtual-machine"
}

module "slurmdbd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmdbd/terraform"
  model_name  = juju_model.slurm.name
  constraints = "arch=amd64 virt-type=virtual-machine"
}

module "slurmrestd" {
  source      = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmrestd/terraform"
  model_name  = juju_model.slurm.name
  constraints = "arch=amd64 virt-type=virtual-machine"
}

module "mysql" {
  source          = "git::https://github.com/canonical/mysql-operator//terraform"
  juju_model_name = juju_model.slurm.name
  constraints     = "arch=amd64 virt-type=virtual-machine"
}

Set compute nodes to IDLE

Compute nodes are initially enlisted with their state set to DOWN after your Slurm deployment becomes active. To set the compute nodes’ state to IDLE so that they can start having jobs scheduled on them, use juju run to run the resume action on the leading controller:

juju run slurmctld/leader resume nodename="<machine-instance-id/hostname>"

Tips

  1. You can get the hostname of all your compute nodes with juju exec:

juju exec --application slurmd -- hostname -s
  1. The nodename parameter of the resume action also accepts node ranges for setting the state of compute nodes to IDLE in bulk:

juju run slurmctld/leader resume nodename="<machine-instance-id/hostname>[range]"

Verify compute nodes are IDLE

The sackd charm installs the Slurm client commands. To use sinfo to verify that a compute node’s state is IDLE, run the following command with juju exec in your sackd unit:

juju exec -u sackd/0 -- sinfo --nodes $(juju exec -u slurmd/0 -- hostname)

To verify that the entire partition is IDLE, run sinfo without the --nodes flag:

juju exec -u sackd/0 -- sinfo