Getting started with Charmed HPC¶
This tutorial takes you through multiple aspects of Charmed HPC, such as:
Building a small Charmed HPC cluster with a shared filesystem
Preparing and submitting a multi-node batch job to your Charmed HPC cluster’s workload scheduler
Creating and using a container image to provide the runtime environment for a submitted batch job
By the end of this tutorial, you will have worked with a variety of open source projects, such as:
Multipass
Juju
Charms
Apptainer
Ceph
Slurm
This tutorial assumes that you have had some exposure to high-performance computing concepts such as batch scheduling, but does not assume prior experience building HPC clusters. This tutorial also does not expect you to have any prior experience with Multipass, Juju, Apptainer, Ceph, or Slurm.
Using Charmed HPC in production
The Charmed HPC cluster built in this tutorial is for learning purposes and should not be used as the basis for a production HPC cluster. For more in-depth steps on how to deploy a fully operational Charmed HPC cluster, see Charmed HPC’s How-to guides
Prerequisites¶
To successfully complete this tutorial, you will need:
At least 8 CPU cores, 16GB RAM, and 40GB storage available
An active internet connection
Create a virtual machine with Multipass¶
First, download a copy of the cloud initialization (cloud-init) file, charmed-hpc-tutorial-cloud-init.yml, that defines the underlying cloud infrastructure for the virtual machine. For this tutorial, the file includes instructions for creating and configuring your LXD machine cloud localhost
with the charmed-hpc-controller
Juju controller and creating workload and submit scripts for the example jobs. The cloud-init step will be completed as part of the virtual machine launch and will not be something you need to set up manually. You can expand the dropdown below to view the full cloud-init file before downloading onto your local system:
charmed-hpc-tutorial-cloud-init.yml
1#cloud-config
2
3# Ensure VM is fully up-to-date multipass does not support reboots.
4# See: https://github.com/canonical/multipass/issues/4199
5# Package management
6package_reboot_if_required: false
7package_update: true
8package_upgrade: true
9
10# Install prerequisites
11snap:
12 commands:
13 00: snap install juju --channel=3/stable
14 01: snap install lxd --channel=6/stable
15
16# Configure and initialize prerequisites
17lxd:
18 init:
19 storage_backend: dir
20
21# Commands to run at the end of the cloud-init process
22runcmd:
23 - lxc network set lxdbr0 ipv6.address none
24 - su ubuntu -c 'juju bootstrap localhost charmed-hpc-controller'
25
26# Write files to the Multipass instance
27write_files:
28 # MPI workload dependencies
29 - path: /home/ubuntu/mpi_hello_world.c
30 owner: ubuntu:ubuntu
31 permissions: !!str "0664"
32 defer: true
33 content: |
34 #include <mpi.h>
35 #include <stdio.h>
36
37 int main(int argc, char** argv) {
38 // Initialize the MPI environment
39 MPI_Init(NULL, NULL);
40
41 // Get the number of nodes
42 int size;
43 MPI_Comm_size(MPI_COMM_WORLD, &size);
44
45 // Get the rank of the process
46 int rank;
47 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
48
49 // Get the name of the node
50 char node_name[MPI_MAX_PROCESSOR_NAME];
51 int name_len;
52 MPI_Get_processor_name(node_name, &name_len);
53
54 // Print hello world message
55 printf("Hello world from node %s, rank %d out of %d nodes\n",
56 node_name, rank, size);
57
58 // Finalize the MPI environment.
59 MPI_Finalize();
60 }
61 - path: /home/ubuntu/submit_hello.sh
62 owner: ubuntu:ubuntu
63 permissions: !!str "0664"
64 defer: true
65 content: |
66 #!/usr/bin/env bash
67 #SBATCH --job-name=hello_world
68 #SBATCH --partition=tutorial-partition
69 #SBATCH --nodes=2
70 #SBATCH --error=error.txt
71 #SBATCH --output=output.txt
72
73 mpirun ./mpi_hello_world
74 # Container workload dependencies.
75 - path: /home/ubuntu/generate.py
76 owner: ubuntu:ubuntu
77 permissions: !!str "0664"
78 defer: true
79 content: |
80 #!/usr/bin/env python3
81
82 """Generate example dataset for workload."""
83
84 import argparse
85
86 from faker import Faker
87 from faker.providers import DynamicProvider
88 from pandas import DataFrame
89
90
91 faker = Faker()
92 favorite_lts_mascot = DynamicProvider(
93 provider_name="favorite_lts_mascot",
94 elements=[
95 "Dapper Drake",
96 "Hardy Heron",
97 "Lucid Lynx",
98 "Precise Pangolin",
99 "Trusty Tahr",
100 "Xenial Xerus",
101 "Bionic Beaver",
102 "Focal Fossa",
103 "Jammy Jellyfish",
104 "Noble Numbat",
105 ],
106 )
107 faker.add_provider(favorite_lts_mascot)
108
109
110 def main(rows: int) -> None:
111 df = DataFrame(
112 [
113 [faker.email(), faker.country(), faker.favorite_lts_mascot()]
114 for _ in range(rows)
115 ],
116 columns=["email", "country", "favorite_lts_mascot"],
117 )
118 df.to_csv("favorite_lts_mascot.csv")
119
120
121 if __name__ == "__main__":
122 parser = argparse.ArgumentParser()
123 parser.add_argument(
124 "--rows", type=int, default=1, help="Rows of fake data to generate"
125 )
126 args = parser.parse_args()
127
128 main(rows=args.rows)
129 - path: /home/ubuntu/workload.py
130 owner: ubuntu:ubuntu
131 permissions: !!str "0664"
132 defer: true
133 content: |
134 #!/usr/bin/env python3
135
136 """Plot the most popular Ubuntu LTS mascot."""
137
138 import argparse
139 import os
140
141 import pandas as pd
142 import plotext as plt
143
144 def main(dataset: str | os.PathLike, file: str | os.PathLike) -> None:
145 df = pd.read_csv(dataset)
146 mascots = df["favorite_lts_mascot"].value_counts().sort_index()
147
148 plt.simple_bar(
149 mascots.index,
150 mascots.values,
151 title="Favorite LTS mascot",
152 color="orange",
153 width=150,
154 )
155
156 if file:
157 plt.save_fig(
158 file if os.path.isabs(file) else f"{os.getcwd()}/{file}",
159 keep_colors=True
160 )
161 else:
162 plt.show()
163
164 if __name__ == "__main__":
165 parser = argparse.ArgumentParser()
166 parser.add_argument("dataset", type=str, help="Path to CSV dataset to plot")
167 parser.add_argument(
168 "-o",
169 "--output",
170 type=str,
171 default="",
172 help="Output file to save plotted graph",
173 )
174 args = parser.parse_args()
175
176 main(args.dataset, args.output)
177 - path: /home/ubuntu/workload.def
178 owner: ubuntu:ubuntu
179 permissions: !!str "0664"
180 defer: true
181 content: |
182 bootstrap: docker
183 from: ubuntu:24.04
184
185 %files
186 generate.py /usr/bin/generate
187 workload.py /usr/bin/workload
188
189 %environment
190 export PATH=/usr/bin/venv/bin:${PATH}
191 export PYTHONPATH=/usr/bin/venv:${PYTHONPATH}
192
193 %post
194 export DEBIAN_FRONTEND=noninteractive
195 apt-get update -y
196 apt-get install -y python3-dev python3-venv
197
198 python3 -m venv /usr/bin/venv
199 alias python3=/usr/bin/venv/bin/python3
200 alias pip=/usr/bin/venv/bin/pip
201
202 pip install -U faker
203 pip install -U pandas
204 pip install -U plotext
205
206 chmod 755 /usr/bin/generate
207 chmod 755 /usr/bin/workload
208
209 %runscript
210 exec workload "$@"
211 - path: /home/ubuntu/submit_apptainer_mascot.sh
212 owner: ubuntu:ubuntu
213 permissions: !!str "0664"
214 defer: true
215 content: |
216 #!/usr/bin/env bash
217 #SBATCH --job-name=favorite-lts-mascot
218 #SBATCH --partition=tutorial-partition
219 #SBATCH --nodes=2
220 #SBATCH --error=mascot_error.txt
221 #SBATCH --output=mascot_output.txt
222
223 apptainer exec workload.sif generate --rows 1000000
224 apptainer run workload.sif favorite_lts_mascot.csv --output graph.out
From the local directory holding the cloud-init file, launch a virtual machine using Multipass:
The virtual machine launch process should take five minutes or less to complete, but may take longer due to network strength. Upon completion of the launch process, check the status of cloud-init to confirm that all processes completed successfully.
Enter the virtual machine:
Then check cloud-init status
:
If the status shows done
and there are no errors, then you are ready to move on to deploying the cluster charms.
Get compute nodes ready for jobs¶
Now that Slurm and the filesystem have been successfully deployed, the next step is to set up the compute nodes themselves. The compute nodes must be moved from the down
state to the idle
state so that they can start having jobs ran on them. First, check that the compute nodes are still down, which will show something similar to:
Then, bring up the compute nodes:
And verify that the STATE
is now set to idle
, which should now show:
Copy files onto cluster¶
The workload files that were created during the cloud initialization step now need to be copied onto the cluster filesystem from the virtual machine filesystem. First you will make the new example directories, then set appropriate permissions, and finally copy the files over:
The /scratch
directory is mounted on the compute nodes and will be used to read and write from during the batch jobs.
Run a batch job¶
In the following steps, you will compile a small Hello World MPI script and run it by submitting a batch job to Slurm.
Compile¶
First, SSH into the login node, sackd/0
:
This will place you in your home directory /home/ubuntu
. Next, you will need to move to the /scratch/mpi_example
directory, install the Open MPI libraries need for compiling, and then compile the mpi_hello_world.c file by running the mpicc
command:
For quick referencing, the two files for the MPI Hello World example are provided in dropdowns here:
mpi_hello_world.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of nodes
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Get the rank of the process
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Get the name of the node
char node_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(node_name, &name_len);
// Print hello world message
printf("Hello world from node %s, rank %d out of %d nodes\n",
node_name, rank, size);
// Finalize the MPI environment.
MPI_Finalize();
}
submit_hello.sh
#!/usr/bin/env bash
#SBATCH --job-name=hello_world
#SBATCH --partition=tutorial-partition
#SBATCH --nodes=2
#SBATCH --error=error.txt
#SBATCH --output=output.txt
mpirun ./mpi_hello_world
Submit batch job¶
Now, submit your batch job to the queue using sbatch
:
You job will complete after a few seconds. The generated output.txt file will look similar to the following:
The batch job successfully spread the MPI job across two nodes that were able to report back their MPI rank to a shared output file.
Run a container job¶
Next you will go through the steps to generate a random sample of Ubuntu mascot votes and plot the results. The process requires Python and few specific libraries so you will use Apptainer to build a container job and run the job on the cluster.
Set up Apptainer¶
Apptainer must be deployed and integrated with the existing Slurm deployment using Juju and these steps need to be completed from charmed-hpc-tutorial
environment; to return to that environment from within sackd/0
, use the exit
command.
Deploy and integrate Apptainer:
After a few minutes, juju status
should look similar to the following:
Build the container image using apptainer
¶
Before you can submit your container workload to your Charmed HPC cluster, you must build the container image from the build recipe. The build recipe file workload.def defines the environment and libraries that will be in the container image.
To build the image, return to the cluster login node, move to the example directory, and call apptainer build
:
The files for the Apptainer Mascot Vote example are provided here for reference.
generate.py
1#!/usr/bin/env python3
2
3"""Generate example dataset for workload."""
4
5import argparse
6
7from faker import Faker
8from faker.providers import DynamicProvider
9from pandas import DataFrame
10
11
12faker = Faker()
13favorite_lts_mascot = DynamicProvider(
14 provider_name="favorite_lts_mascot",
15 elements=[
16 "Dapper Drake",
17 "Hardy Heron",
18 "Lucid Lynx",
19 "Precise Pangolin",
20 "Trusty Tahr",
21 "Xenial Xerus",
22 "Bionic Beaver",
23 "Focal Fossa",
24 "Jammy Jellyfish",
25 "Noble Numbat",
26 ],
27)
28faker.add_provider(favorite_lts_mascot)
29
30
31def main(rows: int) -> None:
32 df = DataFrame(
33 [
34 [faker.email(), faker.country(), faker.favorite_lts_mascot()]
35 for _ in range(rows)
36 ],
37 columns=["email", "country", "favorite_lts_mascot"],
38 )
39 df.to_csv("favorite_lts_mascot.csv")
40
41
42if __name__ == "__main__":
43 parser = argparse.ArgumentParser()
44 parser.add_argument(
45 "--rows", type=int, default=1, help="Rows of fake data to generate"
46 )
47 args = parser.parse_args()
48
49 main(rows=args.rows)
workload.py
1#!/usr/bin/env python3
2
3"""Plot the most popular Ubuntu LTS mascot."""
4
5import argparse
6import os
7
8import pandas as pd
9import plotext as plt
10
11def main(dataset: str | os.PathLike, file: str | os.PathLike) -> None:
12 df = pd.read_csv(dataset)
13 mascots = df["favorite_lts_mascot"].value_counts().sort_index()
14
15 plt.simple_bar(
16 mascots.index,
17 mascots.values,
18 title="Favorite LTS mascot",
19 color="orange",
20 width=150,
21 )
22
23 if file:
24 plt.save_fig(
25 file if os.path.isabs(file) else f"{os.getcwd()}/{file}",
26 keep_colors=True
27 )
28 else:
29 plt.show()
30
31if __name__ == "__main__":
32 parser = argparse.ArgumentParser()
33 parser.add_argument("dataset", type=str, help="Path to CSV dataset to plot")
34 parser.add_argument(
35 "-o",
36 "--output",
37 type=str,
38 default="",
39 help="Output file to save plotted graph",
40 )
41 args = parser.parse_args()
42
43 main(args.dataset, args.output)
workload.def
bootstrap: docker
from: ubuntu:24.04
%files
generate.py /usr/bin/generate
workload.py /usr/bin/workload
%environment
export PATH=/usr/bin/venv/bin:${PATH}
export PYTHONPATH=/usr/bin/venv:${PYTHONPATH}
%post
export DEBIAN_FRONTEND=noninteractive
apt-get update -y
apt-get install -y python3-dev python3-venv
python3 -m venv /usr/bin/venv
alias python3=/usr/bin/venv/bin/python3
alias pip=/usr/bin/venv/bin/pip
pip install -U faker
pip install -U pandas
pip install -U plotext
chmod 755 /usr/bin/generate
chmod 755 /usr/bin/workload
%runscript
exec workload "$@"
submit_apptainer_mascot.sh
#!/usr/bin/env bash
#SBATCH --job-name=favorite-lts-mascot
#SBATCH --partition=tutorial-partition
#SBATCH --nodes=2
#SBATCH --error=mascot_error.txt
#SBATCH --output=mascot_output.txt
apptainer exec workload.sif generate --rows 1000000
apptainer run workload.sif favorite_lts_mascot.csv --output graph.out
Use the image to run jobs¶
Now that you have built the container image, you can submit a job to the cluster that uses the new workload.sif image to generate one million lines in a table and then uses the resulting favorite_lts_mascot.csv to build the bar plot:
To view the status of the job while it is running, run squeue
.
Once the job has completed, view the generated bar plot that will look similar to the following:
Summary and clean up¶
Is this tutorial, you:
Deployed and integrated Slurm and a shared filesystem
Launched an MPI batch job and saw cross-node communicated results
Built a container image with Apptainer and used it to run a batch job and generate a bar plot
Now that you have completed the tutorial, if you would like to completely remove the virtual machine, return to your local terminal and multipass delete
the virtual machine as follows:
Next steps¶
Now that you have gotten started with Charmed HPC, check out the Explanation section for details on important concepts and the How-to guides for how to use more of Charmed HPC’s features.