Examples of Slurm scripts
Interactive mode
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
To run shell commands in the interactive mode, use the following (sample) command:
This will run a job in the plgrid-testing partition, on a single node, using a single core.
Warning
plgrid-testing is a partition that may not be present on particular supercomputer. Please be sure you are using proper partition. You may check available partitions in this documentation or using command sinfo on login node.
The srun command is typically used to execute the specified operation on the assigned resources.
However, if resources have not been allocated in advance, srun also takes care of performing the allocation.
Batch mode
To run a job in batch mode, use the sbatch command. Usage: sbatch script.sh.
All scheduler options should be preceded by the #SBATCH keyword (do not forget the #!).
For more information see: man sbatch and sbatch --help.
Sample script with comments
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
## Job name
#SBATCH -J ADFtestjob
## Number of allocated nodes
#SBATCH -N 1
## Number of tasks per node (by default this corresponds to the number of cores allocated per node)
#SBATCH --ntasks-per-node=1
## Memory allocated per core
#SBATCH --mem-per-cpu=1GB
## Max task execution time (format is HH:MM:SS)
#SBATCH --time=01:00:00
## Name of grant to which resource usage will be charged
#SBATCH -A <grant_id>
## Name of partition
#SBATCH -p plgrid-testing
## Name of file to which standard output will be redirected
#SBATCH --output="output.out"
## Name of file to which the standard error stream will be redirected
#SBATCH --error="error.err"
## change to sbatch working directory
cd $SLURM_SUBMIT_DIR
## load an application using Modules
module load plgrid/apps/adf/2014.07
## run a binary with an input file
adf input.adf
Single core job (on Ares)
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=3850M
# ... other configuration directives
In the above script, we allocate a single task with one core on a single node (duh!) and 3850 MB of memory. Those parameters and values are the defaults for jobs in the plgrid queue on Ares, so specifying them is optional, but it is a good practice to state them in cases where a job uses more than one core.
Half node job (on Ares)
If you are unsure if your application can benefit from allocating a whole node, allocating a certain node fraction is best. Half of a computing node on Ares has 24 cores with 92 GB of memory, which looks like a good fit. In the Slurm script, this would look like the following example:
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=92G
# ... other configuration directives
Note that the script explicitly requests 1 node, 1 task, and 24 cores for the task. The memory declaration states that we request half of the node's memory. Why not allocate 120 GB? Because then we would allocate more than half of the total memory, more than half of the node. In such a case, two jobs wouldn't be able to fit on one node, which results in more difficult scheduling and suboptimal resource usage.
Additionally, if the job exceeds the ratio of memory proportional to 1 CPU (3,85 GB on Ares), the accounting system takes this into account and charges the grant as if the job would have used more CPUs! An example of such a job would be to request 24 CPU and 120 GB of memory, resulting in job billing as if the job used 32 cores.
Multi node job (on Ares)
If your job can utilize many cores, allocating a whole node or multiple whole nodes is best. Avoid jobs where individual tasks are spread across several nodes. Such jobs usually perform poorly, as communication within a single node is much faster than talking to other machines. A sample job script for the multi-node job is shown below:
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#SBATCH --mem=184G
# ... other configuration directives
# ... some MPI application
This script specifies that the job will use 4 nodes and 192 cores. Each node will allocate 184 GB of memory. Please note that the job explicitly states that it will use 4 whole nodes, where we request 48 tasks on each node. This way ensures that tasks are close to each other.
CPU job script
This is the simple script for submitting basic CPU jobs:
Simple CPU job
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
#SBATCH --job-name=job_name
#SBATCH --time=01:00:00
#SBATCH --account=<grantname-cpu>
#SBATCH --partition=plgrid
module load python
srun python myapp.py
The job will be named job_name, declares a run time of 1 hour, is being run with the grantname-cpu account, is submitted to plgrid (default for CPU jobs) partition. The job operates in the directory where the batch command was issued, loads a python module, and executes a python application. Job's output will be written to a file named slurm-<JOBID>.out in the current directory. The srun before python invocation is a good practice, as in more complex cases srun allows for more precise control of resources assigned to the application.
The advanced job could look like the following example:
Advanced CPU job
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
#SBATCH --job-name=job_name
#SBATCH --time=01:00:00
#SBATCH --account=grantname-cpu
#SBATCH --partition=plgrid
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#SBATCH --mem=180G
#SBATCH --output="joblog-%j.txt"
#SBATCH --error="joberr-%j.txt"
module load openmpi
mpiexec myapp.bin
Please note the additional parameters and the MPI-enabled application! This job uses 2 nodes, with 48 tasks on each node, and each task uses 1 CPU. Each node will allocate 180 GB of memory for the job. The job's stdout and stderr are redirected to joblog-<JOBID>.txt and joberr-<JOBID>.txt files. In the example, the myapp.bin application uses MPI. The mpiexec command is responsible for spawning the additional application ranks (processes). In most cases of MPI applications, the mpiexec's parameters are configured by the system, so there is no need to specify the -np argument explicitly. Note that using mpiexec allows us to omit the 'srun' command, as it is used by mpiexec internally.
Parallel jobs
MPI jobs
When running MPI jobs, the -n and -np scheduler options should not be explicitly specified - instead, the system will select optimal values based on allocation parameters expressed in the startup script. Usage of -n and -np should be restricted to exceptional cases which call for non-standard configurations (e.g. local threading).
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
## Job name
#SBATCH -J MPITest
## Number of allocated nodes
#SBATCH -N 2
## Number of tasks per node (by default this corresponds to the number of cores allocated per node)
#SBATCH --ntasks-per-node=24
## Memory allocated per core
#SBATCH --mem-per-cpu=5GB
## Max task execution time (format is HH:MM:SS)
#SBATCH --time=01:00:00
## Name of grant to which resource usage will be charged
#SBATCH -A <grant_id>
## Name of partition
#SBATCH -p plgrid-testing
## Name of file to which standard output will be redirected
#SBATCH --output="output.out"
## Name of file to which the standard error stream will be redirected
#SBATCH --error="error.err"
## Host name
srun /bin/hostname
## Load the IntelMPI module
module add plgrid/tools/impi
## change to sbatch working directory
cd $SLURM_SUBMIT_DIR
mpiexec ./calcDiff 100 50
It is considered good practice to compile and execute applications in an environment comprising the same set of modules. We recommend using the mpiexec wrapper. Computational resource allocation - in this case 2 nodes, 24 cores each - results in 48 MPI tasks available for execution. It is permissible to run a series of MPI applications with a single script, but given their execution time (which is usually long) this approach is discouraged. Running multiple MPI applications in parallel within a single allocation is not allowed due to resource contention issues.
Simple parallelization of independent tasks
Parallel processing is a typical use case scenario for computing clusters. One representative example involves processing multiple images with a shared algorithm. SLURM provides a way to easily parallelize such tasks with the srun command. A sample script which runs multiple computations in parallel (depending on their specifications) is provided below.
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
## Job name
#SBATCH -J testjob
## Number of allocated nodes
#SBATCH -N 2
## Number of tasks per node (by default this corresponds to the number of cores allocated per node)
#SBATCH --ntasks-per-node=24
## Memory allocated per core
#SBATCH --mem-per-cpu=5GB
## Max task execution time (format is HH:MM:SS)
#SBATCH --time=01:00:00
## Name of grant to which resource usage will be charged
#SBATCH -A <grant_id>
## Name of partition
#SBATCH -p plgrid-testing
## Name of file to which standard output will be redirected
#SBATCH --output="output.out"
## Name of file to which the standard error stream will be
redirected
#SBATCH --error="error.err"
## Load application's module
module load plgrid/tools/imagemagick
## change to sbatch working directory
cd $SLURM_SUBMIT_DIR
ls *.tif | xargs -t -d "\n" -P ${SLURM_NTASKS} -n 1 srun -n 1 -N 1 --mem=5gb mogrify -format png
The above script runs the mogrify application for each *.tif file in the working directory, converting the image to the png format. We begin by listing all input files, and feed the output into the xargs command, which (through the -P flag) ensures parallel execution of each subsequent srun instance. Each instance is supplied with a single input parameter, with the maximum level of parallelism defined as ${SLURM_NTASKS}. Thus, each invocation of srun runs the mogrify application on a single core, with 5 GB of assigned memory.
Array jobs
Array jobs provide a way to schedule large numbers of similar coputational tasks. Each instance of an array job is assigned an index referenced by the $SLURM_ARRAY_TASK_ID environmental variable, which can be used to parameterize the given instance. Submitting array jobs is similar to submitting ordinary batch jobs. The following sample script submits an array job with the task ID range defined by the --array (or -a) scheduler option:
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
## Job name
#SBATCH -J testjob
## Number of allocated nodes
#SBATCH -N 1
## Number of tasks per node (by default this corresponds to the number of cores allocated per node)
#SBATCH --ntasks-per-node=1
## Memory allocated per core
#SBATCH --mem-per-cpu=5GB
## Max task execution time (format is HH:MM:SS)
#SBATCH --time=01:00:00
## Name of grant to which resource usage will be charged
#SBATCH -A <grant_id>
## Name of partition
#SBATCH -p plgrid-testing
## Name of file to which standard output will be redirected
#SBATCH --output="output.out"
## Name of file to which the standard error stream will be redirected
#SBATCH --error="error.err"
## Array index range
#SBATCH --array=0-100
## change to sbatch working directory
cd $SLURM_SUBMIT_DIR
myCalculations $SLURM_ARRAY_TASK_ID
GPU job script
The SLURM scheduler treats GPUs as generic resources (GRES) identified by the gpu codeword.
You can find out which nodes/partitions provide such resources by using the sinfo command:
sinfo -o '%P || %N || %G'
Jobs are submitted by adding (i.e. on Ares NVIDIA V100) --partition=plgrid-gpu-v100 --gres=gpu[:count] to your list of scheduler options. If the count argument is omitted, the system will default to allocating a single GPU adapter per processing node.
Following submission of your job, the scheduler automatically sets the $CUDA_VISIBLE_DEVICES environmental variable and permits access to allocated GPUs.
Simple GPU job (on Ares)
The simple script for submitting GPU jobs:
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
#!/bin/bash -l
#SBATCH --job-name=job_name
#SBATCH --time=01:00:00
#SBATCH --account=grantname-gpu
#SBATCH --partition=plgrid-gpu-v100
#SBATCH --cpus-per-task=4
#SBATCH --mem=40G
#SBATCH --gres=gpu
module load cuda
srun ./myapp
Please note the specific account name and partition for GPU jobs. The job allocated one GPU with the --gres parameter. The whole GPU is allocated for the job, --memory parameter refers to the system memory used by the job. More information on how to use GPU's can be found here: Slurm GRES.
Interactive jobs (on Ares)
For example, to run a single-node interactive job with 2 GPUs, type:
Warning
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
Interactive MPI jobs
Warning
Running interactive MPI jobs (e.g. for testing) is a special case. Due to the fact that SLURM treats GPUs differently than general-purpose processors, launching interactive computations calls for a nonstandard procedure.
This sample scripts and examples are provided as a reference and SHOULD NOT be executed as they are. Before running any jobs, make sure that the scheduler options accurately reflect your requirements e.g. with regard to grant ID, partition name, execution time etc.
For example, to run an interactive job on 2 nodes with 2 GPUs per node (4 GPUs total) for 1 hour, specify the following:
The salloc command allocates our job and returns its ID - for example, 1234.
Next, we use srun requesting 0 GPUs for the job whose ID was returned by salloc:
This gives us shell access on one of the computing nodes. Now, after loading the appropriate MPI module, we may run our actual application via mpirun or mpiexec. It is important to avoid exporting environmental settings - e.g., when using IntelMPI, remember to specify the -genvnone flag. The application will retain access to all GPUs allocated via salloc regardless of the gpu:0 parameter in srun.