Running jobs

This page has information about:

Running intensive computations through the scheduler

Login nodes

Once you have logged into Hamilton, the Linux commands you type in at the prompt are run on one of the service's two login nodes. Although these are relatively powerful computers, they are a resource shared between all the users using Hamilton and should not be used for running demanding programs. Light interactive work, downloading and compiling software, and short test runs using a few CPU cores are all acceptable.

Care should be taken not to overload the login nodes: we reserve the right to stop programs that interfere with other people's use of the service.

Running intensive computations

The majority of the CPU cores and RAM on Hamilton are in its compute nodes, which are accessed via the queuing system, Slurm. Most work on Hamilton is done as non-interactive batch jobs that are scheduled by Slurm to run when space becomes available. However, interactive work is also possible through Slurm.

Batch jobs

A batch job is typically written using a login node and is submitted to Slurm from there. It is composed as a script, written with a text editor such as nano, that contains two things:

instructions to Slurm describing the resources (CPU, memory, time, etc) needed for the job and any other Slurm settings
the commands the job will run, in sequence.

The Example job scripts page has sample scripts for various types of jobs, and the Software pages have additional advice on configuring jobs for certain applications. All batch jobs are submitted using the command:

sbatch <job script name>

Once a job has been submitted to the queuing system, it will be scheduled and run as resources become free.

When a job script is submitted using the sbatch command, the system will provide you with a job number, or job id. This number is how the system identifies the job; it can be used to see if the job has completed running yet, to cancel it, etc. If you need to contact us about a problem with a job, please include this number as it is essential when diagnosing problems.

Using the example job script for a serial job (see Example job scripts), and a fictional user account foobar22:

[foobar22@login1 ~]$ sbatch my_serial_job.sh
Submitted batch job 3141717

[foobar22@login1 ~]$ squeue -u foobar22
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3141717 shared my_seria foobar22 PD 0:00 1 (Resources)

The fifth column (ST) shows what state the job is in. R means that the job is running and PD means the job is pending, i.e. waiting for its turn in the queue. While it is pending, the NODELIST(REASON) column will show why it is not running, for example:
(Resources) - normal. The job is waiting for nodes to become free and allow it to run
(Priority) - normal. The job is waiting in the queue as there are higher-priority jobs ahead of it
(PartitionNodeLimit) - job will not run. The job submission script has asked for too many resources for the queue

When the job has started running, a file called slurm-<jobid>.out will be created. This contains any output printed by the commands in your job script. If the batch scheduled has to kill your job, for example because it tried to use more time or memory than requested, this will be noted at the bottom of this file.

Once the job has finished running, it will no longer appear in the output of squeue. Details about a finished job can be obtained from the command sacct -j <jobid>

Interactive jobs

Interactive jobs are useful when, for example, work needs to be done interactively but is too intensive for a login node, or for testing software's behaviour in a Slurm environment. The Slurm command srun will start an interactive job. For example, to start an interactive shell on a compute node, use:

srun --pty bash

Jobs run through srun are subject to the same controls as batch jobs. If you need extra resources, such as CPU cores, memory or time, request them in the same way as with sbatch (see Queueing system). For example:

srun --pty --mem=2G -c 2 -p test bash

Instead of starting an interactive shell on a compute node, other commands can also be run through srun, e.g:

srun --mem=2G -c 2 -p test <mycommand>

Queueing system

Useful commands

The core commands to interact with the Slurm scheduling system are:

sfree - show what resources are available
sinfo - summary of the system and status
sbatch <jobscript> - submit a job to the queue
squeue -u <username> - see the status of jobs in the queue
scancel <jobid> - remove jobs from the queue
sacct -j <jobid> - show details of a job that has finished
srun - start an interactive job

Available queues and job limits

Compute nodes are organised into queues (also known as partitions). Hamilton currently has 5 queues:

Queue	Description	Node type	Node quantity	Job limits
shared	Default queue, intended for jobs that can share nodes	Standard	119(*)	3 days
multi	For jobs requiring one or more whole nodes	Standard	119(*)	3 days
long	For jobs requiring >3 days to run	Standard	1(*)	7 days
bigmem	For jobs requiring > 250GB memory	High-memory	2	3 days
test	For short test jobs	Standard	1	15 minutes

(*) The shared, multi and long queues share a single pool of 119 nodes.

Types of compute node:

Standard - 128 CPU cores, 400GB temporary disk space, 250GB RAM
High-memory - 128 CPU cores, 400GB temporary disk space, 1.95TB RAM

Most work on Hamilton is done in the form of batch jobs, but it is also possible to run interactive jobs via the srun command. Both types of job can be submitted to any queue.

Job resources and options

Unless you specify otherwise, jobs will be submitted to the shared queue and allocated the following resources:

1 hour (15 minutes for the test queue)
1 CPU core
1GB memory
1GB temporary disk space ($TMPDIR)

Further resources can be allocated using sbatch or srun options, which can be included either on the command line (e.g. sbatch -n 1 <job_script>, or by embedding them in your job script (e.g. adding the line #SBATCH -n 1). If both are done, the command line takes precedence. Useful options include:

Option	Description
-p <QUEUE>	Submit job to <QUEUE> (queues are also known as partitions)
-t <TIME>	Run job for a maximum time of <TIME>, in the format dd-hh:mm:ss
-c <CORES>	For multi-core jobs: allocate <CORES> CPU cores to the job
-n <CORES>	For MPI jobs: allocate <CORES> CPU cores to the job
-N <NODES>	Allocate <NODES> compute nodes to the job
--mem=<MEM>	Allocate <MEM> RAM to the job, e.g. 1G
--gres=tmp:<TMPSPACE>	Allocate <TMPSPACE> temporary disk space on the compute node(s)
--array=<START>-<END>	Run job several times, from indexes <START> to <END>
--mail-user=<EMAIL>	Send job notifications to email address <EMAIL> (for batch jobs only; not needed to send to submitter's Durham address)
--mail-type=<TYPE>	Types of job notifications to send, e.g. BEGIN, END, FAIL, ALL (recommended: END,FAIL). For batch jobs only.

Environment variables

Slurm sets a number of environment variables that can be helpful to, for example, match the behaviour of a job to its resource allocation. These are detailed on the sbatch and srun man pages.

The four additional environment variables below are set to match the value given in #SBATCH -c <number>, to help automate the behaviour of multi-threaded programs. This should be reasonable in most cases, but the values can be changed in job scripts if desired.

$OMP_NUM_THREADS
$OPENBLAS_NUM_THREADS
$MKL_NUM_THREADS
$BLIS_NUM_THREADS

Example jobs

1) Serial jobs (1 CPU core)

Programs that aren't parallel, which includes most programs, are known as serial or sequential programs. They only use one CPU core at a time, and so many can run at the same time on one of Hamilton's multi-core compute nodes.

An example job script to run a program called my_serial_program would be:

#!/bin/bash
 
# Request resources:
#SBATCH -c 1           # 1 CPU core
#SBATCH --mem=1G       # memory required, up to 250G on standard nodes.
#SBATCH --time=1:0:0   # time limit for job (format:  days-hours:minutes:seconds)
#SBATCH --gres=tmp:1G  # temporary disk space required on the compute node ($TMPDIR),
#                        up to 400G
# Run in the 'shared' queue (job may share node with other jobs)
#SBATCH -p shared
 
# Commands to be run:
module load my_module
./my_serial_program

If saved in a file called my_serial_job.sh, this can be submitted to the queue with the command sbatch my_serial_job.sh

2) Shared memory job (multiple CPU cores on one node)

Some programs can use more than one CPU core at a time, but are limited to a single compute node. These typically use programming techniques such as OpenMP or threading to achieve this. We call them shared memory programs, because the parallelisation requires that all CPU cores have access to the same RAM/memory.

An example job script to run a program called my_sharedmemory_program, would be:

#!/bin/bash
 
# Request resources:
#SBATCH -c 2          # number of CPU cores, one per thread, up to 128
#SBATCH --mem=1G      # memory required, up to 250G on standard nodes
#SBATCH --time=1:0:0  # time limit for job (format:  days-hours:minutes:seconds)
#SBATCH --gres=tmp:1G # temporary disk space required on the compute node ($TMPDIR), 
#                       up to 400G
# Run in the 'shared' queue (job may share node with other jobs)
#SBATCH -p shared
 
# Commands to be run:
module load my_module
./my_sharedmemory_program

If saved in a file called my_shared_job.sh, this can be submitted to the queue with the command sbatch my_shared_job.sh

3) High memory job

Jobs that require >250GB memory (per node) should run in the bigmem queue. The nodes in this queue each have 1.95TB memory. An example job script my_bigmem_job.sh might be:

#!/bin/bash
 
# Request resources:
#SBATCH -c 1            # number of CPU cores, up to 128 for shared-memory programs
#SBATCH --mem=260G      # memory required, up to 1.95T
#SBATCH --time=1:0:0    # time limit for job (format:  days-hours:minutes:seconds)
#SBATCH --gres=tmp:1G   # temporary disk space required on the compute node ($TMPDIR),
#                         up to 400G
# Run in the bigmem queue (job may share node with other jobs)
#SBATCH -p bigmem
 
# Commands to be run:
module load my_module
./my_bigmem_program

If saved in a file called my_bigmem_job.sh, this can be submitted to the queue with the command sbatch my_bigmem_job.sh

4) Distributed memory job (multiple CPUs across one or more nodes)

Programs can be written to take advantage of CPU cores and memory spread across multiple compute nodes. They typically use the low-level library called MPI (Message Passing Interface) to allow communication between many copies of the same program, each with access to its own CPU core and memory. We call this a distributed memory programming model.

An example job script to run an MPI program called my_mpi_program would be:

#!/bin/bash
 
# Request resources:
#SBATCH -n 1           # number of MPI ranks (1 per CPU core)
#SBATCH --mem=1G       # memory required per node, in units M, G or T
#SBATCH --time=1:0:0   # time limit for job (format:  days-hours:minutes:seconds)
#SBATCH --gres=tmp:1G  # temporary disk space required on each compute node ($TMPDIR)
#SBATCH -N 1           # number of compute nodes.  
 
# Smaller jobs can run in the shared queue.  
# Larger jobs that will occupy one or more whole nodes should use the multi queue.
#SBATCH -p shared 
 
# Commands to be run.  
# Note that mpirun will automatically launch the number of ranks specified above 
module load my_module
mpirun ./my_mpi_program

If saved in a file called my_dist_job.sh, this can be submitted to the queue with the command sbatch my_dist_job.sh

5) Hybrid distributed and shared memory job (multiple CPUs across one or more nodes)

Writers of distributed memory programs have discovered that using a mixed MPI / OpenMP model have their benefits (for example, to reduce the memory and computation dedicated to halo exchanges between different processes in grid-based codes).

For these codes, we recommend running one MPI rank per CPU socket (two MPI ranks per compute node on Hamilton). An example job script would be:

#!/bin/bash
 
# Request resources:
#SBATCH -n 1                    # number of MPI ranks
#SBATCH -c 1                    # number of threads per rank (one thread per CPU core)
#SBATCH --ntasks-per-socket=1   # number of MPI ranks per CPU socket
#SBATCH -N 1                    # number of compute nodes. 
#SBATCH --mem=1G                # memory required per node, in units M, G or T
#SBATCH --gres=tmp:1G           # temporary disk space on each compute node ($TMPDIR)
#SBATCH -t 1:0:0                # time limit for job (format: days-hours:minutes:seconds) 
 
# Smaller jobs can run in the shared queue. 
# Larger jobs that will occupy one or more whole nodes should use hte multi queue.
#SBATCH -p shared 
 
# Commands to be run. 
# Note that mpirun will automatically launch the number of ranks specified above 
module load my_module o provide the CPUs requested)  
mpirun ./my_hybrid_program

If saved in a file called my_hybrid_job.sh, this can be submitted to the queue with the command sbatch my_hybrid_job.sh

6) Job arrays

Sometimes it is necessary to run a large number of very similar jobs. In order to avoid having to write a job script for each of these jobs, the batch queue system provides a technique called job arrays, to allow a single job script to be used to be run many times. Each run is called a task.

This feature can be combined with any of the above examples. Below is a serial job example that runs the command ./my_program 32 times, with the arguments input_file_1.txt to input_file_32.txt.

Note: when using job arrays, individual tasks may not run correctly if they write to shared output files or temporary files. Use separate directories or uniquely-named output files, and make use of TMPDIR for temporary files. Some applications may use default locations for temporary files; check the relevant Software pages for advice.

#!/bin/bash
 
# Request resources (per task):
#SBATCH -c 1           # 1 CPU core
#SBATCH --mem=1G       # 1 GB RAM
#SBATCH --time=1:0:0   # 6 hours (hours:minutes:seconds)
 
# Run on the shared queue
#SBATCH -p shared
 
# Specify the tasks to run:
#SBATCH --array=1-32   # Create 32 tasks, numbers 1 to 32
 
# Each separate task can be identified based on the SLURM_ARRAY_TASK_ID
# environment variable:
 
echo "I am task number $SLURM_ARRAY_TASK_ID"
 
# Run program:
module load my_module
./my_program input_file_${SLURM_ARRAY_TASK_ID}.txt

Running long jobs on Hamilton

Most of the queues on Hamilton have a time limit of 3 days. If you have a job that will need to run for longer, the long queue is an option, but this queue has very limited capacity so wait times can be long. Another possibility is to checkpoint the job and restart it.

A long-running job has more to lose if it is interrupted, for example because it has reached its time limit or because of a system issue. Checkpointing is a technique in which a program saves a copy of its state at intervals as it progresses, with the intention that it can be restarted from that point if execution is interrupted. The process of checkpointing and restarting can be repeated until the program completes. This reduces the risk for long jobs and also allows the execution of jobs needing more time than the queues permit.

The best checkpoint and restore option is one built into the program itself, and many popular applications have this capability. It is sometimes also described as a 'restart' capability. If your application has this capability, use that. The rest of this page covers a feature installed on Hamilton that assists with cases that do not have built-in checkpointing.

Hamilton's Checkpoint/Restore feature

Hamilton has a feature called Checkpoint and Restore Jobs, which attempts to automatically save and restart a job when it reaches the queue's time limit, without the application knowing about it. The technique can be used for serial and multi-threaded jobs, but not ones that use multiple compute nodes or Slurm tasks, e.g. MPI applications.

Note: a job can restore from a checkpoint only if its files are in the same state as at the time of the checkpoint. For example, if a program modifies an output file after a checkpoint (e.g. because it updates the file very frequently) and then fails, it will not restart.

Before you start using Checkpoint/Restore:

If you would like to use the Checkpoint/Restore feature, please let us know so that we can give you access to it. You will not be able to use it without this access. You do not need to tell us if you are using your application's own checkpointing facility, only if you want to use the Hamilton feature.
Check that your job is suitable. It should not use multiple nodes or multiple slurm tasks (e.g. via MPI).
Check the locations of files used by the job, including temporary files. Advice for different storage areas is outlined below.

Home directory

Files used by the job cannot be held in your home directory and should be copied to /nobackup first. We provide a command migrate-to-nobackup to help with this. Usage is:

migrate-to-nobackup <file or directory>

Advice for some common cases:

R libraries

If you have installed R libraries in your account, you can migrate your R library storage location to /nobackup, such that it can still be used by R:

migrate-to-nobackup ~/R

WARNING - your R library directory will no longer be backed up.

Python libraries

If you have installed python libraries in your account using pip, you can migrate your pip library storage location to /nobackup, such that it can still be used by python:

migrate-to-nobackup ~/.local/lib/python*

WARNING - your pip python library directory will no longer be backed up.

Other files in your home directory

We provide a tool to help identify any files in your home directory that are used by your jobs, including hidden files created automatically by your application:

Run a test job (without attempting to checkpoint it)
While the job is running, type the following command on a login node:

chkptproblems <JOBID>

If the command reports a list of programs (under the "COMMAND" column) and files they have open (under the "NAME" column), then these files are stored in a location that will prevent your job from being successfully checkpointed.
Modify the job to use a different location and/or use the migrate-to-nobackup tool as above to move the files to /nobackup. See Manual checkpoint/restart cycles below for further information on testing your application.

Temporary files/TMPDIR

All files used by a checkpoint/restore job must be available from all compute nodes - so the node’s local storage should not be used. Because of this, Hamilton’s Checkpoint and Restore solution sets the TMPDIR environment variable to /nobackup/$USER/tmp so they will use this for temporary files. Check that your /nobackup quota can accommodate any files you place in TMPDIR.

Run the command chkptproblems <JOBID> on a login node while job <JOBID> is running, to report on any files used by the job that are located in a node’s local storage.

A future development may allow you to stage data in/out of a job so that you can take advantage of the local disk space on a compute node. Please let us know if this is important to you.

Running a job

To submit a Checkpoint and Restore job (the chmod command is only needed the first time):

module load chkpt
chmod u+x my_job_script.sh
sbatch $CHKPT_HOME/chkpt_job ./my_job_script.sh

Important: the contents of any #SBATCH lines in your job script will be ignored, so these need to be provided as a flag to sbatch instead. For example, if your job script included the line #SBATCH -c 8 to request 8 CPU cores, you would need to submit using:

sbatch -c 8 $CHKPT_HOME/chkpt_job ./my_job_script.sh

Important: the files in use by the job when it is checkpointed need to be in /nobackup, otherwise checkpointing will fail. This includes job output files. The job script may need to cd to somewhere in /nobackup to avoid running in your home directory.

By default, the job will have a time limit of 3 days, but will be checkpointed and requeued 1 hour before it reaches that limit. Other restrictions, such as the number of CPU cores and memory available to the job, will behave in the same way as they do for non-checkpoint/restart jobs.

Jobs using the checkpoint and restore feature run a maximum of 10 times by default, although this can be increased using the -r <maxruns> flag to chkpt_job

Changes to output files

A new file, chkpt-<jobid>.out will contain the output from your job script.

The usual Slurm output file, by default called slurm-<jobid>.out, will now contain the checkpoint/restore messages for the job, which we will find useful if you ask us to help you with a job that is not checkpointing or restarting correctly.

Retrieving Slurm accounting information about a Checkpoint and Restore job

A checkpointed job keeps its original jobID when it restarts. As the sacct command by default shows you only the most recent instance of a job, use the -D ("duplicates") flag to see the entire history of the job, including any restarts. For example, the following command shows some useful information:

sacct -j <jobid> -D -o jobid,state,totalcpu,cputime,reqmem,maxrss --units=G

Manual checkpoint/restart cycles

When using a new application with checkpoint and restart, we recommend that you test it. To force a job to checkpoint and restart, rather than waiting for it to run for three days, type:

scancel --signal=USR1 --batch <jobid>

To cancel a Checkpoint and Restore job completely:

scancel <jobid>

Restarting a previous Checkpoint and Restore job

As long as jobs fulfill the requirements above, including that files must not have been changed since the last checkpoint was taken, checkpoint/restore jobs can be restarted manually if necessary. If you find that a checkpoint and restore job has stopped restarting, e.g. because of a system failure or because the job has reached its maximum number of restarts, it can be restarted manually using:

module load chkpt
sbatch <sbatch_options> $CHKPT_HOME/chkpt_job -c <chkptid>

where <sbatch_options> are the flags you originally supplied to sbatch, and <chkptid> is the ID of the checkpoint to resume. <chkptid> takes the form of <jobid>__<runid>, where <runid> is incremented each time the job is restarted.

Available checkpoint IDs can be listed using the command:

ls /nobackup/chkpt/$USER

Checkpoint retention

Note that old checkpoints will be automatically deleted if unused for 30 days.