Resource management

Cluster's resources are managed by Simple Linux Utility for Resource Manager (SLURM). It is not so simple but quite powerful and has a perfect documentation, please refer to it if you would like to learn more.

Herein we provide the most popular functionalities and useful commands for users of FireUni cluster.

Partitions

The cluster is organized in partitions to easily manage the resources needed for the particular job. There are total of six overlapping partitions configured with fourty eight nodes. The associations are presented below for all fourty eight nodes (see specification for details).

Nodes	CPUs per node	Memory per CPU
w[01-18]	4	2.75 GB
w[19-23]	4	2.75 GB
w[24-48]	6	1.85 GB

All nodes from the partition will be potential candidates for the job.

When using common partition a lower memory limit per CPU is applied (1.85 GB). You can avoid it by specyfing two partitions in your submit script --partition=four-core,six-core - each will apply its own limit. Alternatively you can increase the limit with --mem-limit=[YOURLIMIT]. The latter solution, however, can lead to OOM issues for memory-greedy programs.

List the queue

In cluster GUI queue info is available in Active Jobs app.

You can list all the jobs you have submitted that are running or waiting to be executed (pending).

squeue --me

The command returns a table with basic information: job ID (JOBID) and job name (NAME), submitting user (USER) and the partition (PARTITION) to which the job was submitted.

Moreover it provides a status (ST) which can be in most cases the following: * R - running, job is currently in progress, * CF - configuring, job is waiting for resources to be configured (i.e. nodes to start), * PD - pending, job is submitted to the queue and not yet running (i.e. due to lack of resources available), * CA - canceled, job has been canceled (i.e. due to error), * TO - timeout, job has been canceled because the wall time has been reached.

TIME is the total time the job is running. It will be compared with wall time specified in sbatch argument time.

NODES is the number of computational nodes of the cluster allocated to the job and NODELIST is a specific list of those.

When status is PD or CD instead of NODELIST you can see REASON. And that is exactly what it looks like - i.e. (Resources) means that your job is pending due to lack of the resources available.

If you would like to have a look on the current workload on the cluster use bare command.

squeue

List the nodes (partitions)

In cluster GUI nodes/cores availability is presented in System Status app.

You can have a look at the partitions available on the cluster.

sinfo -p [partition name]

The command will print the state of nodes from a specified partition.

idle - free and waiting for a job,
alloc - allocated, already performing a task and fully loaded,
mix - partially allocated, partially free,
down - not responding, admin intervention required (inform us ASAP)
fail - unavailable due to hardware/software failure,

Nodes from six-core partition working in a power-saving mode. They are by default turned off and SLURM automatically brings them up when needed for a job. For nodes from this partition additional character (a flag) may be printed next to the state:

~ - node is powered off,
# - node is powering up,
% - node is powering off,
* - node does not respond

No flag indicates that node is powered up.

Show job info

In cluster GUI detailed job info is available in Active Jobs app.

Detailed job info can be printed with scontrol. You will need job ID so use squeue --me if not sure.

scontrol show job [job ID]

You can check the meaning of the output here or use man scontrol show command.

Cancel a job

In cluster GUI you can cancel, pause or modify your job from Job Composer app.

Only those jobs you have submitted are eligible to be cancelled by you. You will need job ID so use squeue --me if not sure.

scancel job [job ID]

To cancel only jobs that are pending (all running will remained unaffected).

scancel job -u [username] -t pending