The queue system

From hpcwiki
Jump to navigation Jump to search

In order to manage jobs, a queue manager called Torque (PBS Parallel Batch System implementation), is active.

A batch job consists of a regular bash script containing resource requests (eg amount of nodes/cores/memory). When enough resources are available, your script is launched on the assigned node. The script should make sure the application is launched on the assigned nodes. The way this is done is application specific and probably explained in the application documentation (eg Matlab parallel toolbox). Keep in mind the queue doesn't do any magic, it only assigns nodes, launches the script and waits until the script finishes.

The three most used user commands in a PBS/Torque queue system are:

qsub

Usage: qsub <job script>

Submits a job into the queue system, specified in the file <job script>. This is a shell command file with extra PBS queue directives.

A simple example job looks like this:

#!/bin/sh
#
#PBS -N echo_test
#PBS -l nodes=1,walltime=01:00:00
#PBS -q guest
#PBS -M J.Smith@example.com
#PBS -o out.$PBS_JOBID
#PBS -e err.$PBS_JOBID
# Start echo_test example job
cd $PBS_O_WORKDIR
echo "hello"

This script will change to the directory where the job file is submitted, run the shell command 'echo “hello”' on a slave node and exit.

The #PBS lines are Torque directives which provide the following information to the queue system:

#PBS -N <name>

Name of the job in queue system

#PBS -­l nodes=<x>,walltime=<hh:mm:ss> or
#PBS ­-l nodes=<x>:ppn=<c>,walltime=<hh:mm:ss>

Number of requested nodes <x>, procs/cores <c> and optionally, estimated wallclock time <hh:mm:ss> (hours : minutes : seconds) the job wil require

#PBS -q <queue>

Name of the queue where the job will be submitted

#PBS -M <email adres>

Email address to send job status in case of a problem

#PBS -o <file>

Name of output file to write stdout

#PBS -e <file>

Name of output file to write stderr

You can use certain environment variables in the job script to pass specific data to programs or change directories. In fact, the example job script uses two variables: $PBS_JOBID and $PBS_O_WORKDIR

The most useful variables are:

$PBS_JOBNAME User specified job name
$PBS_JOBID Unique PBS/torque job id
$PBS_QUEUE Job queue where the job is submitted to
$PBS_WALLTIME Total wallclock time in seconds
$PBS_O_WORKDIR Directory where qsub command was executed
$PBS_O_HOME Home directory of submitting user
$PBS_O_LOGNAME Name of submitting user
$PBS_O_SHELL Script shell
$PBS_O_HOST Host on which job script is currently running
$PBS_O_PATH Path variable used to locate executables within job script
$TMPDIR Local scratch directory on the node. Use this for storing temporary files.
Since this is a local disk, access is much faster than the /home directory.
The directory will be cleaned when the job exits.
$PBS_NUM_NODES Number of nodes allocated to the job
$PBS_NUM_PPN Number of procs(=cores) per node allocated to the job
$PBS_NP Number of total procs(=cores) allocated to the job (equal to $PBS_NUM_NODES * $PBS_NUM_PPN)
$PBS_NODEFILE File containing line delimited list on nodes allocated to the job

qdel

Usage: qdel <jobid>

Deletes job with id <jobid> from the queue. If <jobid> is 'all', all user jobs will be deleted.

qstat

Usage: qstat [-­a] [­-n] [­-q] [­-Q]

Prints an overview of jobs with their respective owners, queues, queue times and status

-­a Displays jobs in the queue system in a long line format.

-n Like ­-a , but also lists the processor core(s) and node(s) used.

-q Displays queues and their status, number of jobs running, jobs queued, and total jobs allowed.

-Q Like -­q but shows additional queue parameters with longer lines.