The queue system
In order to manage jobs, a queue manager called Torque (PBS Parallel Batch System implementation), is active.
A batch job consists of a regular bash script containing resource requests (eg amount of nodes/cores/memory). When enough resources are available, your script is launched on the assigned node. The script should make sure the application is launched on the assigned nodes. The way this is done is application specific and probably explained in the application documentation (eg Matlab parallel toolbox). Keep in mind the queue doesn't do any magic, it only assigns nodes, launches the script and waits until the script finishes.
The three most used user commands in a PBS/Torque queue system are:
qsub
Usage: qsub <job script>
Submits a job into the queue system, specified in the file <job script>
. This is a shell command file with extra PBS queue directives.
A simple example job looks like this:
#!/bin/sh # #PBS -N echo_test #PBS -l nodes=1,walltime=01:00:00 #PBS -q guest #PBS -M J.Smith@example.com #PBS -o out.$PBS_JOBID #PBS -e err.$PBS_JOBID # Start echo_test example job cd $PBS_O_WORKDIR echo "hello"
This script will change to the directory where the job file is submitted, run the shell command 'echo “hello”
' on a slave node and exit.
The #PBS
lines are Torque directives which provide the following information to the queue system:
#PBS -N <name>
Name of the job in queue system
#PBS -l nodes=<x>,walltime=<hh:mm:ss> or #PBS -l nodes=<x>:ppn=<c>,walltime=<hh:mm:ss>
Number of requested nodes <x>
, procs/cores <c>
and optionally, estimated wallclock time <hh:mm:ss>
(hours : minutes : seconds) the job wil require
#PBS -q <queue>
Name of the queue where the job will be submitted
#PBS -M <email adres>
Email address to send job status in case of a problem
#PBS -o <file>
Name of output file to write stdout
#PBS -e <file>
Name of output file to write stderr
You can use certain environment variables in the job script to pass specific data to programs or change directories. In fact, the example job script uses two variables: $PBS_JOBID
and $PBS_O_WORKDIR
The most useful variables are:
$PBS_JOBNAME |
User specified job name |
$PBS_JOBID |
Unique PBS/torque job id |
$PBS_QUEUE |
Job queue where the job is submitted to |
$PBS_WALLTIME |
Total wallclock time in seconds |
$PBS_O_WORKDIR |
Directory where qsub command was executed |
$PBS_O_HOME |
Home directory of submitting user |
$PBS_O_LOGNAME |
Name of submitting user |
$PBS_O_SHELL |
Script shell |
$PBS_O_HOST |
Host on which job script is currently running |
$PBS_O_PATH |
Path variable used to locate executables within job script |
$TMPDIR |
Local scratch directory on the node. Use this for storing temporary files. |
Since this is a local disk, access is much faster than the /home directory. | |
The directory will be cleaned when the job exits. | |
$PBS_NUM_NODES |
Number of nodes allocated to the job |
$PBS_NUM_PPN |
Number of procs(=cores) per node allocated to the job |
$PBS_NP |
Number of total procs(=cores) allocated to the job (equal to $PBS_NUM_NODES * $PBS_NUM_PPN) |
$PBS_NODEFILE |
File containing line delimited list on nodes allocated to the job |
qdel
Usage: qdel <jobid>
Deletes job with id <jobid>
from the queue. If <jobid>
is 'all
', all user jobs will be deleted.
qstat
Usage: qstat [-a] [-n] [-q] [-Q]
Prints an overview of jobs with their respective owners, queues, queue times and status
-a
Displays jobs in the queue system in a long line format.
-n
Like -a
, but also lists the processor core(s) and node(s) used.
-q
Displays queues and their status, number of jobs running, jobs queued, and total jobs allowed.
-Q
Like -q
but shows additional queue parameters with longer lines.