More about queues and nodes

From hpcwiki
Jump to navigation Jump to search

The different queues

The larger hpc clusters, most notably hpc03, hpc06, hpc11 and hpc12, are shared by two or more research groups. On those clusters every group has their own queue, sometimes even more than one. These queues give exclusive and full access to a specific set of nodes.

There is also a guest queue on every hpc cluster that gives access to all nodes, but with some restrictions, you will not be able to run non-rerunable and interactive jobs.

In most cases, access to one of the queues is based on group membership in the Active Directory. If your netid is not a member of the right group, you default to the guest queue if you submit a job. If you have access to the group and bulk network shares of your research group, you should also have access to the normal queue on the hpc cluster. If not, contact the secretary in your research group and let him/her arrange the group membership of your netid.

You can check your default queue by submitting a small test job and then have a look at the list with jobs with the qstat command.

[jsmith@hpc10 ~]$ echo "sleep 60" | qsub 
[jsmith@hpc10 ~]$ qstat -u jsmith

If you see anything other than guest in the third column, then you are all set.

There are two ways to select the guest queue;

With the -q switch on the commandline:

qsub -q guest job1

Or with a directive at the start of your job script:

#PBS -q guest

It is important to know that a job in the guest queue can be interrupted and resumed at any time. You should make sure that the application in your job saves the intermediate results at regular intervals and that it knows how to continue when your job is resumed. If you neglect this, your job in the guest queue will start all over again every time it is interrupted and resumed.

The different nodes

On most hpc clusters you'll find that worker nodes are not all identical, different series of nodes exist which were purchased at different times and with different specifications. To distinguish between the different series of nodes, they are labelled with properties like typea, typeb, typec, etc. On some hpc clusters, nodes have extra properties showing to which queue they belong or showing additional features, like an infiniband network or extra memory compared to similar nodes.

A useful command that shows all nodes and how they are utilized is LOCALnodeload.pl. A typical output looks like this:

[jsmith@hpc10 ~]$ LOCALnodeload.pl
Node       Np State/jobs Load  Properties
---------- -- ---------- ----- ----------
n10-01     12 12         12.01 typea     
n10-02     12 free        0.00 typea     
n10-03     12 free        0.00 typea     
n10-04     12 free        0.00 typea     
n10-05     16 12         11.93 typeb     
n10-06     16 free        0.00 typeb     
n10-07     16 offline     0.00 typeb     
n10-08     16 down        0.00 typeb     

The first column (Node) shows the names of the nodes. The second column (Np) shows the total number of processors. The third column (State/jobs) shows the number of processors currenly in use or the status of the node (free, offline or down). The forth colum (Load) shows the actual load on the nodes. In an ideal situation the load matches the number of processors in use. The last column (Properties) shows the properties as described above. As you can see in the example, typea nodes have 12 processors and typeb nodes have 16. Node n10-01 is fully occupied, node n10-05 is running one or more jobs but still has 4 processors free. Nodes n10-07 and n10-08 cannot be used.

Selecting nodes

If you submit a job, the scheduler automatically selects a node to run it. By default a jobs gets one node and one processor. You can manually select the number of processors and nodes for your job by using the -l switch with the qsub command. You can also select nodes by property. the -l switch works like this:

qsub -l nodes=<x>:ppn=<c>:<property>:<property>...
  • <x> is either an amount of nodes or the name(s) of the selected node(s)
  • <c> is number of processors per node
  • <property> is any of the properties you see in Properties column of the LOCALnodeload.pl command.

Examples:

qsub -l nodes=4 Request 4 nodes of any type
qsub -l nodes=n10-07+n10-08 Request 2 specific nodes by hostname
qsub -l nodes=4:ppn=2 Request 2 processors on each of four nodes
qsub -l nodes=1:ppn=4 Request 4 processors on one node
qsub -l nodes=2:typea Request 2 nodes with the typea property

Instead of using the -l or the -q switches on the commandline when you submit your job with qsub, you can also add them as a directive to your job script. For instance, if you add

#PBS -l nodes=1:ppn=4
#PBS -q guest

at the start of your script, you can just use

qsub job.sh

instead of

qsub -l nodes=1:ppn=4 -q guest job.sh

Avoid over- and underutilization

An important thing to consider when you create your own job script is matching the number of processors that you request with the number of processors that the software in your script will actually use. It is possible that you request only one processor and that your program will use all processors available on the nodes. This is called overutilization and is not very efficient when other jobs are already running on the same node and using the same processors.

It is also possible that you request several (or all) processors and that your program will only use one. This will leave the other processors you claimed unused (underutilization), which is also not very efficient because the unused processors you requested will not be used for other jobs.

How to avoid over- and underutilization? Many programs have options that will let them use only one thread (utilization of only one processor) or a specific number of threads.

For example, Ansys has the -np switch:

ansys -np N

and Fluent has the -t switch

fluent -tN

where N matches the number of processors that you request in your job.

If your program does not have an option to limit the number of processor, you can try to add this line in your job script, just before the line where your progam starts:

export OMP_NUM_THREADS=N

Of course, N must match the number of processors that you request in your job. Alternatively, you could also request an entire node (all processors) in your job and let your program use all available resources of that node.

Avoid excessive reads and writes on your homedir

Some programs read and write a lot of data to and from your home directory. This is not very efficient, on the nodes your home directory is a network share, so access is relatively slow and it keeps the master node unnecessarily busy. If you expect that your job will do a lot of reading and writing to disk, you can use the local disk on the node instead, which is mounted on /var/tmp on all nodes. You can do this by adding a few extra lines to your job script, right before the line that starts the program in your job, for example:

TMP=/var/tmp/${PBS_JOBID}
mkdir -p ${TMP}
/usr/bin/rsync -vax "${PBS_O_WORKDIR}/" ${TMP}/
cd ${TMP}

Once your program is done you can copy the results back to your home directory and clean up by adding these two lines at the end of your job script:

/usr/bin/rsync -vax ${TMP}/ "${PBS_O_WORKDIR}/"
[ $? -eq 0 ] && /bin/rm -rf ${TMP}

This usually works best if you create a seperate directory in your homedir, move the necessary files and the job script to it and run your job from there. Otherwise you would end up copying your entire home directory to the node for no good reason.

Access to nodes

All nodes are independant Linux machines and you could be tempted to log in to one of the nodes and work from there. This is however forbidden, any attempt to log in to a node will fail. There is one exception, you can log in to a node if you have a job running on it, this way you can check on the progress of your job and see if things are still working as intended. To check which node runs your job, type:

qstat -u $USER -n1

This will get you a list of all your jobs, in the last column you'll see the nodes in use. If you log in to a node, please do not run any additional CPU intensive programs to avoid overutilization.

If you must log in to a node in order to run software that can not be run from a script, you can start an interactive job. This is done using the -I switch with qsub, like this:

qsub -I

As soon as a node is assigned to you (this may take a while), you'll get a new command line prompt, as if you just logged in with ssh. This will reserve only one processor, you should take care that if you start a CPU intensive program, it does not use more than one processor. If you need more processors or if you want to use a specific node, you can request this for your interactive job with the -l switch, for example, if you want to request 8 processors on node n10-08:

qsub -I -l nodes=n10-08:ppn=8

If you want to run a progam with a graphical interface on a node, you'll need to make sure that X forwarding works when logged in to the master node. Then you can use the -X switch start your interactive job with X forwarding enabled:

qsub -I -X

It is important to know that an interactive can only be run in the normal queues, you can not run an interactive job in the guest queue!

MPI jobs

Some workloads need OpenMPI to run, typically on two or more nodes at once. For such a job your job script usually contains a line like this:

module load mpi/openmpi-1.8.8-gnu

And your actual workload would start with mpirun or mpiexec like this:

#PBS -l nodes=2:ppn=20
module load mpi/openmpi-1.8.8-gnu
cd $PBS_O_WORKDIR
mpirun -n $PBS_NP whatever_workload_there_is

OpenMPI uses rsh or ssh under water to communicate between the assigned nodes, in some cases this leads to Host key verification failed errors and a premature termination of your job. To prevent this, you need to prepare a few files in your home directory. You only have to do this once on the master node.

First of all, if you have never done this before on the master node, generate an ssh private/public keypair:

ssh-keygen

Do not enter a passphrase, just press the enter key three times.

Next type (or copy/paste) these two commands:

cat ${HOME}/.ssh/id_rsa.pub >> ${HOME}/.ssh/authorized_keys
chmod go-rwx ${HOME}/.ssh/authorized_keys

And finally type (or copy/paste) these two lines:

HPC=$(hostname | cut -c 4-5) ; \
printf "host n${HPC}-* hpc${HPC}*\n\tStrictHostKeyChecking no\n\tUserKnownHostsFile /dev/null\n\tLogLevel QUIET\n" >> ${HOME}/.ssh/config

The first line will give you a temporary > prompt, this is normal behaviour.

A generic example of a job script

The script below can be used as a starting point to create your own jobs. Feel free to copy, paste and modify it to your needs. Lines starting with # will not be executed and contain useful information.

#!/bin/bash
#
# Torque directives (#PBS) must always be at the start of a job script!
#
# Request nodes and processors per node
#
#PBS -l nodes=1:ppn=1
#
#
# Set the name of the job
#
#PBS -N name_of_job
#
#
# Set the mail options (type 'man qsub' for more information)
#
#PBS -m bea
#
#
# Set the email address where you want notifications sent to
# By default mail will be sent to your TU Delft mailbox
#
#PBS -M $USER@mailboxcluster.tudelft.net
#
#
# Set the rerunable flag, 'n' is not rerunable, default is 'y'
#
#PBS -r y

# Make sure I'm the only one that can read my output
umask 0077

# create a temporary directory in /var/tmp
TMP=/var/tmp/${PBS_JOBID}
mkdir -p ${TMP}
echo "Temporary work dir: ${TMP}"
if [ ! -d "${TMP}" ]; then
    echo "Cannot create temporary directory. Disk probably full."
    exit 1
fi

# copy the input files to ${TMP}
echo "Copying from ${PBS_O_WORKDIR}/ to ${TMP}/"
/usr/bin/rsync -vax "${PBS_O_WORKDIR}/" ${TMP}/

cd ${TMP}

# 

module load application1
module load application2

export OMP_NUM_THREADS=1

# Here is where the application is started on the node


# job done, copy everything back
echo "Copying from ${TMP}/ to ${PBS_O_WORKDIR}/"
/usr/bin/rsync -vax ${TMP}/ "${PBS_O_WORKDIR}/"

# delete my temporary files
[ $? -eq 0 ] && /bin/rm -rf ${TMP}