More about queues and nodes: Difference between revisions
(9 intermediate revisions by one other user not shown) | |||
Line 96: | Line 96: | ||
For example, Ansys has the <code>-np</code> switch: | For example, Ansys has the <code>-np</code> switch: | ||
ansys - | ansys -np N | ||
and Fluent has the <code>-t</code> switch | and Fluent has the <code>-t</code> switch | ||
Line 109: | Line 109: | ||
Of course, N must match the number of processors that you request in your job. Alternatively, you could also request an entire node (all processors) in your job and let your program use all available resources of that node. | Of course, N must match the number of processors that you request in your job. Alternatively, you could also request an entire node (all processors) in your job and let your program use all available resources of that node. | ||
== Avoid excessive reads and writes on your homedir == | |||
Some programs read and write a lot of data to and from your home directory. This is not very efficient, on the nodes your home directory is a network share, so access is relatively slow and it keeps the master node unnecessarily busy. If you expect that your job will do a lot of reading and writing to disk, you can use the local disk on the node instead, which is mounted on /var/tmp on all nodes. You can do this by adding a few extra lines to your job script, right before the line that starts the program in your job, for example: | |||
<pre> | |||
TMP=/var/tmp/${PBS_JOBID} | |||
mkdir -p ${TMP} | |||
/usr/bin/rsync -vax "${PBS_O_WORKDIR}/" ${TMP}/ | |||
cd ${TMP} | |||
</pre> | |||
Once your program is done you can copy the results back to your home directory and clean up by adding these two lines at the end of your job script: | |||
<pre> | |||
/usr/bin/rsync -vax ${TMP}/ "${PBS_O_WORKDIR}/" | |||
[ $? -eq 0 ] && /bin/rm -rf ${TMP} | |||
</pre> | |||
This usually works best if you create a seperate directory in your homedir, move the necessary files and the job script to it and run your job from there. Otherwise you would end up copying your entire home directory to the node for no good reason. | |||
== Access to nodes == | |||
All nodes are independant Linux machines and you could be tempted to log in to one of the nodes and work from there. This is however forbidden, any attempt to log in to a node will fail. There is one exception, you can log in to a node if you have a job running on it, this way you can check on the progress of your job and see if things are still working as intended. To check which node runs your job, type: | |||
qstat -u $USER -n1 | |||
This will get you a list of all your jobs, in the last column you'll see the nodes in use. If you log in to a node, please do not run any additional CPU intensive programs to avoid overutilization. | |||
If you must log in to a node in order to run software that can not be run from a script, you can start an interactive job. This is done using the <code>-I</code> switch with qsub, like this: | |||
qsub -I | |||
As soon as a node is assigned to you (this may take a while), you'll get a new command line prompt, as if you just logged in with ssh. This will reserve only one processor, you should take care that if you start a CPU intensive program, it does not use more than one processor. If you need more processors or if you want to use a specific node, you can request this for your interactive job with the <code>-l</code> switch, for example, if you want to request 8 processors on node n10-08: | |||
qsub -I -l nodes=n10-08:ppn=8 | |||
If you want to run a progam with a graphical interface on a node, you'll need to make sure that X forwarding works when logged in to the master node. Then you can use the <code>-X</code> switch start your interactive job with X forwarding enabled: | |||
qsub -I -X | |||
It is important to know that an interactive can only be run in the normal queues, '''you can not run an interactive job in the guest queue!''' | |||
== MPI jobs == | |||
Some workloads need OpenMPI to run, typically on two or more nodes at once. For such a job your job script usually contains a line like this: | |||
module load mpi/openmpi-1.8.8-gnu | |||
And your actual workload would start with <code>mpirun</code> or <code>mpiexec</code> like this: | |||
<pre> | |||
#PBS -l nodes=2:ppn=20 | |||
module load mpi/openmpi-1.8.8-gnu | |||
cd $PBS_O_WORKDIR | |||
mpirun -n $PBS_NP whatever_workload_there_is | |||
</pre> | |||
OpenMPI uses rsh or ssh under water to communicate between the assigned nodes, in some cases this leads to <code>Host key verification failed</code> errors and a premature termination of your job. To prevent this, you need to prepare a few files in your home directory. You only have to do this once on the master node. | |||
First of all, if you have never done this before on the master node, generate an ssh private/public keypair: | |||
ssh-keygen | |||
Do not enter a passphrase, just press the enter key three times. | |||
Next type (or copy/paste) these two commands: | |||
cat ${HOME}/.ssh/id_rsa.pub >> ${HOME}/.ssh/authorized_keys | |||
chmod go-rwx ${HOME}/.ssh/authorized_keys | |||
And finally type (or copy/paste) these two lines: | |||
HPC=$(hostname | cut -c 4-5) ; \ | |||
printf "host n${HPC}-* hpc${HPC}*\n\tStrictHostKeyChecking no\n\tUserKnownHostsFile /dev/null\n\tLogLevel QUIET\n" >> ${HOME}/.ssh/config | |||
The first line will give you a temporary <code> > </code> prompt, this is normal behaviour. | |||
== A generic example of a job script == | |||
The script below can be used as a starting point to create your own jobs. Feel free to copy, paste and modify it to your needs. Lines starting with # will not be executed and contain useful information. | |||
<pre> | |||
#!/bin/bash | |||
# | |||
# Torque directives (#PBS) must always be at the start of a job script! | |||
# | |||
# Request nodes and processors per node | |||
# | |||
#PBS -l nodes=1:ppn=1 | |||
# | |||
# | |||
# Set the name of the job | |||
# | |||
#PBS -N name_of_job | |||
# | |||
# | |||
# Set the mail options (type 'man qsub' for more information) | |||
# | |||
#PBS -m bea | |||
# | |||
# | |||
# Set the email address where you want notifications sent to | |||
# By default mail will be sent to your TU Delft mailbox | |||
# | |||
#PBS -M $USER@mailboxcluster.tudelft.net | |||
# | |||
# | |||
# Set the rerunable flag, 'n' is not rerunable, default is 'y' | |||
# | |||
#PBS -r y | |||
# Make sure I'm the only one that can read my output | |||
umask 0077 | |||
# create a temporary directory in /var/tmp | |||
TMP=/var/tmp/${PBS_JOBID} | |||
mkdir -p ${TMP} | |||
echo "Temporary work dir: ${TMP}" | |||
if [ ! -d "${TMP}" ]; then | |||
echo "Cannot create temporary directory. Disk probably full." | |||
exit 1 | |||
fi | |||
# copy the input files to ${TMP} | |||
echo "Copying from ${PBS_O_WORKDIR}/ to ${TMP}/" | |||
/usr/bin/rsync -vax "${PBS_O_WORKDIR}/" ${TMP}/ | |||
cd ${TMP} | |||
# | |||
module load application1 | |||
module load application2 | |||
export OMP_NUM_THREADS=1 | |||
# Here is where the application is started on the node | |||
# job done, copy everything back | |||
echo "Copying from ${TMP}/ to ${PBS_O_WORKDIR}/" | |||
/usr/bin/rsync -vax ${TMP}/ "${PBS_O_WORKDIR}/" | |||
# delete my temporary files | |||
[ $? -eq 0 ] && /bin/rm -rf ${TMP} | |||
</pre> |
Latest revision as of 12:27, 13 June 2022
The different queues
The larger hpc clusters, most notably hpc03, hpc06, hpc11 and hpc12, are shared by two or more research groups. On those clusters every group has their own queue, sometimes even more than one. These queues give exclusive and full access to a specific set of nodes.
There is also a guest queue on every hpc cluster that gives access to all nodes, but with some restrictions, you will not be able to run non-rerunable and interactive jobs.
In most cases, access to one of the queues is based on group membership in the Active Directory. If your netid is not a member of the right group, you default to the guest queue if you submit a job. If you have access to the group and bulk network shares of your research group, you should also have access to the normal queue on the hpc cluster. If not, contact the secretary in your research group and let him/her arrange the group membership of your netid.
You can check your default queue by submitting a small test job and then have a look at the list with jobs with the qstat command.
[jsmith@hpc10 ~]$ echo "sleep 60" | qsub [jsmith@hpc10 ~]$ qstat -u jsmith
If you see anything other than guest in the third column, then you are all set.
There are two ways to select the guest queue;
With the -q switch on the commandline:
qsub -q guest job1
Or with a directive at the start of your job script:
#PBS -q guest
It is important to know that a job in the guest queue can be interrupted and resumed at any time. You should make sure that the application in your job saves the intermediate results at regular intervals and that it knows how to continue when your job is resumed. If you neglect this, your job in the guest queue will start all over again every time it is interrupted and resumed.
The different nodes
On most hpc clusters you'll find that worker nodes are not all identical, different series of nodes exist which were purchased at different times and with different specifications. To distinguish between the different series of nodes, they are labelled with properties like typea, typeb, typec, etc. On some hpc clusters, nodes have extra properties showing to which queue they belong or showing additional features, like an infiniband network or extra memory compared to similar nodes.
A useful command that shows all nodes and how they are utilized is LOCALnodeload.pl
. A typical output looks like this:
[jsmith@hpc10 ~]$ LOCALnodeload.pl Node Np State/jobs Load Properties ---------- -- ---------- ----- ---------- n10-01 12 12 12.01 typea n10-02 12 free 0.00 typea n10-03 12 free 0.00 typea n10-04 12 free 0.00 typea n10-05 16 12 11.93 typeb n10-06 16 free 0.00 typeb n10-07 16 offline 0.00 typeb n10-08 16 down 0.00 typeb
The first column (Node) shows the names of the nodes. The second column (Np) shows the total number of processors. The third column (State/jobs) shows the number of processors currenly in use or the status of the node (free, offline or down). The forth colum (Load) shows the actual load on the nodes. In an ideal situation the load matches the number of processors in use. The last column (Properties) shows the properties as described above. As you can see in the example, typea nodes have 12 processors and typeb nodes have 16. Node n10-01 is fully occupied, node n10-05 is running one or more jobs but still has 4 processors free. Nodes n10-07 and n10-08 cannot be used.
Selecting nodes
If you submit a job, the scheduler automatically selects a node to run it. By default a jobs gets one node and one processor. You can manually select the number of processors and nodes for your job by using the -l
switch with the qsub
command. You can also select nodes by property. the -l
switch works like this:
qsub -l nodes=<x>:ppn=<c>:<property>:<property>...
- <x> is either an amount of nodes or the name(s) of the selected node(s)
- <c> is number of processors per node
- <property> is any of the properties you see in Properties column of the LOCALnodeload.pl command.
Examples:
qsub -l nodes=4 |
Request 4 nodes of any type |
qsub -l nodes=n10-07+n10-08 |
Request 2 specific nodes by hostname |
qsub -l nodes=4:ppn=2 |
Request 2 processors on each of four nodes |
qsub -l nodes=1:ppn=4 |
Request 4 processors on one node |
qsub -l nodes=2:typea |
Request 2 nodes with the typea property
|
Instead of using the -l or the -q switches on the commandline when you submit your job with qsub, you can also add them as a directive to your job script. For instance, if you add
#PBS -l nodes=1:ppn=4 #PBS -q guest
at the start of your script, you can just use
qsub job.sh
instead of
qsub -l nodes=1:ppn=4 -q guest job.sh
Avoid over- and underutilization
An important thing to consider when you create your own job script is matching the number of processors that you request with the number of processors that the software in your script will actually use. It is possible that you request only one processor and that your program will use all processors available on the nodes. This is called overutilization and is not very efficient when other jobs are already running on the same node and using the same processors.
It is also possible that you request several (or all) processors and that your program will only use one. This will leave the other processors you claimed unused (underutilization), which is also not very efficient because the unused processors you requested will not be used for other jobs.
How to avoid over- and underutilization? Many programs have options that will let them use only one thread (utilization of only one processor) or a specific number of threads.
For example, Ansys has the -np
switch:
ansys -np N
and Fluent has the -t
switch
fluent -tN
where N matches the number of processors that you request in your job.
If your program does not have an option to limit the number of processor, you can try to add this line in your job script, just before the line where your progam starts:
export OMP_NUM_THREADS=N
Of course, N must match the number of processors that you request in your job. Alternatively, you could also request an entire node (all processors) in your job and let your program use all available resources of that node.
Avoid excessive reads and writes on your homedir
Some programs read and write a lot of data to and from your home directory. This is not very efficient, on the nodes your home directory is a network share, so access is relatively slow and it keeps the master node unnecessarily busy. If you expect that your job will do a lot of reading and writing to disk, you can use the local disk on the node instead, which is mounted on /var/tmp on all nodes. You can do this by adding a few extra lines to your job script, right before the line that starts the program in your job, for example:
TMP=/var/tmp/${PBS_JOBID} mkdir -p ${TMP} /usr/bin/rsync -vax "${PBS_O_WORKDIR}/" ${TMP}/ cd ${TMP}
Once your program is done you can copy the results back to your home directory and clean up by adding these two lines at the end of your job script:
/usr/bin/rsync -vax ${TMP}/ "${PBS_O_WORKDIR}/" [ $? -eq 0 ] && /bin/rm -rf ${TMP}
This usually works best if you create a seperate directory in your homedir, move the necessary files and the job script to it and run your job from there. Otherwise you would end up copying your entire home directory to the node for no good reason.
Access to nodes
All nodes are independant Linux machines and you could be tempted to log in to one of the nodes and work from there. This is however forbidden, any attempt to log in to a node will fail. There is one exception, you can log in to a node if you have a job running on it, this way you can check on the progress of your job and see if things are still working as intended. To check which node runs your job, type:
qstat -u $USER -n1
This will get you a list of all your jobs, in the last column you'll see the nodes in use. If you log in to a node, please do not run any additional CPU intensive programs to avoid overutilization.
If you must log in to a node in order to run software that can not be run from a script, you can start an interactive job. This is done using the -I
switch with qsub, like this:
qsub -I
As soon as a node is assigned to you (this may take a while), you'll get a new command line prompt, as if you just logged in with ssh. This will reserve only one processor, you should take care that if you start a CPU intensive program, it does not use more than one processor. If you need more processors or if you want to use a specific node, you can request this for your interactive job with the -l
switch, for example, if you want to request 8 processors on node n10-08:
qsub -I -l nodes=n10-08:ppn=8
If you want to run a progam with a graphical interface on a node, you'll need to make sure that X forwarding works when logged in to the master node. Then you can use the -X
switch start your interactive job with X forwarding enabled:
qsub -I -X
It is important to know that an interactive can only be run in the normal queues, you can not run an interactive job in the guest queue!
MPI jobs
Some workloads need OpenMPI to run, typically on two or more nodes at once. For such a job your job script usually contains a line like this:
module load mpi/openmpi-1.8.8-gnu
And your actual workload would start with mpirun
or mpiexec
like this:
#PBS -l nodes=2:ppn=20 module load mpi/openmpi-1.8.8-gnu cd $PBS_O_WORKDIR mpirun -n $PBS_NP whatever_workload_there_is
OpenMPI uses rsh or ssh under water to communicate between the assigned nodes, in some cases this leads to Host key verification failed
errors and a premature termination of your job. To prevent this, you need to prepare a few files in your home directory. You only have to do this once on the master node.
First of all, if you have never done this before on the master node, generate an ssh private/public keypair:
ssh-keygen
Do not enter a passphrase, just press the enter key three times.
Next type (or copy/paste) these two commands:
cat ${HOME}/.ssh/id_rsa.pub >> ${HOME}/.ssh/authorized_keys chmod go-rwx ${HOME}/.ssh/authorized_keys
And finally type (or copy/paste) these two lines:
HPC=$(hostname | cut -c 4-5) ; \ printf "host n${HPC}-* hpc${HPC}*\n\tStrictHostKeyChecking no\n\tUserKnownHostsFile /dev/null\n\tLogLevel QUIET\n" >> ${HOME}/.ssh/config
The first line will give you a temporary >
prompt, this is normal behaviour.
A generic example of a job script
The script below can be used as a starting point to create your own jobs. Feel free to copy, paste and modify it to your needs. Lines starting with # will not be executed and contain useful information.
#!/bin/bash # # Torque directives (#PBS) must always be at the start of a job script! # # Request nodes and processors per node # #PBS -l nodes=1:ppn=1 # # # Set the name of the job # #PBS -N name_of_job # # # Set the mail options (type 'man qsub' for more information) # #PBS -m bea # # # Set the email address where you want notifications sent to # By default mail will be sent to your TU Delft mailbox # #PBS -M $USER@mailboxcluster.tudelft.net # # # Set the rerunable flag, 'n' is not rerunable, default is 'y' # #PBS -r y # Make sure I'm the only one that can read my output umask 0077 # create a temporary directory in /var/tmp TMP=/var/tmp/${PBS_JOBID} mkdir -p ${TMP} echo "Temporary work dir: ${TMP}" if [ ! -d "${TMP}" ]; then echo "Cannot create temporary directory. Disk probably full." exit 1 fi # copy the input files to ${TMP} echo "Copying from ${PBS_O_WORKDIR}/ to ${TMP}/" /usr/bin/rsync -vax "${PBS_O_WORKDIR}/" ${TMP}/ cd ${TMP} # module load application1 module load application2 export OMP_NUM_THREADS=1 # Here is where the application is started on the node # job done, copy everything back echo "Copying from ${TMP}/ to ${PBS_O_WORKDIR}/" /usr/bin/rsync -vax ${TMP}/ "${PBS_O_WORKDIR}/" # delete my temporary files [ $? -eq 0 ] && /bin/rm -rf ${TMP}