More about queues and nodes: Difference between revisions
No edit summary |
|||
Line 25: | Line 25: | ||
It is important to know that a job in the guest queue can be interrupted and resumed at any time. You should make sure that the application in your job saves the intermediate results at regular intervals and that it knows how to continue when your job is resumed. If you neglect this, your job in the guest queue will start all over again every time it is interrupted and resumed. | It is important to know that a job in the guest queue can be interrupted and resumed at any time. You should make sure that the application in your job saves the intermediate results at regular intervals and that it knows how to continue when your job is resumed. If you neglect this, your job in the guest queue will start all over again every time it is interrupted and resumed. | ||
== The different nodes == | |||
On most hpc clusters you'll find that worker nodes are not all identical, different series of nodes exist which were purchased at different times and with different specifications. To distinguish between the different series of nodes, they are labelled with properties like typea, typeb, typec, etc. On some hpc clusters, nodes have extra properties showing to which queue they belong or showing additional features, like an infiniband network or extra memory compared to similar nodes. | |||
A useful command that shows all nodes and how they are utilized is <code>LOCALnodeload.pl</code>. A typical output looks like this: | |||
<pre> | |||
[jsmith@hpc10 ~]$ LOCALnodeload.pl | |||
Node Np State/jobs Load Properties | |||
---------- -- ---------- ----- ---------- | |||
n10-01 12 12 12.01 typea | |||
n10-02 12 free 0.00 typea | |||
n10-03 12 free 0.00 typea | |||
n10-04 12 free 0.00 typea | |||
n10-05 16 12 11.93 typeb | |||
n10-06 16 free 0.00 typeb | |||
n10-07 16 offline 0.00 typeb | |||
n10-08 16 down 0.00 typeb | |||
</pre> | |||
The first column (Node) shows the names of the nodes. The second column (Np) shows the total number of CPU slots. The third column (State/jobs) shows the number of CPU slots currenly in use or the status of the node (free, offline or down). The forth colum (Load) shows the actual load on the nodes. In an ideal situation the load matches the number of CPU slots in use. The last column (Properties) shows the properties as described above. As you can see in the example, typea nodes have 12 CPU slots and typeb nodes have 16. Node n10-01 is fully occupied, node n10-05 is running one or more jobs but still has 4 CPU slots free. Nodes n10-07 and n10-08 cannot be used. | |||
== Selecting nodes == | |||
If you submit a job, the scheduler automatically selects a node to run it. You can manually select on wich node your job runs by using the <code>-l</code> switch with the <code>qsub</code> command, like this: | |||
qsub -l nodes=<x>:ppn=<c>:<property>:<property>... | |||
* <x> is either an amount of nodes or the name(s) of the selected node(s) | |||
* <c> is number of CPU slots per node | |||
* <property> is any of the properties you see in Properties column of the LOCALnodeload.pl command. | |||
Examples: | |||
{| | |||
|- | |||
| <code>qsub -l nodes=4</code> || Request 4 nodes of any type | |||
|- | |||
| <code>qsub -l nodes=n10-07+n10-08</code> || Request 2 specific nodes by hostname | |||
|- | |||
| <code>qsub -l nodes=4:ppn=2</code> || Request 2 processors on each of four nodes | |||
|- | |||
| <code>qsub -l nodes=1:ppn=4</code> || Request 4 processors on one node | |||
|- | |||
| <code>qsub -l nodes=2:typea</code> || Request two nodes with typea property | |||
|- | |||
|} |
Revision as of 09:29, 29 March 2017
The different queues
The larger hpc clusters, most notably hpc03, hpc06, hpc11 and hpc12, are shared by two or more research groups. On those clusters every group has their own queue, sometimes even more than one. These queues give exclusive and full access to a specific set of nodes.
There is also a guest queue on every hpc cluster that gives access to all nodes, but with some restrictions, you will not be able to run non-rerunable and interactive jobs.
In most cases, access to one of the queues is based on group membership in the Active Directory. If your netid is not a member of the right group, you default to the guest queue if you submit a job. If you have access to the group and bulk network shares of your research group, you should also have access to the normal queue on the hpc cluster. If not, contact the secretary in your research group and let him/her arrange the group membership of your netid.
You can check your default queue by submitting a small test job and then have a look at the list with jobs with the qstat command.
[jsmith@hpc10 ~]$ echo "sleep 60" | qsub [jsmith@hpc10 ~]$ qstat -u jsmith
If you see anything other than guest in the third column, then you are all set.
There are two ways to select the guest queue;
With the -q switch on the commandline:
qsub -q guest job1
Or with a directive at the start of your job script:
#PBS -q guest
It is important to know that a job in the guest queue can be interrupted and resumed at any time. You should make sure that the application in your job saves the intermediate results at regular intervals and that it knows how to continue when your job is resumed. If you neglect this, your job in the guest queue will start all over again every time it is interrupted and resumed.
The different nodes
On most hpc clusters you'll find that worker nodes are not all identical, different series of nodes exist which were purchased at different times and with different specifications. To distinguish between the different series of nodes, they are labelled with properties like typea, typeb, typec, etc. On some hpc clusters, nodes have extra properties showing to which queue they belong or showing additional features, like an infiniband network or extra memory compared to similar nodes.
A useful command that shows all nodes and how they are utilized is LOCALnodeload.pl
. A typical output looks like this:
[jsmith@hpc10 ~]$ LOCALnodeload.pl Node Np State/jobs Load Properties ---------- -- ---------- ----- ---------- n10-01 12 12 12.01 typea n10-02 12 free 0.00 typea n10-03 12 free 0.00 typea n10-04 12 free 0.00 typea n10-05 16 12 11.93 typeb n10-06 16 free 0.00 typeb n10-07 16 offline 0.00 typeb n10-08 16 down 0.00 typeb
The first column (Node) shows the names of the nodes. The second column (Np) shows the total number of CPU slots. The third column (State/jobs) shows the number of CPU slots currenly in use or the status of the node (free, offline or down). The forth colum (Load) shows the actual load on the nodes. In an ideal situation the load matches the number of CPU slots in use. The last column (Properties) shows the properties as described above. As you can see in the example, typea nodes have 12 CPU slots and typeb nodes have 16. Node n10-01 is fully occupied, node n10-05 is running one or more jobs but still has 4 CPU slots free. Nodes n10-07 and n10-08 cannot be used.
Selecting nodes
If you submit a job, the scheduler automatically selects a node to run it. You can manually select on wich node your job runs by using the -l
switch with the qsub
command, like this:
qsub -l nodes=<x>:ppn=<c>:<property>:<property>...
- <x> is either an amount of nodes or the name(s) of the selected node(s)
- <c> is number of CPU slots per node
- <property> is any of the properties you see in Properties column of the LOCALnodeload.pl command.
Examples:
qsub -l nodes=4 |
Request 4 nodes of any type |
qsub -l nodes=n10-07+n10-08 |
Request 2 specific nodes by hostname |
qsub -l nodes=4:ppn=2 |
Request 2 processors on each of four nodes |
qsub -l nodes=1:ppn=4 |
Request 4 processors on one node |
qsub -l nodes=2:typea |
Request two nodes with typea property |