How to run a job: Difference between revisions

From hpcwiki
Jump to navigation Jump to search
No edit summary
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
== First steps ==
== First example ==


In order to make the worker nodes run a parallel job, you have to prepare a job script. This script tells the queue manager what you want to do, it has to be submitted with the qsub command. A typical job script looks like this:
We'll have a look at a detailed example of a very simple job. The job script for this example looks like this:


#!/bin/sh
  #
  #
  #PBS -l nodes=1:ppn=1
  #PBS -l nodes=1:ppn=1
  #
  #
cd $PBS_O_WORKDIR
  sleep 120
  sleep 120
  echo "Hello world!"
  echo "Hello world!"


This job does nothing for two minutes, after that it prints "Hello world!". The script starts with #PBS, which is a directive for Torque (the resource manager). In this example it means that you tell the queue manager to use one CPU (ppn:1) on one node (nodes=1) for this job. There can be more than one directive, they should always be at the start of the script. The remaing lines in this example are just commands that you could type on the command line.
This job does nothing for two minutes, after that it prints "Hello world!". The script starts with #PBS, which is a directive for Torque (the resource manager). In this example it means that you tell Torque to use one CPU (ppn:1) on one node (nodes=1) for this job. There can be more than one directives, they should always be at the start of the script. The remaing lines in this example are just commands that you could type on the command line.


Assuming that the name of the script is it can be submitted like this:
Assuming that the name of the script is <code>job1</code> it can be submitted like this:


  qsub job1
  qsub job1


The qsub command responds with a job id, wich looks like this: <code>24.hpc10.hpc</code>. The part before the first dot is system wide unique number that's increased by 1 for every new job that is submitted.
The qsub command responds with a job id, wich looks like this: <code>24.hpc10.hpc</code>. The part before the first dot is a system wide unique number that's increased by 1 for every new job that is submitted.


You can check if your job is running with the qstat command. This command by itself gives a list of all running jobs, the job you just submitted will probably be one of the last down the list.
You can check if your job is running with the <code>qstat</code> command. This command by itself gives a list of all running jobs, the job you just submitted will probably be one of the last down the list. For a wider list with slightly more information type <code>qstat -a</code>:
 
<pre>[jsmith@hpc10:~]$ qstat -a
 
hpc10.hpc:
                                                                                  Req'd    Req'd      Elap
Job ID                  Username    Queue    Jobname          SessID  NDS  TSK  Memory  Time    S  Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
24.hpc10.hpc            jsmith      q1      job1              17395    1      1    --        --  R      --
[jsmith@hpc10:~]$ </pre>
 
The part that tells you if your job is running is the second to last column, if this is a capital Q, then your job is waiting for a free worker node, if it is a capital R, then your job is running. If you only want to see the job you just submitted, type <code>qstat -a</code> followed by the job id:
 
qstat -a 24.hpc10.hpc
 
Once the commands in your job script are finished, your job will be terminated automatically. You can terminate the job yourself with the qdel command, like this:
 
qdel 24.hpc10.hpc
 
Of course, a job that is terminated prematurely will not give you the results you were expecting.
 
As long as your job is running, there will be two temporary files in your home directory, like <code>24.hpc10.hpc.OU </code>and <code>24.hpc10.hpc.ER</code>, these names correspond with the jobd id. The output and error messages that would normally be printed on the command line, will be redirected to these temporary files. If your job is finished, these files will be renamed and moved to the directory where you were when the job was submitted. The new names will be the same as the name of your jobscript, followed by a dot and the letter o or e, plus the first part of the job id. The example job script above would, if executed correctly, leave two files; <code>job1.o24</code> which would contain the text 'Hello world!' and <code>job1.e24</code> which would be empty.
 
== Example with Matlab ==
 
Matrix Laboratory. An interactive numerical white-board and scripting tool for prototyping algorithms and performing numerical analysis and visual representation of data.
 
The cluster is most effective used in batch mode. This requires disabling the default matlab GUI features and display capabilities.
 
A simple example job for running a single core Matlab run in batch mode:
 
#/bin/sh
#
#PBS -­l nodes=1
#
module load matlab
cd $PBS_O_WORKDIR
matlab -­nosplash -­nodisplay ­-nojvm -­singleCompThread < pseudoinv.m

Latest revision as of 17:12, 1 March 2017

First example

We'll have a look at a detailed example of a very simple job. The job script for this example looks like this:

#!/bin/sh
#
#PBS -l nodes=1:ppn=1
#
cd $PBS_O_WORKDIR
sleep 120
echo "Hello world!"

This job does nothing for two minutes, after that it prints "Hello world!". The script starts with #PBS, which is a directive for Torque (the resource manager). In this example it means that you tell Torque to use one CPU (ppn:1) on one node (nodes=1) for this job. There can be more than one directives, they should always be at the start of the script. The remaing lines in this example are just commands that you could type on the command line.

Assuming that the name of the script is job1 it can be submitted like this:

qsub job1

The qsub command responds with a job id, wich looks like this: 24.hpc10.hpc. The part before the first dot is a system wide unique number that's increased by 1 for every new job that is submitted.

You can check if your job is running with the qstat command. This command by itself gives a list of all running jobs, the job you just submitted will probably be one of the last down the list. For a wider list with slightly more information type qstat -a:

[jsmith@hpc10:~]$ qstat -a

hpc10.hpc: 
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
24.hpc10.hpc            jsmith      q1       job1              17395     1      1    --        --  R       -- 
[jsmith@hpc10:~]$ 

The part that tells you if your job is running is the second to last column, if this is a capital Q, then your job is waiting for a free worker node, if it is a capital R, then your job is running. If you only want to see the job you just submitted, type qstat -a followed by the job id:

qstat -a 24.hpc10.hpc

Once the commands in your job script are finished, your job will be terminated automatically. You can terminate the job yourself with the qdel command, like this:

qdel 24.hpc10.hpc

Of course, a job that is terminated prematurely will not give you the results you were expecting.

As long as your job is running, there will be two temporary files in your home directory, like 24.hpc10.hpc.OU and 24.hpc10.hpc.ER, these names correspond with the jobd id. The output and error messages that would normally be printed on the command line, will be redirected to these temporary files. If your job is finished, these files will be renamed and moved to the directory where you were when the job was submitted. The new names will be the same as the name of your jobscript, followed by a dot and the letter o or e, plus the first part of the job id. The example job script above would, if executed correctly, leave two files; job1.o24 which would contain the text 'Hello world!' and job1.e24 which would be empty.

Example with Matlab

Matrix Laboratory. An interactive numerical white-board and scripting tool for prototyping algorithms and performing numerical analysis and visual representation of data.

The cluster is most effective used in batch mode. This requires disabling the default matlab GUI features and display capabilities.

A simple example job for running a single core Matlab run in batch mode:

#/bin/sh
#
#PBS -­l nodes=1
#
module load matlab
cd $PBS_O_WORKDIR
matlab -­nosplash -­nodisplay ­-nojvm -­singleCompThread < pseudoinv.m