Monitoring jobs

You’ll feel a bit helpless submitting a job and then just waiting for the output to appear, so it’s important to be able to track and monitor the jobs you’ve submitted to the scheduler. The most useful command for tracking the status of submitted jobs is qstat. Try submitting the following:

>
qsub sim 5
qstat

You’ll see output like the following:

(Output)
job-ID  prior   name       user         state submit/start at     queue  slots
------------------------------------------------------------------------------
<...more...>
1007632 0.50118 Ets1.sh    gaolong      qw    02/19/2014 17:17:38        4
1007633 0.50118 Mxi1.sh    gaolong      qw    02/19/2014 17:17:38        4
1022907 0.00000 sim        pbreheny     qw    02/20/2014 07:42:40        1

Assuming you submitted qstat immediately after submitting your job, you’ll be at the end of the list, and your ‘state’ will be ‘qw’, which stands for ‘queued and waiting’. This means that the batch scheduler is still in the process of deciding where to run the job and allocating resources to do so. After a few seconds, your job will transition to a state of ‘r’, meaning ‘running’. For the most part, these are the only two states you will see, although you may happen to catch your job in a ‘t’ state, meaning that it is in the process of transitioning to a compute node. There are also various error states that your job could enter if something goes wrong, such as ‘Eqw’ – if you see anything other than ‘qw’, ‘r’, or ‘t’, it’s an indication that something has gone wrong.

Running qstat with no options will tell you about all the jobs currently submitted to the entire Neon cluster. This is sometimes useful to see, but typically a bit of information overload. To see only your jobs, you can submit:

>
qstat -u pbreheny

(obviously, replacing pbreheny with your own hawkID). Perhaps more helpful, assuming you’re planning to use the TUKEY queue, is to see all the jobs submitted to that queue:

>
qstat -q TUKEY

One final monitoring command that may be useful is qhost, which provides more information about the processor and memory usage of specific hosts. Obviously, the specific hosts we would be interested in are the machines belonging to the TUKEY queue, which can be queried as follows:

>
qhost -h neon-mm-compute-5-25
qhost -h neon-mm-compute-8-17

In particular, this command is useful as a way to see if TUKEY is running out of memory.

Deleting jobs: qdel

For various reasons, you may wish to delete a job from the queue (usually, because you realize there is a mistake in your code or your qsub command). By running qstat, you can learn the ID for the job you submitted (1022907 in the example above). To kill it, simply use the qdel command:

>
qdel 1022907

You will then get a confirmation message confirming that you have deleted the job.