Monitoring jobs
You’ll feel a bit helpless submitting a job and then just waiting for the output to appear, so it’s important to be able to track and monitor the jobs you’ve submitted to the scheduler. The most useful command for tracking the status of submitted jobs is qstat
. Try submitting the following:
qsub sim 5 qstat
You’ll see output like the following:
job-ID prior name user state submit/start at queue slots ------------------------------------------------------------------------------ <...more...> 1007632 0.50118 Ets1.sh gaolong qw 02/19/2014 17:17:38 4 1007633 0.50118 Mxi1.sh gaolong qw 02/19/2014 17:17:38 4 1022907 0.00000 sim pbreheny qw 02/20/2014 07:42:40 1
Assuming you submitted qstat
immediately after submitting your job, you’ll be at the end of the list, and your ‘state’ will be ‘qw’, which stands for ‘queued and waiting’. This means that the batch scheduler is still in the process of deciding where to run the job and allocating resources to do so. After a few seconds, your job will transition to a state of ‘r’, meaning ‘running’. For the most part, these are the only two states you will see, although you may happen to catch your job in a ‘t’ state, meaning that it is in the process of transitioning to a compute node. There are also various error states that your job could enter if something goes wrong, such as ‘Eqw’ – if you see anything other than ‘qw’, ‘r’, or ‘t’, it’s an indication that something has gone wrong.
Running qstat
with no options will tell you about all the jobs currently submitted to the entire Neon cluster. This is sometimes useful to see, but typically a bit of information overload. To see only your jobs, you can submit:
qstat -u pbreheny
(obviously, replacing pbreheny
with your own hawkID). Perhaps more helpful, assuming you’re planning to use the TUKEY queue, is to see all the jobs submitted to that queue:
qstat -q TUKEY
One final monitoring command that may be useful is qhost
, which provides more
information about the processor and memory usage of specific hosts. Obviously,
the specific hosts we would be interested in are the machines belonging to the
TUKEY queue, which can be queried as follows:
qhost -h neon-mm-compute-5-25 qhost -h neon-mm-compute-8-17
In particular, this command is useful as a way to see if TUKEY is running out of memory.
Deleting jobs: qdel
For various reasons, you may wish to delete a job from the queue (usually,
because you realize there is a mistake in your code or your qsub
command). By
running qstat
, you can learn the ID for the job you submitted (1022907 in the
example above). To kill it, simply use the qdel
command:
qdel 1022907
You will then get a confirmation message confirming that you have deleted the job.