Hello,
I have recently started using starcluster for scheduling my jobs on ec2.
I am using the following command to run my jobs:
qsub -N MultiMatlab -pe orte 8 -e /home/ubuntu/outputs/ -o
/home/ubuntu/outputs/ -j y <job to run>
where <job to run> is a matlab compiled binary (generated using
http://www.mathworks.com/help/toolbox/compiler/mcc.html and run using
http://www.mathworks.com/products/compiler/mcr/index.html) that
internally uses the matlab 'parfor'
(
http://www.mathworks.com/help/distcomp/parfor.html).
Although I am able to successfully schedule my jobs, many of them are
crashing after running for sometime on the nodes. (I have made sure
that I am not exceeding the amount of memory/cpu resources on each
node.)
I have included the qstat output before/after job3 on node002 crashed.
(In this case, I have started four similar/identical jobs on each of
the four nodes.)
The output of "qstat -explain a" (also included below) indicates the
error as "error: no value for 'np_load_avg' because execd is in
unknown state".
I have tried modifying the queue configuration using "qconf -mq" but
to no avail. I have included qconf output below. (set
np_load_avg=11.75 instead of the default 1.75)
I was wondering if there are any suggestions for fixing this issue.
Could you kindly let me know.
[21:55:49Fri Nov 09~]qstat -f
queuename                      qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q_at_node001                  BIP   0/8/8          35.28    linux-x64
      2 0.55500 MultiMatla ubuntu       r     11/09/2012 21:39:31     8
---------------------------------------------------------------------------------
all.q_at_node002                  BIP   0/8/8          31.54    linux-x64
      3 0.55500 MultiMatla ubuntu       r     11/09/2012 21:40:46     8
---------------------------------------------------------------------------------
all.q_at_node003                  BIP   0/8/8          37.81    linux-x64
      4 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:01     8
---------------------------------------------------------------------------------
all.q_at_node004                  BIP   0/8/8          21.15    linux-x64
      5 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:16     8
[21:55:51Fri Nov 09~]qstat -f
queuename                      qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q_at_node001                  BIP   0/8/8          35.07    linux-x64
      2 0.55500 MultiMatla ubuntu       r     11/09/2012 21:39:31     8
---------------------------------------------------------------------------------
all.q_at_node002                  BIP   0/8/8          -NA-     linux-x64     au
      3 0.55500 MultiMatla ubuntu       r     11/09/2012 21:40:46     8
---------------------------------------------------------------------------------
all.q_at_node003                  BIP   0/8/8          38.70    linux-x64
      4 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:01     8
---------------------------------------------------------------------------------
all.q_at_node004                  BIP   0/8/8          20.34    linux-x64
      5 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:16     8
[21:56:46Fri Nov 09~]qstat -explain a
queuename                      qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q_at_node001                  BIP   0/8/8          35.07    linux-x64
      2 0.55500 MultiMatla ubuntu       r     11/09/2012 21:39:31     8
---------------------------------------------------------------------------------
all.q_at_node002                  BIP   0/8/8          -NA-     linux-x64     au
 error: no value for "np_load_avg" because execd is in unknown state
      3 0.55500 MultiMatla ubuntu       r     11/09/2012 21:40:46     8
---------------------------------------------------------------------------------
all.q_at_node003                  BIP   0/8/8          38.70    linux-x64
      4 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:01     8
---------------------------------------------------------------------------------
all.q_at_node004                  BIP   0/8/8          20.34    linux-x64
      5 0.55500 MultiMatla ubuntu       r     11/09/2012 21:41:16     8
root_at_master:~# qconf -mq all.q
qname                 all.q
hostlist              _at_allhosts
seq_no                0
load_thresholds       np_load_avg=11.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make orte
rerun                 FALSE
slots                 1,[node001=8],[node002=8],[node003=8],[node004=8]
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
thanks,
~Santosh
Received on Fri Nov 09 2012 - 17:49:06 EST