Thanks Ron, that was helpful.  It looks like the NFS share of my EBS volume
isn't being setup correctly when using addnode.  In my config file, I've
set MOUNT_PATH=/data, yet when I use addnode:
>>> Configuring NFS exports path(s):
/home
Sure enough, if I make changes using the master node in /home/danp, I can
see those changes on node001 whereas there is no /data on node001.
Should I create a bug report for this, or is this message sufficient?
Dan
On Mon, Nov 26, 2012 at 2:20 PM, Ron Chen <ron_chen_123_at_yahoo.com> wrote:
> Do you have the danp user on node001?
>
> Also, you should check the execd's messages file
> ($SGE_ROOT/default/spool/<host>/messages) to find out why the job caused
> errors.
>
> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>
>  -Ron
>
>
>
> ________________________________
> From: Daniel Polhamus <danp_at_metrumrg.com>
> To: starcluster_at_mit.edu
> Sent: Monday, November 26, 2012 1:55 PM
> Subject: [StarCluster] Addnode SGE problem
>
>
> Hi all,
>
> I've run into a problem with "addnode" that I'm having a difficult time
> diagnosing.   Using the development version of starcluster, when I issue a
> starcluster addnode, the nodes added in the resulting cluster are unusable
> -- they result in SGE errors.  Jobs run on the master node, but any nodes
> I've added are broken.  If, however, I start the cluster with multiple
> nodes then resulting nodes are all usable (so it's not a user code issue).
>  I have a hunch that this is due to the fact that we have several users
> working under the same account (as different AWS IAM users) and we are not
> all on the same StarCluster version.  To be clear, we are all on varying
> stages of the developmental version (0.9999).  Where do I begin debugging
> this?  The hostfile seems to be set up correctly (see output below).
>
> Thanks,
> Dan
>
> danp_at_master:~$ cat /etc/hosts
> 127.0.0.1 localhost
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> 10.196.149.155 master
> 10.226.219.58 node001
>
> And here's what the errors look like:
>
> danp_at_master:~$ qstat -f
> queuename                      qtype resv/used/tot. load_avg arch
>  states
>
> ---------------------------------------------------------------------------------
> all.q_at_master                   BIP   0/0/8          1.29     lx24-amd64
>
>
> ---------------------------------------------------------------------------------
> all.q_at_node001                  BIP   0/0/8          0.70     lx24-amd64
>
>
>
> ############################################################################
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>
> ############################################################################
>       1 0.55500 postList   danp         Eqw   11/26/2012 16:54:38     1
>
>       3 0.55500 postList   danp         Eqw   11/26/2012 16:54:39     1
>
>       5 0.55500 postList   danp         Eqw   11/26/2012 16:54:39     1
>
>       7 0.55500 postList   danp         Eqw   11/26/2012 16:54:39     1
>
>       8 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>       9 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      10 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      11 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      13 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      15 0.55500 postList   danp         Eqw   11/26/2012 16:54:41     1
>
>
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
-- 
Daniel G Polhamus, PhD
Metrum Research Group, LLC
2 Tunxis Rd, Suite 112
Tariffville, CT 06081
(888) 308-7049 ext 403
Received on Mon Nov 26 2012 - 14:45:12 EST
 
This archive was generated by
hypermail 2.3.0.