Hi Rayson & Justin,
Attached please find the crash report generated by the loadbalance and
another the output  of the qhost -xml running on the master node. Hopefully
these provide clue on what went wrong.
Thanks for the help!
-Wei
On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin_at_yahoo.com> wrote:
> The XML parser does not like the output of "qhost -xml". (We changed some
> minor XML output code in Grid Scheduler recently,
> but as you have encountered this before in earlier versions, looks like
> our changes are not the cause of this issue.)
>
>
> I just started a 1 node cluster and let the loadbalancer add another node,
> and it all seemed to work fine...from the error message
> in your email, qhost exited with 1, and a number of things can cause qhost
> to exit with code 1.
>
>
> Can you run from the interactive shell the following command on one of the
> nodes on EC2 when you encounter this problem
> again??
>
> % qhost -xml
>
> And then send us the output. It can be an issue related to how the XML is
> generated in Grid Engine/Grid Scheduler, or it can be
> something else in the XML parser.
>
> Rayson
>
> =================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
> ________________________________
> From: Wei Tao <wei.tao_at_tsibiocomputing.com>
> To: starcluster_at_mit.edu
> Sent: Wednesday, January 11, 2012 10:01 AM
> Subject: [StarCluster] loadbalance error
>
>
> Hi all,
>
> I was running loadbalance. After a while, I got the following error. Can
> someone shed some light on this? This happened before with earlier versions
> of Starcluster as well.
>
> >>> Loading full job history
> !!! ERROR - command 'source /etc/profile && qhost -xml' failed with status
> 1
> Traceback (most recent call last):
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
> line 251, in main
>     sc.execute(args)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
> line 89, in execute
>     lb.run(cluster)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 583, in run
>     if self.get_stats() == -1:
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 529, in get_stats
>     self.stat.parse_qhost(qhostxml)
>   File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 49, in parse_qhost
>     doc = xml.dom.minidom.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
>     return expatbuilder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
>     return builder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
>     parser.Parse(string, True)
> ExpatError: syntax error: line 1, column 0
>
> ---------------------------------------------------------------------------
> MemoryError                               Traceback (most recent call last)
>
> /usr/local/bin/starcluster in <module>()
>       7 if __name__ == '__main__':
>       8     sys.exit(
> ----> 9         load_entry_point('StarCluster==0.93', 'console_scripts',
> 'starcluster')()
>      10     )
>      11
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main()
>     306     logger.configure_sc_logging()
>     307     warn_debug_file_moved()
> --> 308     StarClusterCLI().main()
>     309
>     310 if __name__ == '__main__':
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main(self)
>     283             log.debug(traceback.format_exc())
>     284             print
> --> 285             self.bug_found()
>     286
>     287
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in bug_found(self)
>     150         crashfile = open(static.CRASH_FILE, 'w')
>     151         crashfile.write(header % "CRASH DETAILS")
> --> 152         crashfile.write(session.stream.getvalue())
>     153         crashfile.write(header % "SYSTEM INFO")
>     154         crashfile.write("StarCluster: %s\n" % __version__)
>
> /usr/lib/python2.6/StringIO.pyc in getvalue(self)
>     268         """
>     269         if self.buflist:
> --> 270             self.buf += ''.join(self.buflist)
>     271             self.buflist = []
>     272         return self.buf
>
> MemoryError:
>
> Thanks!
>
> -Wei
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
-- 
Wei Tao, Ph.D.
TSI Biocomputing LLC
617-564-0934
Received on Tue Jan 17 2012 - 18:21:37 EST
 
This archive was generated by
hypermail 2.3.0.