That looks normal. Can you please send qstat, qacct, and qhost output from
when you're seeing the problem? Thanks.
On Wed, Sep 18, 2013 at 2:47 PM, Ryan Golhar <ngsbioinformatics_at_gmail.com>wrote:
> I've since terminated the cluster and an experimenting with different set
> up, but here's the output from qstat and qhost;
>
> ec2-user_at_master:~$ qstat
> job-ID  prior   name       user         state submit/start at     queue
>                        slots ja-task-ID
>
> -----------------------------------------------------------------------------------------------------------------
>       4 0.55500 j1-00493-0 ec2-user     r     09/18/2013 17:38:44
> all.q_at_node001                      8
>       6 0.55500 j1-00508-0 ec2-user     r     09/18/2013 17:45:44
> all.q_at_node002                      8
>       7 0.55500 j1-00525-0 ec2-user     r     09/18/2013 17:46:29
> all.q_at_node003                      8
>       8 0.55500 j1-00541-0 ec2-user     r     09/18/2013 17:54:59
> all.q_at_node004                      8
>       9 0.55500 j1-00565-0 ec2-user     r     09/18/2013 17:55:44
> all.q_at_node005                      8
>      10 0.55500 j1-00596-0 ec2-user     r     09/18/2013 17:58:59
> all.q_at_node006                      8
>      11 0.55500 j1-00604-0 ec2-user     r     09/18/2013 18:05:14
> all.q_at_node007                      8
>      12 0.55500 j1-00625-0 ec2-user     r     09/18/2013 18:05:14
> all.q_at_node008                      8
>      13 0.55500 j1-00650-0 ec2-user     r     09/18/2013 18:05:14
> all.q_at_node009                      8
>      18 0.55500 j1-00734-0 ec2-user     r     09/18/2013 18:07:29
> all.q_at_node010                      8
>      19 0.55500 j1-00738-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node011                      8
>      20 0.55500 j1-00739-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node012                      8
>      21 0.55500 j1-00770   ec2-user     r     09/18/2013 18:16:59
> all.q_at_node013                      8
>      22 0.55500 j1-00806-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node014                      8
>      23 0.55500 j1-00825-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node015                      8
>      24 0.55500 j1-00826-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node016                      8
>      25 0.55500 j1-00846-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node017                      8
>      26 0.55500 j1-00847-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node018                      8
>      27 0.55500 j1-00913   ec2-user     r     09/18/2013 18:16:59
> all.q_at_node019                      8
>      28 0.55500 j1-00914-0 ec2-user     r     09/18/2013 18:16:59
> all.q_at_node020                      8
>      29 0.55500 j1-00914   ec2-user     r     09/18/2013 18:26:29
> all.q_at_node021                      8
>      30 0.55500 j1-00922   ec2-user     r     09/18/2013 18:26:29
> all.q_at_node022                      8
>      31 0.55500 j1-00977   ec2-user     r     09/18/2013 18:26:29
> all.q_at_node023                      8
>      32 0.55500 j1-00984-0 ec2-user     r     09/18/2013 18:26:29
> all.q_at_node024                      8
>      33 0.55500 j1-00984   ec2-user     r     09/18/2013 18:26:29
> all.q_at_node025                      8
>      34 0.55500 j1-00998-0 ec2-user     r     09/18/2013 18:26:29
> all.q_at_node026                      8
>      35 0.55500 j1-01010-0 ec2-user     r     09/18/2013 18:26:29
> all.q_at_node027                      8
>      36 0.55500 j1-01019-0 ec2-user     r     09/18/2013 18:26:29
> all.q_at_node028                      8
>      37 0.55500 j1-01025-0 ec2-user     r     09/18/2013 18:26:29
> all.q_at_node029                      8
>      38 0.55500 j1-01026-0 ec2-user     r     09/18/2013 18:26:29
> all.q_at_node030                      8
>
> ec2-user_at_master:~$ qhost
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO
>  SWAPUS
>
> -------------------------------------------------------------------------------
> global                  -               -     -       -       -       -
>     -
> node001                 linux-x64       8  7.74    6.8G    3.8G     0.0
>   0.0
> node002                 linux-x64       8  7.93    6.8G    3.7G     0.0
>   0.0
> node003                 linux-x64       8  7.68    6.8G    3.7G     0.0
>   0.0
> node004                 linux-x64       8  7.86    6.8G    3.8G     0.0
>   0.0
> node005                 linux-x64       8  7.87    6.8G    3.7G     0.0
>   0.0
> node006                 linux-x64       8  7.66    6.8G    3.7G     0.0
>   0.0
> node007                 linux-x64       8  0.01    6.8G  564.8M     0.0
>   0.0
> node008                 linux-x64       8  0.01    6.8G  493.6M     0.0
>   0.0
> node009                 linux-x64       8  0.02    6.8G  564.4M     0.0
>   0.0
> node010                 linux-x64       8  7.85    6.8G    3.7G     0.0
>   0.0
> node011                 linux-x64       8  7.53    6.8G    3.7G     0.0
>   0.0
> node012                 linux-x64       8  7.57    6.8G    3.6G     0.0
>   0.0
> node013                 linux-x64       8  7.71    6.8G    3.7G     0.0
>   0.0
> node014                 linux-x64       8  7.49    6.8G    3.7G     0.0
>   0.0
> node015                 linux-x64       8  7.51    6.8G    3.7G     0.0
>   0.0
> node016                 linux-x64       8  7.50    6.8G    3.6G     0.0
>   0.0
> node017                 linux-x64       8  7.89    6.8G    3.7G     0.0
>   0.0
> node018                 linux-x64       8  7.50    6.8G    3.7G     0.0
>   0.0
> node019                 linux-x64       8  7.52    6.8G    3.7G     0.0
>   0.0
> node020                 linux-x64       8  7.68    6.8G    3.6G     0.0
>   0.0
> node021                 linux-x64       8  7.16    6.8G    3.6G     0.0
>   0.0
> node022                 linux-x64       8  6.99    6.8G    3.6G     0.0
>   0.0
> node023                 linux-x64       8  6.80    6.8G    3.6G     0.0
>   0.0
> node024                 linux-x64       8  7.20    6.8G    3.6G     0.0
>   0.0
> node025                 linux-x64       8  6.86    6.8G    3.6G     0.0
>   0.0
> node026                 linux-x64       8  7.24    6.8G    3.6G     0.0
>   0.0
> node027                 linux-x64       8  6.88    6.8G    3.7G     0.0
>   0.0
> node028                 linux-x64       8  6.28    6.8G    3.6G     0.0
>   0.0
> node029                 linux-x64       8  7.42    6.8G    3.6G     0.0
>   0.0
> node030                 linux-x64       8  0.10    6.8G  390.4M     0.0
>   0.0
> node031                 linux-x64       8  0.06    6.8G  135.0M     0.0
>   0.0
> node032                 linux-x64       8  0.04    6.8G  135.3M     0.0
>   0.0
> node033                 linux-x64       8  0.07    6.8G  135.6M     0.0
>   0.0
> node034                 linux-x64       8  0.10    6.8G  134.9M     0.0
>   0.0
>
>
> I never saw anything unusual
>
>
> On Wed, Sep 18, 2013 at 10:40 AM, Rajat Banerjee <rajatb_at_post.harvard.edu>wrote:
>
>> Ryan,
>> Could you put the output of qhost and qstat into a text file and send it
>> back to the list? That's what feeds the load balancer those stats.
>>
>> Thanks,
>> Rajat
>>
>>
>> On Tue, Sep 17, 2013 at 11:47 PM, Ryan Golhar <
>> ngsbioinformatics_at_gmail.com> wrote:
>>
>>> I'm running a cluster with over 800 jobs queued....and I'm running
>>> loadbalance.  Every other query by loadbalance shows Avg job duration and
>>> wait time of 0 secs.  Why is this?  It hasn't yet caused a problem, but
>>> seems odd....
>>>
>>> >>> Loading full job history
>>> Execution hosts: 19
>>> Queued jobs: 791
>>> Oldest queued job: 2013-09-17 22:19:23
>>> Avg job duration: 3559 secs
>>> Avg job wait time: 12389 secs
>>> Last cluster modification time: 2013-09-18 00:11:31
>>> >>> Not adding nodes: already at or above maximum (1)
>>> >>> Sleeping...(looping again in 60 secs)
>>>
>>> Execution hosts: 19
>>> Queued jobs: 791
>>> Oldest queued job: 2013-09-17 22:19:23
>>> Avg job duration: 0 secs
>>> Avg job wait time: 0 secs
>>> Last cluster modification time: 2013-09-18 00:11:31
>>> >>> Not adding nodes: already at or above maximum (1)
>>> >>> Sleeping...(looping again in 60 secs)
>>>
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Wed Sep 18 2013 - 15:01:50 EDT
 
This archive was generated by
hypermail 2.3.0.