Re:  Instances are not accepting jobs when the slots are available.
 
Here is a followup of my investigation of the unusual high CPU/core usage
in EC2 instances.
In the last post, I reported my observations of 1. unusual high CPU/core
usage of the R process in EC2 instances, which is designed to use one core
on the local machine;  And 2. unusual high percentage of kernel time in CPU
usage.
I looked more into the R processes using htop and found a lot of threads
were created in each of them. And there are tons of sched_yield() system
calls in each thread.
Do these phenomenons with starcluster at EC2 ring a bell for someone?
Thanks!
Jin
On Thu, Jul 17, 2014 at 3:48 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
> Hi Chris,
>
> Thanks for your prompt reply and point me to look  the unusual high load
> of the instance! And I found something more mysterious in EC2 instances
> (C3.8xlarge, to be more specific) :
>
> 1. I found some of my jobs are using CPU as much as 900%, although these
> job are designed to use only one core and behave so in my local machine,
> which lead to the unexpected high load of the system. Following is an
> example snapshot of these process.
>
> 2. While all the 8 running jobs takes 3000% CPU which is close to the full
> of 32 cores. The kernel time takes up to 70% of the CPU time.
>
> Are these problem related to the visualization nature of the EC2
> instances? Can you give me a hint to investigate them?
>
> Thanks!
> Jin
>
>
>
> [image: Inline image 1]
>
>
>
>
>
>
> On Thu, Jul 17, 2014 at 1:48 PM, Chris Dagdigian <dag_at_bioteam.net> wrote:
>
>>
>> Hi Jin,
>>
>> The cluster is not accepting jobs into those open slots because your
>> compute nodes are reporting alarm state "a"  - your first host has a
>> reported load average of 148!
>>
>> Alarm state 'a' means "load threshold alarm level reached" it basically
>> means that the server load is high enough that the nodes are refusing
>> new work until the load average goes down.
>>
>> All of those load alarm thresholds are configurable values within SGE so
>> you can revise them upwards if you want
>>
>> Regards,
>> Chris
>>
>>
>> Jin Yu wrote:
>> > Hello,
>> >
>> > I just started a cluster of 20 c3.8xlarge instances, which have 32
>> > virtual cores in each.  In my understanding, each instance should have
>> > 32 slots available  to run the jobs by default. But after running it
>> > for a while, I found a lot of nodes are not running at the full speed.
>> >
>> > Following as an example, you can see node016 has only 13 jobs running
>> > and node017 has 9 jobs running, while node018 has 32 jobs running. I
>> > have another ~10000 jobs waiting in the queue, so it is not a matter
>> > of running out of jobs.
>> >
>> > Can anyone give me a hint what is going on here?
>> >
>> > Thanks!
>> > Jin
>> >
>> >
>> > all.q_at_node016                  BIP   0/13/32        148.35   linux-x64
>> >     a
>> >     784 0.55500 job.part.a sgeadmin     r     07/17/2014 11:25:59
>> > 1
>> >     982 0.55500 job.part.a sgeadmin     r     07/17/2014 14:43:59
>> > 1
>> >    1056 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:44
>> > 1
>> >    1057 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:44
>> > 1
>> >    1058 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:59
>> > 1
>> >    1121 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >    1122 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >    1123 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >    1124 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >    1125 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >    1126 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >    1127 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >    1128 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>> > 1
>> >
>> ---------------------------------------------------------------------------------
>> > all.q_at_node017                  BIP   0/9/32         83.86    linux-x64
>> >     a
>> >     568 0.55500 job.part.a sgeadmin     r     07/17/2014 04:01:14
>> > 1
>> >    1001 0.55500 job.part.a sgeadmin     r     07/17/2014 15:07:29
>> > 1
>> >    1002 0.55500 job.part.a sgeadmin     r     07/17/2014 15:07:29
>> > 1
>> >    1072 0.55500 job.part.a sgeadmin     r     07/17/2014 16:53:29
>> > 1
>> >    1116 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:29
>> > 1
>> >    1117 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:29
>> > 1
>> >    1118 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:44
>> > 1
>> >    1119 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:59
>> > 1
>> >    1120 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:59
>> > 1
>> >
>> ---------------------------------------------------------------------------------
>> > all.q_at_node018                  BIP   0/32/32        346.00   linux-x64
>> >     a
>> >
>> > _______________________________________________
>> > StarCluster mailing list
>> > StarCluster_at_mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
>
Received on Thu Jul 17 2014 - 18:06:53 EDT
 
This archive was generated by
hypermail 2.3.0.