From: Andrew Kingston <andrew@peerless.co.uk>
Subject: jobs not picking up spare cpus
   Date: Fri, 25 May 2007 06:02:25 -0400
Msg# 1549
View Complete Thread (2 articles) | All Threads
Last Next
Hi

We came across a bit of a strange problem yesterday where we had spare cpus available for jobs, but there were certain ones not picking up frames. We'd seen it before, but were never really sure whether it was down to the way people had set their jobs up or something else. However this time I was able to check all the jobs were set up correctly & I also found these kinds of messages in the log on the job server for these jobs:-

ALERT Ignoring frmarb 'Run': task in unexpected state 'Idle' (expected Start|Busy) msg from ?@lafarm23:33523 * Lots of these showing up for different farm machines

FAIL/LISTCPUS Fputs[2]: write failed: _SureWrite(): Broken pipe

ALERT Ignoring 'Idle': task in non-applicable state 'Start' for jobid lin2.928 from ?@lafarm19:

lafarm19 & lafarm23 were two of the machines not picking up frames.

Also I've just checked through the logs this morning & found quite a few of these types of messages on that job server:-

ALERT Task 'CpuPass1' ignored for non-existant frame -99999 from ?@lafarm23:33566

Prev=lin2 0 lin2.892,091_070_tiles_v04 -99999 100 2048 JobPass Job state is 'Done'

New=lin2 0 lin2.892,091_070_tiles_v04 -99999 100 2048 CpuPass2 Ram unavailable on lafarm23 (2048>0)

These only appeared between 4 & 4:15 am, and I'm fairly sure no one was here rendering then...

Any ideas?

Cheers
Andrew

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: jobs not picking up spare cpus
   Date: Fri, 25 May 2007 12:40:41 -0400
Msg# 1550
View Complete Thread (2 articles) | All Threads
Last Next
Andrew Kingston wrote:
> ALERT      Ignoring frmarb 'Run': task in unexpected state 'Idle' 
> (expected Start|Busy) msg from ?@lafarm23:33523 * Lots of these showing 
> up for different farm machines
> 
> FAIL/LISTCPUS Fputs[2]: write failed: _SureWrite(): Broken pipe
> 
> ALERT	   Ignoring 'Idle': task in non-applicable state 'Start' for jobid 
> lin2.928 from ?@lafarm19:

	Can you send me some complete logs directly via email?
	(ie. not on the group) Not sure if these are really something
	to worry about or not.

	Regarding jobs not picking up spare cpus, focus on the
	'Cpus' report for the job (check the STATE and NOTES column)
	and compare to the 'All Cpus' report to see what's idle vs. inuse.
	Send me those two reports if need be.

> lafarm19 & lafarm23 were two of the machines not picking up frames.
> 
> Also I've just checked through the logs this morning & found quite a few 
> of these types of messages on that job server:-
> 
> ALERT      Task 'CpuPass1' ignored for non-existant frame -99999 from ?@lafarm23:33566
>   Prev=lin2            0       lin2.892,091_070_tiles_v04     -99999 100   2048 JobPass    Job state is 'Done'
>   New=lin2             0       lin2.892,091_070_tiles_v04     -99999  100   2048 CpuPass2   Ram unavailable on lafarm23 (2048>0)
> 
> These only appeared between 4 & 4:15 am, and I'm fairly sure no one was 
> here rendering then...

	At 5am rush runs a cleanup operation (see 'taskcleanuphours' in rush.conf),
	but I'm not sure why it would show as 4am instead of 5am.
	Would need to see some logs to tell what's up there.

	Is there possibly a mix of different rush versions on the network?
	When sending the above (in separate email), include the output of:

		rush -ping +any -t 3

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)