From: Patrick Boucher <patrickb@(email surpressed)>
Subject: Die frames
   Date: Tue, 08 Nov 2005 13:01:16 -0800
Msg# 1103
View Complete Thread (5 articles) | All Threads
Last Next
Hi all,

One of our sysadmins here ripped 4 computers from the farm without first doing a 'rush -getoff [computernames...]'

When the job that was rendering on these nodes was dumped the frames wound up in Die state and are stuck and the job doesn't want to dump.

Is there any way to tell rush that it won't be getting a return code from the nodes and just forget about it?

Any help would be appreciated.

--
Patrick Boucher
TD - Coder - Resident geek
Buzz Image Group
Tel 514.848.0579
Fax 514.848.6371

www.buzzimage.com
www.xsi-blog.com

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Die frames
   Date: Tue, 08 Nov 2005 13:13:31 -0800
Msg# 1105
View Complete Thread (5 articles) | All Threads
Last Next
Patrick Boucher wrote:
One of our sysadmins here ripped 4 computers from the farm without first doing a 'rush -getoff [computernames...]'

When the job that was rendering on these nodes was dumped the frames wound up in Die state and are stuck and the job doesn't want to dump.

Is there any way to tell rush that it won't be getting a return code from the nodes and just forget about it?

	Yes; the 'Down' button in irush is precisely for this purpose.

		1) Click 'Frames'
		2) Highlight the 'Die' frames waiting for the remotes to come back
		3) Click the red 'Down' button

	For more info on the irush 'Down' button, right click on it to view the help.

	The equivalent command line option is 'rush -down'; for more info on that, see:
	http://www.seriss.com/rush-current/rush/rush-command-line-options.html#-down

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)

   From: Patrick Boucher <patrickb@(email surpressed)>
Subject: Re: Die frames
   Date: Tue, 08 Nov 2005 13:29:21 -0800
Msg# 1106
View Complete Thread (5 articles) | All Threads
Last Next
Greg Ercolano wrote:
Patrick Boucher wrote:
One of our sysadmins here ripped 4 computers from the farm without first doing a 'rush -getoff [computernames...]'

When the job that was rendering on these nodes was dumped the frames wound up in Die state and are stuck and the job doesn't want to dump.

Is there any way to tell rush that it won't be getting a return code from the nodes and just forget about it?

	Yes; the 'Down' button in irush is precisely for this purpose.

		1) Click 'Frames'
		2) Highlight the 'Die' frames waiting for the remotes to come back
		3) Click the red 'Down' button

	For more info on the irush 'Down' button, right click on it to view the help.

	The equivalent command line option is 'rush -down'; for more info on that, see:
	http://www.seriss.com/rush-current/rush/rush-command-line-options.html#-down


I'll be going to bed just a tiny bit smarter. I love it!

Thanks!

--
Patrick Boucher
TD - Coder - Resident geek
Buzz Image Group
Tel 514.848.0579
Fax 514.848.6371

www.buzzimage.com
www.xsi-blog.com

   From: Brent Hensarling <brenth@(email surpressed)>
Subject: Re: Die frames
   Date: Tue, 08 Nov 2005 13:50:20 -0800
Msg# 1107
View Complete Thread (5 articles) | All Threads
Last Next
Its funny, I never noticed that there :)  What about the case where a frame is hung on a machine that is not dead, but the frame will not reque at all? We tend to get this all the time, and basically what I do is a restart on the rush service for the machine that has the hung frame, and it then reques the frame and then I can turn it back online and it will start rendering again. Will the down command just reque the frame and if the machine is still online, let it continue rendering?

_________________________________________________

Brent Hensarling

Luma Pictures

luma-pictures.com

_________________________________________________


On Nov 8, 2005, at 1:13 PM, Greg Ercolano wrote:

[posted to rush.general]

Patrick Boucher wrote:
One of our sysadmins here ripped 4 computers from the farm without first doing a 'rush -getoff [computernames...]'
When the job that was rendering on these nodes was dumped the frames wound up in Die state and are stuck and the job doesn't want to dump.
Is there any way to tell rush that it won't be getting a return code from the nodes and just forget about it?

Yes; the 'Down' button in irush is precisely for this purpose.

1) Click 'Frames'
2) Highlight the 'Die' frames waiting for the remotes to come back
3) Click the red 'Down' button

For more info on the irush 'Down' button, right click on it to view the help.

The equivalent command line option is 'rush -down'; for more info on that, see:

-- 
Rush Render Queue, http://seriss.com/rush/
Tel: xxx-xxx-xxxx
Cel: 310-266-8906
Fax: xxx-xxx-xxx


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Die frames
   Date: Tue, 08 Nov 2005 14:10:50 -0800
Msg# 1108
View Complete Thread (5 articles) | All Threads
Last Next
Brent Hensarling wrote:
Its funny, I never noticed that there :) What about the case where a frame is hung on a machine that is not dead, but the frame will not reque at all?

	When you requeue a running frame, Rush uses the most powerful means
	to kill the process it can to kill the process. Under unix, it uses
	kill -9, and under Windows it uses a similar 'most deadly' means of
	killing the process.

	If the kernel won't let the process go, it's most likely 'hung'
	in a way that it severe.

	Under Unix, the common cause is the file server is down, and the
	app is accessing a file through a hard mount. The process won't
	revive unless the remote server comes back, or the mount is fixed.

	In some cases, the problem is a broken network file system that
	is buggy and not compatible with the file server, causing the
	mount to hang indefinitely, even though the remote server is healthy.
	(In such a case, the bug is probably in the client's NFS implementation)

	Similar problems with other filesystems/platforms can cause this.

> We tend to get this all the time, and basically what I  do
is a restart on the rush service for the machine that has the hung frame, and it then reques the frame and then I can turn it back online and it will start rendering again.

	Probably what happens there is the process is orphaned but remains
	running.

	To solve this problem I would investigate deeply what is causing
	the processes to hang in an unkillable manner.

	For instance, can you kill the render with 'kill -9' (UNIX)
	or from the task manager (WINDOWS)? My guess is you won't be able to,
	indicating that your OS is at fault, most likely cause is the
	mounting scheme (buggy NFS/SMB) or buggy network lookup system
	(LDAP/NIS). I've seen both cause 'unkillable hanging problems'
	that you would experience both under the command line and rush.

Will the down command just reque the frame and if the machine is still online, let it continue rendering?

	IIRC, down tells the job server to release the frame and requeue it,
	severing the association with the remote.

	However, I believe the remote will still not let it go, because
	the process is still running from it's point of view -- the only
	way to really let it go is to kill the render process somehow,
	otherwise you'd have to restart the rushd service, or better yet
	reboot the box, since unkillable processes, esp. under windows,
	should not be happening. You need to identify the problem with the
	OS if a process is unkillable.

	I'm not familiar with the windows specific debugging tools, but
	under unix I use 'strace -p [pid_of_render_process' to determine
	what the process is doing, and/or the long reports from 'ps' to
	see what the WCHAN entry shows the process waiting for, and the
	'STATUS' column showing what mode the process is in (ie. in an
	unkillable sleep, or some such, which means it's stuck on some
	I/O device).


--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)