From: Patrick Boucher <patrickb@(email surpressed)> Subject: Die frames Date: Tue, 08 Nov 2005 13:01:16 -0800 |
Msg# 1103 View Complete Thread (5 articles) | All Threads Last Next |
Hi all,One of our sysadmins here ripped 4 computers from the farm without first doing a 'rush -getoff [computernames...]' When the job that was rendering on these nodes was dumped the frames wound up in Die state and are stuck and the job doesn't want to dump. Is there any way to tell rush that it won't be getting a return code from the nodes and just forget about it? Any help would be appreciated. -- Patrick Boucher TD - Coder - Resident geek Buzz Image Group Tel 514.848.0579 Fax 514.848.6371 www.buzzimage.com www.xsi-blog.com |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Die frames Date: Tue, 08 Nov 2005 13:13:31 -0800 |
Msg# 1105 View Complete Thread (5 articles) | All Threads Last Next |
Patrick Boucher wrote: One of our sysadmins here ripped 4 computers from the farm without first doing a 'rush -getoff [computernames...]'When the job that was rendering on these nodes was dumped the frames wound up in Die state and are stuck and the job doesn't want to dump.Is there any way to tell rush that it won't be getting a return code from the nodes and just forget about it? Yes; the 'Down' button in irush is precisely for this purpose. 1) Click 'Frames' 2) Highlight the 'Die' frames waiting for the remotes to come back 3) Click the red 'Down' button For more info on the irush 'Down' button, right click on it to view the help. The equivalent command line option is 'rush -down'; for more info on that, see: http://www.seriss.com/rush-current/rush/rush-command-line-options.html#-down -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) |
From: Patrick Boucher <patrickb@(email surpressed)> Subject: Re: Die frames Date: Tue, 08 Nov 2005 13:29:21 -0800 |
Msg# 1106 View Complete Thread (5 articles) | All Threads Last Next |
Greg Ercolano wrote: Patrick Boucher wrote:One of our sysadmins here ripped 4 computers from the farm without first doing a 'rush -getoff [computernames...]'When the job that was rendering on these nodes was dumped the frames wound up in Die state and are stuck and the job doesn't want to dump.Is there any way to tell rush that it won't be getting a return code from the nodes and just forget about it?Yes; the 'Down' button in irush is precisely for this purpose. 1) Click 'Frames' 2) Highlight the 'Die' frames waiting for the remotes to come back 3) Click the red 'Down' button For more info on the irush 'Down' button, right click on it to view the help. The equivalent command line option is 'rush -down'; for more info on that, see: http://www.seriss.com/rush-current/rush/rush-command-line-options.html#-down I'll be going to bed just a tiny bit smarter. I love it! Thanks! -- Patrick Boucher TD - Coder - Resident geek Buzz Image Group Tel 514.848.0579 Fax 514.848.6371 www.buzzimage.com www.xsi-blog.com |
From: Brent Hensarling <brenth@(email surpressed)> Subject: Re: Die frames Date: Tue, 08 Nov 2005 13:50:20 -0800 |
Msg# 1107 View Complete Thread (5 articles) | All Threads Last Next |
|
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Die frames Date: Tue, 08 Nov 2005 14:10:50 -0800 |
Msg# 1108 View Complete Thread (5 articles) | All Threads Last Next |
Brent Hensarling wrote: Its funny, I never noticed that there :) What about the case where a frame is hung on a machine that is not dead, but the frame will not reque at all? When you requeue a running frame, Rush uses the most powerful means to kill the process it can to kill the process. Under unix, it uses kill -9, and under Windows it uses a similar 'most deadly' means of killing the process. If the kernel won't let the process go, it's most likely 'hung' in a way that it severe. Under Unix, the common cause is the file server is down, and the app is accessing a file through a hard mount. The process won't revive unless the remote server comes back, or the mount is fixed. In some cases, the problem is a broken network file system that is buggy and not compatible with the file server, causing the mount to hang indefinitely, even though the remote server is healthy. (In such a case, the bug is probably in the client's NFS implementation) Similar problems with other filesystems/platforms can cause this. > We tend to get this all the time, and basically what I do is a restart on the rush service for the machine that has the hung frame, and it then reques the frame and then I can turn it back online and it will start rendering again. Probably what happens there is the process is orphaned but remains running. To solve this problem I would investigate deeply what is causing the processes to hang in an unkillable manner. For instance, can you kill the render with 'kill -9' (UNIX) or from the task manager (WINDOWS)? My guess is you won't be able to, indicating that your OS is at fault, most likely cause is the mounting scheme (buggy NFS/SMB) or buggy network lookup system (LDAP/NIS). I've seen both cause 'unkillable hanging problems' that you would experience both under the command line and rush. Will the down command just reque the frame and if the machine is still online, let it continue rendering? IIRC, down tells the job server to release the frame and requeue it, severing the association with the remote. However, I believe the remote will still not let it go, because the process is still running from it's point of view -- the only way to really let it go is to kill the render process somehow, otherwise you'd have to restart the rushd service, or better yet reboot the box, since unkillable processes, esp. under windows, should not be happening. You need to identify the problem with the OS if a process is unkillable. I'm not familiar with the windows specific debugging tools, but under unix I use 'strace -p [pid_of_render_process' to determine what the process is doing, and/or the long reports from 'ps' to see what the WCHAN entry shows the process waiting for, and the 'STATUS' column showing what mode the process is in (ie. in an unkillable sleep, or some such, which means it's stuck on some I/O device). -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) |