From: "Mr. Daniel Browne" <dbrowne@(email surpressed)> Subject: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 14:35:58 -0400 |
Msg# 2229 View Complete Thread (11 articles) | All Threads Last Next |
Hi Greg, We're using a Houdini submit based on your perl version that I rebuilt in Python based off of your one for Maya. It seems that when a job is removed from a machine, either by issuing a getoff, down, or fail, the job restarts on a new host but does not die or get removed from the first one properly. Is there something more within a Python render script that I have to do to handle job fails? -Dan ---------- Dan "Doc" Browne System Administrator Evil Eye Pictures dbrowne@(email surpressed) Office: (415) 777-0666 x105 |
From: "Mr. Daniel Browne" <dbrowne@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 14:40:43 -0400 |
Msg# 2230 View Complete Thread (11 articles) | All Threads Last Next |
The situation may also be complicated by the fact that these are running as a local user with forceuid and forcegid. On May 4, 2012, at 11:35 AM, Mr. Daniel Browne wrote: [posted to rush.general] Hi Greg, We're using a Houdini submit based on your perl version that I rebuilt = in Python based off of your one for Maya. It seems that when a job is = removed from a machine, either by issuing a getoff, down, or fail, the = job restarts on a new host but does not die or get removed from the = first one properly. Is there something more within a Python render = script that I have to do to handle job fails? -Dan ---------- Dan "Doc" Browne System Administrator Evil Eye Pictures dbrowne@(email surpressed) Office: (415) 777-0666 x105 ---------- Dan "Doc" Browne System Administrator Evil Eye Pictures dbrowne@(email surpressed) Office: (415) 777-0666 x105 |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 15:12:54 -0400 |
Msg# 2232 View Complete Thread (11 articles) | All Threads Last Next |
On 05/04/12 11:40, Mr. Daniel Browne wrote: > The situation may also be complicated by the fact that these are running = > as a local user with forceuid and forcegid. I don't think that should be relevant. I doubt permissions are the issue. It's more likely the process hierarchy being disconnected somehow. I'm guessing when you requeue/dump/getoff/etc, the python script is getting killed properly, but somehow the renders remain running due to their being backgrounded somehow. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 15:09:56 -0400 |
Msg# 2231 View Complete Thread (11 articles) | All Threads Last Next |
On 05/04/12 11:35, Mr. Daniel Browne wrote: > We're using a Houdini submit based on your perl version that I rebuilt = > in Python based off of your one for Maya. It seems that when a job is = > removed from a machine, either by issuing a getoff, down, or fail, the = > job restarts on a new host but does not die or get removed from the = > first one properly. Sounds like you're saying the script is killed, but the render remains running, is that the case? If so: are you sure the script isn't somehow 'backgrounding' the render, so that it disconnects from the process hierarchy? That's the only situation where I'd think this would be possible. Rush will kill the entire process hierarchy, but if the render process somehow disconnects from the process hierarchy, then rush can't control it. > Is there something more within a Python render = > script that I have to do to handle job fails? The technique I'd recommend for running a render from Python that I know works OK: On unix: sys.stdout.flush() sys.stderr.flush() return os.system(cmd) On windows: import subprocess exitcode = subprocess.call(cmd, shell=0) I would change the shell=0 to shell=1 if you plan on using redirection (<>) or pipes (|) or other shell-special chars like &&. To know how to advise based on what you're currently using, would need more info: 1) What does the process hierarchy look like when a frame is running? For instance, on linux: % ps fax [..] 32325 ? Ss 0:00 /usr/local/rush/bin/rushd 6627 ? SNs 0:00 \_ perl /eagle/net/cd/releases/tar/102.42d/irush/../examples/perl/submit-maya.pl -render 5 1 5 1 yes - 3 Fail Licpause+Retry //eag 6628 ? SN 0:00 \_ Render -r mr -proj /eagle/net/tmp -s 1 -e 5 -b 1 -v 5 -rt 1 /eagle/net/tmp/scenes/foo.ma 6629 ? SN 0:00 \_ /bin/csh -f /usr/autodesk/maya2012-x64/bin/maya -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15 6648 ? DN 0:00 \_ /usr/autodesk/maya2012-x64/bin/maya.bin -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15/AST [..] ..and then, what's the process hierarchy look like AFTER you requeue/getoff/whatever the frame.. is the python script still running? Or is it only the render processes, and if so, which ones? 2) What python code are you using to invoke Houdini during rendering? My guess is how this is being invoked is what's causing the problem. 2a) Are you using os.system() or os.popen() or subprocess.call()..? Include any special command flags for these that you might be using. 2b) What the exact command is being invoked? Perhaps there's something about the command that's causing the problem, like the presence of unix '&' or DOS's 'start'. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) |
From: "Mr. Daniel Browne" <dbrowne@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 15:16:08 -0400 |
Msg# 2233 View Complete Thread (11 articles) | All Threads Last Next |
I'm executing using your (exitcode, errmsg) = Rush.RunCommand(command) call, version 82 On May 4, 2012, at 12:09 PM, Greg Ercolano wrote: [posted to rush.general] On 05/04/12 11:35, Mr. Daniel Browne wrote: > We're using a Houdini submit based on your perl version that I rebuilt = > in Python based off of your one for Maya. It seems that when a job is = > removed from a machine, either by issuing a getoff, down, or fail, the = > job restarts on a new host but does not die or get removed from the = > first one properly. Sounds like you're saying the script is killed, but the render remains running, is that the case? If so: are you sure the script isn't somehow 'backgrounding' the render, so that it disconnects from the process hierarchy? That's the only situation where I'd think this would be possible. Rush will kill the entire process hierarchy, but if the render process somehow disconnects from the process hierarchy, then rush can't control it. > Is there something more within a Python render = > script that I have to do to handle job fails? The technique I'd recommend for running a render from Python that I know works OK: On unix: sys.stdout.flush() sys.stderr.flush() return os.system(cmd) On windows: import subprocess exitcode = subprocess.call(cmd, shell=0) I would change the shell=0 to shell=1 if you plan on using redirection (<>) or pipes (|) or other shell-special chars like &&. To know how to advise based on what you're currently using, would need more info: 1) What does the process hierarchy look like when a frame is running? For instance, on linux: % ps fax [..] 32325 ? Ss 0:00 /usr/local/rush/bin/rushd 6627 ? SNs 0:00 \_ perl /eagle/net/cd/releases/tar/102.42d/irush/../examples/perl/submit-maya.pl -render 5 1 5 1 yes - 3 Fail Licpause+Retry //eag 6628 ? SN 0:00 \_ Render -r mr -proj /eagle/net/tmp -s 1 -e 5 -b 1 -v 5 -rt 1 /eagle/net/tmp/scenes/foo.ma 6629 ? SN 0:00 \_ /bin/csh -f /usr/autodesk/maya2012-x64/bin/maya -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15 6648 ? DN 0:00 \_ /usr/autodesk/maya2012-x64/bin/maya.bin -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15/AST [..] ..and then, what's the process hierarchy look like AFTER you requeue/getoff/whatever the frame.. is the python script still running? Or is it only the render processes, and if so, which ones? 2) What python code are you using to invoke Houdini during rendering? My guess is how this is being invoked is what's causing the problem. 2a) Are you using os.system() or os.popen() or subprocess.call()..? Include any special command flags for these that you might be using. 2b) What the exact command is being invoked? Perhaps there's something about the command that's causing the problem, like the presence of unix '&' or DOS's 'start'. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) ---------- Dan "Doc" Browne System Administrator Evil Eye Pictures dbrowne@(email surpressed) Office: (415) 777-0666 x105 |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 15:35:16 -0400 |
Msg# 2234 View Complete Thread (11 articles) | All Threads Last Next |
On 05/04/12 12:16, Mr. Daniel Browne wrote: > I'm executing using your=20 > (exitcode, errmsg) =3D Rush.RunCommand(command) > call, version 82 OK, then I'll need that other info; (1), (2), (2a), (2b).. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) |
From: "Mr. Daniel Browne" <dbrowne@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 17:06:11 -0400 |
Msg# 2235 View Complete Thread (11 articles) | All Threads Last Next |
I'll let you know what I find next time it occurs. It seems to only happen with jobs running on a specific group of machines that are being used for GPU-accelerated rendering. On May 4, 2012, at 12:35 PM, Greg Ercolano wrote: [posted to rush.general] On 05/04/12 12:16, Mr. Daniel Browne wrote: > I'm executing using your=20 > (exitcode, errmsg) =3D Rush.RunCommand(command) > call, version 82 OK, then I'll need that other info; (1), (2), (2a), (2b).. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) ---------- Dan "Doc" Browne System Administrator Evil Eye Pictures dbrowne@(email surpressed) Office: (415) 777-0666 x105 |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 17:53:51 -0400 |
Msg# 2236 View Complete Thread (11 articles) | All Threads Last Next |
On 05/04/12 14:06, Mr. Daniel Browne wrote: > I'll let you know what I find next time it occurs. It seems to only = > happen with jobs running on a specific group of machines that are being = > used for GPU-accelerated rendering. Mmm, I'd be surprised if that mattered; use of GPU or otherwise shouldn't affect a process's execution.. unless that is it's running in an 'unkillable' state while interacting with the hardware. Let's wait to see what you find. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 18:09:44 -0400 |
Msg# 2237 View Complete Thread (11 articles) | All Threads Last Next |
On 05/04/12 14:53, Greg Ercolano wrote: > On 05/04/12 14:06, Mr. Daniel Browne wrote: >> I'll let you know what I find next time it occurs. It seems to only = >> happen with jobs running on a specific group of machines that are being = >> used for GPU-accelerated rendering. > > Mmm, I'd be surprised if that mattered; use of GPU or otherwise > shouldn't affect a process's execution.. unless that is it's > running in an 'unkillable' state while interacting with the hardware. > > Let's wait to see what you find. Actually, you might try to replicate from the command line; try running your python script with the same command you're telling rush to run. (Just set any RUSH environment variables the script depends on, like RUSH_FRAME or RUSH_JOBID). Then, while it's rendering, hit ^C to kill it, then check to see if somehow the render is still running. So for instance, if you're telling rush to run: python /your/script/foo.py -render -arg1 -arg2 -arg3 ..then to test from the command line, use e.g. ( export RUSH_FRAME=1; export RUSH_JOBID=yourjobid.123; python /your/script/foo.py -render -arg1 -arg2 -arg3 ) ..before hitting ^C. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) |
From: "Mr. Daniel Browne" <dbrowne@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 18:22:17 -0400 |
Msg# 2238 View Complete Thread (11 articles) | All Threads Last Next |
I've been able to kill it from the terminal using the ps and kill commands, so I don't think that will show us much. On May 4, 2012, at 3:09 PM, Greg Ercolano wrote: [posted to rush.general] On 05/04/12 14:53, Greg Ercolano wrote: > On 05/04/12 14:06, Mr. Daniel Browne wrote: >> I'll let you know what I find next time it occurs. It seems to only = >> happen with jobs running on a specific group of machines that are being = >> used for GPU-accelerated rendering. > > Mmm, I'd be surprised if that mattered; use of GPU or otherwise > shouldn't affect a process's execution.. unless that is it's > running in an 'unkillable' state while interacting with the hardware. > > Let's wait to see what you find. Actually, you might try to replicate from the command line; try running your python script with the same command you're telling rush to run. (Just set any RUSH environment variables the script depends on, like RUSH_FRAME or RUSH_JOBID). Then, while it's rendering, hit ^C to kill it, then check to see if somehow the render is still running. So for instance, if you're telling rush to run: python /your/script/foo.py -render -arg1 -arg2 -arg3 ..then to test from the command line, use e.g. ( export RUSH_FRAME=1; export RUSH_JOBID=yourjobid.123; python /your/script/foo.py -render -arg1 -arg2 -arg3 ) ..before hitting ^C. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) ---------- Dan "Doc" Browne System Administrator Evil Eye Pictures dbrowne@(email surpressed) Office: (415) 777-0666 x105 |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: Jobs not being killed by "Getoff" Date: Fri, 04 May 2012 18:31:01 -0400 |
Msg# 2239 View Complete Thread (11 articles) | All Threads Last Next |
On 05/04/12 15:22, Mr. Daniel Browne wrote: > I've been able to kill it from the terminal using the ps and kill = > commands, so I don't think that will show us much. Mmm, but if you say rush killed the frame and requeued it elsewhere, that means Rush got an exit code from the OS indicating the python script did actually stop running. So the only way that the render could live on is if it disconnected from the process. But hmm, looking at your OP carefully: > It seems that when a job is removed from a machine, (getoff, *down*, or fail, > the job restarts on a new host but does not die or get removed from the first one properly. You mention the use of 'Down'. I'm guessing that might have meant "Dump", but if you did actually use 'Down', that might explain it; you definitely wouldn't want to use 'Down' unless the remote machine was actually down and not responding. -- Greg Ercolano, erco@(email surpressed) Seriss Corporation Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed)ext.23 Fax: (Tel# suppressed) Cel: (Tel# suppressed) |