From: "Mr. Daniel Browne" <dbrowne@(email surpressed)>
Subject: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 14:35:58 -0400
Msg# 2229
View Complete Thread (11 articles) | All Threads
Last Next
Hi Greg,

We're using a Houdini submit based on your perl version that I rebuilt in Python based off of your one for Maya. It seems that when a job is removed from a machine, either by issuing a getoff, down, or fail, the job restarts on a new host but does not die or get removed from the first one properly. Is there something more within a Python render script that I have to do to handle job fails?

-Dan


----------
Dan "Doc" Browne
System Administrator
Evil Eye Pictures

dbrowne@(email surpressed)
Office: (415) 777-0666 x105


   From: "Mr. Daniel Browne" <dbrowne@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 14:40:43 -0400
Msg# 2230
View Complete Thread (11 articles) | All Threads
Last Next
The situation may also be complicated by the fact that these are running as a local user with forceuid and forcegid.


On May 4, 2012, at 11:35 AM, Mr. Daniel Browne wrote:

[posted to rush.general]

Hi Greg,

We're using a Houdini submit based on your perl version that I rebuilt =
in Python based off of your one for Maya. It seems that when a job is =
removed from a machine, either by issuing a getoff, down, or fail, the =
job restarts on a new host but does not die or get removed from the =
first one properly. Is there something more within a Python render =
script that I have to do to handle job fails?

-Dan


----------
Dan "Doc" Browne
System Administrator
Evil Eye Pictures

dbrowne@(email surpressed)
Office: (415) 777-0666 x105


----------
Dan "Doc" Browne
System Administrator
Evil Eye Pictures

dbrowne@(email surpressed)
Office: (415) 777-0666 x105


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 15:12:54 -0400
Msg# 2232
View Complete Thread (11 articles) | All Threads
Last Next
On 05/04/12 11:40, Mr. Daniel Browne wrote:
> The situation may also be complicated by the fact that these are running =
> as a local user with forceuid and forcegid.

	I don't think that should be relevant.

	I doubt permissions are the issue.

	It's more likely the process hierarchy being disconnected somehow.

	I'm guessing when you requeue/dump/getoff/etc,
	the python script is getting killed properly,
	but somehow the renders remain running due to their being backgrounded somehow.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 15:09:56 -0400
Msg# 2231
View Complete Thread (11 articles) | All Threads
Last Next
On 05/04/12 11:35, Mr. Daniel Browne wrote:
> We're using a Houdini submit based on your perl version that I rebuilt =
> in Python based off of your one for Maya. It seems that when a job is =
> removed from a machine, either by issuing a getoff, down, or fail, the =
> job restarts on a new host but does not die or get removed from the =
> first one properly.

	Sounds like you're saying the script is killed, but the render
	remains running, is that the case?

	If so: are you sure the script isn't somehow 'backgrounding'
	the render, so that it disconnects from the process hierarchy?

	That's the only situation where I'd think this would be possible.

	Rush will kill the entire process hierarchy, but if the render process
	somehow disconnects from the process hierarchy, then rush can't control it.

> Is there something more within a Python render =
> script that I have to do to handle job fails?

	The technique I'd recommend for running a render from Python that
	I know works OK:

	On unix:

	    sys.stdout.flush()
	    sys.stderr.flush()
	    return os.system(cmd)

	On windows:

		import subprocess
		exitcode = subprocess.call(cmd, shell=0)

	I would change the shell=0 to shell=1 if you plan on using
	redirection (<>) or pipes (|) or other shell-special chars
	like &&.

	To know how to advise based on what you're currently
	using, would need more info:

	1) What does the process hierarchy look like when a frame
	   is running? For instance, on linux:

% ps fax
[..]
32325 ?        Ss     0:00 /usr/local/rush/bin/rushd
 6627 ?        SNs    0:00  \_ perl /eagle/net/cd/releases/tar/102.42d/irush/../examples/perl/submit-maya.pl -render 5 1 5 1 yes - 3 Fail Licpause+Retry //eag
 6628 ?        SN     0:00      \_ Render -r mr -proj /eagle/net/tmp -s 1 -e 5 -b 1 -v 5 -rt 1 /eagle/net/tmp/scenes/foo.ma
 6629 ?        SN     0:00          \_ /bin/csh -f /usr/autodesk/maya2012-x64/bin/maya -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15
 6648 ?        DN     0:00              \_ /usr/autodesk/maya2012-x64/bin/maya.bin -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15/AST
[..]

	   ..and then, what's the process hierarchy look like AFTER you requeue/getoff/whatever
	   the frame.. is the python script still running? Or is it only the render
	   processes, and if so, which ones?


	2) What python code are you using to invoke Houdini during rendering?
	   My guess is how this is being invoked is what's causing the problem.

		2a) Are you using os.system() or os.popen() or subprocess.call()..?
	            Include any special command flags for these that you might be using.

		2b) What the exact command is being invoked?
		    Perhaps there's something about the command that's causing the
		    problem, like the presence of unix '&' or DOS's 'start'.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


   From: "Mr. Daniel Browne" <dbrowne@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 15:16:08 -0400
Msg# 2233
View Complete Thread (11 articles) | All Threads
Last Next
I'm executing using your 

    (exitcode, errmsg) = Rush.RunCommand(command)

call, version 82


On May 4, 2012, at 12:09 PM, Greg Ercolano wrote:

[posted to rush.general]

On 05/04/12 11:35, Mr. Daniel Browne wrote:
> We're using a Houdini submit based on your perl version that I rebuilt =
> in Python based off of your one for Maya. It seems that when a job is =
> removed from a machine, either by issuing a getoff, down, or fail, the =
> job restarts on a new host but does not die or get removed from the =
> first one properly.

	Sounds like you're saying the script is killed, but the render
	remains running, is that the case?

	If so: are you sure the script isn't somehow 'backgrounding'
	the render, so that it disconnects from the process hierarchy?

	That's the only situation where I'd think this would be possible.

	Rush will kill the entire process hierarchy, but if the render process
	somehow disconnects from the process hierarchy, then rush can't control it.

> Is there something more within a Python render =
> script that I have to do to handle job fails?

	The technique I'd recommend for running a render from Python that
	I know works OK:

	On unix:

	    sys.stdout.flush()
	    sys.stderr.flush()
	    return os.system(cmd)

	On windows:

		import subprocess
		exitcode = subprocess.call(cmd, shell=0)

	I would change the shell=0 to shell=1 if you plan on using
	redirection (<>) or pipes (|) or other shell-special chars
	like &&.

	To know how to advise based on what you're currently
	using, would need more info:

	1) What does the process hierarchy look like when a frame
	   is running? For instance, on linux:

% ps fax
[..]
32325 ?        Ss     0:00 /usr/local/rush/bin/rushd
6627 ?        SNs    0:00  \_ perl /eagle/net/cd/releases/tar/102.42d/irush/../examples/perl/submit-maya.pl -render 5 1 5 1 yes - 3 Fail Licpause+Retry //eag
6628 ?        SN     0:00      \_ Render -r mr -proj /eagle/net/tmp -s 1 -e 5 -b 1 -v 5 -rt 1 /eagle/net/tmp/scenes/foo.ma
6629 ?        SN     0:00          \_ /bin/csh -f /usr/autodesk/maya2012-x64/bin/maya -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15
6648 ?        DN     0:00              \_ /usr/autodesk/maya2012-x64/bin/maya.bin -batch -file /eagle/net/tmp/scenes/foo.ma -script /var/tmp/.RUSH_TMP.15/AST
[..]

	   ..and then, what's the process hierarchy look like AFTER you requeue/getoff/whatever
	   the frame.. is the python script still running? Or is it only the render
	   processes, and if so, which ones?


	2) What python code are you using to invoke Houdini during rendering?
	   My guess is how this is being invoked is what's causing the problem.

		2a) Are you using os.system() or os.popen() or subprocess.call()..?
	            Include any special command flags for these that you might be using.

		2b) What the exact command is being invoked?
		    Perhaps there's something about the command that's causing the
		    problem, like the presence of unix '&' or DOS's 'start'.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


----------
Dan "Doc" Browne
System Administrator
Evil Eye Pictures

dbrowne@(email surpressed)
Office: (415) 777-0666 x105


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 15:35:16 -0400
Msg# 2234
View Complete Thread (11 articles) | All Threads
Last Next
On 05/04/12 12:16, Mr. Daniel Browne wrote:
> I'm executing using your=20
>     (exitcode, errmsg) =3D Rush.RunCommand(command)
> call, version 82

	OK, then I'll need that other info; (1), (2), (2a), (2b)..

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


   From: "Mr. Daniel Browne" <dbrowne@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 17:06:11 -0400
Msg# 2235
View Complete Thread (11 articles) | All Threads
Last Next
I'll let you know what I find next time it occurs. It seems to only happen with jobs running on a specific group of machines that are being used for GPU-accelerated rendering.


On May 4, 2012, at 12:35 PM, Greg Ercolano wrote:

[posted to rush.general]

On 05/04/12 12:16, Mr. Daniel Browne wrote:
> I'm executing using your=20
>    (exitcode, errmsg) =3D Rush.RunCommand(command)
> call, version 82

	OK, then I'll need that other info; (1), (2), (2a), (2b)..

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


----------
Dan "Doc" Browne
System Administrator
Evil Eye Pictures

dbrowne@(email surpressed)
Office: (415) 777-0666 x105


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 17:53:51 -0400
Msg# 2236
View Complete Thread (11 articles) | All Threads
Last Next
On 05/04/12 14:06, Mr. Daniel Browne wrote:
> I'll let you know what I find next time it occurs. It seems to only =
> happen with jobs running on a specific group of machines that are being =
> used for GPU-accelerated rendering.

	Mmm, I'd be surprised if that mattered; use of GPU or otherwise
	shouldn't affect a process's execution.. unless that is it's
	running in an 'unkillable' state while interacting with the hardware.

	Let's wait to see what you find.


-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 18:09:44 -0400
Msg# 2237
View Complete Thread (11 articles) | All Threads
Last Next
On 05/04/12 14:53, Greg Ercolano wrote:
> On 05/04/12 14:06, Mr. Daniel Browne wrote:
>> I'll let you know what I find next time it occurs. It seems to only =
>> happen with jobs running on a specific group of machines that are being =
>> used for GPU-accelerated rendering.
> 
> 	Mmm, I'd be surprised if that mattered; use of GPU or otherwise
> 	shouldn't affect a process's execution.. unless that is it's
> 	running in an 'unkillable' state while interacting with the hardware.
> 
> 	Let's wait to see what you find.


   Actually, you might try to replicate from the command line;
   try running your python script with the same command you're
   telling rush to run. (Just set any RUSH environment variables
   the script depends on, like RUSH_FRAME or RUSH_JOBID).

   Then, while it's rendering, hit ^C to kill it,
   then check to see if somehow the render is still running.

   So for instance, if you're telling rush to run:

	python /your/script/foo.py -render -arg1 -arg2 -arg3

   ..then to test from the command line, use e.g.

	( export RUSH_FRAME=1; export RUSH_JOBID=yourjobid.123; python /your/script/foo.py -render -arg1 -arg2 -arg3 )

    ..before hitting ^C.


-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


   From: "Mr. Daniel Browne" <dbrowne@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 18:22:17 -0400
Msg# 2238
View Complete Thread (11 articles) | All Threads
Last Next
I've been able to kill it from the terminal using the ps and kill commands, so I don't think that will show us much.


On May 4, 2012, at 3:09 PM, Greg Ercolano wrote:

[posted to rush.general]

On 05/04/12 14:53, Greg Ercolano wrote:
> On 05/04/12 14:06, Mr. Daniel Browne wrote:
>> I'll let you know what I find next time it occurs. It seems to only =
>> happen with jobs running on a specific group of machines that are being =
>> used for GPU-accelerated rendering.
> 
> 	Mmm, I'd be surprised if that mattered; use of GPU or otherwise
> 	shouldn't affect a process's execution.. unless that is it's
> 	running in an 'unkillable' state while interacting with the hardware.
> 
> 	Let's wait to see what you find.


  Actually, you might try to replicate from the command line;
  try running your python script with the same command you're
  telling rush to run. (Just set any RUSH environment variables
  the script depends on, like RUSH_FRAME or RUSH_JOBID).

  Then, while it's rendering, hit ^C to kill it,
  then check to see if somehow the render is still running.

  So for instance, if you're telling rush to run:

	python /your/script/foo.py -render -arg1 -arg2 -arg3

  ..then to test from the command line, use e.g.

	( export RUSH_FRAME=1; export RUSH_JOBID=yourjobid.123; python /your/script/foo.py -render -arg1 -arg2 -arg3 )

   ..before hitting ^C.


-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


----------
Dan "Doc" Browne
System Administrator
Evil Eye Pictures

dbrowne@(email surpressed)
Office: (415) 777-0666 x105


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Jobs not being killed by "Getoff"
   Date: Fri, 04 May 2012 18:31:01 -0400
Msg# 2239
View Complete Thread (11 articles) | All Threads
Last Next
On 05/04/12 15:22, Mr. Daniel Browne wrote:
> I've been able to kill it from the terminal using the ps and kill =
> commands, so I don't think that will show us much.

	Mmm, but if you say rush killed the frame and requeued it
	elsewhere, that means Rush got an exit code from the OS
	indicating the python script did actually stop running.

	So the only way that the render could live on is if it
	disconnected from the process.

	But hmm, looking at your OP carefully:

> It seems that when a job is removed from a machine, (getoff, *down*, or fail,
> the job restarts on a new host but does not die or get removed from the first one properly.

	You mention the use of 'Down'.

	I'm guessing that might have meant "Dump", but if you did
	actually use 'Down', that might explain it; you definitely
	wouldn't want to use 'Down' unless the remote machine was
	actually down and not responding.



-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)