From: Lutz Paelike <lp@(email surpressed)>
Subject: How to detect and handle Frame MaxTime failures
   Date: Thu, 03 Nov 2011 14:03:54 -0400
Msg# 2142
View Complete Thread (5 articles) | All Threads
Last Next
Hi,

we have sometimes some MaxTime failures in our rush queue and the frames are then killed after MaxTime is reached. This is fine but still every frame is rendered, reaches MaxTime and is finally killed.
We would like to monitor a job and if more then, let's say 5 frames,
are killed due to Maxtime the job (or a series of jobs) should be skipped completely and no more frames should be renderered. 

Because we usually chain several jobs together with the WaitFor command,
a single jobs with 100 frames reaching MaxTime blocks the renderfarm for several hours which is mostly a problem at night when the farm is not watched.

A solution would be to have something like a TimeOutCommand, that
calls a script that can take appropriate action (This would be on a per frame basis), or even better a general StatusCommand that could be called for every frame, or for every job and additional information could be passed via environment variables.

Since the killing of the process is initiated by rush, my custom render
script can not detect that it was killed because it reached MaxTime.

The only solution i can think of right now is to go through every job in
the queue and parse the log files if there is any MAXTIME entry.

Am i missing something here or what would be the best approach for this problem?

Cheers,

Lutz Paelike
Pipeline Supervisor
D-Facto-Motion GmbH


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: How to detect and handle Frame MaxTime failures
   Date: Thu, 03 Nov 2011 18:58:03 -0400
Msg# 2144
View Complete Thread (5 articles) | All Threads
Last Next
On 11/03/11 11:03, Lutz Paelike wrote:
> we have sometimes some MaxTime failures in our rush queue and the frames =
> are then killed after MaxTime is reached. This is fine but still every =
> frame is rendered, reaches MaxTime and is finally killed.
> We would like to monitor a job and if more then, let's say 5 frames,
> are killed due to Maxtime the job (or a series of jobs) should be =
> skipped completely and no more frames should be renderered.=20

	I'd suggest instead of using MaxTime, to handle this
	specific set of circumstances, you'd probably want to
	instead put some logic in your script to handle the
	more specific behavior you want.

	For instance, I could see having your own "Render Max Time:"
	field in the submit form that passes the value to the
	render script, which in turn would take this value,
	fork()s the render off as a child, and then monitors
	the execution time of the render.

	This way the script can decide if it should kill the render,
	and if so, implement its own logic to modify the job.

	For instance, I could see logic that adds a job remark
	(rush -jobremark) and frame notes (rush -notes) to tell the user
	what happened, and have the script then either pause the job (rush -pause)
	or have it fail all the Que frames (rush -fail que) so that the job
	simply fails itself quickly.

> Because we usually chain several jobs together with the WaitFor command,
> a single jobs with 100 frames reaching MaxTime blocks the renderfarm for =
> several hours which is mostly a problem at night when the farm is not =
> watched.

	If you used the above technique to 'Fail' all the Que frames,
	then the job would suddenly fail itself, allowing other the
	other waitfor jobs to start running.

	Just curious though: are you using 'waitfor' to simulate
	a FIFO queue? If so, did you rule out using rush's FIFO
	scheduling? (eg. 'sched fifo' in the rush.conf file)
	Perhaps that's not what you need, but since it sounds like
	you want the other jobs to continue if this one keeps hanging,
	then I imagine the jobs really shouldn't be dependent on
	each other, and perhaps just FIFO scheduled..

> A solution would be to have something like a TimeOutCommand, that
> calls a script that can take appropriate action (This would be on a per =
> frame basis), or even better a general StatusCommand that could be =
> called for every frame, or for every job and additional information =
> could be passed via environment variables.

	I think this kind of thing is best done as logic in the
	script itself; background the command, and monitor its
	execution time.. if it exceeds the max, the script can
	choose what to do.

> Since the killing of the process is initiated by rush, my custom render
> script can not detect that it was killed because it reached MaxTime.

	Right -- a good reason not to use it in this case,
	and use the above instead, I would think.

> The only solution i can think of right now is to go through every job in
> the queue and parse the log files if there is any MAXTIME entry.

	I once investigated trying to make a 'callback option'
	for maxtime so that when it expires, a script could be
	run to do post-kill logic.. but I soon realized there
	would need to be all kinds of options to do what someone
	would want; run the script BEFORE the kill occurs, or
	AFTER it occurs, or have the script decide whether to
	kill it or not, etc.

	Seemed best to implement such things in the script itself.


-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: How to detect and handle Frame MaxTime failures
   Date: Thu, 03 Nov 2011 20:54:18 -0400
Msg# 2147
View Complete Thread (5 articles) | All Threads
Last Next
On 11/03/11 15:58, Greg Ercolano wrote:
> 	I'd suggest instead of using MaxTime, to handle this
> 	specific set of circumstances, you'd probably want to
> 	instead put some logic in your script to handle the
> 	more specific behavior you want.
> 
> 	For instance, I could see having your own "Render Max Time:"
> 	field in the submit form that passes the value to the
> 	render script, which in turn would take this value,
> 	fork()s the render off as a child, and then monitors
> 	the execution time of the render.
> 
> 	This way the script can decide if it should kill the render,
> 	and if so, implement its own logic to modify the job.

    As an actual perl coding example, here's a unix-specific technique
    that defines a function called 'RunCommandMaxTime()' that takes
    two arguments: the command to run, and the max # seconds.

    So calling it is as simple as:

my $cmd     = "yourcommand -arg1 -arg2 ..";  # COMMAND TO RUN
my $maxsecs = 800;                           # HOW MANY SECONDS IS 'TOO LONG'..
RunCommandMaxTime($cmd, $maxsecs);

    What follows is the definition of that function, which you can customize
    to include whatever post-kill logic you want (see '# ADD POST KILL LOGIC HERE').

    You could add this function to the .common.pl file, so that any of the
    submit scripts could use it if you wanted.

--- snip
use POSIX;

# RUN A COMMAND WITH A MAXIMUM TIME
#    Unix only.
#    $1 -- command to run
#    $2 -- maximum number of seconds command should take before being killed
#
sub RunCommandMaxTime($$)
{
    my ($cmd, $maxtime) = @_;
    my $starttime = time();
    my $pid = fork();
    if ( $pid == -1 )
    {
        # ERROR
        print "ERROR: fork() failed?! $!\n";
        exit(1);
    }
    elsif ( $pid == 0 )
    {
        # CHILD PROCESS
        POSIX::setsid();
        exec($cmd);
        print "ERROR: exec() failed: $!\n";
        exit(1);
    }
    else
    {
        # PARENT -- WATCH CHILD
        my $childpid   = $pid;
        my $exitstatus = 0;
        my $killed     = 0;
        while ( 1 )
        {
            # WATCH THE CHILD PROCESS
            #     See if it finished, and if so, reap.
            #     If it didn't, see if maxtime expired. If so, kill and reap.
            #     Otherwise, keep waiting..
            #
            my $kid = POSIX::waitpid($childpid, WNOHANG);       # see if child finished
            if ( $kid > 0 ) { $exitstatus = $?; last; } # finished? reap + break loop
            # SEE IF MAXTIME EXPIRED
            if ( ( time() - $starttime ) > $maxtime )
            {
                print STDERR "\n--- MAXTIME EXPIRED! Killing child..\n";
                kill(-9, $childpid);   # -9 means kill *process group*
                $killed = 1;
                # Add logic here that you want to do if maxtime expired
            }
            sleep(1);
        }

        # CHILD FINISHED
        if ( $killed ) { print STDERR "--- Render took too long and was killed.\n"; exit(1); }
        print STDERR "Child finished in time. EXITCODE=" . ($exitstatus >> 8) . " (status=$exitstatus)\n";
    }
}
--- snip


PS. If you're instead using windows, you'd have to replace the fork()/exec() stuff
    with the WIN32 equivalent, which in activestate perl is possible with
    'use Win32::Process;' and a combo of Win32::Process::Create() to background
    the child, and Wait() with some number of seconds, and GetExitCode().
    There's actually an example of this in .common.pl

    To handle killing the process, I would stay away from any of the win32 stuff,
    and simply call 'rush -fail $ENV{RUSH_FRAME}' to cause the script to
    commit suicide "cleanly", as the logic for getting that right is tricky
    to do from a script.

    If I decide on vacationing in the sixth circle of hell, I can follow up with
    the WIN32 equivalent code.


   From: Lutz Paelike <lp@(email surpressed)>
Subject: Re: How to detect and handle Frame MaxTime failures
   Date: Fri, 04 Nov 2011 08:31:12 -0400
Msg# 2149
View Complete Thread (5 articles) | All Threads
Last Next
Hey Greg,


> 	I'd suggest instead of using MaxTime, to handle this
> 	specific set of circumstances, you'd probably want to
> 	instead put some logic in your script to handle the
> 	more specific behavior you want.



Ok i will change my script as you suggested.


>    If I decide on vacationing in the sixth circle of hell, I can follow up with
>    the WIN32 equivalent code.

Thanks for your example script.
If i use perl i will join you on your vacation ;)

I will stick to python, these things are nicely encapsuled in the subprocess module.


Cheers,

Lutz Paelike
Pipeline Supervisor
D-Facto-Motion GmbH

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: How to detect and handle Frame MaxTime failures
   Date: Fri, 04 Nov 2011 12:18:07 -0400
Msg# 2150
View Complete Thread (5 articles) | All Threads
Last Next
On 11/04/11 05:31, Lutz Paelike wrote:
>>    If I decide on vacationing in the sixth circle of hell, I can
> follow up with the WIN32 equivalent code.
> 
> Thanks for your example script.
> If i use perl i will join you on your vacation ;)

	Ha, I guess I should have asked you if you were
	using python.

> I will stick to python, these things are nicely encapsuled in the =
> subprocess module.

	That's interesting; I guess you could do a non-blocking read
	on the subprocess.Popen() pipe, in which case that would probably
	work OK, because then your read loop wouldn't hang if the program
	stopped outputting data, so it can detect a timeout.

	If you can, post a simplified version of what you come up with.
	If I get a chance, I'll try to post some code that does what
	I describe above.

	I think the above technique could have been done in perl,
	but I didn't investigate non-blocking reads, as I knew waitpid()
	would work.. but that might be easier. It also gives you the option
	to parse the output of the render while it runs, so you can catch
	errors as they happen.

	Be aware when you 'kill' the render, the renderer MIGHT have
	started children, so you want to use a process group to be sure
	to kill not only the immediate child, but all its children too.

	For sure 'rush -fail $os.environ["RUSH_FRAME"]' would clean all
	this up for you, killing your own script as well as the render
	and any of its children. So if you're worried about using kill
	correctly, you could use that instead. (Just be sure that's the
	/last/ thing you do, as your script will probably be unceremoniously
	killed within the next fraction of a second.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)