From: Dylan Penhale <dylan@(email surpressed).au>
Subject: Requeing failed frames from a batch
   Date: Thu, 09 Mar 2006 19:04:46 -0500
Msg# 1252
View Complete Thread (4 articles) | All Threads
Last Next
I know this has been covered before but I don't seem able to find it for the life of me.

When submitting batches of frames to render in maya I understand that the batch command is under the control of mayabatch, with rush waiting for an exit code. Therefore if mayabatch was to fail a frame for whatever reason, rush wouldn't know which frame had failed, it would only know there was a problem due to the non zero exit code* We can however assume that if the batch of frames was to fail it would fail all frames after the failure (it wouldn't fail a frame in the middle and then continue to render the remaining frames successfully).

*this said we sometimes see maya exit with a code of zero when no cameras are set in the scene and no image has been rendered. Technically there is no error, so we parse the log file for this error and fail the batch.

I need to find a way to re-submit/re-queue failed frames when submit from batches.

My thoughts:

o monitor the logs and append all successful frames to a temp file, then calculate the missing frames and resubmit as a separate job. Not sure how this could be done though, I thought perhaps a waitfor job submitted at the same time, but if the batch job fails then the waitfor job would never get to run.

o monitor the logs as they are being written and log failed frames to a temp file. Then append a "check" frame at the end of the job to do the re-queue function acting on that file. Sounds tricky too.

o upon completion of each batch check the logs for output lines and do a file size check on each frame. If the number of frames in the batch doesn't equal the number of frames in the batch then re-queue from the failed frame. This is kind of what we do already, we already check the frames with image size check, but we need to re-queue next. This method only checks frames that are output into the log file though, and doesn't know about possible missing frames. It also doesn't deal with a job that may hang.

It's easy to re-gueue the whole batch but if only one frame has failed it's re-rendering all the good frames again. I wonder if anyone is running similar checking on batches, or have any clever way of dealing with failed batches?


_________________________________________

Dylan Penhale
Systems Administrator
Fuel International




   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Requeing failed frames from a batch
   Date: Thu, 09 Mar 2006 23:43:17 -0500
Msg# 1253
View Complete Thread (4 articles) | All Threads
Last Next
I need to find a way to re-submit/re-queue failed frames when submit from batches.

	I might be missing something.

	Let's say after rendering a batch of 20-29, frame 25 fails,
	and you have a way to detect this by grepping the logs.

	So you'll know for sure that 20-24 are OK, and 25-29 need to be
	re-rendered.

	If you just exit the script with exit(2), rush will try to re-render
	the entire batch of 20-29.

	Are you saying that you want it to:

		a) Assume the problem was intermittent (ie. maya crashed due to a bug,
		   and will probably render 25-29 just fine if we re-invoke maya to just
		   render that frame range?
	..or..

		b) Assume the problem was caused by the user (a bad scene file)
		   and we should just append frames 25-29 to a text file somewhere
		   that we can later use to submit a job to fix these frames
		   after the user has intervened to fix the scenefile.

	If a) then yes, just re-invoke maya with the new frame range,
	and the user won't even have to know there was a failure.

	If b) then I would think you could do any of a number of things:

		o Have the render script create 0000.fix frames in the image dir
		  that a later 'fix job' could look for, and just render those frames

		o Append the bad frames to a .txt file

		o Have a 'jobdonecommand' that looks for bad frames, and submits
		  a 'fix job' in the pause state, and emails the user to fix the
		  problem, then unpause the fix job to run the fix frames..

It's easy to re-gueue the whole batch but if only one frame has failed it's re-rendering all the good frames again.

	If you have the logic in the render script to know which frames
	need to be re-rendered, and are confident that just re-rendering those
	is all that's needed to get them to render OK, then just re-invoke maya
	with just the fix range.

	Or, if you just want to tell the user which frames are bad, you can
	use 'rush -notes $ENV{RUSH_FRAME}:"BAD FRAMES: $bad_start - $bad_end"'
	so that the "Frames" report shows a message telling which frames were bad.
	(this is safe to do when errors occur. Using 'rush -notes' isn't recommended
	if run on EVERY frame.. that's too much load to the job server if there are
	100's of render nodes. But logging (uncommon) error conditions is OK..)

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)

   From: Dylan Penhale <dylanpenhale@(email surpressed)>
Subject: RE: Requeing failed frames from a batch
   Date: Fri, 10 Mar 2006 00:52:43 -0500
Msg# 1254
View Complete Thread (4 articles) | All Threads
Last Next
Cool 

I think option A is what we want. Generally if we have found that if a batch
is going to fail from a scene file error it will either not render at all,
or start failing after a particular point in the frame range and from then
on. So this being the case we would just retry the batch from the failed
frame onwards.

The system we have come up with then is as follows. 

	O Rush submits the batch to mayabatch for rendering. 
	
	O Rush waits for the mayabatch exit and then checks the frames 
	(and frame sizes in our case) according to the log file.
	
	O If it finds that the number of frames written in the log does not
equal 	the number of frames in the batch then reset $opt{sfrm} value to the
last 	good frame +1 and retry. This will effectively retry the batch from
the 	failed frame till the end of the batch ($opt{efrm})

I think this is all we need. We currently have the failures going out as
emails to sysadmins and job owners, so I think we could tweak that slightly
so they only get the error if the 3rd retry (or however many retries are
set) is reached.


|-----Original Message-----
|From: Greg Ercolano [mailto:erco@(email surpressed)] 
|Sent: Friday, 10 March 2006 3:43 PM
|To: void@(email surpressed)
|Subject: Re: Requeing failed frames from a batch
|
|[posted to rush.general]
|
|> I need to find a way to re-submit/re-queue failed frames when submit 
|> from batches.
|
|	I might be missing something.
|
|	Let's say after rendering a batch of 20-29, frame 25 fails,
|	and you have a way to detect this by grepping the logs.
|
|	So you'll know for sure that 20-24 are OK, and 25-29 need to be
|	re-rendered.
|
|	If you just exit the script with exit(2), rush will try 
|to re-render
|	the entire batch of 20-29.
|
|	Are you saying that you want it to:
|
|		a) Assume the problem was intermittent (ie. 
|maya crashed due to a bug,
|		   and will probably render 25-29 just fine if 
|we re-invoke maya to just
|		   render that frame range?
|	..or..
|
|		b) Assume the problem was caused by the user (a 
|bad scene file)
|		   and we should just append frames 25-29 to a 
|text file somewhere
|		   that we can later use to submit a job to fix 
|these frames
|		   after the user has intervened to fix the scenefile.
|
|	If a) then yes, just re-invoke maya with the new frame range,
|	and the user won't even have to know there was a failure.
|
|	If b) then I would think you could do any of a number of things:
|
|		o Have the render script create 0000.fix frames 
|in the image dir
|		  that a later 'fix job' could look for, and 
|just render those frames
|
|		o Append the bad frames to a .txt file
|
|		o Have a 'jobdonecommand' that looks for bad 
|frames, and submits
|		  a 'fix job' in the pause state, and emails 
|the user to fix the
|		  problem, then unpause the fix job to run the 
|fix frames..
|
|> It's easy to re-gueue the whole batch but if only one frame 
|has failed 
|> it's re-rendering all the good frames again.
|
|	If you have the logic in the render script to know which frames
|	need to be re-rendered, and are confident that just 
|re-rendering those
|	is all that's needed to get them to render OK, then 
|just re-invoke maya
|	with just the fix range.
|
|	Or, if you just want to tell the user which frames are 
|bad, you can
|	use 'rush -notes $ENV{RUSH_FRAME}:"BAD FRAMES: 
|$bad_start - $bad_end"'
|	so that the "Frames" report shows a message telling 
|which frames were bad.
|	(this is safe to do when errors occur. Using 'rush 
|-notes' isn't recommended
|	if run on EVERY frame.. that's too much load to the job 
|server if there are
|	100's of render nodes. But logging (uncommon) error 
|conditions is OK..)
|
|--
|Greg Ercolano, erco@(email surpressed)
|Rush Render Queue, http://seriss.com/rush/
|Tel: (Tel# suppressed)
|Cel: (Tel# suppressed)
|Fax: (Tel# suppressed)
|


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Requeing failed frames from a batch
   Date: Fri, 10 Mar 2006 07:56:38 -0500
Msg# 1255
View Complete Thread (4 articles) | All Threads
Last Next
Dylan Penhale wrote:
The system we have come up with then is as follows. O Rush submits the batch to mayabatch for rendering. O Rush waits for the mayabatch exit and then checks the frames (and frame sizes in our case) according to the log file.
	
	O If it finds that the number of frames written in the log does not
        equal the number of frames in the batch then reset $opt{sfrm} value
        to the last good frame +1 and retry..

	Yes, exactly.

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)