From: Mat X <info@matx.ca>
Subject: Nuke GenArts Sapphire renders failing
   Date: Mon, 07 Feb 2011 22:53:30 -0800
Msg# 2010
View Complete Thread (6 articles) | All Threads
Last Next
I wanted to give a heads up to anyone else having issues with GenArts Sapphire (I know of at least one other facility) and a failing frames in Nuke.

The solution, for those not wanting to read through a long rambling post, is to set the nuke disk cache to the rush temp dir, and everyone lives happily ever after.

For the long story version read on....

I ran into this really weird issue when we upgraded our farm to Mac OS X 10.6.4 and our GenArts Sapphire Nuke renders started failing.

We had upgraded to Sapphire v5 previously so I didn't think that was the issue, but the errors were a mixture of "plugin not installed", "unknown plugin", "unknown command" and "corrupt nuke script".

Of course I contacted The Foundry and Gen Arts support, but they were stumped and could not really reproduce the errors (besides the Foundry releasing new Nuke versions which actually supposedly fixed some Sapphire issues, but not my failing frames).

What I tried:

I reinstalled Sapphire.

	- seemed to work, but would start failing again soon enough

I copied the Sapphire plugin bundle into the Nuke built-in plugins folder

	- seemed to work, but would start failing again soon enough

I set the SAPPHIRE_OFX_DIR and the RLM_LICENSE variables in the submit script and moved the Sapphire bundle properly to our central plugin fileserver

	- seemed to work, but would start failing again soon enough


In conclusion: most of my solutions "seemed to work", but would start failing again soon enough.

Then I noticed artist clean his local disk cache because his local renders were failing. So, on a hunch I wrote a simple script to clear the local disk cache on the render nodes and set it up with a RUSH submit-generic to allow the artists to get their renders to stop failing frames. And it worked. When a render would start failing frames they would run the submit-generic script and the renders would work again.

The problem was it a manual procedure and the artists did not find it simple enough. Fair enough, it was a workaround. But I couldn't automate since if one person clear the cache on a node while other renders were running they would fail their frames also. I did not want to set it up as a pre or post render action for that reason.

The other solution was to go back to rendering as unique users, instead of forcing renders to render as one user (set in rush.conf). But I got linux, windows and mac renders all working as the same user, so I didn't want to change that now.

The best solution to all this was Greg Ercolano's idea to tie the nuke temp directory to the rush temp directory. Since each launch of the submit process brought a new rush process with its own temp dir that would be useful to stash the nuke disk cache there also. and rush cleans up after it's done and that solves the need to run scripts afterwards to clean up.


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Nuke GenArts Sapphire renders failing
   Date: Tue, 08 Feb 2011 10:43:10 -0500
Msg# 2011
View Complete Thread (6 articles) | All Threads
Last Next
Hi Mat,

    Right -- what this comes down to is adding this single line:

        if ( defined($ENV{RUSH_TMPDIR}) ) { $ENV{NUKE_TEMP_DIR} = $ENV{RUSH_TMPDIR}; }

    ..to the top of your submit-nuke.pl script.

    This tells nuke to use the temp directory rush creates for each frame
    (and removes after its done) so that the render can have a unique directory
    for all temp files, including its caches.

    The next release of Rush will have this setting -- I advise it for not just
    Sapphire plugins, but for Nuke in general.

    I'd suggest putting the above setting below all the nuke environment variable settings, eg:


..
#########################################################
### NUKE SPECIFIC VARIABLES -- CUSTOMIZE AS NECESSARY ###
### See Nuke's documentation for more info.           ###
#########################################################
if ( $G::iswindows )
{
   ..stuff..
}
elsif ( $G::ismac )
{
   ..stuff..
}
elsif ( $G::islinux )
{
   ..stuff..
}

if ( defined($ENV{RUSH_TMPDIR}) ) { $ENV{NUKE_TEMP_DIR} = $ENV{RUSH_TMPDIR}; }    # <-- PUT IT HERE




Mat X wrote:
I wanted to give a heads up to anyone else having issues with GenArts 
Sapphire (I know of at least one other facility) and a failing frames 
in Nuke.

The solution, for those not wanting to read through a long rambling 
post, is to set the nuke disk cache to the rush temp dir, and everyone 
lives happily ever after.

For the long story version read on....

I ran into this really weird issue when we upgraded our farm to Mac OS 
X 10.6.4 and our GenArts Sapphire Nuke renders started failing.

We had upgraded to Sapphire v5 previously so I didn't think that was 
the issue, but the errors were a mixture of "plugin not installed",  
"unknown plugin", "unknown command" and "corrupt nuke script".

Of course I contacted The Foundry and Gen Arts support, but they were 
stumped and could not really reproduce the errors (besides the Foundry 
releasing new Nuke versions which actually supposedly fixed some 
Sapphire issues, but not my failing frames).

What I tried:

I reinstalled Sapphire.

	- seemed to work, but would start failing again soon enough

I copied the Sapphire plugin bundle into the Nuke built-in plugins folder

	- seemed to work, but would start failing again soon enough

I set the SAPPHIRE_OFX_DIR and the RLM_LICENSE variables in the submit 
script and moved the Sapphire bundle properly to our central plugin 
fileserver

	- seemed to work, but would start failing again soon enough


In conclusion: most of my solutions "seemed to work", but would start 
failing again soon enough.

Then I noticed artist clean his local disk cache because his local 
renders were failing. So, on a hunch I wrote a simple script to clear 
the local disk cache on the render nodes and set it up with a RUSH 
submit-generic to allow the artists to get their renders to stop 
failing frames. And it worked. When a render would start failing frames 
they would run the submit-generic script and the renders would work 
again.

The problem was it a manual procedure and the artists did not find it 
simple enough. Fair enough, it was a workaround. But I couldn't 
automate since if one person clear the cache on a node while other 
renders were running they would fail their frames also. I did not want 
to set it up as a pre or post render action for that reason.

The other solution was to go back to rendering as unique users, instead 
of forcing renders to render as one user (set in rush.conf). But I got 
linux, windows and mac renders all working as the same user, so I 
didn't want to change that now.

The best solution to all this was Greg Ercolano's idea to tie the nuke 
temp directory to the rush temp directory. Since each launch of the 
submit process brought a new rush process with its own temp dir that 
would be useful to stash the nuke disk cache there also. and rush 
cleans up after it's done and that solves the need to run scripts 
afterwards to clean up.
  
-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Craig Allison <craigallison@(email surpressed)>
Subject: < MinTime then ReQueue
   Date: Wed, 16 Feb 2011 12:03:56 -0500
Msg# 2019
View Complete Thread (6 articles) | All Threads
Last Next
Hey Greg

For the last couple of years I've been using Rush to create standardised production QuickTimes from rendered frames, email the relevant groups, publish frames, move data around the network etc and it's been working really well overall, but I'm getting a reoccurring problem with the initial render where it's immediately moving to state "Done" without the render ever happening, when I click "Que" the job will then render properly without complaint.

As the time elapsed on these problem jobs shows as 00:00:00 I thought there might be a way of saying if job < 00:00:01 then ReQueue?

Regards

Craig

Craig Allison
Digital Systems & I/O Manager
The Senate Visual Effects
Twickenham Film Studios
St. Margarets
Middlesex
TW1 2AW

+44208 607 8866
craigallison@(email surpressed)
skype: craig_9000




   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: < MinTime then ReQueue
   Date: Wed, 16 Feb 2011 12:19:05 -0500
Msg# 2020
View Complete Thread (6 articles) | All Threads
Last Next
Craig Allison wrote:
> For the last couple of years I've been using Rush to create standardised =
> production QuickTimes from rendered frames, email the relevant groups, =
> publish frames, move data around the network etc and it's been working =
> really well overall, but I'm getting a reoccurring problem with the =
> initial render where it's immediately moving to state "Done" without the =
> render ever happening, when I click "Que" the job will then render =
> properly without complaint.

	Hmm, can you include some more info.. when this happens, paste me:

		1) The 'Frames' report for the first time it says 'Done' (but no render).
		   I'd like to see what machine it picked up on where it quickly became "done".

		1a) Is a frame log generated for that frame? If so, paste that here too.

		2) The 'Jobs Full' report for this job.

		3) The script that submits the job, and the script (if separate) that renders it.

		   It's possible the script is submitting the job in such a way
		   that the frame is forced to start in the 'Done' state on submit.
		   (It is possible to do this)

		4) The rushd.log from the machine that took the render the first time
		   (where the render time is 00:00:00)

> As the time elapsed on these problem jobs shows as 00:00:00 I thought =
> there might be a way of saying if job < 00:00:01 then ReQueue?

	You could make a done command that checks this and requeues it,
	but before you try covering up the problem, let's first try to
	determine the cause.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: < MinTime then ReQueue
   Date: Thu, 17 Feb 2011 12:33:34 -0800
Msg# 2022
View Complete Thread (6 articles) | All Threads
Last Next
Craig has followed up with me offline on this; seems it might just be
a small omission in script he's working with.

The script writes out a shake file then submits a shake job to render it.

The problem just might be an issue of a missing close() in the script;
after writing the shake file it submits the job. But without the close(),
if the job picks up quickly, shake will see an empty file, and finishes
immediately with the frame "Done".

The intermittent behavior seems due to how quickly the job picks up;
if it picks up quickly, shake sees an empty file. But if it takes
an extra second or two to pick up, the submit script finishes executing,
automatically closing the file, flushing it to disk.. then when the job
kicks in, shake reads the proper file.

This would also explain why re-que'ing the frame always renders successfully.

Adding a close() to the custom script will likely fix the problem.
Otherwise, if there's still trouble, I'd suggest experimenting with sync(1)
and/or fsync(2) to ensure the OS commits the file to the remote server before
continuing. (But that really shouldn't be necessary.)

   From: Craig Allison <craigallison@(email surpressed)>
Subject: Re: < MinTime then ReQueue
   Date: Fri, 18 Feb 2011 07:46:14 -0500
Msg# 2023
View Complete Thread (6 articles) | All Threads
Last Next
I can confirm that the close () command has fixed the issue, haven't had a problem since.

Awesome work once again Greg!

Thank you

Craig


Craig Allison
Digital Systems & I/O Manager
The Senate Visual Effects
Twickenham Film Studios
St. Margarets
Middlesex
TW1 2AW

+44208 607 8866
craigallison@(email surpressed)
skype: craig_9000




On 17 Feb 2011, at 20:33, Greg Ercolano wrote:

> [posted to rush.general]
> 
> Craig has followed up with me offline on this; seems it might just be
> a small omission in script he's working with.
> 
> The script writes out a shake file then submits a shake job to render it.
> 
> The problem just might be an issue of a missing close() in the script;
> after writing the shake file it submits the job. But without the close(),
> if the job picks up quickly, shake will see an empty file, and finishes
> immediately with the frame "Done".
> 
> The intermittent behavior seems due to how quickly the job picks up;
> if it picks up quickly, shake sees an empty file. But if it takes
> an extra second or two to pick up, the submit script finishes executing,
> automatically closing the file, flushing it to disk.. then when the job
> kicks in, shake reads the proper file.
> 
> This would also explain why re-que'ing the frame always renders successfully.
> 
> Adding a close() to the custom script will likely fix the problem.
> Otherwise, if there's still trouble, I'd suggest experimenting with sync(1)
> and/or fsync(2) to ensure the OS commits the file to the remote server before
> continuing. (But that really shouldn't be necessary.)
>