|
Most people will want the newer 'Chaining Individual Frames', which lets one inter-depend at the frame level using the new 'dependon' submit command.
This allows you to create a chain of dependencies, such that some jobs
render frames in parallel, while other jobs wait for individual frames
to finish.
You can do this by making a script that invokes several job submissions;
each time a job is submitted, the jobid(s) are saved, and used in the
'dependon' command in the NEXT submit script. This creates a frame
dependency chain between one job and others. Many jobs can be chained
in this manner.
Jobs can be submitted such that one job waits for the other to
dump. To do this, use the submit script
'waitfor' command to wait for
other jobids.
You can do this by making a script that invokes several job submissions;
each time a job is submitted, the jobid is saved, and used in the 'waitfor'
command in the NEXT submit script. This chains one job to the other. Many
can be chained together in this manner.
A simple example showing how the csh eval
command can be used to gather up a jobid to be used in another submit script's
waitfor command.
I. Chaining Individual Frames
Jobs can be submitted such that one job's frames waits for the others
using the submit script command DependOn.
Example
Here's a typical fg/bg/comp example; a submit script that starts three jobs;
two renders (fg/bg) run in parallel, and a third (comp) waits, starting comps
as frames in the fg/bg job complete. Note how the csh eval command
is used to gather up jobids for the comp job's dependon command.
#!/bin/csh -f
### SUBMIT SCRIPT -- Create frame dependencies between jobs
# Job #1: FOREGROUND ELEMENT
eval `rush -submit` << EOF
title MYSHOW/FG
ram 250
frames 1-10
command $cwd/render-fg
cpus +any=10@100
logdir $cwd/logs-fg
EOF
if ( $status ) exit 1
set fgjobid = $RUSH_JOBID
echo " FG: setenv RUSH_JOBID $RUSH_JOBID"
# Job #2 -- BACKGROUND ELEMENT
# This job can run in parallel with the foreground,
# so no dependency is defined.
#
eval `rush -submit` << EOF
title MYSHOW/BG
ram 250
frames 1-10
command $cwd/render-bg
cpus +any=10@100
logdir $cwd/logs-bg
EOF
if ( $status ) exit 1
set bgjobid = $RUSH_JOBID
echo " BG: setenv RUSH_JOBID $RUSH_JOBID"
# Job #3 -- COMP
# This job waits for individual frames in FG and BG jobs
# to complete successfully before comping frames.
#
eval `rush -submit` << EOF
title MYSHOW/COMP
ram 250
frames 1-10
command $cwd/render-comp
cpus +any=10@100
logdir $cwd/logs-comp
dependon $fgjobid $bgjobid
EOF
if ( $status ) exit 1
echo "COMP: setenv RUSH_JOBID $RUSH_JOBID"
II. Chaining Job Completion
If you are looking for chaining jobs at the frame level, you probably
want to see the above Chaining Frames
section. However, if you want one job to wait for other jobs to COMPLETLY
finish before moving on to the next, read on.
#!/bin/csh -f
### SUBMIT SCRIPT -- Chaining Multiple Jobs
# Job #1
eval `rush -submit` << EOF
title MYSHOW/MYRENDER
ram 250
frames 1-10
command $cwd/render-script
cpus +any=10@100
cpus vaio=8@100
autodump donefail
logdir $cwd/logs-1
EOF
if ( $status ) exit 1
# (eval eats the setenv command, so we duplicate it here)
echo "setenv RUSH_JOBID $RUSH_JOBID"
# Job #2 -- this job will wait for the above job to finish
rush -submit << EOF
title MYSHOW/MYCOMP
ram 250
frames 1-10
command $cwd/comp-script
cpus +any=10@100
cpus vaio=8@100
logdir $cwd/logs-2
waitfor $RUSH_JOBID
EOF
The benefit is mainly from avoiding the per-frame overhead involved in loading the entire scene (texture maps, animation files, etc) each frame.
One technique is to tell the render queue to render on 'tens' (ie. 1-500,10) and have the render script fire off ten frames at a time, using $RUSH_FRAME as the start frame, and ($RUSH_FRAME + 9) as the end frame.
This involves two things; setting a step rate for the frame range in the submit script, and also passing this step rate to the render script, so it knows how many frames to batch.
Batching Multiple
Frames
Submit Script |
#!/bin/csh -f source $RUSH_DIR/etc/.submit # NUMBER OF FRAMES TO BATCH # Change this value as needed @ batch = 10 rush -submit << EOF title BATCH_RENDER ram 10 frames 1-100,$batch command $cwd/render-batch $batch logdir $cwd/logs cpus va@100 how@100 EOF exit $status |
Batching Multiple
Frames
Render Script |
#!/bin/csh -f source $RUSH_DIR/etc/.render # START/END FRAME FOR BATCHING @ sfrm = $RUSH_FRAME @ efrm = ( $sfrm + $argv[1] - 1 ) echo "--- Working on frames $sfrm-$efrm - `date`" myrender $sfrm,$efrm,1 if ( $status ) exit 1 exit 0 |
Whenever your render script returns an exit code of 2 (REQUEUE), the frame is requeued, the 'Try' count is incremented (shown in 'rush -lf' and the frame is executed again.
Rush passes the retry count to the render script as an environment variable $RUSH_RETRY which the script can use to act conditionally.
Retrying Frames
Render Script |
#!/bin/csh -f source $RUSH_DIR/etc/.render echo "--- Working on frame $RUSH_FRAME - `date`" render /job/MYJOB/MYSHOT/ribs/fg.$RUSH_PADFRAME.rib if ( $status == 0 ) exit 0 # it worked # FAILED? RETRY 3 TIMES if ( $RUSH_TRY < 3 ) exit 2 # retry up to 3 times exit 1 # otherwise fail |
For instance, in a situation where a 3rd party program outputs error messages like 'cannot open file' or 'write error', but always returns an exit 0. A savvy render script programmer can use 'egrep' to detect the error message and report it back to rush.
Detecting Render Problems with Grep |
#!/bin/csh -f
### RENDER SCRIPT my_render $RUSH_FRAME # 'my_render' always returns an exit code
of 0,
egrep -s 'cannot open file|write error'
$RUSH_LOGFILE
|
Grep: An Advanced Example |
#!/bin/csh -f ############################### # R E N D E R S C R I P T # ############################### echo "--- Working on frame $RUSH_FRAME - `date`" ### MAYA RENDER Render30 -s $RUSH_FRAME -e $RUSH_FRAME -b 1 -proj $1 -rd /jobs/MYSHOW/MYSHOT/images $2 set err = $status ### GREP FOR ERROR MESSAGES set msg = "" if ( `grep -s "Texture file" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Texture File" if ( `grep -s "Failed to open IFF" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "IFF Error" if ( `grep -s "find destination plug" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Plug Error" if ( `grep -s "ESEC_J" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "License Error" if ( `grep -s "doesn" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Missing File" if ( `grep -s "TrenderTesselation" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Tesselation Error" if ( `grep -s "Memory exception" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Memory Error" if ( `grep -s "post-process stage" $RUSH_LOGFILE ; echo $status` == 0 ) set msg = "Post Process" ### FOUND ONE OF THE ABOVE? if ( "$msg" != "" ) then # MAKE NOTE IN FRAMELIST FOR TD/RENDER WATCHER rush -notes ${RUSH_FRAME}:"$msg" switch ( "$msg" ) ### NON-FATAL case "License Error": case "IFF Error": case "Plug Error": case "Tesselation Error": case "Memory Error": case "Post Process": echo -- REQUEUE exit 2 ### FATAL case "Texture File": case "Missing File": default: echo -- FAIL exit 1 endsw endif # NON-SPECIFIC ERROR? if ( $err != 0 ) then rush -notes ${RUSH_FRAME}:"See Logs" echo -- FAIL exit 1 endif # NO ERRORS rush -notes ${RUSH_FRAME}:"OK" echo -- OK exit 0 |
You can embed 'rush -notes' commands into your render script to alter the 'notes' field for the rendering frame, eg:
Frame notes are cleared each time a frame begins rendering, so there's no need to specify a rush command to clear the frame notes in your render script. In fact, that's discouraged because of the following warning..
Warning: Each execution of 'rush -notes' invokes a TCP connection
to the job server daemon. Invoking 'rush' commands on a per frame
basis is unwise (except under error condition circumstances), as it
imposes a large TCP load on the job server daemon if many connections
occur all at once, slowing the daemon's response critically.
This happens especially if your render times are short, and you are rendering on many cpus. Therefore you are only encouraged to embed 'rush' commands in render scripts under error conditions only (ie. infrequently), so as to lessen the possibility of multiple concurrent TCP connections. |
Here's an example showing a render script that makes use of the NOTES field to report helpful errors to the user..
% cat render_me #!/bin/csh -f echo "--- Working on frame $RUSH_FRAME - `date`" ### YOUR RENDER COMMAND(S) HERE particle $DATA/files/stars-$RUSH_PADFRAME.par set err = $status ### CHECK FOR MISSING FILES egrep -i -s no.such.file.or.directory $RUSH_LOGFILE if ( $status ) rush -notes ${RUSH_FRAME}:'Missing file' ### CHECK FOR CORE DUMPS egrep -i -s core.dumped $RUSH_LOGFILE if ( $status ) rush -notes ${RUSH_FRAME}:'Core dumped' ### CHECK FOR LICENSE ERRORS egrep -i -s no.available.licenses $RUSH_LOGFILE if ( $status ) then rush -notes ${RUSH_FRAME}:'License error' sleep 10 endif ### NON-SPECIFIC ERRORS if ( $err ) then rush -notes ${RUSH_FRAME}:'?' exit 1 endif exit 0 % rush -lf STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES Fail 0030 2 vaio 20338 02/27,14:41:22 00:01:03 Missing file Fail 0031 2 vaio 20339 02/27,14:41:22 00:01:03 Missing file Fail 0032 2 vaio 20340 02/27,14:41:22 00:01:03 Missing file Run 0033 9 vaio 20365 02/27,14:55:25 00:00:45 License error Done 0034 9 vaio 20367 02/27,14:41:25 00:01:04 - Done 0035 8 vaio 20369 02/27,14:41:25 00:01:04 - Done 0036 8 tahoe 20389 02/27,14:41:29 00:01:03 - Done 0037 8 tahoe 20394 02/27,14:41:29 00:01:03 - Done 0038 8 tahoe 20396 02/27,14:41:29 00:01:03 - Done 0039 8 superior 20413 02/27,14:41:32 00:01:03 - Done 0040 8 superior 20423 02/27,14:41:32 00:01:03 - Done 0041 8 erie 20425 02/27,14:41:32 00:00:08 Core dumped Done 0042 8 rotwang.erco.c 12662 02/27,14:41:32 00:01:06 - Done 0043 8 rotwang.erco.c 12663 02/27,14:41:32 00:01:06 - Fail 0044 8 rotwang.erco.c 12664 02/27,14:55:35 00:00:55 Missing file Fail 0045 8 ontario 20434 02/27,14:55:35 00:00:55 Missing file Fail 0046 8 ontario 20441 02/27,14:55:35 00:00:55 Missing file |
To pause the job for a short period, add the following to your
render script after detecting a license error:
# LICENSE ERROR? PAUSE JOB FOR 5 MINUTES
if ( license_error ) then
# If not already paused, pause and restart in 5 mins.
rush -lj | egrep '^Pause.*'${RUSH_JOBID}
if ( $status != 0 ) then
rush -pause
rush -notes ${RUSH_FRAME}:'Pause job/lic error'
( sleep 300 ; rush -cont ) >& /dev/null < /dev/null &
endif
exit 2
endif
There is a possibility of a race condition between the check for
a paused job, and the backgrounding of the 'rush -cont'. But in such
a case both timers will expire at about the same time, so it shouldn't
be of much concern.