|
|||||
You can do this by making a script that invokes several job submissions; each time a job is submitted, the jobid is saved, and used in the 'waitfor' command in the NEXT submit script. This chains one job to the other. Many can be chained together in this manner.
A simple example showing how the csh eval command can be used to gather up a jobid to be used in another submit script's waitfor command.
#!/bin/csh -f ### SUBMIT SCRIPT -- Chaining Multiple Jobs # Job #1 eval `rush -submit` << EOF title MYSHOW/RENDER ram 250 frames 1-10 command $cwd/render-script cpus +any=10@100 cpus vaio=8@100 autodump donefail logdir $cwd/logs-1 EOF if ( $status ) exit 1 # (eval eats the setenv command, so we duplicate it here) echo "setenv RUSH_JOBID $RUSH_JOBID" # Job #2 -- this job will wait for the above job to finish rush -submit << EOF title MYSHOW/COMP ram 250 frames 1-10 command $cwd/comp-script cpus +any=10@100 cpus vaio=8@100 logdir $cwd/logs-2 waitfor $RUSH_JOBID EOF |
The benefit is mainly from avoiding the per-frame overhead involved in loading the entire scene (texture maps, animation files, etc) each frame.
One technique is to tell the render queue to render on 'tens' (ie. 1-500,10) and have the render script fire off ten frames at a time, using $RUSH_FRAME as the start frame, and ($RUSH_FRAME + 9) as the end frame.
This involves two things; setting a step rate for the frame range in the submit script, and also passing this step rate to the render script, so it knows how many frames to batch.
| Batching Multiple
Frames
Submit Script |
#!/bin/csh -f source $RUSH_DIR/etc/.submit # NUMBER OF FRAMES TO BATCH # Change this value as needed @ batch = 10 rush -submit << EOF title BATCH_RENDER ram 10 frames 1-100,$batch command $cwd/render-batch $batch logdir $cwd/logs cpus va@100 how@100 EOF exit $status |
| Batching Multiple
Frames
Render Script |
#!/bin/csh -f source $RUSH_DIR/etc/.render # START/END FRAME FOR BATCHING @ sfrm = $RUSH_FRAME @ efrm = ( $sfrm + $argv[1] - 1 ) echo "--- Working on frames $sfrm-$efrm - `date`" myrender $sfrm,$efrm,1 if ( $status ) exit 1 exit 0 |
Whenever your render script returns an exit code of 2 (REQUEUE), the frame is requeued, the 'Try' count is incremented (shown in 'rush -lf' and the frame is executed again.
Rush passes the retry count to the render script as an environment variable $RUSH_RETRY which the script can use to act conditionally.
| Retrying Frames
Render Script |
#!/bin/csh -f source $RUSH_DIR/etc/.render echo "--- Working on frame $RUSH_FRAME - `date`" render /job/MYJOB/MYSHOT/ribs/fg.$RUSH_PADFRAME.rib if ( $status == 0 ) exit 0 # it worked # FAILED? RETRY 3 TIMES if ( $RUSH_TRY < 3 ) exit 2 # retry up to 3 times exit 1 # otherwise fail |
For instance, in a situation where a 3rd party program outputs error messages like 'cannot open file' or 'write error', but always returns an exit 0. A savvy render script programmer can use 'egrep' to detect the error message and report it back to rush.
| Detecting Render Problems with Grep |
| #!/bin/csh -f
# RENDER SCRIPT my_render $RUSH_FRAME # 'my_render' always returns an exit code
of 0,
egrep -s 'cannot open file|write error'
$RUSH_LOGFILE
|
You can embed 'rush -notes' commands into your render script to alter the 'notes' field for the rendering frame, eg:
Frame notes are cleared each time a frame begins rendering, so there's no need to specify a rush command to clear the frame notes in your render script. In fact, that's discouraged because of the following warning..
|
Warning: Each execution of 'rush -notes' invokes a TCP connection
to the job server daemon. Invoking 'rush' commands on a per frame
basis is unwise (except under error condition circumstances), as it
imposes a large TCP load on the job server daemon if many connections
occur all at once, slowing the daemon's response critically.
This happens especially if your render times are short, and you are rendering on many cpus. Therefore you are only encouraged to embed 'rush' commands in render scripts under error conditions only (ie. infrequently), so as to lessen the possibility of multiple concurrent TCP connections. |
Here's an example showing a render script that makes use of the NOTES field to report helpful errors to the user..
% cat render_me
#!/bin/csh -f
echo "--- Working on frame $RUSH_FRAME - `date`"
### YOUR RENDER COMMAND(S) HERE
particle $DATA/files/stars-$RUSH_PADFRAME.par
set err = $status
### CHECK FOR MISSING FILES
egrep -i -s no.such.file.or.directory $RUSH_LOGFILE
if ( $status ) rush -notes ${RUSH_FRAME}:'Missing file'
### CHECK FOR CORE DUMPS
egrep -i -s core.dumped $RUSH_LOGFILE
if ( $status ) rush -notes ${RUSH_FRAME}:'Core dumped'
### CHECK FOR LICENSE ERRORS
egrep -i -s no.available.licenses $RUSH_LOGFILE
if ( $status ) then
rush -notes ${RUSH_FRAME}:'License error'
sleep 10
endif
### NON-SPECIFIC ERRORS
if ( $err ) then
rush -notes ${RUSH_FRAME}:'?'
exit 1
endif
exit 0
% rush -lf
STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES
Fail 0030 2 vaio 20338 02/27,14:41:22 00:01:03 Missing file
Fail 0031 2 vaio 20339 02/27,14:41:22 00:01:03 Missing file
Fail 0032 2 vaio 20340 02/27,14:41:22 00:01:03 Missing file
Run 0033 9 vaio 20365 02/27,14:55:25 00:00:45 License error
Done 0034 9 vaio 20367 02/27,14:41:25 00:01:04 -
Done 0035 8 vaio 20369 02/27,14:41:25 00:01:04 -
Done 0036 8 tahoe 20389 02/27,14:41:29 00:01:03 -
Done 0037 8 tahoe 20394 02/27,14:41:29 00:01:03 -
Done 0038 8 tahoe 20396 02/27,14:41:29 00:01:03 -
Done 0039 8 superior 20413 02/27,14:41:32 00:01:03 -
Done 0040 8 superior 20423 02/27,14:41:32 00:01:03 -
Done 0041 8 erie 20425 02/27,14:41:32 00:00:08 Core dumped
Done 0042 8 rotwang.erco.c 12662 02/27,14:41:32 00:01:06 -
Done 0043 8 rotwang.erco.c 12663 02/27,14:41:32 00:01:06 -
Fail 0044 8 rotwang.erco.c 12664 02/27,14:55:35 00:00:55 Missing file
Fail 0045 8 ontario 20434 02/27,14:55:35 00:00:55 Missing file
Fail 0046 8 ontario 20441 02/27,14:55:35 00:00:55 Missing file
|
To pause the job for a short period, add the following to your
render script after detecting a license error:
# LICENSE ERROR? PAUSE JOB FOR 5 MINUTES
if ( license_error ) then
# If not already paused, pause and restart in 5 mins.
rush -lj | egrep '^Pause.*'${RUSH_JOBID}
if ( $status != 0 ) then
rush -pause
rush -notes ${RUSH_FRAME}:'Pause job/lic error'
( sleep 300 ; rush -cont ) >& /dev/null < /dev/null &
endif
exit 2
endif
There is a possibility of a race condition between the check for
a paused job, and the backgrounding of the 'rush -cont'. But in such
a case both timers will expire at about the same time, so it shouldn't
be of much concern.