|
|
|
AutoDump Criteria Command Cpus DependOn DoneCommand DoneMail Frames LogDir LogFlags NeverCpus Notes Priority Ram State Title WaitFor |
Dump job on completion Criteria for matching hosts Render script to execute Hosts (or hostgroups) to use for rendering Command to run when job done Send mail when job done Frame ranges to render Directory for log files Controls logfile behavior Cpus to never use for rendering Job notes Default priority Ram job expects to use (max) Initial state for job Title for job Wait for other jobs to complete |
AutoDump (rush -autodump) |
autodump off autodump done autodump donefail |
# Don't autodump; job remains when frames are done. # Job dumps itself when all frames are DONE # Job dumps itself If all frames are DONE or FAIL |
Command (rush -command) |
Usually, this is always an absolute NFS path to the Render Script.
It can, however, be an absolute path to any executable or script, provided it returns rush exit codes (0,1,2), and knows how to access RUSH_FRAME to determine which frame its working on.
command /job/MARINER/DIVE/rush/render-script 640 480
WinNT Note - Use UNC paths for the absolute path
to the render script. This prevents problems with inconsistently mapped
drive letters. A UNC example:
|
Cpus (rush -ac/-rc) |
When specifying a cpu, your are telling rush at least three things:
The number of cpus defaults to 1 if unspecified.
If unspecified, the priority value defaults to the Priority value for the job.
Priority is a value between 1 and 999, with 999 being highest priority, 1 being lowest. Priority values can be followed by optional flags 'k' and/or 'a'. See Priority Description for a full description of how the priority mechanism works.
cpus pabst cpus pabst=4 cpus pabst=4@900 cpus pabst=4@900,2@500 cpus +any=10@1 cpus +farm=50@1 |
# 1 cpu on pabst, default priority # 4 cpus on pabst, default priority # 4 cpus at 900 priority # 4 cpus at 900 priority, 2 cpus at 500 # use up to 10 cpus on 'any available machine' # use up to 50 machines on the 'farm' host group |
Host Groups are configured by your sysadmin in the Hosts file.
Criteria (rush -criteria) |
[erco@howland] % rush -lac IP Hostname Ram Cpus Pri Criteria 192.168.10.3 rotwang 100 2 0 +any,linux,linux6.0,intel,+dante 192.168.10.2 how 256 2 0 +any,sgi,irix,irix6.2 192.168.10.1 nt 256 1 0 +any,winnt,+dante
When you specify hosts to render, any Criteria you specify will
limit which machines your renders will run on; if the criteria you specify
don't match a particular host, even if the host is specifically requested
by a Cpus command, frames will be turned away from
rendering on that machine.
For instance, if your job depends on using only linux machines or sgis
running IRIX 6.2, you might submit your job with a criteria line that reads:
The above presumes your sysadmin uses 'linux' and 'irix6.2' as qualifiers
in the host list. If you need new criteria strings configured, ask your
sysadmin to add them to the rush system's hosts
file.
Only one Criteria command should appear in a submit script; multiple instances of the command are not cumulative.
Here are some more examples:
criteria ( linux | ( irix6 & octane ) ) criteria ( linux | irix6.2 ) criteria ( linux & !alpha ) criteria ( linux & alpha & carrera ) criteria ( +any ) criteria ( !intel ) |
# Use linux machines OR irix6 octanes. # Only linux machines OR irix6.2 machines. # Use only linux machines that are NOT dec-alphas. # Use only linux dec-alphas built by Carrera. # Use all available machines # Use all machines that are NOT intel based machines. |
When a job is submitted with 'dependon', all the frames in the job enter the 'Hold' state. (Frames in the 'Hold' state will not be rendered until switched back to the 'Que' state) As a particular frame in the jobs that are depended on enter the 'Done' state (ie. finish rendering successfully), it is then that the corresponding frame in the current job will switch from Hold to Que, allowing the frame to begin rendering, as resources become available.
You can have a job depend on several others, the only stipulations being that the dependon jobs:
Frames will *not* switch from Hold -> Que until *all* the jobs depended on have their corresponding frames in the Done state. Otherwise those frames will remain in the Hold state.
See Chaining Jobs for scripting techniques to do this. Example usage:
DoneCommand (rush -donecommand) |
This command is executed only once, on the job server.
If the done command is set to '-', or if donecommand is not specified at all, it will be disabled.
For the DoneCommand to be executed, the job must dump. For automatic invocation, you will need to have the AutoDump command enabled, for the job to dump when all the frames are done. If AutoDump is disabled, the only way the DoneCommand will execute, is if someone manually invokes 'rush -dump'.
DoneCommand scripts are passed the jobid in the RUSH_JOBID environment variable, so it's possible for the script to use rush(1) commands to query the job. Exit codes are currently ignored. The stdout and stderr output from the DoneCommand is writted to a file called 'done.log' in the LogDir.
#!/bin/csh -f # EXAMPLE 'DoneCommand' SCRIPT set $wwwreport = /somewhere/MYPROJECT/html/`logname`/jobreport.html # CREATE A CUSTOMIZED WEBPAGE REPORT set logdir = `dirname $RUSH_LOGFILE` cat $logdir/framelist | \ my_report_generator > $wwwreport # MAIL THE REPORT TO SOMEONE Mail -s "$RUSH_JOBID Html Report" `logname` < $wwwreport |
The DoneCommand should avoid doing anything to the job that might make it continue running. Though possible, this would confuse someone manually trying to dump the job, only to find it requeuing itself.
donecommand - donecommand $cwd/cleanup |
# Disable done commands # Setup script to run before job dumps itself |
(rush -donemail) |
Arguments should all be valid email addresses. If more than one address needs to be specified, separate with commas. There should be no spaces in the list of addresses. Use '-' to disable sending completion mail (default). Some possible settings for DoneMail:
donemail - donemail erco@3dsite.com donemail fred,jack |
# Mail disabled # Send mail to erco@3dsite.com # Send mail to fred and jack |
Frames (rush -af/-rf) |
frames 1-10 frames 100-150,2 frames 500 507 615 |
# Frames 1 thru 10 # Frames 100 thru 150 on twos # Frames 500, 507 and 615 |
frames 1-5=Done frames 6-10=Fail frames 11-15=Hold frames 16-20=Queue |
# Frames 1 thru 5 in DONE state # Frames 6 thru 10 in FAIL state # Frames 11 thru 15 in HOLD state # Frames 16 thru 20 in QUEUE state (default) |
frames 1-10:Black frames 11:Fade_up_on_sc17 |
# Notes for frames 1 thru 10 is "Black" # Frame 11 has note "Fade_up_on_sc17" |
[erco@howland]% rush -lf STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES Que 0001 0 - 0 00/00,00:00:00 00:00:00 Black Que 0002 0 - 0 00/00,00:00:00 00:00:00 Black Que 0003 0 - 0 00/00,00:00:00 00:00:00 Black Que 0004 0 - 0 00/00,00:00:00 00:00:00 Black Que 0005 0 - 0 00/00,00:00:00 00:00:00 Black Que 0006 0 - 0 00/00,00:00:00 00:00:00 Black Que 0007 0 - 0 00/00,00:00:00 00:00:00 Black Que 0008 0 - 0 00/00,00:00:00 00:00:00 Black Que 0009 0 - 0 00/00,00:00:00 00:00:00 Black Que 0010 0 - 0 00/00,00:00:00 00:00:00 Black Que 0011 0 - 0 00/00,00:00:00 00:00:00 Fade_up_on_sc17States and Notes. Frame states and notes specifications can appear together, e.g.:
frames 1-5=Done:This_is_a_test
When a job is dumped, two other files appear in this directory;
The directory must exist relative to both the job server and all machines participating in rendering, and the directory must be read/writable by the user submitting the job.
logdir /jobs/myjob/logs
logdir - |
#
Logs are dumped in this directory
# Disable log files |
WinNT Note - Use a UNC path for the absolute path to the logdir.
This prevents problems with inconsistently mapped drive letters. A UNC example:
|
The default behavior is to overwrite frame logfiles, each time a frame renders.
KeepLast tells the system to always keep the previous logfile, if there is one. It does this by renaming the previous log to a ".old" file, before creating the new log for a running frame, similar to running the command:
mv logs/0055 logs/0055.old |
KeepAll is similar to KeepLast, with the additional behavior that all 'previous' logs are kept; before a framelog is overwritten, it is concatenated to the .old file, similar to running:
cat logs/0055 >> logs/0055.old |
Beware; if your logfiles are long, KeepAll will create significant use of disk space, since the logs will accumulate. A good reason to use KeepLast instead.
logflags - | # Logs are overwritten (Default) |
logflags keepall | # Keep all logs; concatenate old logs in 0000.old |
logflags keeplast | # Like 'keepall', but only keeps last log (don't concatenate) |
NeverCpus (rush -an/-rn) |
nevercpus tahoe rotwang # Never use tahoe or rotwang for rendering
Notes (rush -notes) |
.. notes Please don't dump this job until you have visually notes verified the matte transition at frames 205-219. notes Call me at home if there are problems! -fred ..
Notes for the job appear in the 'rush -ljf' reports:
[erco@howland]% rush -ljf : : Elapsed: 00:00:00 Frames: 22 Cpus: rotwang=2@100k Cpus: how=3@100k Notes[0]: Please don't dump this job until you have visually Notes[1]: verified the matte transition at frames 205-219. Notes[2]: Call me at home if there are problems! -fred |
See Priority Description for a description of how priority values work.
Ram (rush -ram) |
While job is running, the configured Ram value is compared against the
available ram on the remote processors. If the amount of ram your job wants
is more than the remote machine has available, then the frame will not
be started. This behavior prevents swapping the remote machines.
ram 128 # Only run on machines that have at least 128MB of ram available
: state Pause # Submit job in paused state :After you submit the job, the job will be in the paused state:
[erco@howland]% rush -lj STATUS JOBID TITLE OWNER %DONE BUSY NOTES ------ ----------- ------------ -------- ----- ---- ----------- Pause how-857 THX/LOGO erco %0 0 Job paused.
Title (rush -title) |
title THX/LOGOHere's an example report showing where the title might show up:
[erco@howland]% rush -lj STATUS JOBID TITLE OWNER %DONE BUSY NOTES ------ ----------- ------------ -------- ----- ---- ----------- Run how-857 THX/LOGO erco %0 0 00:00:05Titles cannot contain spaces, and should be short (<15 characters) to prevent reports from loosing columnar formatting.
% rush -ac tahoe@300 +rfarm=10@100k +any=10@10See the Cpus submit command for more info.
% rush -af 10-15 # Add frames 10 thru 15 to the current job % rush -af 20-25=Hold # Add frames 20 thru 25 in Hold state % rush -af 30-35:"To be done" # Add frames 30 thru 35, setting Notes field to "To be done" % rush -af 40-45=Done:"To be done" # Add frames 40 thru 45 in Done state with Notes field set
See Frames submit command for more info.
% rush -an tahoe +rfarmSee Nevercpus submit command for more info.
#!/bin/csh -f ### ### Script to edit the rush.conf file, and rdist it out ### set TMPFILE=/usr/tmp/rush.conf.$$ cp /usr/local/rush/etc/rush.conf $TMPFILE vi $TMPFILE rush -checkconf $TMPFILE if ( $? ) then echo You lose, game over. set err=1 else foreach i ( tahoe superior erie ) rdist -c $TMPFILE ${i}:/usr/local/rush/etc/rush.conf end set err=0 endif rm -f $TMPFILE exit $err
Caveat: for reasons strange and unusual, all daemon 'catlog' output has a '>' prepended to each line.
Only root and the rush adminuser can use this command.
Read the examples carefully to avoid confusing jobid(s) to affect with new jobids for the dependon command; remeber to separate all 'dependon' jobids with commas instead of spaces.
Setting dependon to '-' will disable it, but you will have to manually un-Hold any frames manually.
All 'dependon' jobids must have the same hostname as the job being modified.
To differentiate between the new value for 'dependon', and the jobid(s) to be affected, the new value for 'dependon' must be a comma separated list of jobids with no spaces. Examples:
rush -dependon - | # Disable all dependencies |
rush -dependon tahoe-37,tahoe-38, | # Job now depends on 'tahoe-37' and 'tahoe-38' |
rush tahoe-40 -dependon tahoe-37,tahoe-38, | # tahoe-40 now depends on 'tahoe-37' and 'tahoe-38' |
rush tahoe-40 -dependon tahoe-37, |
# tahoe-40 now depends on job 'tahoe-37' # Trailing comma on tahoe-37 is important! |
Only root and the rush adminuser can use this command.
Only root and the rush adminuser can use this command.
Only root and the rush adminuser can use this command.
These are flags that can be used with rush(1)'s -dlog flag, rushd(8)'s -d flag, and the rush.conf file's LogFlags <logflags>. These flags can be combined to accumulate logging verbosity. All flags can be enabled by specifying 'a'. |
a - all b - bump mechanism logging d - log duplicate/redundant receipt of packet drops e - events (time oriented, async) f - fork h - hostname lookups j - Log job submissions k - Log bumped/killed/usurped tasks l - Logical string evaluations o - connect()/open()/close()/bind()/socket() (low level) p - parse command line arguments, submit scripts m - memory calculations (RAM) during priority battles, etc n - network commands (udp/tcp) r - reboot management/transactions s - signals t - tcp u - udp w - 'waitfor' checks y - yp lookups C - class ToWords/FromWords F - File loading line-by-line debugging E - Errors not normally displayed (benign, but suspect) T - task/taskack transactions U - update (scheduling, priority mechanism, idle cpu management) R - Reaper msgs S - Server/Client context switches X - Random UDP message dropping -- TESTING ONLY!! ('a' does not affect this option, it must be specified) |
The frames to affect can either be specified as a frame range (ie. 1-100) or as a frame state, for which all frames matching that state will be changed. Examples:
% rush -done 1-100 % rush -done fail % rush -done fail que |
# 'Done' frames 1 through 100 # 'Done' all frames currently Fail # 'Done' all frames currently Fail or Que |
Setting the done command to '-' will disable it. Examples:
rush -donecommand "/bin/sleep 30" | # set donecommand to '/bin/sleep 30' |
rush -donecommand "-" | # disable donecommand |
If no arguments are specified, the RUSH_JOBID variable is used to determine which job to dump.
If jobid(s) are specified, those jobs will be dumped.
If a user's name is specified, all jobs owned by that user on the local machine will be dumped. (eg. 'rush -dump fred')
If 'user@host' is used, all jobs owned by 'user' on 'host' will be dumped. (eg. 'rush -dump fred@tahoe')
The frames to affect can either be specified as a frame range (ie. 1-100) or as a frame state, for which all matching frames will be affected. Examples:
% rush -fail 1-100 % rush -fail done % rush -fail done hold |
# Fail frames 1 through 100 # Fail all frames currently Done # Fail all frames currently Done or Hold |
For instance, if you want to control another person's job, you might get an error, eg:
% rush -an vaio va-229 rush: va-229: you're not owner! % rush -an vaio va-229 -fu Add Neverhosts vaio |
# Attempt to add 'vaio' as a neverhost to someone's job # Fails because you're not the job's owner # Same command with -fu to force it.. # ..now it works! |
See also the RUSH_FU environment variable.
1 This acronym is rumored to have an alternate, pejorative expansion.
Offlines the local [remote] processors, kills/requeues any running frames, causing them to start elsewhere.
When a frame is in the Hold state, the frame will not be rendered, and a job will not autodump if there are any Hold frames in the frame list.
The frames to affect can either be specified as a frame range (ie. 1-100) or as a frame state, for which all matching frames will be affected. Examples:
% rush -hold 1-100 % rush -hold fail % rush -hold fail done |
# Hold frames 1 through 100 # Hold all frames currently Fail # Hold all frames currently Fail or Done |
If 'rush -lac hostname' is used, the information comes from the cache of the daemon running on the specified host; useful in determining hostname caching problems.
% rush -lfi vaio-139 Jobid State Total Perc Average Average ETA ------------ ----- ----- ---- ---------- ------------------------ vaio-139 Que 7 %69 - - vaio-139 Run 1 %10 - - vaio-139 Done 2 %20 00:32:16 Thu Sep 28 00:41:43 2000 vaio-139 Fail 0 %0 - - vaio-139 Hold 0 %0 - - |
Some description of the average columns:
oldest_start = "The oldest start time for all Done frames" recent_end = "Most recent end time for all Done frames" time_spent = ( recent_end - oldest_start ) time_per_frame = time_spent / total_done_frames time_to_go = time_per_frame * ( total_que + total_hold ) |
Warning: the ETA is meant for ball park estimates only, and is not meant to be taken literally.
Examples:
rush -notes 155:"license error" rush -notes 200-250:"redo later"
The frames to affect can either be specified as a frame range (ie. 1-100) or as a frame state, for which all frames matching that state will be changed. Examples:
% rush -que 1-100 % rush -que fail % rush -que fail done |
# Que frames 1 through 100 # Que all frames currently Fail # Que all frames currently Fail or Done |
Cpus can be removed in one of several ways:
To remove a cpu via 'JOBTID' values (eg. 'rush -rc .334') you must precede each value with a period. When you delete by JOBTID, you are deleting single cpus from the 'rush -lc' report. Note this report includes the JOBTID values so you can see which values to delete. If the cpu you delete is part of a larger specification, (eg. tahoe=4@12), then the cpu count for the spec will be modified, the cpu count in that spec will be decremented as well (eg. tahoe=3@12)
If you remove a hostname (eg. 'rush -rc tahoe') then all cpu specifications that have that host name (eg. tahoe=3@100) will be removed. Also, any host groups that expand to include that host will have that host removed from the expansion (eg. +any=3@100, which includes tahoe).
If you remove a cpu specification (eg. 'rush -rc +any=3@100'), it must match character-for-character the entry shown in the 'rush -lc' report for the job:
% rush -ac tahoe@100 # Add a cpu. % rush -rc tahoe@100 # Now try to remove it 'tahoe@100' no such cpu specification # FAILED: need to use spec shown in 'rush -lc' % rush -lc # Look at 'rush -lc' report CPUSPEC STATE FRM PID ELAPSED .. tahoe=1@100 Run 0002 26747 00:00:11 .. # More complete specification in report. % rush -rc tahoe=1@100 # Remove using spec shown in report 'tahoe=1@100' removed. # It works |
Reservation jobs are like any other; to remove a reservation, dump the job, to change the cpu specification, use the usual rush commands like -ac/-rc to modify the job, or use the GUI.
'cpuspec' indicates which processors are to be reserved, and at what priority. It is an error not to include a priority value in the cpuspec. A priority of '500' will reserve the processor from jobs with a priority of 500 or lower, but yields to jobs higher than 500.
To reserve the cpus on your workstation, while still allowing you to submit your own jobs to it, reserve your cpus at a priority of 998, and submit jobs to use your workstation at a priority of 999.
To prevent all jobs from rendering on your workstation, reserve cpus with a priority of 999. Or just 'rush -offline' your machine.
The optional 'ram value' allows one to reserve an amount of ram per processor. This is useful on multiproc machines where you want to prevent large jobs from rendering on the few processors left unreserved. Increasing the ram value you reserve will decrease the size of jobs rush will allow to run on the unused processors.
If 'cpuspec' doesn't include the number of processors, one processor is assumed. If 'ramval' is unspecified, a value of '1' is assumed.
rush -reserve tahoe@998 # Reserve 1 cpu on tahoe@998 rush -reserve tahoe=2@998 128 # Reserve 2 cpus @998, 128MB of ram to each rush -reserve tahoe=2@500 128 # Reserve 2 cpus @500, 128MB of ram to each |
Changing the order of the framelist affects the order frames are rendered, since frames are issued from the top of the list, down.
Example. If you have a job with 10 frames in the frame list in normal 1 thru 10 order, you can use 'rush -reorder' to get different orderings, such as these:
rush -reorder 10-1 # becomes 10 9 8 7 6 5 4 3 2 1 rush -reorder 1-10,2 2-10,2 # becomes 1 3 5 7 9 2 4 6 8 10 |
Logs can be automatically rotated with the LogRotateHour command in the rush.conf file.
Only root and the rush adminuser can use this command.
'rush -status' quickly reports the status of jobs and renders on the local [remote] hosts.
Optionally, a continuously updating report can be generated, where [-c count] specifies the number of updates, and [-s secs] specifies the seconds delay between each report. If -c is zero, updating is continuous.
The output contains several records of information for each host, one record per line. Host records start with an 'h' record, and terminate with a line of '---'.
4 different types of data records are possible. Data record types are defined by the first character in each record line, and can be one of:
h - hostname header. Leads off records for the specified host. d - daemon information. Info about the running daemon. j - job information. One line per job. p - processor status. One line per processor.Each record has its own fields:
h <hostname> d <sequence-id> <daemon> <version> <PID=pid> <online state> <jobs> <busy procs> <total procs> j <owner> <jobid> <job title> <job state> <elapsed> <percent done> <percent fail> <# frms busy> p <owner> <jobid> <job title> <frame> <priority> <pid> <elapsed>
A 'try count' is maintained for each frame in a job's frame list. This value is shown in the 'Try' column of 'rush -lf' reports, and is passed to render scripts via the $RUSH_TRY environment variable.
% rush -lf STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES Done 0001 4 superior 4144 10/04,03:29:10 00:00:18 Done 0002 4 tahoe 4160 10/04,03:29:30 00:00:16 Done 0003 4 placid 4163 10/04,03:29:47 00:00:16 Done 0004 6 huron 4168 10/04,03:30:04 00:00:16 Done 0005 4 finger 4171 10/04,03:30:21 00:00:16 % rush -try 0 1-5 0001: tries was 4, now 0 0002: tries was 4, now 0 0003: tries was 4, now 0 0004: tries was 6, now 0 0005: tries was 4, now 0 % rush -lf STAT FRAME TRY HOSTNAME PID START ELAPSED NOTES Done 0001 0 superior 4144 10/04,03:29:10 00:00:18 Done 0002 0 tahoe 4160 10/04,03:29:30 00:00:16 Done 0003 0 placid 4163 10/04,03:29:47 00:00:16 Done 0004 0 huron 4168 10/04,03:30:04 00:00:16 Done 0005 0 finger 4171 10/04,03:30:21 00:00:16 |
Read the examples carefully to avoid confusing jobid(s) to affect with new jobids for the waitfor command; remeber to separate all 'waitfor' jobids with commas instead of spaces.
Setting waitfor to '-' will disable the job from waiting for anything.
All 'waitfor' jobids must have the same hostname as the job being modified.
To differentiate between the new value for 'waitfor', and the jobid(s) to be affected, the new value for 'waitfor' must be a comma separated list of jobids with no spaces. Examples:
rush -waitfor - | # Disable all waitfor's |
rush -waitfor tahoe-37,tahoe-38, | # Job waits for 'tahoe-37' and 'tahoe-38' to dump |
rush tahoe-40 -waitfor tahoe-37,tahoe-38, | # tahoe-40 now waits for 'tahoe-37' and 'tahoe-38' to dump |
rush tahoe-40 -waitfor tahoe-37, |
# tahoe-40 now waits for 'tahoe-37' to dump # Trailing comma on tahoe-37 is important! |
The rush.conf file can be updated on the fly; simply edit a copy, make
changes, then rdist(1) the copy to all the machines, and the daemons
will pick up your changes within one minute.
To make changes to this file and update this to the network,
use these commands.
Flags can be combined to enable multiple debugging features.
LogFlags affect both the daemon AND user applications. To affect only
the daemon, specify flags on daemon's command line, or use 'rush -dlog
<flags>'.
See Logging Flags Table for a complete
list of all the one letter log flags. When a task on a remote cpu becomes IDLE, it tries to convince a job
to use its cpu. If the job 'passes' on this request (no more frames to
render, etc), the remote task enters a JOBPASS state, to avoid contacting
the job again for a while. After the timeout period, the task re-enters
an IDLE state to see if maybe the job had a FAIL frame, and has more frames
to render after all. Options can be none, demand or boot:
Since the NT version of rush doesn't know how to map the name 'ntrush'
to the equivalent uid value, NtRushUid is used to resolve it.
Basically, this value should be the same as the uid value for the
Unix user 'ntrush'.
Since the NT version of rush doesn't know how to map the name 'ntrush'
to the equivalent gid value, NtRushGid is used to resolve it.
Basically, this value should be the same as the gid value for the Unix user
'ntrush'.
When a job is submitted, if the user's uid value is outside the range
specified here, an error message is printed, and the job will not be submitted. Though unnecessary for proper operation of the render queue, you should
register the ServerPort value in your /etc/services
file, e.g.:
Set to '-' to disable generation of cpu accounting data.
Set to 'root' if there is no special rush administrative login.
If disablefu is set to 1, users can't control each other's jobs;
only root and adminuser can do this.
Normally, users should be able to control each other jobs,
allowing local policies, peer pressure (and auditing daemon logs)
to prevent pandemonium. This user is allowed to use the RUSH_USER environment variable to pose
as other users for the purpose of cgi-bin scripts being able to submit
jobs as the user on the other end of Netscape.
Set to "root" to disable this feature (default).
The hosts file can be updated on the fly; simply edit a copy, make changes,
then rdist(1) the copy to all the machines, and the daemons will pick up
your changes within one minute.
To make changes to this file and update this to the network,
use these commands.
The format of the hosts file is single lines of 5 white space separated
fields, one line per host:
Blank lines and lines starting with '#' are ignored:
This is the name that will be used in jobids and other cpu reports,
so it is best if short names are used (10 chars or less). Longer names
are ok, but will misalign columnar reports. Avoid using FQDN hostnames
(e.g.. foo.domain.com).
As of version 102.13, you can optionally specify an alternate network
interface, other than the default. Just append to the hostname a ':'
followed by the name of the interface, eg:
This says 'tahoe' is the actual name of the machine (ie. hostname(1)),
but rush should use tahoe's 'tahoe-eth' network interface for all
communications.
'0' is an acceptable value that essentially disables the machine from
participating in rendering, while allowing the host to be specified in
submit scripts. On multiprocessor machines, this value is a total from which rendering
frames subtract their estimated ram use. For instance, if a 4 cpu machine
is configured with a Ram value of 512, and 2 frames are currently rendering
each with ram values 200, then only 112 will be left for rendering on the
other two processors (112 = 512 - ( 200 x 2 ) ). The <Criteria> field might be set to:
Host Group names are configured in this field
too. To add a host group called +servers to the above example:
Configuration File
$RUSH_DIR/etc/rush.conf
The configuration file should be customized by the systems administrator.
Most settings are used only for fine tuning, but some control important
security settings (uidrange/gidrange/forceuid/forcegid), and process auditing/logging
(cpuacctpath).
Command
Description
Example
LogFlags
Not to be confused with submit script LogFlags,
Configures daemon logging features. Most are debugging flags used to track
operation of the system.
logflags jE
UdpTimeout
The number of seconds between udp re-transmissions.
udptimeout 8
UdpMaxRetries
The number of re-transmissions until 'retry time-out' occurs
udpmaxretries 5
UdpRestTimeOut
How many secs to rest before recovering from a 'retry time-out'
udpresttimeout 40
InMaxMsgs
(Version 101.83+) How many messages (tcp/udp) are received from the input queue at a time,
before re-checking output service routines. eg.
for (t=0; t < inmaxmsgs; t++ )
select(..)
if ( no data ) break;
inmaxmsgs 30
LogRotateHour
(Version 101.81+)
Sets the hour (0-23) that the logs automatically rotate. A value of -1 disables
automatic log rotation.
logrotatehour 0
JobUpdateThrottle
Don't advertise jobs' cpus faster than jobthrottlesecs. The daemon
will re-advertise cpus that haven't been acknowledged by the remotes at
about this rate.
jobupdatethrottle 10
JobPassTimeout
The 'jobpasstimeout' value configures how many seconds the
task will remain in the JOBPASS state before re-entering an IDLE state
by itself.
jobpasstimeout 150
DaemonHostCache
rushd(8)'s hostname caching options. Only
affects the way the daemon caches information.
daemonhostcache boot
AppHostCache
rush(1) 's host caching option; only affects the rush(1) client application's
method of host caching. Can be none or demand.
apphostcache demand
NtRushUid
The uid used if an NT submitted job is to run on unix machines.
ntrushuid 100
NtRushGid
The gid used if an NT submitted job is to run on unix machines.
ntrushgid 100
UidRange
Disallow render queue to run processes with a uid outside this range.
First value is a minimum, second value is a maximum.
uidrange 100 65000
GidRange
Controls gid values the same way UidRange controls uid values.
gidrange 100 65000
ForceUid
Forces all user processes to run as this uid. Default is -1, allowing
user processes to run as the UID of the user who submitted the job.
forceuid -1
forceuid 100
ForceGid
Same as ForceUid for GID values.
forcegid -1
forcegid 100
ServerPort
Set the rushd(1) server daemon's port numbers for UDP/TCP connections.
rushd 696/tcp # rush server
rushd 696/udp # rush server
serverport 696
ClientPort
ClientPort is a vestige from the past that is now obsolete
and unrecognized. Please remove from all rush.conf files.
CpuAcctPath
Path to cpu accounting file.
cpuacctpath /var/logs/cpu.acct
AdminUser
Sets login name for user allowed to administer the rush daemons. Commands
such as 'rush -dexit', 'rush -dlog a' and others are
limited to root and this user.
adminuser root
DisableFu
Allows administrator to control whether users can use
'rush -fu' and $RUSH_FU to control other people's jobs.
disablefu 0
WebUser
Sets login name for user the httpd daemon runs as, in cases where rush
is being controlled by web interfaces.
webuser guest
Hosts File
$RUSH_DIR/etc/hosts
The $RUSH_DIR/etc/hosts file must contain the names
of all hosts that participate in rendering.
<Hostname> <#Cpus> <Ram> <Minimum Priority> <Criteria>
<Hostname>
This is the name of the host, and should be the shortest name
possible (e.g. host aliases can be used here).
<#Cpus>
tahoe:tahoe-eth
This should be the number of cpus the host has. This is how
many processes the host will run at the same time. This value can be larger
or smaller than the actual number of physical cpus the machine has.
<Ram>
This is the amount of ram the machine has. This value can be
less or more than the actual ram the machine has; usually this value takes
into account some percentage of the host's swap space as well. This value
is used when accepting frames to render; a frame that asks for more ram
than the machine has will be turned away.
<Minimum Priority>
This prevents frames from rendering on this machine if their
priority is lower than this value. If zero, all frames will be accepted.
<Criteria>
This is a list of comma separated strings that define platform
or operating system specific features for the host. These can be arbitrary
alpha-numeric strings that may also contain dashes, underbars and periods,
but must not contain any white space. '+' characters have the special
purpose of leading off a Host Group specification.
+any,linux,linux6.1,prman3.7
These strings can then be used in TD's submit scripts to limit which hosts
will render their frames. See the Criteria
Submit Script command for more info. All hosts should have a criteria entry
that at least contains
+any.
+any,linux,linux6.1,prman3.7,+servers
# RUSH HOSTS
#
# The 'Host' field should contain short names for hosts (aliases are ok),
# and must be unique.
#
# The 'Criteria' field must *NOT* contain white space, and words are
# comma delimited. All hosts should contain '+any' in the criteria field.
#
#Host Cpus Ram MinPri Criteria
#----- ---- ---- ------ -----------
tahoe 2 256 0 +any,+work,sgi,irix,irix6.5
superior 2 256 0 +any,+work,sgi,irix,irix6.2
ontario 1 128 0 +any,+work,linux,linux6.0,intel
erie 1 128 0 +any,+work,sgi,irix,irix6.4
rf1 1 512 0 +any,+rfarm,linux
rf2 1 512 0 +any,+rfarm,linux
rf3 1 512 0 +any,+rfarm,linux
rf4 1 512 0 +any,+rfarm,linux
rf5 1 512 0 +any,+rfarm,linux
The cpu accounting file is configured with the rush.conf file's CpuAcctPath command. Each time a frame finishes executing, a new entry is created in the Cpu Accounting file logging the name of the job, how long the frame ran, etc.
Cpu Accounting File Example
|
Process Entriesp 948242783 tahoe-798 WERNER/C33 erco 0106 superior 100k 122 0 0 0 p 948242783 tahoe-798 WERNER/C33 erco 0107 superior 100k 122 0 0 0 p 948242865 tahoe-797 KILLER erco 0504 superior 200 121 0 0 0 - --------- --------- ---------- ---- ---- -------- ---- --- - - - | | | | | | | | | | | | | | | | | | | | | | | Exit code | | | | | | | | | | | | | | | | | | | | | #Secs User Time | | | | | | | | | | | | | | User | | | | #Secs System Time | | | | | | | | | | | Title of job | | | #Secs Wall Clock Time | | Jobid | | | | | | | Priority | time(2) process started | | | | Host that ran the process 'p' indicates 'process entry' | Frame that ran |
|
CAVEATS
'Exit code' is normally a positive number representing the actual exit code of the process. This value will be negative if the process was signaled; the value being the signal number. If the value is negative, this usually means the process killed, segfaulted, or was bumped by a higher priority process. Commonly, the 'Exit code' will be one of: -15 - process killed with SIGTERM; someone probably manually killed it -9 - process killed with SIGKILL; probably bumped in a priority battle -3 - process killed with SIGINT; someone sent it a ^C 0 - process did an exit(0); frame Done 1 - process did an exit(1); frame Fail 2 - process did an exit(2); frame Requeue
Although tempting, it is not recommend to use process execution times for cpu billing purposes. Wall clock time includes time process may have spent waiting for network load. User and System times only report the respective times spent for the Render Script only; not its sub-processes (eg. the renderer). To properly bill for cpu time, you would either need to enable full-on unix process accounting to attain accumulated cpu time for all sub-processes in the user's render script, or, create wrapper scripts that use programs like timex(2) to monitor the binary execution time of the critical render/compositor processes.
Tools like timex(2) indicate in their documentation they must have unix process accounting enabled to show sub-process totals. This is usually prohibitive on production machines, due to disk resources used by the unix process accounting system.
TD Questions
|
How can I use padded frame numbers (0000) in my render script?
Use $RUSH_PADFRAME, it is created for you automatically to do 4 digit padding. set padframe = `perl -e 'printf("%04d",$ENV{RUSH_FRAME});'` To use different padding widths, just change the '4' (in '%04d') to a different number. My renders are coming up 'FAIL'. How do I figure out what's wrong? Check the frame logs being generated by your render script. How do I have rush automatically retry frames? How do I set the number of retrys? See Retrying Frames. My job isn't starting renders on my cpus. What's going on? Use 'rush -lc' and check the Notes column for messages. [erco@howland]% rush -lc CPUSPEC[HOST] STATE FRM PID JOBTID ELAPSED NOTES placid=3@100k Idle - - 1 00:04:37 Job state is 'Pause' tahoe=1@1 Idle - - 2 00:02:08 No more frames superior=1@1 Idle - - 3 00:02:08 Not enough ram waccubuc=1@1 Idle - - 4 00:02:08 This is a 'neverhost' ontario=1@1 Idle - - 5 00:02:08 Failed 'criteria' check How do I setup my submit script to only render on certain platforms or operating systems? Use the Criteria submit script command. How can I render several frames in one process using rush? With clever scripting. See Batching Multiple Frames for how to render several frames at a time. My job has its 'k' flag set; why isn't it bumping off other jobs' frames? For a job to bump another off a cpu, these things must be true: Is there an easier way to set the RUSH_JOBID environment variable? You can use eval `submit` to automatically set it, or a simple alias to set it manually. However, cut and pasting the setenv command is not so hard.
alias jid 'setenv RUSH_JOBID "\!*"' Then you can use it on the command line to set one or more jobids:
If you want to have the RUSH_JOBID variable set automatically in your shell whenever you invoke your submit script, then use 'eval':
..the shell automatically parses the 'setenv RUSH_JOBID' command rush prints on stdout when a job is successfully submitted. Error messages are not affected by 'eval', so you don't have to worry about loosing error messages when using this technique. |
Systems Administrator Questions
| ||||||||||
What's the best way to verify all the daemons are running?
This 'pings' all the daemons in the $RUSH_DIR/etc/hosts file with a TCP message. If the daemon isn't running, tail(1) the daemon's log file in $RUSH_DIR/var/rushd.log. How do I stop/start the daemons? (Unix/NT)
All the daemons can be stopped via:
Is there an example boot script I can use to invoke rush?
Is there a way to run 'rush -online' automatically when someone logs out?
A literal example of what should be added to these files would be:
logger -t RUSH "Rush online (user logout)" Use of logger(1) is optional; it leaves an audit trail in the syslog. Include full path to logger(1) if security is an issue. Is there a way to run 'rush -online' automatically when someone's screensaver pops on?
If you have any suggestions on how to do it on various platforms, please send me email.
How do I update changes to the rush hosts file (or rush.conf file) to the network?
# SEND A NEW rush.conf foreach i ( `awk '/^[a-z]/{print $1}' /usr/local/rush/etc/hosts` ) rdist -c /usr/tmp/newconf ${i}:/usr/local/rush/etc/rush.conf end # SEND A NEW RUSH hosts foreach i ( `awk '/^[a-z]/{print $1}' /usr/tmp/newhosts` ) rdist -c /usr/tmp/newhosts ${i}:/usr/local/rush/etc/hosts end NOTE: When sending out new files, you must use rdist(1), and not cp(1) or rcp(1). rdist(1) uses a special 'tmp-file/rename' technique that prevents the daemon from parsing the file before it's finished being written. Is there a way to track who's jobs are bump who?
Is there a way to track who's changing other people's jobs?
Can rush be told to use a different network interface, other than the machine's hostname?
The name on the left of the ':' is the familiar hostname(1) of the machine, and the name that follows the ':' is the alternate network interface you want to use. See also the Hosts File, section on the hostname field. |