From: Antoine Durr <antoine@(email surpressed)>
Subject: cpu balancing script
   Date: Fri, 01 Jun 2007 14:48:35 -0400
Msg# 1568
View Complete Thread (16 articles) | All Threads
Last Next
I'm under the impression that to balance cpus per user requires an external script that dynamically resets priorities based on who should have more or fewer cpus than they already have. Does anyone have an example of such a script?

Thanks,

-- Antoine

Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Fri, 01 Jun 2007 15:17:44 -0400
Msg# 1569
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
> I'm under the impression that to balance cpus per user requires an 
> external script that dynamically resets priorities based on who should 
> have more or fewer cpus than they already have.  Does anyone have an 
> example of such a script?

	I'm happy to see any responses.

	Just please remember not to post proprietary or confidential code.
	(ie. code with restrictive rights banners)

	For more info, please refer to this newsgroup's FAQ:
	http://seriss.com/cgi-bin/rush/newsgroup-threaded.cgi?-view+1

   From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Tue, 05 Jun 2007 20:50:51 -0400
Msg# 1576
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-01 12:17:44 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:
I'm under the impression that to balance cpus per user requires an
external script that dynamically resets priorities based on who should
have more or fewer cpus than they already have.  Does anyone have an
example of such a script?

	I'm happy to see any responses.

	Just please remember not to post proprietary or confidential code.
	(ie. code with restrictive rights banners)

	For more info, please refer to this newsgroup's FAQ:
	http://seriss.com/cgi-bin/rush/newsgroup-threaded.cgi?-view+1

So I whipped up something this afternoon. It's pretty simplistic, and I'm curious as to where it might fall flat on its nose.


#!/usr/bin/perl -w

#
# rush_rebalance.pl
#
# A cpu priority rebalancing script for the Rush render queue
# (c) 2007 Antoine Durr, Floq FX Inc, use at your own risk, YMMV.
#
# For each iteration, get a list of all the running cpus (rush -laj).  Place
# these in a job:cpu_count hash, and sort them descending by how many cpus they
# have running.  Thus, the highest-running job gets priority 1, the next
# highest gets priority 2, and so on.  Now for each job, get a list of the
# jobtaskids (rush -lc), and reset them to the priority.
#
# This is a pretty simplistic setup, and has no provisions for user-desired
# prioritization.  I have no idea how well or poorly this will fare in a large
# environment, or whom it will piss off in the process.
#

while (1)
{
   # generate the list of running and paused jobs
$runlist = `rush -laj | grep -v Done | grep -v RESERVE | tail +3 | awk '{print \$2" " \$7}'`;
   chomp $runlist;

   # set up host=>cpu_count hash, e.g. host.123, 12, host.125, 4
   %runlist = split(/\s+/, $runlist);

   $priority = 1; # reset priorities
   # sort numerically based on hash value
   foreach $job (sort { $runlist{$b} <=> $runlist{$a} } keys %runlist) {
       # print "job: $job, cpus: $runlist{$job} newpriority: $priority\n";

       # grab the list of jobtid's
       $jobtidlist = `rush -lc $job | awk '{print \$5}' | tail +2`;
       chomp $jobtidlist;

       @jobtidlist = split(/\s+/, $jobtidlist);
$jobtidlist = join(" .", @jobtidlist); # prepend period in front of each

       print "rush -fu -cp $job .$jobtidlist \@$priority\n";

       # comment this next line in to have this script actually do something!
       # `rush -fu -cp $job .$jobtidlist \@$priority\n`;

       $priority++;
   }

   sleep 5;
}

--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Tue, 05 Jun 2007 21:29:07 -0400
Msg# 1577
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
> So I whipped up something this afternoon.   It's pretty simplistic, and 
> I'm curious as to where it might fall flat on its nose.

	You might find that 'beats up' rush too much by having it change
	things every 5 seconds.

	Also, try to avoid running 'rush -cp' commands if it's not actually
	changing anything.

	The script I'm writing counts the number of available cpus,
	and splits them up to each user. So if there are 15 procs
	and 3 different users, each user gets 5 cpus assigned to their
	jobs. If a user has multiple jobs, each job gets a few cpus,
	but no more than 5 total for all their jobs.

	I'm getting hung up on the latter condition of what to do
	when a user has 10 jobs with all their priorities equal,
	but only 5 cpus allocated to them; just assign 1 cpu to
	their first 5 jobs, and let the others languish until
	more procs free up or jobs finish?

	I'm using 'rush -status +any -c 0 -s 20' to get the job info;
	it will poll forever until killed, and is low bandwidth.
	I'm caching the data, looking for changes in job states
	or jobs either added or removed.

	I count 'available cpus' as cpus that are 'online' and
	don't have RESERVE jobs assigned to them.

	Anyway, it's taking a while to write, to handle the weird
	situations.

	But definitely you must avoid the tendency to keep changing
	things in rush to the point where all it's doing is rescheduling
	things. Only tell rush to change things when there's something
	worth changing, and try not to 'oscillate' (ie. jumping jobs
	up and down one or two procs every iteration, due to rounding)

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 14:44:37 -0400
Msg# 1579
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-05 18:29:07 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:
So I whipped up something this afternoon.   It's pretty simplistic, and
I'm curious as to where it might fall flat on its nose.

	You might find that 'beats up' rush too much by having it change
	things every 5 seconds.

	Also, try to avoid running 'rush -cp' commands if it's not actually
	changing anything.

	The script I'm writing counts the number of available cpus,
	and splits them up to each user. So if there are 15 procs
	and 3 different users, each user gets 5 cpus assigned to their
	jobs. If a user has multiple jobs, each job gets a few cpus,
	but no more than 5 total for all their jobs.

Yeah, that's really the right way to go, as it should be per-user.


	I'm getting hung up on the latter condition of what to do
	when a user has 10 jobs with all their priorities equal,
	but only 5 cpus allocated to them; just assign 1 cpu to
	their first 5 jobs, and let the others languish until
	more procs free up or jobs finish?

Well, the problem with changing the # of cpus allocated to a job is that you stomp on information that could be critical to the job, e.g. I'm running a comp, but don't want to run with more than 4 cpus, or I'll flood the IO bandwidth of the drives. Or I've only got two Houdini licenses, thus only run two jobs. Thus, the thing that should be tweaked is priority, so that a person deficient cpus gets a higher priority, and therefore a greater chance of picking up the next available proc. Of course, then there's no longer a user-settable priority system. This isn't too bad, as (IMO) all users should have the same priority, and it's up to show management to tweak that.


	I'm using 'rush -status +any -c 0 -s 20' to get the job info;
	it will poll forever until killed, and is low bandwidth.
	I'm caching the data, looking for changes in job states
	or jobs either added or removed.

	I count 'available cpus' as cpus that are 'online' and
	don't have RESERVE jobs assigned to them.

	Anyway, it's taking a while to write, to handle the weird
	situations.

	But definitely you must avoid the tendency to keep changing
	things in rush to the point where all it's doing is rescheduling
	things. Only tell rush to change things when there's something
	worth changing, and try not to 'oscillate' (ie. jumping jobs
	up and down one or two procs every iteration, due to rounding)

Ideally, the priority scheduling should be revised on every cpu assignment and every done frame, so that the next assignment makes the distribution more balanced. The challenge then becomes dealing with fast frames, as you then spend an inordinate amount of time rebalancing. Thus, every 5 or 10 seconds should be plenty. However, if a whole slew of cpus can be assigned in that time, the queue could very quickly become out of balance.

This process is, IMO, one of the top requirements of a render queue.

-- Antoine

--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 15:03:29 -0400
Msg# 1580
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
> Well, the problem with changing the # of cpus allocated to a job is 
> that you stomp on information that could be critical to the job, e.g. 

	I see; one way to communicate that to the watcher would be to
	have the job specify 'maxcpus 4' (rush submit command), and the
	watcher could notice that and honor it as a 'given limit' for
	that particular job, and not assign more cpus than that.

> I'm running a comp, but don't want to run with more than 4 cpus, or 
> I'll flood the IO bandwidth of the drives.

	Heh; if 4 procs are flooding the fileserver bandwidth,
	time for a new file server ;)

> Or I've only got two Houdini licenses, thus only run two jobs.

	I see; in such cases either maxcpus or even just submitting
	to two named machines might help. eg:

maxcpus 2
cpus    tahoe=1@999 ontario=1@999

> Thus, the thing that should 
> be tweaked is priority, so that a person deficient cpus gets a higher 
> priority, and therefore a greater chance of picking up the next 
> available proc.  Of course, then there's no longer a user-settable 
> priority system.  This isn't too bad, as (IMO) all users should have 
> the same priority, and it's up to show management to tweak that.

	Ya, this is kinda why I leave it up to customers to implement
	their own watcher scripts, cause folks want to schedule things
	their own ways.

	You should be able to slap rush around to follow your own rules,
	just be sure not to slap it around too often, or it'll spend more
	time rescheduling things than it will keeping things running.

	I hope that my 'working example', whenever I get it working,
	will be a good starting point, as I intend to show good practices
	in how to monitor/adjust rush at a decent rate, without overloading
	the daemons.

	Trouble is, I've been busy on some other stuff, but I intend
	to have something in the not too distant future.

> Ideally, the priority scheduling should be revised on every cpu 
> assignment and every done frame, so that the next assignment makes the 
> distribution more balanced.

	My take on it is that there are so many problems that come
	with production, that no one scheduler can handle them all
	efficiently.. it's just too complicated.

	And when the scheduler is so complicated that only a few
	zen gurus can understand it, folks will curse it constantly.

	So rush's view is to just keep procs busy, with cpucaps and
	priority will manage things. If things are taking too long,
	bump a few of the high-pri procs ('staircasing') up a little more,
	maybe even make them 'k' (kill) priority.

> The challenge then becomes dealing with 
> fast frames, as you then spend an inordinate amount of time 
> rebalancing.  Thus, every 5 or 10 seconds should be plenty.  However, 
> if a whole slew of cpus can be assigned in that time, the queue could 
> very quickly become out of balance.

	I think if you look at it from the point of view where the
	caps prevent things from getting too crazy between samples,
	you'll find stability.

	Oscillation in a scheduler is a common problem, and I avoid
	all that by having static scheduling to fit the distributed
	nature of the queue.

	Rush has a different approach from the centralized schedulers;
	it takes some time to understand its approach, and not try
	to 'force fit' a scheduling algorithm that's too opposite to
	it's design. The idea behind rush's design is to prevent the
	need for micromanaged processing; if you have a comp you want
	to sneak by on a few procs, just give the job a few high-pri
	procs, and the rest at low, eg:

cpus +any=2@900k
cpus +any=10@1

	Works best if everyone is using that same technique, so that
	all are guaranteed a few procs, and the rest round robin..
	if there's nothing else going on, they get as many procs
	as they can handle.


-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Robert Minsk <rminsk@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 15:17:49 -0400
Msg# 1581
View Complete Thread (16 articles) | All Threads
Last Next
On Wednesday 06 June 2007 12:03, Greg Ercolano wrote:
> [posted to rush.general]
>
> Antoine Durr wrote:
> > Well, the problem with changing the # of cpus allocated to a job is
> > that you stomp on information that could be critical to the job, e.g.
>
>  I see; one way to communicate that to the watcher would be to
>  have the job specify 'maxcpus 4' (rush submit command), and the
>  watcher could notice that and honor it as a 'given limit' for
>  that particular job, and not assign more cpus than that.

In our system we communicate scheduling information like this using the 
"Notes" field of the job submission.  In fact from Gregs docs "Each job has a 
free form 'notes' field, which can be used for various purposes, such as 
passing informational notes to other programs..."

   From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 15:48:57 -0400
Msg# 1582
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-06 12:03:29 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:
Well, the problem with changing the # of cpus allocated to a job is
that you stomp on information that could be critical to the job, e.g.

	I see; one way to communicate that to the watcher would be to
	have the job specify 'maxcpus 4' (rush submit command), and the
	watcher could notice that and honor it as a 'given limit' for
	that particular job, and not assign more cpus than that.

I'm running a comp, but don't want to run with more than 4 cpus, or
I'll flood the IO bandwidth of the drives.

	Heh; if 4 procs are flooding the fileserver bandwidth,
	time for a new file server ;)

That was just a number. But with gig-e and xeon 5355's running 4k comps, you can get I/O limited very quick.


Or I've only got two Houdini licenses, thus only run two jobs.

	I see; in such cases either maxcpus or even just submitting
	to two named machines might help. eg:

maxcpus 2
cpus    tahoe=1@999 ontario=1@999

Beyond their own machine (which is important) I really dislike the notion of a user having to know or choose what machines their stuff lands on. The users don't (and shouldn't) have control of arbitrary machines. And what if one of those goes down? What if a new host gets added in the middle of the night? A job should not have any innate knowledge about specific machines. A job should know about pools of machines only.


Thus, the thing that should
be tweaked is priority, so that a person deficient cpus gets a higher
priority, and therefore a greater chance of picking up the next
available proc.  Of course, then there's no longer a user-settable
priority system.  This isn't too bad, as (IMO) all users should have
the same priority, and it's up to show management to tweak that.

	Ya, this is kinda why I leave it up to customers to implement
	their own watcher scripts, cause folks want to schedule things
	their own ways.

I'm curious as to the different ways that exist. What kinds of things are important to people?


	You should be able to slap rush around to follow your own rules,
	just be sure not to slap it around too often, or it'll spend more
	time rescheduling things than it will keeping things running.

	I hope that my 'working example', whenever I get it working,
	will be a good starting point, as I intend to show good practices
	in how to monitor/adjust rush at a decent rate, without overloading
	the daemons.

	Trouble is, I've been busy on some other stuff, but I intend
	to have something in the not too distant future.

Ideally, the priority scheduling should be revised on every cpu
assignment and every done frame, so that the next assignment makes the
distribution more balanced.

	My take on it is that there are so many problems that come
	with production, that no one scheduler can handle them all
	efficiently.. it's just too complicated.

I think a good way to do this is to keep the load balancing out of the assignment loop (which is inherently the way it is right now). Give the users a few different balancing scripts, with different knobs. The caveate, of course, is that these balancing scripts can get enough low-level information and run frequently enough to be useful.


	And when the scheduler is so complicated that only a few
	zen gurus can understand it, folks will curse it constantly.

I think folks will curse it if it doesn't do what it says it will do. In the meantime, Rush wins by having superb assignment consistency, at the expense of balancing fairness.


	So rush's view is to just keep procs busy, with cpucaps and
	priority will manage things. If things are taking too long,
	bump a few of the high-pri procs ('staircasing') up a little more,
	maybe even make them 'k' (kill) priority.

The only place I think kill priority is warranted is on your own machine. Killing jobs that are mid-way is a great way to waste resources. I love that Rush can be set up so that your own machine is "yours", which is of great comfort to users. Yes, if I want to use my machine, I should be able to kill whatever's on there (since I can do that already with -getoff). But you wouldn't want me to have that power over *your* machine.

I don't think the users should be tasked with determining the most advantageous priorities themselves just to get their frames run. Also, when you get to the point of having 10x jobs as there are cpus, the tendency to favor your own can easily outweigh the needs of the studio. *That's* where users start to curse. So yeah, having something in there so that you get at least 1 cpus to get your stuff going. At R&H, we had the notion that single-frame jobs got higher priority than multi-frame jobs, so that you could run your single test frames and get them through the queue. And what did I do? I wrote a submission script that submitted all my frames as single-frame jobs! ;-)


The challenge then becomes dealing with
fast frames, as you then spend an inordinate amount of time
rebalancing.  Thus, every 5 or 10 seconds should be plenty.  However,
if a whole slew of cpus can be assigned in that time, the queue could
very quickly become out of balance.

	I think if you look at it from the point of view where the
	caps prevent things from getting too crazy between samples,
	you'll find stability.

	Oscillation in a scheduler is a common problem, and I avoid
	all that by having static scheduling to fit the distributed
	nature of the queue.

	Rush has a different approach from the centralized schedulers;
	it takes some time to understand its approach, and not try
	to 'force fit' a scheduling algorithm that's too opposite to
	it's design. The idea behind rush's design is to prevent the
	need for micromanaged processing; if you have a comp you want
	to sneak by on a few procs, just give the job a few high-pri
	procs, and the rest at low, eg:

cpus +any=2@900k
cpus +any=10@1

	Works best if everyone is using that same technique, so that
	all are guaranteed a few procs, and the rest round robin..
	if there's nothing else going on, they get as many procs
	as they can handle.

The round robin'ing still fails when you have a mix of long renders and short frames. The long frames get all the cpus.

-- Antoine



--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 16:15:34 -0400
Msg# 1583
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
>> maxcpus 2
>> cpus    tahoe=1@999 ontario=1@999
> 
> Beyond their own machine (which is important) I really dislike the 
> notion of a user having to know or choose what machines their stuff 
> lands on.

	Agreed; but if you have e.g. node locked licenses,
	it at least ensures only those boxes get the renders.

	If you have floaters, then yes, you wouldn't specify
	hostnames, just a cpu cap.

	Thing is, if you submit two jobs that are both houdini
	jobs, and you only have two floating lics, then you
	need some 'centralized' way to count the lics; this is
	where a watcher script might come in, and realize there
	are two houdini jobs and limit the procs, or tweak the
	priorities to prevent a situation where more than two
	are rendering at once.

	Rush itself doesn't maintain a centralized perspective
	of things, so it can't do centralized stuff like counting.

> I'm curious as to the different ways that exist.  What kinds of things 
> are important to people?

	I've seen a variety of requests too numerous to mention.
	I think Esc had the most complex of all the scheduling
	algorithms I'd come across for The Matrix III. They had
	all kinds of stuff in there; taking ram into account,
	render times that change over time, I think they were
	even polling ram use as the renders ran. The guy who was
	writing it had a lot of high level goals.

	Trouble with their implementation was they had >500 boxes
	on their farm, and were throwing all the jobs submitted
	to one box..! I warned that was a bad, bad from the get go,
	but they locked into that for some reason. The whole point
	of rush's decentralized design is to distribute the job
	load, so it really hinders it to focus all jobs on a single
	box for a large net. This made it tough for them, because
	the central box became overloaded fast, in addition to their
	watcher constantly rescheduling it.

	That was a while ago, 2002/2003 IIRC, when boxes and
	networks were slower, and rush has had some optimizations
	since then.

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 16:30:46 -0400
Msg# 1584
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-06 13:15:34 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:
maxcpus 2
cpus    tahoe=1@999 ontario=1@999

Beyond their own machine (which is important) I really dislike the
notion of a user having to know or choose what machines their stuff
lands on.

	Agreed; but if you have e.g. node locked licenses,
	it at least ensures only those boxes get the renders.

Again, an appropriately named pool would take care of that.


	If you have floaters, then yes, you wouldn't specify
	hostnames, just a cpu cap.

	Thing is, if you submit two jobs that are both houdini
	jobs, and you only have two floating lics, then you
	need some 'centralized' way to count the lics; this is
	where a watcher script might come in, and realize there
	are two houdini jobs and limit the procs, or tweak the
	priorities to prevent a situation where more than two
	are rendering at once.

Jobs should have the notion of "requirements", one of which is a particular license type. Admittedly, this is tricky because users use the licenses w/out notifying the queue! So the queue has to figure out what's left, figure out how many it's using, and only allow so many more after that.


	Rush itself doesn't maintain a centralized perspective
	of things, so it can't do centralized stuff like counting.

I'm curious as to the different ways that exist.  What kinds of things
are important to people?

	I've seen a variety of requests too numerous to mention.
	I think Esc had the most complex of all the scheduling
	algorithms I'd come across for The Matrix III. They had
	all kinds of stuff in there; taking ram into account,
	render times that change over time, I think they were
	even polling ram use as the renders ran. The guy who was

Funny, I'm doing the ram-usage polling right now. I simultaneously launch a memory watcher script, which given a PID, finds all the child PIDs and adds up their memory consumption, writes to a file in the logdir. When the process completes, it tails the last line of that file, and puts it into the per-frame notes field, so that the users see what kind of memory footprint their job had. This is a pretty critical feature, IMO, as you *really* want to avoid going into swap on a box!

Ideally, I should be able to submit with a requirement of a certain amount of ram, and have the frames only run on machines that have that much ram left. Yes, that is not failure-free, as doing only spot-checks doesn't tell you that a particular job on a host might suddenly chew up another gig. But at least it should try.

-- Antoine

	writing it had a lot of high level goals.

	Trouble with their implementation was they had >500 boxes
	on their farm, and were throwing all the jobs submitted
	to one box..! I warned that was a bad, bad from the get go,
	but they locked into that for some reason. The whole point
	of rush's decentralized design is to distribute the job
	load, so it really hinders it to focus all jobs on a single
	box for a large net. This made it tough for them, because
	the central box became overloaded fast, in addition to their
	watcher constantly rescheduling it.

	That was a while ago, 2002/2003 IIRC, when boxes and
	networks were slower, and rush has had some optimizations
	since then.


--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 18:08:41 -0400
Msg# 1585
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
>> 	Agreed; but if you have e.g. node locked licenses,
>> 	it at least ensures only those boxes get the renders.
> 
> Again, an appropriately named pool would take care of that.

	Yes, submitting to +houdini=2 is nicer, so even if the
	node locks are moved around change, users don't have to
	change their 'Cpus' specs.

> Jobs should have the notion of "requirements", one of which is a 
> particular license type.   Admittedly, this is tricky because users use 
> the licenses w/out notifying the queue!  So the queue has to figure out 
> what's left, figure out how many it's using, and only allow so many 
> more after that.

	Yes.. some companies make license counting 'wrappers' for their
	renders, so that interactive use can be tracked and be 'predicted'
	as part of a larger, 'reservation' oriented system.

	Others use their 'watcher' to interrogate the third party
	license managers, to see how many lics are available, and
	modify the cpu allocations to keep that balanced.

	Honestly though, most folks with large nets just buy or rent
	the licenses they need so they can make use of the whole farm,
	and not have to juggle that stuff, because even when its done
	right, there are race conditions with the license counting,
	unless you have some kind of reservation system embedded in
	the license system, or a wrapper that does this.

> When the process completes, it tails the last line of that 
> file, and puts it into the per-frame notes field, so that the users see 
> what kind of memory footprint their job had.

	Yes, that is very useful info, and its best to simply advertise
	it to the user, so they can submit their job with a correct
	'Ram' value based on their impression of the numbers.

	Often the renderers print those numbers in the log for you,
	so you can just grep them out; maya and mental ray both
	do this.

	Trouble is the format of these messages can change from one
	rev of the renderer to another. And, depending on the job,
	sometimes there are several of these messages per frame,
	such as when renders are batching, or worse, when complex jobs
	render multiple images per frame ('levels', 'passes', 'comps' etc)
	So it gets a little tricky to implement that in a way that
	works for all situations.

	It's a good idea to show that info in the rush 'Frames' reports,
	just watch out that on large networks, running the 'rush -notes'
	command from within the render script /every frame/ may
	"DoS attack" your jobserver esp if the render times are short.

	For folks with large networks, though, I recommend /against/ running
	commands like 'rush -notes', 'rush -ljf' in the render scripts every
	frame, for that reason. 'rush -notes' is a high latency TCP command
	that isn't really meant to be run on 100's of machines at once all
	to the same server. 'rush -notes' is only recommended for advertising
	error conditions.

	On a small net like yours, though, it shouldn't be a problem.
	It's only when you get above 50 render nodes or so that this
	would become a problem; depends on how fast render times are.

	A few weeks ago I implemented a much lower latency 'rush -exitnotes'
	command which handles sending back 'per frame' messages to the
	jobserver in a reasonable manner by connecting to the 'render node'
	instead of the job server, setting things up so the note is passed
	back to the job server as part of the UDP message that delivers the
	exit code back to the server. It'll be in version 102.43.

	I'd like to add the 'ram usage indicator' to the submit scripts
	as an option, once the 'rush -exitnotes' is fully released.

> This is a pretty critical 
> feature, IMO, as you *really* want to avoid going into swap on a box!

	Yes, definitely.

	Gets tricky to detect swap though, as often when a box looks like
	it's swapping, it's actually just paging out old junk to make room
	in ram that it should have cleared out long ago.

	Best thing is to just know in advance how much ram the job will
	tend to need, and submit with that ram value set. (e.g. the 'Ram:'
	prompt in the submit forms)

	'rushtop' is handy for seeing if a job is using a lot of ram.
	Just render your job on a box, and watch the ram use as the
	render runs to get a feel for how nasty it is.. then submit
	with the 'Ram' field set accordingly.

	Rushtop is the only thing in rush that actually polls ram use,
	and that's global ramuse, not process hierarchy ramuse.. the
	only thing about tam all the OS's deliver coherently for our
	purposes.

	But the numbers the renderers spit out are the best ones
	by far. Beats polling.

> Ideally, I should be able to submit with a requirement of a certain 
> amount of ram, and have the frames only run on machines that have that 
> much ram left.

	Yes, the 'Ram' submit value can be used for this; the value
	your job submits with, say '10', tells rush each frame uses
	'10' units of ram. The 'RAM' column in the rush/etc/hosts file
	indicates how many units of ram rush thinks each machine has,
	so when it tries to start a frame rendering, it subtracts the
	job's 'ram' value from the total to see if there's room to run it.

	This is all management of static values.. rush doesn't actually
	poll the machine's ram use. Rush assumes if it can use the machine,
	it can use all of it. You can reserve cpus (and ram) using the
	'rush -reserve' command.

> Yes, that is not failure-free, as doing only 
> spot-checks doesn't tell you that a particular job on a host might 
> suddenly chew up another gig.  But at least it should try.

	My goal was to avoid any features in rush that had too much
	of a 'fuzzy' aspect to it, such as polling ram use.

	Even kernel mailing lists argue endlessly on how free ram
	should be determined, and it often changes from release
	to release. I've had to tweak rushtop several times to take
	into account changes in the different OS's ram calculations.

	It's too bad that most of the OS's (esp unix!) doesn't let
	a parent program get back the memory use and usr/sys time
	of the accumulated process tree. The counters are there
	in the structures for the accumulated times, but they're
	all zeroed out. All you can get back is the ram/cpu time
	of the immediate process reliably, and that's usually just
	the perl script, which is useless. The only way to get unix
	to show process hierarchy data that I've seen is to have to
	have system wide process accounting (acct(2)) turned on.
	And whenever that's on, the system bogs down because
	process accounting makes giant logs quickly. But it helps
	the OS tally accumulated process tree info internally.

	Surprisingly, Windows is the only OS that seems to tally
	child processing correctly, both ram and cpu ticks with
	their new job objects stuff.

	I only recently discovered that, and will try some experiments
	to see if it /actually/ works, and not just a place holder.
	This way the cpu.acct file can finally log the memory use
	and usr/sys time info instead of being all zeroes, as well
	as providing the info for the frames report..!

> Funny, I'm doing the ram-usage polling right now.  I simultaneously 
> launch a memory watcher script, which given a PID, finds all the child 
> PIDs and adds up their memory consumption, writes to a file in the 
> logdir.

	Trouble I've found with snapshoting the proc table (did it at DD)
	is that you run into a few real world problems, enough that it
	can often cause more trouble than its worth.

	When polling the process hierarchy, you can end up with wild
	snapshots when processes fork, showing double memory use during
	that time. You can try to smooth those out as aberrant data,
	but some renders fork frequently, causing the data to sometimes
	appear valid, throwing the job into a stall.

	Also, sometimes a single frame would go bananas on ram, causing
	the queue to think the job was going into a phase of high memory
	use. Or sometimes a scene will simply go from a black frame to
	a sudden high memory use, enough to swap. An automated mechanism
	that tries to use this wild data to handle scheduling almost always
	stalls the job, causing folks to simply turn off the feature;
	they'd rather have their render crash a few machines on the
	few frames that go bananas instead of having the job completely
	stall in the middle of the night.

	In rushtop I added an experimental 'paging activity' indicator
	(the 'orange bar') which watches for 'excessive' paging activity,
	and bumps the bar when that happens. This limit was determined
	empirically.. when the orange bar appears, chances are you can
	'feel' the slowness if you're on that box in the form of an
	unresponsive mouse, or similar.

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 18:26:03 -0400
Msg# 1586
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-06 15:08:41 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:

	Yes.. some companies make license counting 'wrappers' for their
	renders, so that interactive use can be tracked and be 'predicted'
	as part of a larger, 'reservation' oriented system.

	Others use their 'watcher' to interrogate the third party
	license managers, to see how many lics are available, and
	modify the cpu allocations to keep that balanced.

	Honestly though, most folks with large nets just buy or rent
	the licenses they need so they can make use of the whole farm,

Hmm, not sure that's feasible for 1000+ node networks, when the packages is multiple thousands per. Site license would do it, but may not be cost effective either.

	and not have to juggle that stuff, because even when its done
	right, there are race conditions with the license counting,
	unless you have some kind of reservation system embedded in
	the license system, or a wrapper that does this.

When the process completes, it tails the last line of that
file, and puts it into the per-frame notes field, so that the users see
what kind of memory footprint their job had.

	Yes, that is very useful info, and its best to simply advertise
	it to the user, so they can submit their job with a correct
	'Ram' value based on their impression of the numbers.

	Often the renderers print those numbers in the log for you,
	so you can just grep them out; maya and mental ray both
	do this.

	Trouble is the format of these messages can change from one
	rev of the renderer to another. And, depending on the job,
	sometimes there are several of these messages per frame,
	such as when renders are batching, or worse, when complex jobs
	render multiple images per frame ('levels', 'passes', 'comps' etc)
	So it gets a little tricky to implement that in a way that
	works for all situations.

That's why having the queue monitor what your process is doing is really the only solution. Reports by the software are what *it* thinks, not what the OS thought.
	A few weeks ago I implemented a much lower latency 'rush -exitnotes'
	command which handles sending back 'per frame' messages to the
	jobserver in a reasonable manner by connecting to the 'render node'
	instead of the job server, setting things up so the note is passed
	back to the job server as part of the UDP message that delivers the
	exit code back to the server. It'll be in version 102.43.

I'll definitely migrate to that. My "exitnotes" also show cpu efficiency (I run my commands via /usr/bin/time), so that compositors can get a general sense of how whether their jobs are cpu or i/o bound, i.e. low efficiency most likely indicates mostly waiting for disk. Renderer low efficiency could be many texture maps access or undue swapping.


	I'd like to add the 'ram usage indicator' to the submit scripts
	as an option, once the 'rush -exitnotes' is fully released.

This is a pretty critical
feature, IMO, as you *really* want to avoid going into swap on a box!

	Yes, definitely.

	Gets tricky to detect swap though, as often when a box looks like
	it's swapping, it's actually just paging out old junk to make room
	in ram that it should have cleared out long ago.

	Best thing is to just know in advance how much ram the job will
	tend to need, and submit with that ram value set. (e.g. the 'Ram:'
	prompt in the submit forms)

Hence the need for frame notes with what happened!


	'rushtop' is handy for seeing if a job is using a lot of ram.
	Just render your job on a box, and watch the ram use as the
	render runs to get a feel for how nasty it is.. then submit
	with the 'Ram' field set accordingly.

Doesn't really work when all your renderfarm computers are 8-proc nodes.

	My goal was to avoid any features in rush that had too much
	of a 'fuzzy' aspect to it, such as polling ram use.

I can get that. What might be useful is having hooks for a bunch of these things, and users can install them if they feel they're needed. A user (me!) shouldn't really have to go and write a memory checking script on their own.


	Even kernel mailing lists argue endlessly on how free ram
	should be determined, and it often changes from release
	to release. I've had to tweak rushtop several times to take
	into account changes in the different OS's ram calculations.

	It's too bad that most of the OS's (esp unix!) doesn't let
	a parent program get back the memory use and usr/sys time

Yeah, I was floored by that when I found out just how bad memory accounting is! The structures are there, for Pete's sake!

	of the accumulated process tree. The counters are there
	in the structures for the accumulated times, but they're
	all zeroed out. All you can get back is the ram/cpu time
	of the immediate process reliably, and that's usually just
	the perl script, which is useless. The only way to get unix
	to show process hierarchy data that I've seen is to have to
	have system wide process accounting (acct(2)) turned on.
	And whenever that's on, the system bogs down because
	process accounting makes giant logs quickly. But it helps
	the OS tally accumulated process tree info internally.


Funny, I'm doing the ram-usage polling right now.  I simultaneously
launch a memory watcher script, which given a PID, finds all the child
PIDs and adds up their memory consumption, writes to a file in the
logdir.

	Trouble I've found with snapshoting the proc table (did it at DD)
	is that you run into a few real world problems, enough that it
	can often cause more trouble than its worth.

	When polling the process hierarchy, you can end up with wild
	snapshots when processes fork, showing double memory use during
	that time. You can try to smooth those out as aberrant data,
	but some renders fork frequently, causing the data to sometimes
	appear valid, throwing the job into a stall.

My script has a ramp-down of polling frequency: for the first 10 seconds, it polls every 2 seconds, then every 5 seconds for another 30, eventually down to once a minute for the life of the process. This has worked pretty darned well, as it captures the fast frames decently. It does miss out on last-second memory surges (Shake tends to do that once in a while, it seems). But for the most part, the frame-to-frame correlation is pretty strong. Oddly enough, I haven't seen the data-doublings due to forks, maybe because the polling is infrequent.


	Also, sometimes a single frame would go bananas on ram, causing
	the queue to think the job was going into a phase of high memory
	use. Or sometimes a scene will simply go from a black frame to
	a sudden high memory use, enough to swap. An automated mechanism
	that tries to use this wild data to handle scheduling almost always
	stalls the job, causing folks to simply turn off the feature;
	they'd rather have their render crash a few machines on the
	few frames that go bananas instead of having the job completely
	stall in the middle of the night.

	In rushtop I added an experimental 'paging activity' indicator
	(the 'orange bar') which watches for 'excessive' paging activity,
	and bumps the bar when that happens. This limit was determined
	empirically.. when the orange bar appears, chances are you can
	'feel' the slowness if you're on that box in the form of an
	unresponsive mouse, or similar.

That's great stuff.  It would be a useful bit of info on an exitnote.

-- Antoine



--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Wed, 06 Jun 2007 20:07:30 -0400
Msg# 1587
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
> Hmm, not sure that's feasible for 1000+ node networks, when the 
> packages is multiple thousands per.

	With deals that large, e.g. $100k ~ $500k and up,
	special deals are arranged with the big vendors.

> That's why having the queue monitor what your process is doing is 
> really the only solution.  Reports by the software are what *it* 
> thinks, not what the OS thought.

	Yeah, I wonder what's up with getrusage() being so poorly
	supported across all unix platforms.

	You'd think keeping count of ram use wouldn't be such
	a big deal.. we know it does it for ps(1) reports and
	/proc/<pid>/stat (In DD's 'race', I used the latter)

> I'll definitely migrate to that.  My "exitnotes" also show cpu 
> efficiency (I run my commands via /usr/bin/time), so that compositors 
> can get a general sense of how whether their jobs are cpu or i/o bound, 
> i.e. low efficiency most likely indicates mostly waiting for disk.  
> Renderer low efficiency could be many texture maps access or undue 
> swapping.

	Right.. it's been on my TODO list for a really long time..
	Laika finally got me off my ass on that one a few weeks ago
	(thanks John!) Turned out to be easy to add.

> I can get that.  What might be useful is having hooks for a bunch of 
> these things, and users can install them if they feel they're needed.  
> A user (me!) shouldn't really have to go and write a memory checking 
> script on their own.

	The 'hook' is that the render scripts are highly hackable.
	So you can, for instance, in perl do a fork() to start the
	renderer, and meanwhile monitor the fork()'ed PID's tree
	via /proc, and come up with the totals.

	I'd supply the process tree monitoring code if I thought
	it worked well, but as I've said I don't trust the code.

	My feeling about all this is that until the OS's support getrusage(),
	I won't bother. Too much hackery for a multiplatform system.

	Back in 2001 I thought 'well, they'll figure this out soon,
	how hard can it be', but here it is 2007 and Linux is still
	arguing over what probably amounts to a 20 line kernel patch:
	http://groups.google.com/group/fa.linux.kernel/tree/browse_frm/thread/9126646bb52e36ae/b2e5a16ae19dad02?rnum=1&hl=en&q=linux+getrusage&_done=%2Fgroup%2Ffa.linux.kernel%2Fbrowse_frm%2Fthread%2F9126646bb52e36ae%2F25a450f891fea844%3Flnk%3Dst%26q%3Dlinux%2Bgetrusage%26rnum%3D8%26hl%3Den%26#doc_b2e5a16ae19dad02

	I mean, even with frigging Alan Cox on the thread saying
	this looks OK. I agree 100% with the subject of that article. >;)

>> 	It's too bad that most of the OS's (esp unix!) doesn't let
>> 	a parent program get back the memory use and usr/sys time
> 
> Yeah, I was floored by that when I found out just how bad memory 
> accounting is!  The structures are there, for Pete's sake!

	Yeah, and not even that.. apparently the process still
	can't even get memory use about *itself*, let alone children!!

	I just did a test now on fedora3; a giant malloc() and memset(),
	and getrusage(RUSAGE_SELF) has all zeros for memory; pathetic!
	

            ru_utime:0/1347     user time used (secs/usecs)
            ru_stime:0/4963     system time used (secs/usecs)
           ru_maxrss:0          <--
            ru_ixrss:0          <-- integral shared text memory size
            ru_idrss:0          <-- integral unshared data size
            ru_isrss:0          <-- integral unshared stack size

	/Same/ results on OSX 10.4.9, too. So both linux and OSX
	have brain dead getrusage(2).

	To show I'm not crazy, compiled and ran the same code
	on my FreeBSD webserver, and it /worked/:

            ru_utime:0/3989     user time used (secs/usecs)
            ru_stime:0/0        system time used (secs/usecs)
           ru_maxrss:604        <--
            ru_ixrss:4          <-- integral shared text memory size
            ru_idrss:988        <-- integral unshared data size
            ru_isrss:128        <-- integral unshared stack size

	(shrug)

> My script has a ramp-down of polling frequency: for the first 10 
> seconds, it polls every 2 seconds, then every 5 seconds for another 30, 
> eventually down to once a minute for the life of the process.  This has 
> worked pretty darned well, as it captures the fast frames decently.  It 
> does miss out on last-second memory surges (Shake tends to do that once 
> in a while, it seems).

	Often those last surges are due to 'exit()' being called
	instead of _exit().

	This is something new due to C++; when you call exit(), all
	the destructors are called, and so, often this creates a lot
	of activity pulling memory pages in so they can be free()d.

	Just calling _exit() bypasses all the destructors, so it
	just frees everything, and the process exits instantly.

	One runs into this with vfork() too.

> But for the most part, the frame-to-frame 
> correlation is pretty strong.  Oddly enough, I haven't seen the 
> data-doublings due to forks, maybe because the polling is infrequent.

	Ya, most likely. Also, possibly shake isn't forking off child
	processes, and likely hasn't got very large of a memory footprint.

	Tell me how it goes with ram heavy renders that fork children..
	At DD I was backing off the sample rate too, but then you'd
	get that double memory use fork() snapshot towards the end
	when the display driver would fire. To 'solve it', I tried to
	increase my sample rate a bit when I saw something that looked
	like a double ramuse to see if it was an aberration. That sorta
	worked, but there were other reasons it would get wild values,
	so I just didn't bother with it in Rush, waiting for the OS
	to deliver the goods.

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Tue, 05 Jun 2007 22:47:01 -0400
Msg# 1578
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
> So I whipped up something this afternoon.   It's pretty simplistic, and 
> I'm curious as to where it might fall flat on its nose.

	BTW, if all you're looking to do is have jobs just be FIFO,
	you can set up the submit scripts so that each time a job
	is submitted, it submits with a successively lower priority.

	I've helped some folks do this by keeping a running priority
	count in a little text file on the network drive, and have the
	submit scripts grab the value from the file, decrement its contents,
	and use that as the job's submit priority.

	Each day at 5am, a crontab resets the value in the file to 999, eg:

echo 999 > /your/server/rushfiles/next-priority.txt

	..this way each morning when folks get in, the priority for
	new jobs starts at the top again.

	This assumes jobs aren't still running from the day before,
	but that's usually the case. And even if they were, they'd simply
	move aside to allow the new jobs of the day to run, and 'last nights jobs'
	would get cpus when there are idle procs.

	This would avoid the need for a "watcher" script.

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Fri, 01 Jun 2007 20:11:39 -0400
Msg# 1570
View Complete Thread (16 articles) | All Threads
Last Next
Antoine Durr wrote:
> I'm under the impression that to balance cpus per user requires an 
> external script that dynamically resets priorities based on who should 
> have more or fewer cpus than they already have.  Does anyone have an 
> example of such a script?

	I'll see if I can whip up a public example for you
	and the group here.

	Perl or python?

-- 
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Antoine Durr <antoine@(email surpressed)>
Subject: Re: cpu balancing script
   Date: Fri, 01 Jun 2007 20:24:32 -0400
Msg# 1571
View Complete Thread (16 articles) | All Threads
Last Next
On 2007-06-01 17:11:39 -0700, Greg Ercolano <erco@(email surpressed)> said:

Antoine Durr wrote:
I'm under the impression that to balance cpus per user requires an
external script that dynamically resets priorities based on who should
have more or fewer cpus than they already have.  Does anyone have an
example of such a script?

	I'll see if I can whip up a public example for you
	and the group here.

	Perl or python?

Perl, please.

Thanks,

-- Antoine

--
Floq FX Inc.
10839 Washington Blvd.
Culver City, CA 90232
310/430-2473