From: Luke Cole <luke@(email surpressed).au>
Subject: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 20:11:50 -0500
Msg# 1160
View Complete Thread (8 articles) | All Threads
Last Next
Hi rush.general,

We've been noticing rush (appearing to be) starting to slow down as we have increased the number of connected hosts, and running jobs. What we see is that it can take a very long time for rush to contact some of the hosts, and as a result, some of the applications like irush etc, will report hosts as being down, even when that is not the case - they are up and happily rendering away. For example:

manta:~ lrcole$ time rush -ping loaner5
loaner5: RUSHD 102.42 PID=776 Boot=12/29/05,11:29:52 Online, 0 jobs, 1 procs, 544 tasks, dlog=-, nfd=4

real    0m8.753s
user    0m0.041s
sys     0m0.017s

And another:

manta:~ lrcole$ time rush -ping loaner16
  loaner16: read error: 40 second timeout from loaner16

real    0m40.125s
user    0m0.041s
sys     0m0.018s
manta:~ lrcole$

We presently have 250 running jobs, and I think nearly 100 machines on the farm (some of these are workstations, so not all are rendering all the time).

I imagine that these apps are just timing out while trying to query some of the machines, and as a result, just assumes that they are unavailable. Has anyone else experienced problems like this before, and may have suggestions on how we could address the issue?

Our rush license server is often very heavily loaded up (it also serves files) - could this be a factor?

Thank you,
---
Luke Cole
Systems Administrator / TD

FUEL International
65 King St., Newtown, Sydney NSW, Australia 2042



   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 20:27:31 -0500
Msg# 1161
View Complete Thread (8 articles) | All Threads
Last Next
Hi Luke,

Need some more info: regarding the unresponsive machines 'loaner5'
and 'loaner16', do you think rush is slow because the rush daemon is busy,
or because the machine is thrashing due to rendering?

It's important to determine if the rushd is busy, or if the machine
is busy due to rendering.

When a machine is not being responsive to 'rush -ping', try ssh/rsh'ing
over to that machine and look at 'top' and/or the output of eg. 'vmstat 3'.
Is rushd using up all the cpu, or is a render? Is the machine swapping due
to unavailable ram? Does rsh/ssh not even respond when trying to connect
to the machine? If so, the renders may be using too much in the way of
ram resources, swapping the machine to death.

Or, possibly rush is being kept busy; what is the successful output of
'rush -tasklist loaner16? If the list is huge, possibly users are submitting
with too many +any specifications. For instance if there are 250 jobs each asking for:

	+any=3@200 +any=5@150 +any=10@100 +any=20@50

..that will make four entries on each host, multiplying the complexity
to rush by 4 (4 specs per job * 250 jobs = 1000 active tasks)

..consider instead using just a two tier submissions:

	+any=3@200 +any=20@50

Our rush license server is often very heavily loaded up (it also serves files) - could this be a factor?

Not likely, as the rushd daemons only communicate with the license
server on boot.

Unless, that is, your license server is also acting as a job server for
jobs (ie. submitting jobs to the license server, such that jobs have jobids
with the license server's hostname in them)



--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 20:34:55 -0500
Msg# 1162
View Complete Thread (8 articles) | All Threads
Last Next
Greg Ercolano wrote:
over to that machine and look at 'top' and/or the output of eg. 'vmstat 3'

	BTW, I should add the 'vmstat 3' command is linux specific.
	The equiv on a Mac would be, IIRC, "vm_stat 3".

	These commands are useful if you suspect renders are taking up too much
	cpu by thrashing the box with eg. swapping/paging activity.

	If, however, rushd is busy (ie. is taking up >90% of the cpu, and showing
	at the top of the top(1) report), then don't bother using vmstat/vm_stat.
	In that case let us know, and we can debug from there to determine why
	rushd is so busy. (Maybe jobs are invoking 'rush' commands during rendering
	that are beating up the server. A common problem is when custom render scripts
	are used that invoke commands like 'rush -lf' or 'rush -notes' commands
	before or after every rendered frame, invoking excessive TCP connections
	back to the job server, causing the job servers to appear 'slow')

	If rushd is busy, do you have job serving distributed to several machines (good),
	or are all jobs being submitted to a single machine (bad)?

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)

   From: Luke Cole <luke@(email surpressed).au>
Subject: Re: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 20:54:46 -0500
Msg# 1164
View Complete Thread (8 articles) | All Threads
Last Next
Hi Greg,

over to that machine and look at 'top' and/or the output of eg. 'vmstat 3'


    BTW, I should add the 'vmstat 3' command is linux specific.
    The equiv on a Mac would be, IIRC, "vm_stat 3".

These commands are useful if you suspect renders are taking up too much
    cpu by thrashing the box with eg. swapping/paging activity.

If, however, rushd is busy (ie. is taking up >90% of the cpu, and showing at the top of the top(1) report), then don't bother using vmstat/vm_stat. In that case let us know, and we can debug from there to determine why rushd is so busy. (Maybe jobs are invoking 'rush' commands during rendering that are beating up the server. A common problem is when custom render scripts are used that invoke commands like 'rush -lf' or 'rush -notes' commands before or after every rendered frame, invoking excessive TCP connections
    back to the job server, causing the job servers to appear 'slow')

If rushd is busy, do you have job serving distributed to several machines (good),
    or are all jobs being submitted to a single machine (bad)?

We have job serving distributed between 3 or 4 machines the 250 jobs are split roughly equally between the job hosts. I'll check a few of the other machines - it looks like the hosts are being hammered by the renders rather than rush.

Silly of me to think rush could be the cause!  :)

Thanks for your excellent assistance,

---
Luke Cole
Systems Administrator / TD

FUEL International
65 King St., Newtown, Sydney NSW, Australia 2042



   From: Luke Cole <luke@(email surpressed).au>
Subject: Re: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 20:50:17 -0500
Msg# 1163
View Complete Thread (8 articles) | All Threads
Last Next
Hi Greg,

Need some more info: regarding the unresponsive machines 'loaner5'
and 'loaner16', do you think rush is slow because the rush daemon is busy,
or because the machine is thrashing due to rendering?

It's important to determine if the rushd is busy, or if the machine
is busy due to rendering.

Ah yes. It would be getting hammered due to rendering - of course - if the machine is maxed out rendering, it will be slow to respond to rush! :)

I checked loaner16 - it is reporting 99% of the CPU as being utilised by mayabatch.exe, so it's definitely running slowly because of that.

When a machine is not being responsive to 'rush -ping', try ssh/ rsh'ing over to that machine and look at 'top' and/or the output of eg. 'vmstat 3'. Is rushd using up all the cpu, or is a render? Is the machine swapping due to unavailable ram? Does rsh/ssh not even respond when trying to connect
to the machine? If so, the renders may be using too much in the way of
ram resources, swapping the machine to death.

I didn't think to check machine performance as it hasn't really been a problem before - most of our dedicated render machines are dual-cpu hosts though, so it could be why they are a bit more responsive (loaner16 is a rental single-cpu host)...thanks for pointing that out as an issue as I didn't think to check load on the troublesome hosts themselves.

Or, possibly rush is being kept busy; what is the successful output of
'rush -tasklist loaner16? If the list is huge, possibly users are submitting with too many +any specifications. For instance if there are 250 jobs each asking for:

    +any=3@200 +any=5@150 +any=10@100 +any=20@50

..that will make four entries on each host, multiplying the complexity
to rush by 4 (4 specs per job * 250 jobs = 1000 active tasks)

..consider instead using just a two tier submissions:

    +any=3@200 +any=20@50

When I run the rush -tasklist command on loaner16 I get a list with almost 420 entries. Does that sound reasonable?

Our rush license server is often very heavily loaded up (it also serves files) - could this be a factor?


Not likely, as the rushd daemons only communicate with the license
server on boot.

Unless, that is, your license server is also acting as a job server for jobs (ie. submitting jobs to the license server, such that jobs have jobids
with the license server's hostname in them)

That should be OK then - the server is a license host only - we host the jobs on some other machines.

Thank you for your help!

---
Luke Cole
Systems Administrator / TD

FUEL International
65 King St., Newtown, Sydney NSW, Australia 2042



   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 21:41:22 -0500
Msg# 1165
View Complete Thread (8 articles) | All Threads
Last Next
Luke Cole wrote:
When I run the rush -tasklist command on loaner16 I get a list with almost 420 entries. Does that sound reasonable?

	Yes, that sounds normal for dual proc machine, or for two
	tier submissions.

I checked loaner16 - it is reporting 99% of the CPU as being
utilised by mayabatch.exe, so it's definitely running slowly because of that.

	Sounds like you should dig a little deeper, since it's normal
	for renders to use 99% of the cpu; rushd shouldn't act unresponsive
	for 40 seconds unless something very extreme is going on.

	Is the rushd.log file complaining on these machines? I'd expect
	to see errors like 'connection reset by peer' due to the unresponsiveness,
	but I'm wondering if there are other errors that might indicate rushd
	is stuck doing eg. bad hostname lookups, or some such. For instance,
	does running 'rush -lah' on the slow-to-respond host show ???'s for
	any entries in the report? That might indicate bad or unresponsive DNS.
	Does the 'rush -lah' report take a long time to come up? (might be
	symptomatic of a slow to respond name lookup system, or if OSX,
	possibly you have the .local Rendezvous/Bonjours Multicast DNS disease,
	in which case make sure you have HOSTNAME=<name_of_host> and not HOSTNAME=-AUTOMATIC-
	in the /etc/hostconfig)

	rushd being unresponsive due to rendering sounds like it /could/ be
	the cause IF the machine's resources are being taxed by the render
	(ie. the renders are using too much ram, or are starting too many
	threads for the number of cpus the machine has)

	If maya is involved, make sure that if your machines are dual procs,
	and in the rush/etc/hosts file each box is configured with '2' for CPUS,
	then rush will try to start two invocations of maya per box, in which case
	you better make sure the maya renders have the '-n 1' flag set, to prevent
	/each/ instance of maya from trying to use /both/ processors..! (causing the
	renders to step on each other, and the rest of the machine, including rushd)

	Or, make sure the ram isn't being overused, causing the box to thrash,
	as that will steal cpu from everything, including the renders.

	It's normal for renders to use 99% of the cpu; the unix scheduler
	should still be able to yield cpu to rushd under those conditions.
	The only situations I can think of where rushd wouldn't be getting
	enough cpu would be:

		a) The renders are running at a higher system priority (lower niceness)
		   than rushd

		b) The kernel is using some kind of 'decaying' scheduling, giving rushd
		   a lower priority than it should. Possibly you can fix this by adjusting
		   rushd's system priority with 'renice'

		c) The machine is swapping due to the renders using more ram than they
		   should be, causing other processes (like rushd) to swap out

	Note that Rush's priority values (+any@200) has nothing to do with what I'm
	calling the system priority (PRI column in 'ps -lax'). The rush priority is
	used by rush only to determine which user's render should be started next,
	and has nothing to do with the priority of processes once they're running.

	The only value in rush that affects the system priority of running processes
	is the 'nice' value the job is submitted as, which is the 'niceness' the renders
	will run as.

	Regarding a/b, possibly you might want to experiment with adjusting the
	niceness level of the renders down, so that they don't bog the box down.
	(Won't help if your problem is 'c')

	If the problem is 'c', then you might have to get more ram for your boxes,
	or tell rush not to render more than one render per box, so that only
	one render runs per box. (Or if it's one user's particular job using all
	the ram, have them try to optimize the job so that it doesn't use so much ram)

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)

   From: Luke Cole <luke@(email surpressed).au>
Subject: Re: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 22:39:16 -0500
Msg# 1166
View Complete Thread (8 articles) | All Threads
Last Next
Hi Greg,

OK - I will investigate further and get back to you with more details - in our case the machines are Windows XP hosts, so I'm not sure how much control over process nice-ness that will allow.

Thanks,

Luke

On 03/01/2006, at 1:41 PM, Greg Ercolano wrote:

[posted to rush.general]

Luke Cole wrote:

When I run the rush -tasklist command on loaner16 I get a list with almost 420 entries. Does that sound reasonable?


    Yes, that sounds normal for dual proc machine, or for two
    tier submissions.


I checked loaner16 - it is reporting 99% of the CPU as being
utilised by mayabatch.exe, so it's definitely running slowly because of that.


    Sounds like you should dig a little deeper, since it's normal
for renders to use 99% of the cpu; rushd shouldn't act unresponsive
    for 40 seconds unless something very extreme is going on.

    Is the rushd.log file complaining on these machines? I'd expect
to see errors like 'connection reset by peer' due to the unresponsiveness, but I'm wondering if there are other errors that might indicate rushd is stuck doing eg. bad hostname lookups, or some such. For instance, does running 'rush -lah' on the slow-to-respond host show ???'s for any entries in the report? That might indicate bad or unresponsive DNS.
    Does the 'rush -lah' report take a long time to come up? (might be
    symptomatic of a slow to respond name lookup system, or if OSX,
possibly you have the .local Rendezvous/Bonjours Multicast DNS disease, in which case make sure you have HOSTNAME=<name_of_host> and not HOSTNAME=-AUTOMATIC-
    in the /etc/hostconfig)

rushd being unresponsive due to rendering sounds like it / could/ be
    the cause IF the machine's resources are being taxed by the render
    (ie. the renders are using too much ram, or are starting too many
    threads for the number of cpus the machine has)

If maya is involved, make sure that if your machines are dual procs, and in the rush/etc/hosts file each box is configured with '2' for CPUS, then rush will try to start two invocations of maya per box, in which case you better make sure the maya renders have the '-n 1' flag set, to prevent /each/ instance of maya from trying to use /both/ processors..! (causing the renders to step on each other, and the rest of the machine, including rushd)

Or, make sure the ram isn't being overused, causing the box to thrash,
    as that will steal cpu from everything, including the renders.

    It's normal for renders to use 99% of the cpu; the unix scheduler
    should still be able to yield cpu to rushd under those conditions.
    The only situations I can think of where rushd wouldn't be getting
    enough cpu would be:

a) The renders are running at a higher system priority (lower niceness)
           than rushd

b) The kernel is using some kind of 'decaying' scheduling, giving rushd a lower priority than it should. Possibly you can fix this by adjusting
           rushd's system priority with 'renice'

c) The machine is swapping due to the renders using more ram than they
           should be, causing other processes (like rushd) to swap out

Note that Rush's priority values (+any@200) has nothing to do with what I'm calling the system priority (PRI column in 'ps -lax'). The rush priority is used by rush only to determine which user's render should be started next, and has nothing to do with the priority of processes once they're running.

The only value in rush that affects the system priority of running processes is the 'nice' value the job is submitted as, which is the 'niceness' the renders
    will run as.

Regarding a/b, possibly you might want to experiment with adjusting the niceness level of the renders down, so that they don't bog the box down.
    (Won't help if your problem is 'c')

If the problem is 'c', then you might have to get more ram for your boxes, or tell rush not to render more than one render per box, so that only one render runs per box. (Or if it's one user's particular job using all the ram, have them try to optimize the job so that it doesn't use so much ram)

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)



---
Luke Cole
Systems Administrator / TD

FUEL International
65 King St., Newtown, Sydney NSW, Australia 2042



   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: rush slowness/timeouts
   Date: Mon, 02 Jan 2006 23:24:34 -0800
Msg# 1167
View Complete Thread (8 articles) | All Threads
Last Next
Luke Cole wrote:
OK - I will investigate further and get back to you with more details - in our case the machines are Windows XP hosts, so I'm not sure how much control over process nice-ness that will allow.

	In that case, the 'task manager' "Processes" tab enabled:
	"View | Select Columns | Base Priority"
	..lets you view the priority, and use right click
	"Set Priority" to change it.

	Also, you can use the task manager's graph under the
	"Performance" tab to see if the renders are taking up
	all the ram.