From: Luke Cole <luke@(email surpressed).au> Subject: rush slowness/timeouts Date: Mon, 02 Jan 2006 20:11:50 -0500 |
Msg# 1160 View Complete Thread (8 articles) | All Threads Last Next |
Hi rush.general,We've been noticing rush (appearing to be) starting to slow down as we have increased the number of connected hosts, and running jobs. What we see is that it can take a very long time for rush to contact some of the hosts, and as a result, some of the applications like irush etc, will report hosts as being down, even when that is not the case - they are up and happily rendering away. For example: manta:~ lrcole$ time rush -ping loaner5loaner5: RUSHD 102.42 PID=776 Boot=12/29/05,11:29:52 Online, 0 jobs, 1 procs, 544 tasks, dlog=-, nfd=4 real 0m8.753s user 0m0.041s sys 0m0.017s And another: manta:~ lrcole$ time rush -ping loaner16 loaner16: read error: 40 second timeout from loaner16 real 0m40.125s user 0m0.041s sys 0m0.018s manta:~ lrcole$We presently have 250 running jobs, and I think nearly 100 machines on the farm (some of these are workstations, so not all are rendering all the time). I imagine that these apps are just timing out while trying to query some of the machines, and as a result, just assumes that they are unavailable. Has anyone else experienced problems like this before, and may have suggestions on how we could address the issue? Our rush license server is often very heavily loaded up (it also serves files) - could this be a factor? Thank you, --- Luke Cole Systems Administrator / TD FUEL International 65 King St., Newtown, Sydney NSW, Australia 2042 |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: rush slowness/timeouts Date: Mon, 02 Jan 2006 20:27:31 -0500 |
Msg# 1161 View Complete Thread (8 articles) | All Threads Last Next |
Hi Luke, Need some more info: regarding the unresponsive machines 'loaner5' and 'loaner16', do you think rush is slow because the rush daemon is busy, or because the machine is thrashing due to rendering? It's important to determine if the rushd is busy, or if the machine is busy due to rendering. When a machine is not being responsive to 'rush -ping', try ssh/rsh'ing over to that machine and look at 'top' and/or the output of eg. 'vmstat 3'. Is rushd using up all the cpu, or is a render? Is the machine swapping due to unavailable ram? Does rsh/ssh not even respond when trying to connect to the machine? If so, the renders may be using too much in the way of ram resources, swapping the machine to death. Or, possibly rush is being kept busy; what is the successful output of 'rush -tasklist loaner16? If the list is huge, possibly users are submitting with too many +any specifications. For instance if there are 250 jobs each asking for: +any=3@200 +any=5@150 +any=10@100 +any=20@50 ..that will make four entries on each host, multiplying the complexity to rush by 4 (4 specs per job * 250 jobs = 1000 active tasks) ..consider instead using just a two tier submissions: +any=3@200 +any=20@50 Our rush license server is often very heavily loaded up (it also serves files) - could this be a factor? Not likely, as the rushd daemons only communicate with the license server on boot. Unless, that is, your license server is also acting as a job server for jobs (ie. submitting jobs to the license server, such that jobs have jobids with the license server's hostname in them) -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: rush slowness/timeouts Date: Mon, 02 Jan 2006 20:34:55 -0500 |
Msg# 1162 View Complete Thread (8 articles) | All Threads Last Next |
Greg Ercolano wrote: over to that machine and look at 'top' and/or the output of eg. 'vmstat 3' BTW, I should add the 'vmstat 3' command is linux specific. The equiv on a Mac would be, IIRC, "vm_stat 3". These commands are useful if you suspect renders are taking up too much cpu by thrashing the box with eg. swapping/paging activity. If, however, rushd is busy (ie. is taking up >90% of the cpu, and showing at the top of the top(1) report), then don't bother using vmstat/vm_stat. In that case let us know, and we can debug from there to determine why rushd is so busy. (Maybe jobs are invoking 'rush' commands during rendering that are beating up the server. A common problem is when custom render scripts are used that invoke commands like 'rush -lf' or 'rush -notes' commands before or after every rendered frame, invoking excessive TCP connections back to the job server, causing the job servers to appear 'slow') If rushd is busy, do you have job serving distributed to several machines (good), or are all jobs being submitted to a single machine (bad)? -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) |
From: Luke Cole <luke@(email surpressed).au> Subject: Re: rush slowness/timeouts Date: Mon, 02 Jan 2006 20:54:46 -0500 |
Msg# 1164 View Complete Thread (8 articles) | All Threads Last Next |
Hi Greg, over to that machine and look at 'top' and/or the output of eg. 'vmstat 3'BTW, I should add the 'vmstat 3' command is linux specific. The equiv on a Mac would be, IIRC, "vm_stat 3".These commands are useful if you suspect renders are taking up too muchcpu by thrashing the box with eg. swapping/paging activity.If, however, rushd is busy (ie. is taking up >90% of the cpu, and showing at the top of the top(1) report), then don't bother using vmstat/vm_stat. In that case let us know, and we can debug from there to determine why rushd is so busy. (Maybe jobs are invoking 'rush' commands during rendering that are beating up the server. A common problem is when custom render scripts are used that invoke commands like 'rush -lf' or 'rush -notes' commands before or after every rendered frame, invoking excessive TCP connectionsback to the job server, causing the job servers to appear 'slow')If rushd is busy, do you have job serving distributed to several machines (good),or are all jobs being submitted to a single machine (bad)? We have job serving distributed between 3 or 4 machines the 250 jobs are split roughly equally between the job hosts. I'll check a few of the other machines - it looks like the hosts are being hammered by the renders rather than rush. Silly of me to think rush could be the cause! :) Thanks for your excellent assistance, --- Luke Cole Systems Administrator / TD FUEL International 65 King St., Newtown, Sydney NSW, Australia 2042 |
From: Luke Cole <luke@(email surpressed).au> Subject: Re: rush slowness/timeouts Date: Mon, 02 Jan 2006 20:50:17 -0500 |
Msg# 1163 View Complete Thread (8 articles) | All Threads Last Next |
Hi Greg, Need some more info: regarding the unresponsive machines 'loaner5'and 'loaner16', do you think rush is slow because the rush daemon is busy,or because the machine is thrashing due to rendering? It's important to determine if the rushd is busy, or if the machine is busy due to rendering. Ah yes. It would be getting hammered due to rendering - of course - if the machine is maxed out rendering, it will be slow to respond to rush! :) I checked loaner16 - it is reporting 99% of the CPU as being utilised by mayabatch.exe, so it's definitely running slowly because of that. When a machine is not being responsive to 'rush -ping', try ssh/ rsh'ing over to that machine and look at 'top' and/or the output of eg. 'vmstat 3'. Is rushd using up all the cpu, or is a render? Is the machine swapping due to unavailable ram? Does rsh/ssh not even respond when trying to connectto the machine? If so, the renders may be using too much in the way of ram resources, swapping the machine to death. I didn't think to check machine performance as it hasn't really been a problem before - most of our dedicated render machines are dual-cpu hosts though, so it could be why they are a bit more responsive (loaner16 is a rental single-cpu host)...thanks for pointing that out as an issue as I didn't think to check load on the troublesome hosts themselves. Or, possibly rush is being kept busy; what is the successful output of'rush -tasklist loaner16? If the list is huge, possibly users are submitting with too many +any specifications. For instance if there are 250 jobs each asking for:+any=3@200 +any=5@150 +any=10@100 +any=20@50 ..that will make four entries on each host, multiplying the complexity to rush by 4 (4 specs per job * 250 jobs = 1000 active tasks) ..consider instead using just a two tier submissions: +any=3@200 +any=20@50 When I run the rush -tasklist command on loaner16 I get a list with almost 420 entries. Does that sound reasonable? Our rush license server is often very heavily loaded up (it also serves files) - could this be a factor?Not likely, as the rushd daemons only communicate with the license server on boot.Unless, that is, your license server is also acting as a job server for jobs (ie. submitting jobs to the license server, such that jobs have jobidswith the license server's hostname in them) That should be OK then - the server is a license host only - we host the jobs on some other machines. Thank you for your help! --- Luke Cole Systems Administrator / TD FUEL International 65 King St., Newtown, Sydney NSW, Australia 2042 |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: rush slowness/timeouts Date: Mon, 02 Jan 2006 21:41:22 -0500 |
Msg# 1165 View Complete Thread (8 articles) | All Threads Last Next |
Luke Cole wrote: When I run the rush -tasklist command on loaner16 I get a list with almost 420 entries. Does that sound reasonable? Yes, that sounds normal for dual proc machine, or for two tier submissions. I checked loaner16 - it is reporting 99% of the CPU as beingutilised by mayabatch.exe, so it's definitely running slowly because of that. Sounds like you should dig a little deeper, since it's normal for renders to use 99% of the cpu; rushd shouldn't act unresponsive for 40 seconds unless something very extreme is going on. Is the rushd.log file complaining on these machines? I'd expect to see errors like 'connection reset by peer' due to the unresponsiveness, but I'm wondering if there are other errors that might indicate rushd is stuck doing eg. bad hostname lookups, or some such. For instance, does running 'rush -lah' on the slow-to-respond host show ???'s for any entries in the report? That might indicate bad or unresponsive DNS. Does the 'rush -lah' report take a long time to come up? (might be symptomatic of a slow to respond name lookup system, or if OSX, possibly you have the .local Rendezvous/Bonjours Multicast DNS disease, in which case make sure you have HOSTNAME=<name_of_host> and not HOSTNAME=-AUTOMATIC- in the /etc/hostconfig) rushd being unresponsive due to rendering sounds like it /could/ be the cause IF the machine's resources are being taxed by the render (ie. the renders are using too much ram, or are starting too many threads for the number of cpus the machine has) If maya is involved, make sure that if your machines are dual procs, and in the rush/etc/hosts file each box is configured with '2' for CPUS, then rush will try to start two invocations of maya per box, in which case you better make sure the maya renders have the '-n 1' flag set, to prevent /each/ instance of maya from trying to use /both/ processors..! (causing the renders to step on each other, and the rest of the machine, including rushd) Or, make sure the ram isn't being overused, causing the box to thrash, as that will steal cpu from everything, including the renders. It's normal for renders to use 99% of the cpu; the unix scheduler should still be able to yield cpu to rushd under those conditions. The only situations I can think of where rushd wouldn't be getting enough cpu would be: a) The renders are running at a higher system priority (lower niceness) than rushd b) The kernel is using some kind of 'decaying' scheduling, giving rushd a lower priority than it should. Possibly you can fix this by adjusting rushd's system priority with 'renice' c) The machine is swapping due to the renders using more ram than they should be, causing other processes (like rushd) to swap out Note that Rush's priority values (+any@200) has nothing to do with what I'm calling the system priority (PRI column in 'ps -lax'). The rush priority is used by rush only to determine which user's render should be started next, and has nothing to do with the priority of processes once they're running. The only value in rush that affects the system priority of running processes is the 'nice' value the job is submitted as, which is the 'niceness' the renders will run as. Regarding a/b, possibly you might want to experiment with adjusting the niceness level of the renders down, so that they don't bog the box down. (Won't help if your problem is 'c') If the problem is 'c', then you might have to get more ram for your boxes, or tell rush not to render more than one render per box, so that only one render runs per box. (Or if it's one user's particular job using all the ram, have them try to optimize the job so that it doesn't use so much ram) -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) |
From: Luke Cole <luke@(email surpressed).au> Subject: Re: rush slowness/timeouts Date: Mon, 02 Jan 2006 22:39:16 -0500 |
Msg# 1166 View Complete Thread (8 articles) | All Threads Last Next |
Hi Greg,OK - I will investigate further and get back to you with more details - in our case the machines are Windows XP hosts, so I'm not sure how much control over process nice-ness that will allow. Thanks, Luke On 03/01/2006, at 1:41 PM, Greg Ercolano wrote: [posted to rush.general] Luke Cole wrote:When I run the rush -tasklist command on loaner16 I get a list with almost 420 entries. Does that sound reasonable?Yes, that sounds normal for dual proc machine, or for two tier submissions.I checked loaner16 - it is reporting 99% of the CPU as beingutilised by mayabatch.exe, so it's definitely running slowly because of that.Sounds like you should dig a little deeper, since it's normalfor renders to use 99% of the cpu; rushd shouldn't act unresponsivefor 40 seconds unless something very extreme is going on. Is the rushd.log file complaining on these machines? I'd expectto see errors like 'connection reset by peer' due to the unresponsiveness, but I'm wondering if there are other errors that might indicate rushd is stuck doing eg. bad hostname lookups, or some such. For instance, does running 'rush -lah' on the slow-to-respond host show ???'s for any entries in the report? That might indicate bad or unresponsive DNS.Does the 'rush -lah' report take a long time to come up? (might be symptomatic of a slow to respond name lookup system, or if OSX,possibly you have the .local Rendezvous/Bonjours Multicast DNS disease, in which case make sure you have HOSTNAME=<name_of_host> and not HOSTNAME=-AUTOMATIC-in the /etc/hostconfig)rushd being unresponsive due to rendering sounds like it / could/ bethe cause IF the machine's resources are being taxed by the render (ie. the renders are using too much ram, or are starting too many threads for the number of cpus the machine has)If maya is involved, make sure that if your machines are dual procs, and in the rush/etc/hosts file each box is configured with '2' for CPUS, then rush will try to start two invocations of maya per box, in which case you better make sure the maya renders have the '-n 1' flag set, to prevent /each/ instance of maya from trying to use /both/ processors..! (causing the renders to step on each other, and the rest of the machine, including rushd)Or, make sure the ram isn't being overused, causing the box to thrash,as that will steal cpu from everything, including the renders. It's normal for renders to use 99% of the cpu; the unix scheduler should still be able to yield cpu to rushd under those conditions. The only situations I can think of where rushd wouldn't be getting enough cpu would be:a) The renders are running at a higher system priority (lower niceness)than rushdb) The kernel is using some kind of 'decaying' scheduling, giving rushd a lower priority than it should. Possibly you can fix this by adjustingrushd's system priority with 'renice'c) The machine is swapping due to the renders using more ram than theyshould be, causing other processes (like rushd) to swap outNote that Rush's priority values (+any@200) has nothing to do with what I'm calling the system priority (PRI column in 'ps -lax'). The rush priority is used by rush only to determine which user's render should be started next, and has nothing to do with the priority of processes once they're running.The only value in rush that affects the system priority of running processes is the 'nice' value the job is submitted as, which is the 'niceness' the renderswill run as.Regarding a/b, possibly you might want to experiment with adjusting the niceness level of the renders down, so that they don't bog the box down.(Won't help if your problem is 'c')If the problem is 'c', then you might have to get more ram for your boxes, or tell rush not to render more than one render per box, so that only one render runs per box. (Or if it's one user's particular job using all the ram, have them try to optimize the job so that it doesn't use so much ram)-- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) --- Luke Cole Systems Administrator / TD FUEL International 65 King St., Newtown, Sydney NSW, Australia 2042 |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: rush slowness/timeouts Date: Mon, 02 Jan 2006 23:24:34 -0800 |
Msg# 1167 View Complete Thread (8 articles) | All Threads Last Next |
Luke Cole wrote: OK - I will investigate further and get back to you with more details - in our case the machines are Windows XP hosts, so I'm not sure how much control over process nice-ness that will allow. In that case, the 'task manager' "Processes" tab enabled: "View | Select Columns | Base Priority" ..lets you view the priority, and use right click "Set Priority" to change it. Also, you can use the task manager's graph under the "Performance" tab to see if the renders are taking up all the ram. |