From: Ron Griswold <ron.griswold@(email surpressed)> Subject: cpu.acct log Date: Fri, 19 Jan 2007 19:03:06 -0500 |
Msg# 1456 View Complete Thread (2 articles) | All Threads Last Next |
Hi,Long story short, we lost most of our log files from our previous show, making any stats calculated just a little inaccurate. I ran across the cpu.acct file using the 'rush -catlog cpu.acct host' command. It has info recorded dating back before I started working here! Woo hoo! But... I think I'm missing something. According to the docs the 9th column in the 'p' report is the "#Secs Wall Clock Time". Both the "system time" and "user time" are 0. I'm assuming (probably naively) that the wall clock time refers to the elapsed time of the render script's execution. Which is exactly what I need. The values however aren't even remotely close to what I would expect them to be though. For a render that took around 2h, the time is recorded as 79s. Is there something wrong, or am I just not getting the meaning of this value? Thanks! Ron Griswold Programmer/TDRiot Pictures |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: cpu.acct log Date: Sat, 20 Jan 2007 04:13:45 -0500 |
Msg# 1457 View Complete Thread (2 articles) | All Threads Last Next |
Ron Griswold wrote: I ran across the cpu.acct file using the 'rush -catlog cpu.acct host' command. It has info recorded dating back before I started working here! Woo hoo! Yes, that's the place to look, and yes, you want the wall clock time (which is the field that immediately follows the priority, and priority immediately follows the hostname) I think I'm missing something. According to the docs the 9th column in the 'p' report is the "#Secs Wall Clock Time". Correct, and that is in seconds, and is the number you want. Both the "system time" and "user time" are 0. Yes, usually they are zero, because that's the system/user times for the immediate child (usually the perl script itself) which has little or no impact on the cpu. AFAIK, to this day even the modern OS's still do not provide accumulated system/user time for children (e.g. the render processes themselves), so rush can't display them either. You might ask why doesn't rush poll the process tree and try to accumulate the user/system times; I tried, and I can tell you polling sucks and is prone to error. Only the OS can do these calculations properly, and from what I've seen, they do not. The values however aren't even remotely close to what I would expect them to be though. For a render that took around 2h, the time is recorded as 79s. Hmm, unless there's a bug I don't know about, you should be seeing a value of ~7200 for a 2h render in that field. Try the test below to check it. What platform were you on for the cpu.acct files you were looking at? Is there something wrong, or am I just not getting the meaning of this value? I'm certainly hoping you haven't found a bug, and are just not looking at the right data somehow. If you have proof of this, show a few lines of excerpt from the cpu.acct log, and then show what you know to be the matching 'rush -lf' report for those entries to show the discrepancy. I'll be able to tell. Make sure you're not looking at requeued frames. Here's a quick test I just ran (linux) showing how to check the consistency between cpu.acct and 'rush -lf': [erco@ontario] % rush -submit ## SUBMIT TEST JOB title foo frames 1-10 command sleep 35 ## EACH FRM SLEEPS 35 SECS cpus ontario=2 ^D setenv RUSH_JOBID ontario.8 [erco@ontario] % tail -f /usr/local/rush/var/cpu.acct ## WATCH FRMS GET DONE [..] Start Jobid Title Owner Frame Hostname Pri Wall Sys Usr Exit PID p 1169282036 ontario.8 foo erco 0001 ontario 1 37 0 0 0 11011 p 1169282037 ontario.8 foo erco 0002 ontario 1 36 0 0 0 11013 p 1169282074 ontario.8 foo erco 0003 ontario 1 36 0 0 0 11019 p 1169282074 ontario.8 foo erco 0004 ontario 1 36 0 0 0 11021 p 1169282112 ontario.8 foo erco 0005 ontario 1 36 0 0 0 11027 p 1169282112 ontario.8 foo erco 0006 ontario 1 36 0 0 0 11029 p 1169282149 ontario.8 foo erco 0007 ontario 1 36 0 0 0 11085 p 1169282149 ontario.8 foo erco 0008 ontario 1 36 0 0 0 11087 p 1169282186 ontario.8 foo erco 0009 ontario 1 37 0 0 0 11093 p 1169282187 ontario.8 foo erco 0010 ontario 1 36 0 0 0 11095 ^C ^ /|\ | +-----+ | \|/ [erco@ontario] % rush -lf ontario.8 ## VIEW RUSH FRAMES v STAT FRAME TRY HOSTNAME PID JOBID START ELAPSED NOTES Done 0001 1 ontario 11011 ontario.8 01/20,00:33:56 00:00:37 Done 0002 1 ontario 11013 ontario.8 01/20,00:33:57 00:00:36 Done 0003 1 ontario 11019 ontario.8 01/20,00:34:34 00:00:36 Done 0004 1 ontario 11021 ontario.8 01/20,00:34:34 00:00:36 Done 0005 1 ontario 11027 ontario.8 01/20,00:35:12 00:00:37 Done 0006 1 ontario 11029 ontario.8 01/20,00:35:12 00:00:37 Done 0007 1 ontario 11085 ontario.8 01/20,00:35:49 00:00:36 Done 0008 1 ontario 11087 ontario.8 01/20,00:35:49 00:00:36 Done 0009 1 ontario 11093 ontario.8 01/20,00:36:27 00:00:37 Done 0010 1 ontario 11095 ontario.8 01/20,00:36:27 00:00:37 The +/-1 second variance is due to rush's own overhead for actually switching from the Run to the Done state; 'rush -lf' includes this overhead for the job, cpu.acct does not. -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ |