From: Ron Griswold <ron.griswold@(email surpressed)>
Subject: cpu.acct log
   Date: Fri, 19 Jan 2007 19:03:06 -0500

Msg# 1456
View Complete Thread (2 articles) | All Threads
Last Next

Hi,

Long story short, we lost most of our log files from our previous show,making any stats calculated just a little inaccurate.

I ran across the cpu.acct file using the 'rush -catlog cpu.acct host'command. It has info recorded dating back before I started workinghere! Woo hoo! But...

I think I'm missing something. According to the docs the 9th column inthe 'p' report is the "#Secs Wall Clock Time". Both the "system time"and "user time" are 0. I'm assuming (probably naively) that the wallclock time refers to the elapsed time of the render script's execution.Which is exactly what I need.

The values however aren't even remotely close to what I would expectthem to be though. For a render that took around 2h, the time isrecorded as 79s.


Is there something wrong, or am I just not getting the meaning of this value?

Thanks!

Ron Griswold
Programmer/TD

Riot Pictures

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: cpu.acct log
   Date: Sat, 20 Jan 2007 04:13:45 -0500

Msg# 1457
View Complete Thread (2 articles) | All Threads
Last Next

Ron Griswold wrote:

I ran across the cpu.acct file using the 'rush -catlog cpu.acct host'command. It has info recorded dating back before I started workinghere! Woo hoo!


	Yes, that's the place to look, and yes, you want the wall clock
	time (which is the field that immediately follows the priority,
	and priority immediately follows the hostname)

I think I'm missing something. According to the docs the 9th column inthe 'p' report is the "#Secs Wall Clock Time".


	Correct, and that is in seconds, and is the number you want.

Both the "system time" and "user time" are 0.


	Yes, usually they are zero, because that's the system/user times
	for the immediate child (usually the perl script itself) which
	has little or no impact on the cpu.

	AFAIK, to this day even the modern OS's still do not provide
	accumulated system/user time for children (e.g. the render
	processes themselves), so rush can't display them either.

	You might ask why doesn't rush poll the process tree and
	try to accumulate the user/system times; I tried, and I can
	tell you polling sucks and is prone to error. Only the OS
	can do these calculations properly, and from what I've seen,
	they do not.

The values however aren't even remotely close to what I would expectthem to be though. For a render that took around 2h, the time isrecorded as 79s.


	Hmm, unless there's a bug I don't know about, you should
	be seeing a value of ~7200 for a 2h render in that field.

	Try the test below to check it.

	What platform were you on for the cpu.acct files you were
	looking at?

Is there something wrong, or am I just not getting the meaning of thisvalue?


	I'm certainly hoping you haven't found a bug, and are just
	not looking at the right data somehow.

	If you have proof of this, show a few lines of excerpt
	from the cpu.acct log, and then show what you know to
	be the matching 'rush -lf' report for those entries to
	show the discrepancy. I'll be able to tell. Make sure you're
	not looking at requeued frames.

	Here's a quick test I just ran (linux) showing how to check
	the consistency between cpu.acct and 'rush -lf':

[erco@ontario] % rush -submit              ## SUBMIT TEST JOB
title    foo
frames   1-10
command  sleep 35                          ## EACH FRM SLEEPS 35 SECS
cpus     ontario=2
^D
setenv RUSH_JOBID ontario.8

[erco@ontario] % tail -f /usr/local/rush/var/cpu.acct   ## WATCH FRMS GET DONE
[..]
 Start      Jobid      Title Owner Frame Hostname   Pri Wall Sys Usr Exit PID
p 1169282036 ontario.8  foo   erco  0001  ontario    1   37   0   0   0    11011
p 1169282037 ontario.8  foo   erco  0002  ontario    1   36   0   0   0    11013
p 1169282074 ontario.8  foo   erco  0003  ontario    1   36   0   0   0    11019
p 1169282074 ontario.8  foo   erco  0004  ontario    1   36   0   0   0    11021
p 1169282112 ontario.8  foo   erco  0005  ontario    1   36   0   0   0    11027
p 1169282112 ontario.8  foo   erco  0006  ontario    1   36   0   0   0    11029
p 1169282149 ontario.8  foo   erco  0007  ontario    1   36   0   0   0    11085
p 1169282149 ontario.8  foo   erco  0008  ontario    1   36   0   0   0    11087
p 1169282186 ontario.8  foo   erco  0009  ontario    1   37   0   0   0    11093
p 1169282187 ontario.8  foo   erco  0010  ontario    1   36   0   0   0    11095
^C                                                        ^
                                                        /|\
                                                         |
                                                         +-----+
                                                               |
                                                              \|/
[erco@ontario] % rush -lf ontario.8   ## VIEW RUSH FRAMES       v
STAT FRAME   TRY HOSTNAME   PID     JOBID     START          ELAPSED  NOTES
Done 0001    1   ontario    11011   ontario.8 01/20,00:33:56 00:00:37
Done 0002    1   ontario    11013   ontario.8 01/20,00:33:57 00:00:36
Done 0003    1   ontario    11019   ontario.8 01/20,00:34:34 00:00:36
Done 0004    1   ontario    11021   ontario.8 01/20,00:34:34 00:00:36
Done 0005    1   ontario    11027   ontario.8 01/20,00:35:12 00:00:37
Done 0006    1   ontario    11029   ontario.8 01/20,00:35:12 00:00:37
Done 0007    1   ontario    11085   ontario.8 01/20,00:35:49 00:00:36
Done 0008    1   ontario    11087   ontario.8 01/20,00:35:49 00:00:36
Done 0009    1   ontario    11093   ontario.8 01/20,00:36:27 00:00:37
Done 0010    1   ontario    11095   ontario.8 01/20,00:36:27 00:00:37

	The +/-1 second variance is due to rush's own overhead for
	actually switching from the Run to the Done state; 'rush -lf'
	includes this overhead for the job, cpu.acct does not.

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/