From: Dylan Penhale <dylan@(email surpressed)>
Subject: OSX hidden process using 50% sys.
   Date: Wed, 16 Sep 2009 23:04:54 -0400
Msg# 1887
View Complete Thread (5 articles) | All Threads
Last Next
Title: OSX hidden process using 50% sys.
Hi

We have a tricky little problem here, I wonder if you have seen anything like it.  

We are submitting a Maya mentalray render from a PC to a bunch of PC’s and Macs. The job server is running XP. On one Mac render node, sometimes more than one, we seemingly randomly see the CPU pegged at 50% system. Top does not show what process is using that 50%, it appears to be hidden. A little digging reveals that there are two processes that appear to be “holding” log files open on the rush log server. Interestingly these commands are both trying to access the same log file and they are NOT the log file that the machine has rendered, nor been assigned to render (from what we can tell).

We are unable to read the log file from the problem render node, but the file is readable from all other hosts.
We think that the kernel of this render node is stuck trying to access this file, which in turn is causing the high cpu sys load.

When we lsof we can see that the command that is accessing the log file is Render.

Render    19841      netrender    1w      REG     26,10     47549 37261653 /Volumes/atlantic/rushlogs/3d/tjn_se_0210_c004rs_a033as_l014hh_seal_matteSealA.log/0008
Render    19841      netrender    2w      REG     26,10     47549 37261653 /Volumes/atlantic/rushlogs/3d/tjn_se_0210_c004rs_a033as_l014hh_seal_matteSealA.log/0008

Note: This not did not try to render frame 0008 from what we can tell.

Should Render be writing the log file directly, or should the STDERR from Render be passed to Rush to do the writing?

The PID’s listed do not show up in the process list (top or ps) so we assume the kernel is experiencing some sort of bug.

We also thought that perhaps this render node was assigned to render this frame, and at the last minute informed that another node was already doing it so told to abort it’s render. It’s hard to tell from the cpu.acct file what is happening as I think this only gets updated at the end of each successful frame.

We are not able to umount the NFS share that the logs are written to, which again suggests the logs are somehow the cause. In fact the only way to get the CPU to behave normally is to restart the box, which can only be done by physically holding the power button in.

In this case the render node is running OSX 10.5.4, and is an Intel Core 2 Duo.

Any ideas?

DYLAN PENHALE
IT MANAGER

FUEL
65 KING STREET
NEWTOWN  SYDNEY
NSW  2042  AUSTRALIA

T. 61 2 9557 7799
F. 61 2 9557 7882
M. 0424 655 320

WWW.FUELVFX.COM


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: OSX hidden process using 50% sys.
   Date: Thu, 17 Sep 2009 00:31:30 -0400
Msg# 1888
View Complete Thread (5 articles) | All Threads
Last Next
Dylan Penhale wrote:
> We are submitting a Maya mentalray render from a PC to a bunch of PC’s
> and Macs. The job server is running XP. On one Mac render node,
> sometimes more than one, we seemingly randomly see the CPU pegged at 50%
> system. Top does not show what process is using that 50%, it appears to
> be hidden.

	The kernel can take CPU of its own to process IO, and that
	won't show up associated with any process. For instance, in
	rushtop this would show up in red.

> A little digging reveals that there are two processes that
> appear to be “holding” log files open on the rush log server.

> Interestingly these commands are both trying to access the same log file
> and they are NOT the log file that the machine has rendered, nor been
> assigned to render (from what we can tell).

> We are unable to read the log file from the problem render node, but the
> file is readable from all other hosts.

	When you try to read the log, what happens, and with what
	command/technique are you trying to look at the file?
	(rush -log, more, type, cat, text editor, etc)

> We think that the kernel of this render node is stuck trying to access
> this file, which in turn is causing the high cpu sys load.

	Yes, most likely, if the cpu is not attributed to a process.

> When we lsof we can see that the command that is accessing the log file
> is Render.
> 
> Render    19841      netrender    1w      REG     26,10     47549
> 37261653
> /Volumes/atlantic/rushlogs/3d/tjn_se_0210_c004rs_a033as_l014hh_seal_matteSealA.log/0008
> Render    19841      netrender    2w      REG     26,10     47549
> 37261653
> /Volumes/atlantic/rushlogs/3d/tjn_se_0210_c004rs_a033as_l014hh_seal_matteSealA.log/0008

	The process ids are the same (19841), so it's one process.

	There are two entries in lsof, one for the stdout file descriptor (1w)
	and one for the stderr file descriptor (2w) which are both 'write only' (w).

	This is normal behavior during rendering, because the processes
	are generating output that is being written to the file server.

	You might check to see what's going on in the log, as maybe it's generating
	a large amount of output due to a high verbosity setting.

	Often you can mitigate the cpu/network overhead of render's verbose output
	by setting the 'maxlogsize' to some very high number (like 10000000, ie. 10M)
	so that the render output is piped through 'logtrim', which will line buffer
	the render's output so that verbose output is buffered.

	BTW, what version of rush are you running?
	I don't think it matters in this case, as the io has to do with the render
	and likely its verbosity setting. But it may help to know on this end.


-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Dylan Penhale <dylanpenhale@(email surpressed)>
Subject: Re: OSX hidden process using 50% sys.
   Date: Thu, 17 Sep 2009 02:10:51 -0400
Msg# 1889
View Complete Thread (5 articles) | All Threads
Last Next


> From: Greg Ercolano <erco@(email surpressed)>
> Reply-To: <rush decimal general at seriss decimal com>
> Date: 17 Sep 2009 04:31:30 -0000
> To: <void@(email surpressed)>
> Subject: Re: OSX hidden process using 50% sys.
> 
> [posted to rush.general]
> 
> Dylan Penhale wrote:
>> We are submitting a Maya mentalray render from a PC to a bunch of PC=92=
> s
>> and Macs. The job server is running XP. On one Mac render node,
>> sometimes more than one, we seemingly randomly see the CPU pegged at 50=
> %
>> system. Top does not show what process is using that 50%, it appears to
>> be hidden.
> 
> The kernel can take CPU of its own to process IO, and that
> won't show up associated with any process. For instance, in
> rushtop this would show up in red.

Yes, rushtop shows lots of red :)

> 
>> A little digging reveals that there are two processes that
>> appear to be =93holding=94 log files open on the rush log server.
> 
>> Interestingly these commands are both trying to access the same log fil=
> e
>> and they are NOT the log file that the machine has rendered, nor been
>> assigned to render (from what we can tell).
> 
>> We are unable to read the log file from the problem render node, but th=
> e
>> file is readable from all other hosts.
> 
> When you try to read the log, what happens, and with what
> command/technique are you trying to look at the file?
> (rush -log, more, type, cat, text editor, etc)

Cat, more, tail - all return no output.

As I say, the log file is ok because we can view it from other machines. I
am curios why this box is even trying to write these files as this node
isn't even supposed to be rendering this frame. I think this might be the
clue to the issue.

> 
>> We think that the kernel of this render node is stuck trying to access
>> this file, which in turn is causing the high cpu sys load.
> 
> Yes, most likely, if the cpu is not attributed to a process.
> 
>> When we lsof we can see that the command that is accessing the log file
>> is Render.
>> =20
>> Render    19841      netrender    1w      REG     26,10     47549
>> 37261653
>> /Volumes/atlantic/rushlogs/3d/tjn_se_0210_c004rs_a033as_l014hh_seal_mat=
> teSealA.log/0008
>> Render    19841      netrender    2w      REG     26,10     47549
>> 37261653
>> /Volumes/atlantic/rushlogs/3d/tjn_se_0210_c004rs_a033as_l014hh_seal_mat=
> teSealA.log/0008
> 
> The process ids are the same (19841), so it's one process.
> 
> There are two entries in lsof, one for the stdout file descriptor (1w)
> and one for the stderr file descriptor (2w) which are both 'write only' =
> (w).
> 
> This is normal behavior during rendering, because the processes
> are generating output that is being written to the file server.
> 
> You might check to see what's going on in the log, as maybe it's generat=
> ing
> a large amount of output due to a high verbosity setting.
> 

The render log looks the same as the others, nothing to suggest an error.
Log size is 46K.

> Often you can mitigate the cpu/network overhead of render's verbose outp=
> ut
> by setting the 'maxlogsize' to some very high number (like 10000000, ie.=
>  10M)
> so that the render output is piped through 'logtrim', which will line bu=
> ffer
> the render's output so that verbose output is buffered.
> 
> BTW, what version of rush are you running?
> I don't think it matters in this case, as the io has to do with the rend=
> er
> and likely its verbosity setting. But it may help to know on this end.

RUSHD 102.42a9

> 
> 
> --=20
> Greg Ercolano, erco@(email surpressed)
> Seriss Corporation
> Rush Render Queue, http://seriss.com/rush/
> Tel: 626-795-5922x23
> Fax: 626-795-5947
> Cel: 310-266-8906



   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: OSX hidden process using 50% sys.
   Date: Thu, 17 Sep 2009 03:56:27 -0400
Msg# 1890
View Complete Thread (5 articles) | All Threads
Last Next
Dylan Penhale wrote:
>>> We are unable to read the log file from the problem render node,
>>> but the file is readable from all other hosts.
>> When you try to read the log, what happens, and with what
>> command/technique are you trying to look at the file?
>> (rush -log, more, type, cat, text editor, etc)
> 
> Cat, more, tail - all return no output.
> 
> As I say, the log file is ok because we can view it from other machines. I
> am curios why this box is even trying to write these files as this node
> isn't even supposed to be rendering this frame. I think this might be the
> clue to the issue.

	This sounds messed up at the file system level;
	I'd suspect the network file system as the problem,
	probably at the client side.

	Hopefully other folks can chime in here if they've seen
	this too.

	What kind of machine is the file server at the other end
	of the NFS mount? Possibly there's an NFS incompatibility
	there.. only thing I can think of other than the client OS.

	I haven't heard of such a problem in OSX's nfs; in my
	experience NFS has been pretty much problem free mounting
	the linux NFS server here.

	BTW, you mention you're unable to umount the file server,
	is this true even after you kill the Render process, and
	lsof verifies there are no other mounts? Or is the Render
	process unkillable because it's 'hung accessing the mount',
	which would all the more point to an NFS client fault.

	Did you try 'umount -f'?

>>> We think that the kernel of this render node is stuck trying to access
>>> this file, which in turn is causing the high cpu sys load.

>> Yes, most likely, if the cpu is not attributed to a process.

> The render log looks the same as the others, nothing to suggest an error.
> Log size is 46K.

	Try yanking the network cable to see if it affects the 50% cpu use.
	My guess is it won't, which means the kernel is spinning doing
	'something bad'.. hard to say what.

	Is the whole mount in a bad state (ie can you view other files OK?),
	or is it just this file and all else works fine?

	All the behavior sounds quite odd from an OS point of view.
	I'd say the client OS is at fault here, with only a slim
	possibility the NFS server might be triggering it.

	Once rush starts the Render process, its up to the OS and Render
	to do its thing; rush has no part of the rendering process once
	it starts until the process exits; then rush steps in and returns
	the exit code.

	I think if you want to move forward, get out the nfs debugging
	tools; tcpdump, possibly even ktrace (or whatever it's called
	on Leopard). If you suspect maybe the server is at fault,
	contact the vendor if you're on support.

> RUSHD 102.42a9

	That sounds fine.

	That release is a little over a year old (June 08),
	and there's only been 2 minor maintenance releases since then..
	but nothing you'd need.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Dylan Penhale <dylanpenhale@(email surpressed)>
Subject: Re: OSX hidden process using 50% sys.
   Date: Fri, 18 Sep 2009 02:54:39 -0400
Msg# 1891
View Complete Thread (5 articles) | All Threads
Last Next


> From: Greg Ercolano <erco@(email surpressed)>
> Reply-To: <rush decimal general at seriss decimal com>
> Date: 17 Sep 2009 07:56:27 -0000
> To: <void@(email surpressed)>
> Subject: Re: OSX hidden process using 50% sys.
> 
> [posted to rush.general]
> 
> Dylan Penhale wrote:
>>>> We are unable to read the log file from the problem render node,
>>>> but the file is readable from all other hosts.
>>> When you try to read the log, what happens, and with what
>>> command/technique are you trying to look at the file?
>>> (rush -log, more, type, cat, text editor, etc)
>> 
>> Cat, more, tail - all return no output.
>> 
>> As I say, the log file is ok because we can view it from other machines. I
>> am curios why this box is even trying to write these files as this node
>> isn't even supposed to be rendering this frame. I think this might be the
>> clue to the issue.
> 
> This sounds messed up at the file system level;
> I'd suspect the network file system as the problem,
> probably at the client side.
> 
> Hopefully other folks can chime in here if they've seen
> this too.
> 
> What kind of machine is the file server at the other end
> of the NFS mount? Possibly there's an NFS incompatibility
> there.. only thing I can think of other than the client OS.
> 
> I haven't heard of such a problem in OSX's nfs; in my
> experience NFS has been pretty much problem free mounting
> the linux NFS server here.
> 
> BTW, you mention you're unable to umount the file server,
> is this true even after you kill the Render process, and
> lsof verifies there are no other mounts? Or is the Render
> process unkillable because it's 'hung accessing the mount',
> which would all the more point to an NFS client fault.
> 
> Did you try 'umount -f'?

Yes, this doesn't work though, it just hangs the shell. Certainly looks like
a kernel error. Once in this state you can't cancel the umount command, you
have to create a new ssh session or power cycle the machine.

The mount options are "intr,bg,resvport,vers=3" which I would think would be
pretty safe. The log server in this case is an OSX server (it's pretty much
only used for rush logs). The server doesn't give anything away in it's
logs.

> 
>>>> We think that the kernel of this render node is stuck trying to access
>>>> this file, which in turn is causing the high cpu sys load.
> 
>>> Yes, most likely, if the cpu is not attributed to a process.
> 
>> The render log looks the same as the others, nothing to suggest an error.
>> Log size is 46K.
> 
> Try yanking the network cable to see if it affects the 50% cpu use.
> My guess is it won't, which means the kernel is spinning doing
> 'something bad'.. hard to say what.
> 
> Is the whole mount in a bad state (ie can you view other files OK?),
> or is it just this file and all else works fine?

Yes we can view other files, it's just this one file we have problems with.

> 
> All the behavior sounds quite odd from an OS point of view.
> I'd say the client OS is at fault here, with only a slim
> possibility the NFS server might be triggering it.
> 
> Once rush starts the Render process, its up to the OS and Render
> to do its thing; rush has no part of the rendering process once
> it starts until the process exits; then rush steps in and returns
> the exit code.
> 
> I think if you want to move forward, get out the nfs debugging
> tools; tcpdump, possibly even ktrace (or whatever it's called
> on Leopard). If you suspect maybe the server is at fault,
> contact the vendor if you're on support.
> 
>> RUSHD 102.42a9
> 
> That sounds fine.
> 
> That release is a little over a year old (June 08),
> and there's only been 2 minor maintenance releases since then..
> but nothing you'd need.
> 
> -- 
> Greg Ercolano, erco@(email surpressed)
> Seriss Corporation
> Rush Render Queue, http://seriss.com/rush/
> Tel: 626-795-5922x23
> Fax: 626-795-5947
> Cel: 310-266-8906