From: Greg Ercolano <erco@(email surpressed)>
Subject: [Q+A] How do I determine what caused a render node to crash during
   Date: Tue, 04 Apr 2006 15:18:01 -0400
Msg# 1269
View Complete Thread (4 articles) | All Threads
Last Next
> We're working on some large projects, and sometimes our renders crash
> some of our render farm nodes, causing them to hang or freeze up.
> Can you recommend any techniques for determining the cause of the crash?

    A completely frozen box usually means the kernel panicked,
    or a mount to the file server froze up (did you put mounts in
    your root directory? Or make symlinks in the root directory to
    mount points? Bad!), or the box is thrashing to death.

    The most common cause is the render process is using so much ram
    it's causing the box to swap to death. Unix boxes often don't behave
    well when a process uses to much memory.

    The next common cause is a buggy OS; the render is somehow triggering
    a bug in a device driver or kernel causing it to crash. Sometimes
    heavy rendering can trigger subtle threading bugs in the kernel
    between eg. the network card and/or disk driver, the two most heavily
    used device drivers during rendering.

    Or possibly there's a hardware problem if the problem is specific
    to a machine (ie. a burned out cpu fan, bad ram, dust problems
    on the mobo, bad drive)

    The first place to look would be the log file for the frame..
    quite possibly there's enough info there to indicate something
    was wrong during the render.

    Next place to look would be the machine's own system log:

	LINUX: /var/log/messages
	OSX: /var/log/system.log
	WINDOWS: Check the "Event Viewer" (My Computer|Manage|Event Viewer)

    For example, under linux, look in /var/log/messages for error messages
    and/or look at the console's last dying messages before rebooting the box.
    I would leave the render nodes in text mode (ie. no X windows),
    and when a box hangs in this way, inspect the console (VGA) for any
    last messages, and become familiar with the magic "SysRq" key sequences
    to inspect the kernel's status (if it's still alive enough to do so), eg:
    http://www.tldp.org/HOWTO/Keyboard-and-Console-HOWTO-8.html

    This implies having an actual keyboard and monitor attached to
    the render node during diagnosis -- I assume you have a KVM or at very
    least, a way to manually patch a keyboard/monitor to the blade. Very
    important to have access to the machine's console to do diagnosis
    of hung boxes, before wacking the reset button.

    If you suspect a ram issue with the renders, you might want to
    leave 'vmstat 3' running to monitor ram/virt memory/io while renders
    are running.

    You could implement a boot script that logs the output of vmstat
    to a log file (eg. /var/log/vmstat.log). It can be as simple as just
    backgrounding a 'vmstat 3 >> /var/log/vmstat.log' command from a boot
    script, or more robust as the following perl script, which includes
    date stamps and [optionally] periodic ps(1) reports:

==========================================================================

#!/usr/bin/perl -w
#
# vmstat-logger -- watch vmstat statistics with date stamps
# erco 1.0 04/04/06
#
#     Run this program at boot in the background to keep a log
#     history of virtual memory use and disk i/o. eg:
#
#           /etc/LOCAL-vmstat-logger < /dev/null > /var/log/vmstat-errors.log 2>&1 &
#

use strict;
require "ctime.pl";

# CUSTOMIZABLE VALUES
my $logfile = "/var/log/vmstat.log";		# log we append to
my $vmsecs   = 5;				# vmstat sample rate (seconds)
my $datesecs = 300;				# date stamp rate (must be > $vmsecs)
my $pssecs   = 60;				# ps(1) report rate (must be > $vmsecs)

# OPEN LOG FOR APPENDING
unless ( open(LOG, ">>/var/log/vmstat.log") ) {
    print STDERR "$0: $logfile: $!\n";
    exit(1);
}
select(LOG); $|=1; select(STDOUT);		# unbuffered i/o to log

# OPEN CONTINUOUS VMSTAT FOR READING
unless ( open(VMSTAT, "vmstat $vmsecs|") ) {
    print STDERR "$0: vmstat $vmsecs: (could not execute): $!\n";
    exit(1);
}

# LOG CONTINUOUS OUTPUT OF VMSTAT WITH DATESTAMPS
my $count;
my $vmstatout;
for ( $count=0; $vmstatout = <VMSTAT>; $count++ ) {

    # TIME STAMP
    if ( ( $count % ($datesecs/$vmsecs)) == 0 ) {	# log time stamp
        print LOG "DATE: --- " . ctime(time());
    }

#   # PS REPORT
#   if ( ( $count % ($pssecs/$vmsecs)) == 0 ) {		# [OPTIONAL] log ps report
#       print LOG "--- PS REPORT:\n";
#       unless ( open(PS, "ps faux|") ) {
#           print LOG "$0: ps faux: $!\n";
#       } else {
#           while (<PS>) {
#               if ( ! /^root/ ) {			# log all non-root processes
#                   print LOG "PS: $_";
#               }
#           }
#           close(PS);
#       }
#   }

    # VMSTAT REPORT
    print LOG "VMSTAT: $vmstatout";				# log vmstat output
}

# EOF

   From: Victor DiMichina <victor@(email surpressed)>
Subject: Re: [Q+A] How do I determine what caused a render node to crash during
   Date: Thu, 06 Apr 2006 20:33:47 -0400
Msg# 1270
View Complete Thread (4 articles) | All Threads
Last Next

Greg Ercolano wrote:



    your root directory? Or make symlinks in the root directory to
mount points? Bad!),


These are bad? What alternatives can we use so that Linux and OSX systems can use UNC names properly. Without symlinks to the mount points, our render machines have no idea where //server/drive is, as they only know /Volumes/drive. We just have symlinks that point <servername> --> / and <drivename> --> /Volumes/<drivename>.
Vic


   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: [Q+A] How do I determine what caused a render node to crash during
   Date: Thu, 06 Apr 2006 21:08:56 -0400
Msg# 1271
View Complete Thread (4 articles) | All Threads
Last Next
    your root directory? Or make symlinks in the root directory to
mount points? Bad!),

These are bad? What alternatives can we use so that Linux and OSX systems can use UNC names properly.

       [[this message was edited 08/01/06 for clarification]]

	I knew someone would take the bait.

	It's good you ask, because yes, putting symlinks in root
that link to mounts are as bad putting your mount points in the root directory. (Basically, it causes the same problem for the OS)

When the OS opens any file on the file system (eg. any files in /bin, /usr, /lib, etc), the OS has to walk the root directory until
	it finds the directory entry for the file you're requesting.

	Let's say someone logins in as root and tries to run /bin/ls,
	and your symlink in the root dir comes before /lib or /bin during
	the OS search, and if the mount is hung, the shell will hang
	when it tries to walk over the symlink. (it touches the link,
	and thus the mount, and hangs)

	Remember: the order of directory entries on the disk may not
	be alphabetical, and may be hashed (randomized).

	The idea is to never put mount points (or symlinks to mount points)
	in the root dir; put them *one level down*, so the OS doesn't have
	to step over them.

	EXAMPLE:
	This is bad, because it effectively puts a symlink called /mountpoint
	in the root dir:

mount somehost:/somedir /some/mountpoint
ln -s /some/mountpoint /mountpoint		<-- BAD

	..but this is good:

mount somehost:/somedir /some/mountpoint
mkdir /net					<-- GOOD
cd /net; ln -s /some/mountpoint mountpoint	<-- GOOD

..because "/net" is a regular directory, with the symlink to the mount safely /inside/ of it. Programs walking the root
	directory won't have any trouble stepping over the /net directory,
	as it won't resolve to a mount.

	This allows root to login to the machine (your root login should
	not have anything in its path that refers to NFS drives), and
	be able to do admin tasks, even if all the mounts are hung.

	On my system, when I have shell scripts that live on a file server
	I want root to be able to access, instead of adding these dirs to
	root's path, I make *aliases* in root's login to refer to them, eg:

alias phone /net/mountpoint/scripts/phone		# phone list script

	This way, there's no chance of my accidentally triggering a hung
	mount just by typing innocuous commands like ls(1) that will hang
	up in the PATH. The hung mount would only hang my shell if I
	actually invoked the above alias.

	Under most situations, you may not encounter this problem,
	even if you have mounts and/or symlinks in the root dir, because
often you'll be lucky that the dirs you create are installed *after* the critical /lib, /bin and /usr dirs.

	But even though you might be safe.. Don't Tempt Fate!
	You don't want to be in a situation where a hung mount
	causes all root logins to hang because that mount's directory
	is in root's PATH, preventing you from logging in to fix the
	problem..!

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: [Q+A] How do I determine what caused a render node to crash during
   Date: Thu, 06 Apr 2006 21:18:32 -0400
Msg# 1272
View Complete Thread (4 articles) | All Threads
Last Next
	One quick clarification [IN CAPS]:

    It's good you ask, because yes, putting symlinks TO A MOUNT POINT in root
    are as bad putting your mount points in the root directory.

	Symlinks in / are OK, as long as they're not pointing to NFS dirs,
	including mount points.

	Basically mount points (and any dirs below them) are potential 'traps'
	for mission critical processes, like emergency root logins.

	You need to be careful that root logins don't cross paths with NFS directories,
	so that one can do emergency administration on client machines that have hung
	NFS mounts [without hanging up].

	From a sysadmin's point of view, I've learned to think of "hard" mounted
	NFS directories as the "kiss of death" that can prevent one from even
	logging in as root. If anything in the root login's process [or shell]
	touches an NFS drive, the root login is toast when there's an NFS outage.

	To ensure root logins 'always work', you have to really be vigilant
	in preventing NFS paths from getting into root logins.

--
Greg Ercolano, erco@(email surpressed)
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Cel: (Tel# suppressed)
Fax: (Tel# suppressed)