From: Greg Ercolano <erco@(email surpressed)> Subject: [Q+A] How do I determine what caused a render node to crash during Date: Tue, 04 Apr 2006 15:18:01 -0400 |
Msg# 1269 View Complete Thread (4 articles) | All Threads Last Next |
> We're working on some large projects, and sometimes our renders crash > some of our render farm nodes, causing them to hang or freeze up. > Can you recommend any techniques for determining the cause of the crash? A completely frozen box usually means the kernel panicked, or a mount to the file server froze up (did you put mounts in your root directory? Or make symlinks in the root directory to mount points? Bad!), or the box is thrashing to death. The most common cause is the render process is using so much ram it's causing the box to swap to death. Unix boxes often don't behave well when a process uses to much memory. The next common cause is a buggy OS; the render is somehow triggering a bug in a device driver or kernel causing it to crash. Sometimes heavy rendering can trigger subtle threading bugs in the kernel between eg. the network card and/or disk driver, the two most heavily used device drivers during rendering. Or possibly there's a hardware problem if the problem is specific to a machine (ie. a burned out cpu fan, bad ram, dust problems on the mobo, bad drive) The first place to look would be the log file for the frame.. quite possibly there's enough info there to indicate something was wrong during the render. Next place to look would be the machine's own system log: LINUX: /var/log/messages OSX: /var/log/system.log WINDOWS: Check the "Event Viewer" (My Computer|Manage|Event Viewer) For example, under linux, look in /var/log/messages for error messages and/or look at the console's last dying messages before rebooting the box. I would leave the render nodes in text mode (ie. no X windows), and when a box hangs in this way, inspect the console (VGA) for any last messages, and become familiar with the magic "SysRq" key sequences to inspect the kernel's status (if it's still alive enough to do so), eg: http://www.tldp.org/HOWTO/Keyboard-and-Console-HOWTO-8.html This implies having an actual keyboard and monitor attached to the render node during diagnosis -- I assume you have a KVM or at very least, a way to manually patch a keyboard/monitor to the blade. Very important to have access to the machine's console to do diagnosis of hung boxes, before wacking the reset button. If you suspect a ram issue with the renders, you might want to leave 'vmstat 3' running to monitor ram/virt memory/io while renders are running. You could implement a boot script that logs the output of vmstat to a log file (eg. /var/log/vmstat.log). It can be as simple as just backgrounding a 'vmstat 3 >> /var/log/vmstat.log' command from a boot script, or more robust as the following perl script, which includes date stamps and [optionally] periodic ps(1) reports: ========================================================================== #!/usr/bin/perl -w # # vmstat-logger -- watch vmstat statistics with date stamps # erco 1.0 04/04/06 # # Run this program at boot in the background to keep a log # history of virtual memory use and disk i/o. eg: # # /etc/LOCAL-vmstat-logger < /dev/null > /var/log/vmstat-errors.log 2>&1 & # use strict; require "ctime.pl"; # CUSTOMIZABLE VALUES my $logfile = "/var/log/vmstat.log"; # log we append to my $vmsecs = 5; # vmstat sample rate (seconds) my $datesecs = 300; # date stamp rate (must be > $vmsecs) my $pssecs = 60; # ps(1) report rate (must be > $vmsecs) # OPEN LOG FOR APPENDING unless ( open(LOG, ">>/var/log/vmstat.log") ) { print STDERR "$0: $logfile: $!\n"; exit(1); } select(LOG); $|=1; select(STDOUT); # unbuffered i/o to log # OPEN CONTINUOUS VMSTAT FOR READING unless ( open(VMSTAT, "vmstat $vmsecs|") ) { print STDERR "$0: vmstat $vmsecs: (could not execute): $!\n"; exit(1); } # LOG CONTINUOUS OUTPUT OF VMSTAT WITH DATESTAMPS my $count; my $vmstatout; for ( $count=0; $vmstatout = <VMSTAT>; $count++ ) { # TIME STAMP if ( ( $count % ($datesecs/$vmsecs)) == 0 ) { # log time stamp print LOG "DATE: --- " . ctime(time()); } # # PS REPORT # if ( ( $count % ($pssecs/$vmsecs)) == 0 ) { # [OPTIONAL] log ps report # print LOG "--- PS REPORT:\n"; # unless ( open(PS, "ps faux|") ) { # print LOG "$0: ps faux: $!\n"; # } else { # while (<PS>) { # if ( ! /^root/ ) { # log all non-root processes # print LOG "PS: $_"; # } # } # close(PS); # } # } # VMSTAT REPORT print LOG "VMSTAT: $vmstatout"; # log vmstat output } # EOF |
From: Victor DiMichina <victor@(email surpressed)> Subject: Re: [Q+A] How do I determine what caused a render node to crash during Date: Thu, 06 Apr 2006 20:33:47 -0400 |
Msg# 1270 View Complete Thread (4 articles) | All Threads Last Next |
Greg Ercolano wrote: your root directory? Or make symlinks in the root directory tomount points? Bad!), These are bad? What alternatives can we use so that Linux and OSX systems can use UNC names properly. Without symlinks to the mount points, our render machines have no idea where //server/drive is, as they only know /Volumes/drive. We just have symlinks that point <servername> --> / and <drivename> --> /Volumes/<drivename>. Vic |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: [Q+A] How do I determine what caused a render node to crash during Date: Thu, 06 Apr 2006 21:08:56 -0400 |
Msg# 1271 View Complete Thread (4 articles) | All Threads Last Next |
your root directory? Or make symlinks in the root directory tomount points? Bad!),These are bad? What alternatives can we use so that Linux and OSX systems can use UNC names properly. [[this message was edited 08/01/06 for clarification]] I knew someone would take the bait. It's good you ask, because yes, putting symlinks in rootthat link to mounts are as bad putting your mount points in the root directory. (Basically, it causes the same problem for the OS) When the OS opens any file on the file system (eg. any files in /bin, /usr, /lib, etc), the OS has to walk the root directory until it finds the directory entry for the file you're requesting. Let's say someone logins in as root and tries to run /bin/ls, and your symlink in the root dir comes before /lib or /bin during the OS search, and if the mount is hung, the shell will hang when it tries to walk over the symlink. (it touches the link, and thus the mount, and hangs) Remember: the order of directory entries on the disk may not be alphabetical, and may be hashed (randomized). The idea is to never put mount points (or symlinks to mount points) in the root dir; put them *one level down*, so the OS doesn't have to step over them. EXAMPLE: This is bad, because it effectively puts a symlink called /mountpoint in the root dir: mount somehost:/somedir /some/mountpoint ln -s /some/mountpoint /mountpoint <-- BAD ..but this is good: mount somehost:/somedir /some/mountpoint mkdir /net <-- GOOD cd /net; ln -s /some/mountpoint mountpoint <-- GOOD..because "/net" is a regular directory, with the symlink to the mount safely /inside/ of it. Programs walking the root directory won't have any trouble stepping over the /net directory, as it won't resolve to a mount. This allows root to login to the machine (your root login should not have anything in its path that refers to NFS drives), and be able to do admin tasks, even if all the mounts are hung. On my system, when I have shell scripts that live on a file server I want root to be able to access, instead of adding these dirs to root's path, I make *aliases* in root's login to refer to them, eg: alias phone /net/mountpoint/scripts/phone # phone list script This way, there's no chance of my accidentally triggering a hung mount just by typing innocuous commands like ls(1) that will hang up in the PATH. The hung mount would only hang my shell if I actually invoked the above alias. Under most situations, you may not encounter this problem, even if you have mounts and/or symlinks in the root dir, becauseoften you'll be lucky that the dirs you create are installed *after* the critical /lib, /bin and /usr dirs. But even though you might be safe.. Don't Tempt Fate! You don't want to be in a situation where a hung mount causes all root logins to hang because that mount's directory is in root's PATH, preventing you from logging in to fix the problem..! -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: [Q+A] How do I determine what caused a render node to crash during Date: Thu, 06 Apr 2006 21:18:32 -0400 |
Msg# 1272 View Complete Thread (4 articles) | All Threads Last Next |
One quick clarification [IN CAPS]: It's good you ask, because yes, putting symlinks TO A MOUNT POINT in root are as bad putting your mount points in the root directory. Symlinks in / are OK, as long as they're not pointing to NFS dirs, including mount points. Basically mount points (and any dirs below them) are potential 'traps' for mission critical processes, like emergency root logins. You need to be careful that root logins don't cross paths with NFS directories, so that one can do emergency administration on client machines that have hung NFS mounts [without hanging up]. From a sysadmin's point of view, I've learned to think of "hard" mounted NFS directories as the "kiss of death" that can prevent one from even logging in as root. If anything in the root login's process [or shell] touches an NFS drive, the root login is toast when there's an NFS outage. To ensure root logins 'always work', you have to really be vigilant in preventing NFS paths from getting into root logins. -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed) |