From: Abraham Schneider <aschneider@(email surpressed)> Subject: strange retries Date: Mon, 26 Jun 2006 05:47:19 -0400 |
Msg# 1317 View Complete Thread (4 articles) | All Threads Last Next |
Hi!We have some troubles with our network/pipeline (linux based ethernet servers connected to a SAN storage): black rendered frames from shake, corrupted rendered frames and several other problems. We try to figure out which part of the pipeline makes these problems but that's not easy because the problems couldn't be reproduced easily. There is one strange thing that I get when rendering with rush and shake: when I look in iRush->Frames I sometimes retries for some packets. Strange thing is: the log of this packet looks like there hasn't been any problem or retry. For example: I have "retry #2 of 5" in the notes of a packet, but the log looks like this: ### ### lion.700: 0006 ### --------------- Rush 102.42a -------------- -- Host: scarecrow -- Pid: 14415 -- Title: servertest1 -- Jobid: lion.717 -- Frame: 0006 -- Tries: 0 -- Owner: aschneid (1054/2001) -- RunningAs: aschneid (1054/2001) -- Priority: 81 -- Nice: 10 -- Tmpdir: /var/tmp/.RUSH_TMP.142 -- LogFile: /mnt/frozone/projects/servertest/servertest1.shk.log/0006-- Command: perl /mnt/libs/rushlib/submit-shake.pl -render /mnt/frozone/projects/servertest/servertest1.shk 5 300 5 AddNever+Requeue 60000 off -v -motion 1.0 1 -cpus 4 -proxyscale Base -- Started: Sat Jun 24 04:47:00 2006 ------------------------------------------ SHAKEPATH: /mnt/frozone/projects/servertest/servertest1.shk RENDERFLAGS: -v -motion 1.0 1 -cpus 4 -proxyscale Base BATCHFRAMES: 5 (6-10) RETRIES: 5 (AddNever+Requeue after 5 retries) MAXLOGSIZE: 60000PATH: /usr/nreal/shake/bin:/usr/local/rush/bin:/usr/local/rush/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin Executing: logtrim -s 60000 -c shake -exec /mnt/frozone/projects/servertest/servertest1.shk -t 6-10 -v -motion 1.0 1 -cpus 4 -proxyscale Base info: rendering frame 6 info: frame 6 rendered in 26.94s info: rendering frame 7 info: frame 7 rendered in 29.38s info: rendering frame 8 info: frame 8 rendered in 27.35s info: rendering frame 9 info: frame 9 rendered in 23.67s info: rendering frame 10 info: frame 10 rendered in 26.74s --- SHAKE SUCCEEDS: EXITCODE=0Any idea why there was a retry? How could I check what forced the retry? Any other general ideas how to test my pipeline to find problematic part? Thanks for any help. Abraham -- Abraham Schneider VFX Compositor ARRI Film & TV Services GmbH Tuerkenstr. 89 D-80799 Muenchen Phone: +49 89 3809-1269 Mobile: +49 173 5719842 Email: aschneider@(email surpressed) |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: strange retries Date: Mon, 26 Jun 2006 06:09:56 -0400 |
Msg# 1318 View Complete Thread (4 articles) | All Threads Last Next |
Hi Abraham, Check for more than one "rushd" daemon running on the job server for that job (jobid is "lion.717", so check "lion" for two rushd's) Under normal circumstances there are sometimes several rushd processes, which are children of the main daemon, and usually are gone within 10 seconds or so. The regular parent daemon is the one with a PPID of 1. The children daemon have a PPID (Parent PID) of the parent. Look for a child rushd that has been around for over a minute (eg. the 'START' column of the 'ps aux' report) with a PPID greater than "1". The cause of the extra daemon is a bug in 'rush -ljf <jobid>' command that caused the extra daemon if <jobid> didn't exist at the time the 'rush -ljf' command was issued. (or if you hit 'Jobs Full' in irush for a job that didn't exist). This bug was fixed in a recent release (102.42a6) which came out a few weeks ago. The upgrade is free; installing the new 102.42a6 release solves this problem permanently. For the short term, identifying the rogue child rushd and killing it will stop the retries. Abraham Schneider wrote: [posted to rush.general] Hi!We have some troubles with our network/pipeline (linux based ethernet servers connected to a SAN storage): black rendered frames from shake, corrupted rendered frames and several other problems. We try to figure out which part of the pipeline makes these problems but that's not easy because the problems couldn't be reproduced easily.There is one strange thing that I get when rendering with rush and shake: when I look in iRush->Frames I sometimes retries for some packets. Strange thing is: the log of this packet looks like there hasn't been any problem or retry.For example: I have "retry #2 of 5" in the notes of a packet, but the log looks like this:### ### lion.700: 0006 ### --------------- Rush 102.42a -------------- -- Host: scarecrow -- Pid: 14415 -- Title: servertest1 -- Jobid: lion.717 -- Frame: 0006 -- Tries: 0 -- Owner: aschneid (1054/2001) -- RunningAs: aschneid (1054/2001) -- Priority: 81 -- Nice: 10 -- Tmpdir: /var/tmp/.RUSH_TMP.142 -- LogFile: /mnt/frozone/projects/servertest/servertest1.shk.log/0006-- Command: perl /mnt/libs/rushlib/submit-shake.pl -render /mnt/frozone/projects/servertest/servertest1.shk 5 300 5 AddNever+Requeue 60000 off -v -motion 1.0 1 -cpus 4 -proxyscale Base-- Started: Sat Jun 24 04:47:00 2006 ------------------------------------------ SHAKEPATH: /mnt/frozone/projects/servertest/servertest1.shk RENDERFLAGS: -v -motion 1.0 1 -cpus 4 -proxyscale Base BATCHFRAMES: 5 (6-10) RETRIES: 5 (AddNever+Requeue after 5 retries) MAXLOGSIZE: 60000PATH: /usr/nreal/shake/bin:/usr/local/rush/bin:/usr/local/rush/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/binExecuting: logtrim -s 60000 -c shake -exec /mnt/frozone/projects/servertest/servertest1.shk -t 6-10 -v -motion 1.0 1 -cpus 4 -proxyscale Baseinfo: rendering frame 6 info: frame 6 rendered in 26.94s info: rendering frame 7 info: frame 7 rendered in 29.38s info: rendering frame 8 info: frame 8 rendered in 27.35s info: rendering frame 9 info: frame 9 rendered in 23.67s info: rendering frame 10 info: frame 10 rendered in 26.74s --- SHAKE SUCCEEDS: EXITCODE=0Any idea why there was a retry? How could I check what forced the retry? Any other general ideas how to test my pipeline to find problematic part? -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed)(new) |
From: Greg Ercolano <erco@(email surpressed)> Subject: Re: strange retries Date: Mon, 26 Jun 2006 06:12:35 -0400 |
Msg# 1319 View Complete Thread (4 articles) | All Threads Last Next |
Greg Ercolano wrote: This bug was fixed in a recent release (102.42a6) which came out a few weeks ago. The upgrade is free; installing the new 102.42a6 releasesolves this problem permanently. BTW, email me directly from your business email address, and I'll send you the free 102.42a6 upgrade info. -- Greg Ercolano, erco@(email surpressed) Rush Render Queue, http://seriss.com/rush/ Tel: (Tel# suppressed) Cel: (Tel# suppressed) Fax: (Tel# suppressed)(new) |
From: Abraham Schneider <aschneider@(email surpressed)> Subject: Re: strange retries Date: Mon, 26 Jun 2006 06:16:54 -0400 |
Msg# 1320 View Complete Thread (4 articles) | All Threads Last Next |
Thanks for your quick reply. I'm already sending here from my business
email address, so it would be really nice if you could send the upgrade
info to this address.
Thanks in advance, Abraham Greg Ercolano schrieb: [posted to rush.general] Greg Ercolano wrote:This bug was fixed in a recent release (102.42a6) which came out a few weeks ago. The upgrade is free; installing the new 102.42a6 releasesolves this problem permanently.BTW, email me directly from your business email address, and I'll send you the free 102.42a6 upgrade info. -- Abraham Schneider VFX Compositor ARRI Film & TV Services GmbH Tuerkenstr. 89 D-80799 Muenchen Phone: +49 89 3809-1269 Mobile: +49 173 5719842 Email: aschneider@(email surpressed) |