From: "Abraham Schneider" <aschneider@(email surpressed)>
Subject: several strange Rush behaviours
   Date: Thu, 19 Jul 2012 04:51:07 -0400

Msg# 2257
View Complete Thread (5 articles) | All Threads
Last Next

Hi there!

Wanted to ask about two strange behaviours that occur from time to time on our rush renderfarm and I don't have any plausible explanation for. Our farm is a mixed farm of Macs and Linux with Rush 102.42a9c/d installed, rendering Nuke 6.3v8.

1. problem:
Very unregulary and random we have a situation like this:

Done rind10.12202 N_NORM_019_060_comp_v04_as aschneid %100 %0 0 16:52:52
Done rind10.12204 N_NORM_002_010_comp_v16_dl dlaubsch %100 %0 0 16:32:02
Done rind10.12206 N_LOW_048_060_redLog_v00a_mw mwarlimont %100 %0 0 16:28:48
Done rind10.12213 N_NORM_001_050_comp_v38_ts tstern %100 %0 0 16:06:51
Done rind10.12215 N_LOW_048_080_redLog_v00a_mw mwarlimont %100 %0 0 16:05:03
Done rind10.12218 N_NORM_019_060_comp_v04_as aschneid %100 %0 0 15:08:02
Fail rind10.12221 N_NORM_001_010_comp_v01_st stischne %99 %1 0 00:58:31
Done rind10.12223 N_NORM_001_010_comp_v01_st stischne %100 %0 0 00:55:04
Run rind10.12225 N_NORM_001_050_comp_v38_ts tstern %81 %0 2 00:41:35
Run rind10.12226 N_NORM_055_010_comp_v01_mt mwarlimont %0 %0 0 00:36:58
Run rind10.12227 N_NORM_100_020_comp_v21_mt mwarlimont %0 %0 0 00:34:11
Done rind10.12228 N_NORM_001_010_comp_v01_st stischne %100 %0 0 00:23:08
Run rind10.12230 N_NORM_103_010_comp_v14_mt mwarlimont %36 %0 17 00:19:49
Run rind10.12231 N_NORM_022_cfd0046_comp_v102 ppoetsch %6 %0 0 00:19:37

Rush is configured to work "first in - first out". And all these jobs were submitted from inside of Nuke via a slightly modified submit_nuke.pl script, all with the same priorities '+nuke=42@500', no difference in submitting at all, as far as I can see. Nothing changed on the farm, no machines added or removed, switched on/offline, etc.

Most of the time, all works just fine and the jobs are rendered one after the other in order of the submitting time/job ID. But sometimes something like above happens: job 12225 starts rendering on all online machines. But halfway through the rendering, it just stops or the amount of CPUs drops significantly and all the other machines continue rendering on a much newer job (in this case job 12228), skipping the unfinished frames from job 12225 and the next submitted jobs 12226 and 12227. 12228 was rendered completely and instead of returning to 12225/12226/12227, most of the machines (except for one machine with two CPUs, that keeps rendering 12225) continued with 12230. I tried to pause 12230 while most of the machines were rendering it. Result was that the machines continued with 12231.

Is there any reason and/or solution, why Rush doesn't follow the 'first in/first out' randomly from time to time? It's hard to debug this problem because I haven't found a way to reproduce this behaviour.

2. problem:
Most of the time, switching a machine/workstation from offline to online, it takes from many seconds to several minutes for this machine to pick up a frame and start rendering. The machine is shown as 'online' instantly, but it just won't start rendering a frame. It's listed as 'online' and 'idle' for several minutes. This happens for all of our machines, doesn't matter if they are Macs or Linux.

Any explanation for that?

Thanks, Abraham

Abraham Schneider
Senior VFX Compositor

ARRI Film & TV Services GmbH
Tuerkenstr. 89
D-80799 Muenchen / Germany

Phone (Tel# suppressed)

EMail aschneider@(email surpressed)
www.arri.de/filmtv
________________________________

ARRI Film & TV Services GmbH
Sitz: München Registergericht: Amtsgericht München
Handelsregisternummer: HRB 69396
Geschäftsführer: Franz Kraus, Dr. Martin Prillmann, Josef Reidinger

Last Next