On 07/19/12 01:51, Abraham Schneider wrote:
> Done rind10.12202 N_NORM_019_060_comp_v04_as aschneid %100 %0 0 16:52:52
> Done rind10.12204 N_NORM_002_010_comp_v16_dl dlaubsch %100 %0 0 16:32:02
> Done rind10.12206 N_LOW_048_060_redLog_v00a_mw mwarlimont %100 %0 0 16:28:48
> Done rind10.12213 N_NORM_001_050_comp_v38_ts tstern %100 %0 0 16:06:51
> Done rind10.12215 N_LOW_048_080_redLog_v00a_mw mwarlimont %100 %0 0 16:05:03
> Done rind10.12218 N_NORM_019_060_comp_v04_as aschneid %100 %0 0 15:08:02
> Fail rind10.12221 N_NORM_001_010_comp_v01_st stischne %99 %1 0 00:58:31
> Done rind10.12223 N_NORM_001_010_comp_v01_st stischne %100 %0 0 00:55:04
> Run rind10.12225 N_NORM_001_050_comp_v38_ts tstern %81 %0 2 00:41:35
> Run rind10.12226 N_NORM_055_010_comp_v01_mt mwarlimont %0 %0 0 00:36:58
> Run rind10.12227 N_NORM_100_020_comp_v21_mt mwarlimont %0 %0 0 00:34:11
> Done rind10.12228 N_NORM_001_010_comp_v01_st stischne %100 %0 0 00:23:08
> Run rind10.12230 N_NORM_103_010_comp_v14_mt mwarlimont %36 %0 17 00:19:49
> Run rind10.12231 N_NORM_022_cfd0046_comp_v102 ppoetsch %6 %0 0 00:19:37
>
> ..sometimes something like above happens: job 12225 starts rendering on all online
> machines. But halfway through the rendering, it just stops or the amount
> of CPUs drops significantly and all the other machines continue
> rendering on a much newer job
Hi Abraham,
Wow, you have large jobids! You must have the jobidmax value
cranked up. Be careful with that (see below).
Can I see these reports for the rind10.12225/6/7 jobs? eg:
rush -lf rind10.12225 rind10.12226 rind10.12227
rush -lc rind10.12225 rind10.12226 rind10.12227
The '26 and '27 jobs appear to be getting completely skipped over,
they stand out to me.
All the rest seem like they could be OK, need to see the reports
to know more.
The 12225 job doesn't worry me too much, as it's 81% done with
2 busy frames, so if those are the last two frames in the job,
that would make sense. But if there are still available frames
in the Que state with a TRY count of zero, that would be puzzling.
As you probably know, if a job is rendering its last few frames,
newly idle cpus will go to the next jobs down. If someone requeues
all the frames in one of the higher up jobs, then that could bring
them back down to 0% done, and they'd have to wait for available procs.
I'll be able to tell from the 'Frames' report; the TRY column will show
if a frame has already been run before.
> 2. problem:
> Most of the time, switching a machine/workstation from offline to =
> online, it takes from many seconds to several minutes for this machine =
> to pick up a frame and start rendering. The machine is shown as 'online' =
> instantly, but it just won't start rendering a frame. It's listed as =
> 'online' and 'idle' for several minutes. This happens for all of our =
> machines, doesn't matter if they are Macs or Linux.
Can you send me the tasklist for the machine in question, ie:
rush -tasklist SLOWHOST
..I want to see how large that report is. If it's really large,
that might be the reason.
That report will show the list of jobs it is considering
to give the idle cpus in the order it wants to check.
One situation might be if there's a bunch of jobs at the
top of its list that are being managed by a machine that
is currently down. In that case rush will try to contact
that machine to get the job started, and will keep trying
until a timeout of about a minute or so, then it will give up
and move to the next jobs in the list that are not on that
unresponsive machine.
Another possibility is if machines reboot to new IP addresses
(eg. DHCP assigned machines), that might cause rushd to not
be able to reach job servers to establish jobs, causing the
above situation.
It might be good if you send me the rushd.log from machines
that act this way; I might be able to tell from that if
there's a problem.
> Any explanation for that?
Those large jobids might be the culprit, not sure.
When you have jobidmax set high, this can mean thousands of jobs can remain
in the queue, causing the system to work extra hard to find jobs that are
available.
The large max should be OK /as long/ as the 'Jobs' reports are kept trim.
ie. dump old jobs. You don't want to leave old jobs in the queue; they take
up memory and make the daemon work harder internally to consider those
jobs in case they've been requeued.
And if you have several job servers each with very large queues, that
would exacerbate the problem.
The reason rush comes with 999 as the max for jobids is to force
folks to dump old jobs so that the queue doesn't get artificially large.
--
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)ext.23
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)
|