From: Craig Allison <craig@(email surpressed)>
Subject: Job Done Command Not Working On Some Boxes
   Date: Mon, 13 Jul 2009 12:17:23 -0400
Msg# 1865
View Complete Thread (5 articles) | All Threads
Last Next
Hello there

I'm having a problem with certain boxes not running the JobDone Command on complete of a render.  I'm not getting any logs from the faulty boxes to even suggest the command is being initiated.

Most of my boxes execute the perl script no problem but a few don't and I can't see anything different between the ones that do and the ones that don't.

Running Rush 102.42a9 on Leopard and Tiger, running a perl script on JobDone.

perl /mnt/vfxxserve3/Scripts/Senate_Donemail.pl

Any ideas?

Cheers

Craig


Craig Allison

Digital Systems & Data Manager

The Senate Visual Effects

Twickenham Film Studios

St.Margarets

Twickenham

TW1 2AW


t: 0208 607 8866 

skype: craig_9000

www.senatevfx.com





   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Job Done Command Not Working On Some Boxes
   Date: Mon, 13 Jul 2009 15:19:35 -0400
Msg# 1866
View Complete Thread (5 articles) | All Threads
Last Next
Craig Allison wrote:
> I'm having a problem with certain boxes not running the JobDone Command
> on complete of a render.  I'm not getting any logs from the faulty boxes
> to even suggest the command is being initiated.

Hi Craig,

	I'd suggest looking in the rushd.log first for the machine acting
	as the job server for the job in question. Ie. login to the machine
	whose hostname is in the jobid, and look at the rush/var/rushd.log
	file for errors about the jobdonecommand.

> Running Rush 102.42a9 on Leopard and Tiger, running a perl script on
> JobDone.
> 
> perl /mnt/vfxxserve3/Scripts/Senate_Donemail.pl
> 
> Any ideas?

	Try sending me via private email:

		1) the 'rush -ljf' info for this job
		2) the contents of your rushd.log

	..Be sure the job got 'done' since midnight, so that any errors
	would likely be in "today's" rushd.log.

	I'll follow up to the group on what we determine is the problem.

	Certainly you should be seeing a 'jobdonecommand.log' in the job's
	log directory. Only reason you wouldn't is if a) logs are disabled,
	b) the -nolog flag was specified for the jobdonecommand, c) the log
	directory was unwritable for some reason by the job server, d) a bug
	I don't know about, as 102.42a9 is very recent.

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Craig Allison <craig@(email surpressed)>
Subject: Re: Job Done Command Not Working On Some Boxes
   Date: Mon, 13 Jul 2009 16:22:52 -0400
Msg# 1867
View Complete Thread (5 articles) | All Threads
Last Next
Hey Greg

I'm working late tonight so we're kind of on the same time for now : )

Checked the log on the host machine and for the fail jobs I'm getting the following:

jobdonecommand 'perl /mnt/vfxxserve3/Scripts/Senate_DoneMail.pl' failed for jobid=senatemac89.326: RUSH: uid '0' outside configured range 100-65000

I've got forceuid/gid set to 501 on the user's machine in rush.conf, where is it picking this one up from?


Cheers


Craig



Craig Allison

Digital Systems & Data Manager

The Senate Visual Effects

Twickenham Film Studios

St.Margarets

Twickenham

TW1 2AW


t: 0208 607 8866 

skype: craig_9000

www.senatevfx.com




On 13 Jul 2009, at 20:19, Greg Ercolano wrote:

[posted to rush.general]

Craig Allison wrote:
I'm having a problem with certain boxes not running the JobDone Command
on complete of a render.  I'm not getting any logs from the faulty boxes
to even suggest the command is being initiated.

Hi Craig,

I'd suggest looking in the rushd.log first for the machine acting
as the job server for the job in question. Ie. login to the machine
whose hostname is in the jobid, and look at the rush/var/rushd.log
file for errors about the jobdonecommand.

Running Rush 102.42a9 on Leopard and Tiger, running a perl script on
JobDone.

perl /mnt/vfxxserve3/Scripts/Senate_Donemail.pl

Any ideas?

Try sending me via private email:

1) the 'rush -ljf' info for this job
2) the contents of your rushd.log

..Be sure the job got 'done' since midnight, so that any errors
would likely be in "today's" rushd.log.

I'll follow up to the group on what we determine is the problem.

Certainly you should be seeing a 'jobdonecommand.log' in the job's
log directory. Only reason you wouldn't is if a) logs are disabled,
b) the -nolog flag was specified for the jobdonecommand, c) the log
directory was unwritable for some reason by the job server, d) a bug
I don't know about, as 102.42a9 is very recent.

-- 
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: 626-795-5922x23
Fax: 626-795-5947
Cel: 310-266-8906



   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Job Done Command Not Working On Some Boxes
   Date: Mon, 13 Jul 2009 16:31:44 -0400
Msg# 1868
View Complete Thread (5 articles) | All Threads
Last Next
Craig Allison wrote:
> Hey Greg
> 
> I'm working late tonight so we're kind of on the same time for now : )
> 
> Checked the log on the host machine and for the fail jobs I'm getting
> the following:
> 
> jobdonecommand 'perl /mnt/vfxxserve3/Scripts/Senate_DoneMail.pl' failed
> for jobid=senatemac89.326: RUSH: uid '0' outside configured range 100-65000
> 
> I've got forceuid/gid set to 501 on the user's machine in rush.conf,
> where is it picking this one up from?

Hi Craig,

	Hmm.. apparently it thinks it should be trying to run the
	command as root on that machine (senatemac89).

	From that machine, can you send me (via private email):

		a) The contents of: /usr/local/rush/etc/rush.conf
		b) The output of: rush -ljf senatemac89.326

	It sounds like maybe there's a forceuid of zero on that machine,
	or the job's owner is root, or there's a bug I need to look into..

-- 
Greg Ercolano, erco@(email surpressed)
Seriss Corporation
Rush Render Queue, http://seriss.com/rush/
Tel: (Tel# suppressed)
Fax: (Tel# suppressed)
Cel: (Tel# suppressed)

   From: Greg Ercolano <erco@(email surpressed)>
Subject: Re: Job Done Command Not Working On Some Boxes
   Date: Fri, 17 Jul 2009 23:20:37 -0400
Msg# 1870
View Complete Thread (5 articles) | All Threads
Last Next
Newsgroup follow up.

OK, after working with Craig earlier this week, I was able to determine
the exact combo of events it took to replicate this.

Turned out these specifics were needed to make this problem show up:

    o Rush version 102.42a9 on both machines
    o Submit from a Tiger workstation with "Submit Host:" pointing at a leopard jobserver
    o Submitting user has no account on the jobserver, just their workstation
    o Rush configured to force all renders to run as "render" user
    o "render" user has a valid account on all machines (workstation and jobserver)
    o Job submitted with a jobdonecommand

With this combo, the renders run ok, but the /jobdonecommand/ fails to run on
the leopard machine with this series of errors in the log:

>07/17,20:05:03 SECURITY   RUSH: uid '0' outside configured range 100-65000
>07/17,20:05:03 INFO       jobdonecommand 'ls' failed for jobid=leopard.2: RUSH: uid '0' outside configured range 100-65000

Now that I've got the problem canned, and replicated here at my office,
I should have it solved soon -- pretty sure now it's a bug in Rush.