[Oscar-users] PBS problem - jobs stuck in 'E' state

Discussion:

Bruce Becker

2002-11-07 07:57:03 UTC

Hi OSCAR amigos

We recently had to change our switch, while PBS jobs were running. These
jobs went from state 'R' to state 'E' and have stayed there ever since.
qdel gives the error:

qdel: Request invalid for state of job 390.qgp3.phy.uct.ac.za

Try as I might, I cannot get the jobs to release the job and become
available for the queue again. I have tried to restart the pbs processes:
pbs_server, pbs_mom, no luck. We are using OSCAR-1.4 with RedHat7.3 and
PBSPro-5.2.2

Any clues ?

Bruce Becker, PhD student - Department of Physics
University of Cape Town

Room 405, R.W. James Building, UCT
University Avenue North
Private Bag RONDEBOSCH
7700

tel (w) +27 21 650 3356
tel (m) +27 82 537 9425
fax +27 21 650 3342

http://hep.phy.uct.ac.za/~becker

Jeremy Enos

2002-11-07 18:07:06 UTC

Permalink

Sounds like a PBS config problem... I don't have any idea how PBSPro-5.2.2
is configured though. Do you use epilogue scripts? Does the job have
output which would indicate whether or not it ran? Is PBS built to use ssh
or rsh?
thx-

Jeremy

Post by Bruce Becker
Hi OSCAR amigos
We recently had to change our switch, while PBS jobs were running. These
jobs went from state 'R' to state 'E' and have stayed there ever since.
qdel: Request invalid for state of job 390.qgp3.phy.uct.ac.za
Try as I might, I cannot get the jobs to release the job and become
pbs_server, pbs_mom, no luck. We are using OSCAR-1.4 with RedHat7.3 and
PBSPro-5.2.2
Any clues ?
Bruce Becker, PhD student - Department of Physics
University of Cape Town
Room 405, R.W. James Building, UCT
University Avenue North
Private Bag RONDEBOSCH
7700
tel (w) +27 21 650 3356
tel (m) +27 82 537 9425
fax +27 21 650 3342
http://hep.phy.uct.ac.za/~becker
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
Oscar-users mailing list
https://lists.sourceforge.net/lists/listinfo/oscar-users

Jenn Sturm

2002-11-07 18:28:03 UTC

Permalink

It's a bug with PBS. To resolve, do the following (substitute
appropriate paths as needed):

# qterm -t quick <--shuts down pbs_server but leaves jobs running
# cd $PBS_HOME/server_priv/jobs
# rm jobid.* <--substitute the actual job ids that are in E state for
jobid here
# /usr/sbin/pbs_server <--restart the pbs server

Deleting the job files for the jobs that are stuck in E will delete
them from the queue once the pbs_server is restarted.

-Jenn Sturm

Post by Jeremy Enos
Sounds like a PBS config problem... I don't have any idea how
PBSPro-5.2.2 is configured though. Do you use epilogue scripts? Does
the job have output which would indicate whether or not it ran? Is
PBS built to use ssh or rsh?
thx-
Jeremy

Post by Bruce Becker
Hi OSCAR amigos
We recently had to change our switch, while PBS jobs were running.
These
jobs went from state 'R' to state 'E' and have stayed there ever
since.
qdel: Request invalid for state of job 390.qgp3.phy.uct.ac.za
Try as I might, I cannot get the jobs to release the job and become
available for the queue again. I have tried to restart the pbs
pbs_server, pbs_mom, no luck. We are using OSCAR-1.4 with RedHat7.3
and
PBSPro-5.2.2
Any clues ?
Bruce Becker, PhD student - Department of Physics
University of Cape Town
Room 405, R.W. James Building, UCT
University Avenue North
Private Bag RONDEBOSCH
7700
tel (w) +27 21 650 3356
tel (m) +27 82 537 9425
fax +27 21 650 3342
http://hep.phy.uct.ac.za/~becker
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
Oscar-users mailing list
https://lists.sourceforge.net/lists/listinfo/oscar-users

-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm Tungsten T
handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
Oscar-users mailing list
https://lists.sourceforge.net/lists/listinfo/oscar-users

___________
Jennifer Sturm
System Administrator and Research Support Specialist
Chemistry Department
Hamilton College
198 College Hill Road
Clinton, NY 13323

tel: 315-859-4745
fax: 315-859-4744

***@hamilton.edu

http://www.chem.hamilton.edu/
http://mars.chem.hamilton.edu/

Bill Nitzberg

2002-11-07 23:37:01 UTC

Permalink

Hi,

Actually, to get rid of a "stuck" job, you can simply:

qdel -Wforce jobid

Best regards,

- bill

Bill Nitzberg, PhD
General Manager, Veridian, PBS Products Dept.

-----Original Message-----
Jenn Sturm
Sent: Thursday, November 07, 2002 12:28 PM
To: Bruce Becker
Cc: Oscar-users List
Subject: Re: [Oscar-users] PBS problem - jobs stuck in 'E' state
It's a bug with PBS. To resolve, do the following (substitute
# qterm -t quick <--shuts down pbs_server but
leaves jobs running
# cd $PBS_HOME/server_priv/jobs
# rm jobid.* <--substitute the actual job
ids that are in E state for
jobid here
# /usr/sbin/pbs_server <--restart the pbs server
Deleting the job files for the jobs that are stuck in E will delete
them from the queue once the pbs_server is restarted.
-Jenn Sturm

Post by Jeremy Enos
Sounds like a PBS config problem... I don't have any idea how
PBSPro-5.2.2 is configured though. Do you use epilogue

scripts? Does

Post by Jeremy Enos
the job have output which would indicate whether or not it ran? Is
PBS built to use ssh or rsh?
thx-
Jeremy

RedHat7.3

Post by Jeremy Enos

Post by Bruce Becker
and
PBSPro-5.2.2
Any clues ?
Bruce Becker, PhD student - Department of Physics
University of Cape Town
Room 405, R.W. James Building, UCT
University Avenue North
Private Bag RONDEBOSCH
7700
tel (w) +27 21 650 3356
tel (m) +27 82 537 9425
fax +27 21 650 3342
http://hep.phy.uct.ac.za/~becker
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
Oscar-users mailing list
https://lists.sourceforge.net/lists/listinfo/oscar-users

___________
Jennifer Sturm
System Administrator and Research Support Specialist
Chemistry Department
Hamilton College
198 College Hill Road
Clinton, NY 13323
tel: 315-859-4745
fax: 315-859-4744

http://www.chem.hamilton.edu/
http://mars.chem.hamilton.edu/

-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en

VINOD

2002-11-08 05:29:02 UTC

Permalink

Hi all,
Please let me know whether i can have a 128 disk less nodes + 1 master node cluster in a beowulf cluster ?
is this possible with OSCAR. ?
If not is there any other open source GPL community where i should look

Benoit des Ligneris

2002-11-08 09:52:04 UTC

Permalink

Hello,

We have a 64 home-made diskless cluster and 36 nodes diskless cluster (running thin-oscar)
The next 200 nodes will arrive soon and be up by the end of december.
Our experience with diskless nodes is that you need a server for 64
nodes, especially at boot time (well, we were not using tftp-ha and this
limit is maybe higher than that) or if your users uses a lot of NFS (ie
file I/O) and parrallel computation (MPI, PVM) at the same time on the
same network.

The thin-OSCAR workgroup (thin-oscar.sf.net) has been started to address
the specific issue of supporting diskless/systemless compute nodes
within OSCAR. I'm currently packaging it for OSCAR. Should be ready within a
week and/or for SC2002.

Ben

The config file is there (you need to compile a new kernel at this time)
http://mike.si.usherb.ca/~mike/oscar/ramdisk.cfg

The oscar2thin is there (shell script)
http://mike.si.usherb.ca/~mike/oscar/oscar2thin

Then you do a (*oscarimage= name of the image you build with SIS/OSCAR*)
./oscar2thin 80 /var/lib/systemimager/images/oscarimage thinimage.img

and it will make a lot of the necessary changes. Some are manual and
explained once the script is over.

Do not use it the first time on a production cluster ;-)) If you can
wait, the OSCAR package will be more human-friendly and all will be
automated...

Ben

--
Benoit des Ligneris Etudiant au Doctorat -- Ph. D. Student
Web : http://benoit.des.ligneris.net/
Vice-President du GULUS vice-president http://www.gulus.org/
Mydynaweb Developpe(u)r: http://mydynaweb.net/
GPG/PGP Key http://benoit.des.ligneris.net/linux/gpg.txt

VINOD

2002-11-09 10:48:02 UTC

Permalink

Dear All,
For my 128 node cluster, i am trying to make the all in 5 different subnets. Is such kind of cionfiguration is possibel fo r a OSCAR installation? I Especially doubt abt the Installtion process. Please let me know is it possible to do the installtion processes on clients lying across multiple switches?
Please advise,

regar

Sean Dague

2002-11-11 01:35:02 UTC

Permalink

Post by VINOD
Dear All,
For my 128 node cluster, i am trying to make the all in 5 different subnets. Is such kind of cionfiguration is possibel fo r a OSCAR installation? I Especially doubt abt the Installtion process. Please let me know is it possible to do the installtion processes on clients lying across multiple switches?
Please advise,

If the DHCP packets from your headnode are forwarded by the routers to your
5 subnets, it will work fine. The clients will boot, get their ip (which
will include their default route) and then they'll start their install
happily.

-Sean

--
_______________________________________________________________________

Sean Dague ***@dague.net http://dague.net

There is no silver bullet. Plus, werewolves make better neighbors than
zombies, and they tend to keep the vampire population down.
_______________________________________________________________________

Jason B.

2002-11-09 12:15:02 UTC

Permalink

Post by VINOD
For my 128 node cluster, i am trying to make the all in 5 different subnets. Is such kind of cionfiguration is possibel fo r a OSCAR installation? I Especially doubt abt the Installtion process. Please let me know is it possible to do the installtion processes on clients lying across multiple switches?

You can add clients in any subnet you wish, as long as they are accessible
to the head node. The only complication is that you would have to install
some clients, then add the next group (from another subnet), and repeat
for each of the 5 subnets.

Jason