Discussion:
[Oscar-users] Jobs not running on reconfigured cluster
Richard Young
2016-05-30 23:25:43 UTC
Permalink
I was hoping somebody would be able to help me with the following problem.

Recently I have applied updates and done some reconfiguration on a RHEL6.8 cluster running Oscar. The major change was changing the ipaddress of the oscar_server, this was required because changes to the network structure. The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond etc. However, I have missed something because no jobs will now run on the cluster. The jobs basically site in the queue and then get cancelled because they have hit their walltime.

Has anybody come across this problem before and be able to supply some insight into how to fix the problem(s).

Thanks

---------------------------------------------------------------------
Richard A. Young
ICT Services
HPC Systems Engineer
University of Southern Queensland
Toowoomba, Queensland 4350
Australia
Email: ***@usq.edu.au Phone: (07) 46315557
Mob: 0437544370 Fax: (07) 46312798
---------------------------------------------------------------------



_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
Kim, DongInn
2016-05-31 02:04:47 UTC
Permalink
Hi Richard,

I would like to double check the following items if I were you.

1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all the nodes.
2. Make sure that the root user can ssh into all the nodes back and forth without password.
3. All the daemons of the job submission are running on all the nodes:
(torque-server, torque-mom in the head node and torque-mom in the client nodes and maui on the head node)
I assume that you are using torque as RM and maui as a scheduler.

Regards,

--
- DongInn



> On May 30, 2016, at 7:25 PM, Richard Young <***@usq.edu.au> wrote:
>
> I was hoping somebody would be able to help me with the following problem.
>
> Recently I have applied updates and done some reconfiguration on a RHEL6.8 cluster running Oscar. The major change was changing the ipaddress of the oscar_server, this was required because changes to the network structure. The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond etc. However, I have missed something because no jobs will now run on the cluster. The jobs basically site in the queue and then get cancelled because they have hit their walltime.
>
> Has anybody come across this problem before and be able to supply some insight into how to fix the problem(s).
>
> Thanks
>
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> HPC Systems Engineer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia
> Email: ***@usq.edu.au Phone: (07) 46315557
> Mob: 0437544370 Fax: (07) 46312798
> ---------------------------------------------------------------------
>
>
>
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
>
> The University of Southern Queensland is a registered provider of education with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
Richard Young
2016-05-31 04:29:54 UTC
Permalink
DongInn
Did check these before but I re-checked as below:
1. /etc/hosts are the same across the cluster.
2. can ssh to a node and back without any problems or password. The known_hosts file has been updated and copied across the cluster.
3. checked nagios/nrpe and it is setup to allow the admin node to collect details.
4. ganglia/gmond is setup to talk to the admin node.
5. pbs_server and maui on the admin have been restarted with no reported errors in the log files.
6. pbs_mom on the nodes has been restarted with no reported errors in the log files.
7. a search through /etc and /var/lib/torque for the ip-address of the server doesn't find anything other old log entries.
8. /etc/dhcp/dhcpd.conf has been updated.
9. /etc/ntp.conf has been updated across the cluster.

Thanks

---------------------------------------------------------------------
Richard A. Young
ICT Services
Email: ***@usq.edu.au Phone: (07) 46315557
Mob: 0437544370 Fax: (07) 46312798
---------------------------------------------------------------------

-----Original Message-----
From: Kim, DongInn [mailto:***@indiana.edu]
Sent: Tuesday, 31 May 2016 12:05 PM
To: Users OSCAR
Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster

Hi Richard,

I would like to double check the following items if I were you.

1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all the nodes.
2. Make sure that the root user can ssh into all the nodes back and forth without password.
3. All the daemons of the job submission are running on all the nodes:
(torque-server, torque-mom in the head node and torque-mom in the client nodes and maui on the head node)
I assume that you are using torque as RM and maui as a scheduler.

Regards,

--
- DongInn



> On May 30, 2016, at 7:25 PM, Richard Young <***@usq.edu.au> wrote:
>
> I was hoping somebody would be able to help me with the following problem.
>
> Recently I have applied updates and done some reconfiguration on a RHEL6.8 cluster running Oscar. The major change was changing the ipaddress of the oscar_server, this was required because changes to the network structure. The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond etc. However, I have missed something because no jobs will now run on the cluster. The jobs basically site in the queue and then get cancelled because they have hit their walltime.
>
> Has anybody come across this problem before and be able to supply some insight into how to fix the problem(s).
>
> Thanks
>
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> HPC Systems Engineer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia
> Email: ***@usq.edu.au Phone: (07) 46315557
> Mob: 0437544370 Fax: (07) 46312798
> ---------------------------------------------------------------------
>
>
>
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
>
> The University of Southern Queensland is a registered provider of education with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users



_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
LAHAYE Olivier
2016-05-31 08:11:43 UTC
Permalink
Hi Richard,

- Can you see ganglia web interface?
- Are you using a DNS for your cluster?
- Are firewalld / iptables services stopped?
- Is nscd cache reseted?
- is munge running?
- I'm not using torque/maui anymore, so I can't check on my side to see if there are some specific config to check...
- were the torque / maui package got updated during the process?

Olivier.
--
Olivier LAHAYE
CEA DRT/LIST/DIR

________________________________________
De : Richard Young [***@usq.edu.au]
Envoyé : mardi 31 mai 2016 06:29
À : 'oscar-***@lists.sourceforge.net'
Objet : Re: [Oscar-users] Jobs not running on reconfigured cluster

DongInn
Did check these before but I re-checked as below:
1. /etc/hosts are the same across the cluster.
2. can ssh to a node and back without any problems or password. The known_hosts file has been updated and copied across the cluster.
3. checked nagios/nrpe and it is setup to allow the admin node to collect details.
4. ganglia/gmond is setup to talk to the admin node.
5. pbs_server and maui on the admin have been restarted with no reported errors in the log files.
6. pbs_mom on the nodes has been restarted with no reported errors in the log files.
7. a search through /etc and /var/lib/torque for the ip-address of the server doesn't find anything other old log entries.
8. /etc/dhcp/dhcpd.conf has been updated.
9. /etc/ntp.conf has been updated across the cluster.

Thanks

---------------------------------------------------------------------
Richard A. Young
ICT Services
Email: ***@usq.edu.au Phone: (07) 46315557
Mob: 0437544370 Fax: (07) 46312798
---------------------------------------------------------------------

-----Original Message-----
From: Kim, DongInn [mailto:***@indiana.edu]
Sent: Tuesday, 31 May 2016 12:05 PM
To: Users OSCAR
Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster

Hi Richard,

I would like to double check the following items if I were you.

1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all the nodes.
2. Make sure that the root user can ssh into all the nodes back and forth without password.
3. All the daemons of the job submission are running on all the nodes:
(torque-server, torque-mom in the head node and torque-mom in the client nodes and maui on the head node)
I assume that you are using torque as RM and maui as a scheduler.

Regards,

--
- DongInn



> On May 30, 2016, at 7:25 PM, Richard Young <***@usq.edu.au> wrote:
>
> I was hoping somebody would be able to help me with the following problem.
>
> Recently I have applied updates and done some reconfiguration on a RHEL6.8 cluster running Oscar. The major change was changing the ipaddress of the oscar_server, this was required because changes to the network structure. The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond etc. However, I have missed something because no jobs will now run on the cluster. The jobs basically site in the queue and then get cancelled because they have hit their walltime.
>
> Has anybody come across this problem before and be able to supply some insight into how to fix the problem(s).
>
> Thanks
>
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> HPC Systems Engineer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia
> Email: ***@usq.edu.au Phone: (07) 46315557
> Mob: 0437544370 Fax: (07) 46312798
> ---------------------------------------------------------------------
>
>
>
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
>
> The University of Southern Queensland is a registered provider of education with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users



_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Oscar-users mailing list
Oscar-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users
Richard Young
2016-06-01 00:40:43 UTC
Permalink
Lahaye
- No I can't see the ganglia web interface on either the public or private interfaces, it says "you have no permission"
- the admin node is setup as a forwarding dns server and lookups seem to work correctly
- the firewall/iptables services have been stopped, with on an iptables rule set from the command line to forward and NAT traffic
- nscd cache has been turned off
- munge is running
- torque/maui packages did get updated, configurations have been check to make certain they were the same as before the update.

Thanks

---------------------------------------------------------------------
Richard A. Young
ICT Services
Email: ***@usq.edu.au Phone: (07) 46315557
Mob: 0437544370 Fax: (07) 46312798
---------------------------------------------------------------------

-----Original Message-----
From: LAHAYE Olivier [mailto:***@cea.fr]
Sent: Tuesday, 31 May 2016 6:12 PM
To: oscar-***@lists.sourceforge.net
Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster

Hi Richard,

- Can you see ganglia web interface?
- Are you using a DNS for your cluster?
- Are firewalld / iptables services stopped?
- Is nscd cache reseted?
- is munge running?
- I'm not using torque/maui anymore, so I can't check on my side to see if there are some specific config to check...
- were the torque / maui package got updated during the process?

Olivier.
--
Olivier LAHAYE
CEA DRT/LIST/DIR

________________________________________
De : Richard Young [***@usq.edu.au] Envoyé : mardi 31 mai 2016 06:29 À : 'oscar-***@lists.sourceforge.net'
Objet : Re: [Oscar-users] Jobs not running on reconfigured cluster

DongInn
Did check these before but I re-checked as below:
1. /etc/hosts are the same across the cluster.
2. can ssh to a node and back without any problems or password. The known_hosts file has been updated and copied across the cluster.
3. checked nagios/nrpe and it is setup to allow the admin node to collect details.
4. ganglia/gmond is setup to talk to the admin node.
5. pbs_server and maui on the admin have been restarted with no reported errors in the log files.
6. pbs_mom on the nodes has been restarted with no reported errors in the log files.
7. a search through /etc and /var/lib/torque for the ip-address of the server doesn't find anything other old log entries.
8. /etc/dhcp/dhcpd.conf has been updated.
9. /etc/ntp.conf has been updated across the cluster.

Thanks

---------------------------------------------------------------------
Richard A. Young
ICT Services
Email: ***@usq.edu.au Phone: (07) 46315557
Mob: 0437544370 Fax: (07) 46312798
---------------------------------------------------------------------

-----Original Message-----
From: Kim, DongInn [mailto:***@indiana.edu]
Sent: Tuesday, 31 May 2016 12:05 PM
To: Users OSCAR
Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster

Hi Richard,

I would like to double check the following items if I were you.

1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all the nodes.
2. Make sure that the root user can ssh into all the nodes back and forth without password.
3. All the daemons of the job submission are running on all the nodes:
(torque-server, torque-mom in the head node and torque-mom in the client nodes and maui on the head node)
I assume that you are using torque as RM and maui as a scheduler.

Regards,

--
- DongInn



> On May 30, 2016, at 7:25 PM, Richard Young <***@usq.edu.au> wrote:
>
> I was hoping somebody would be able to help me with the following problem.
>
> Recently I have applied updates and done some reconfiguration on a RHEL6.8 cluster running Oscar. The major change was changing the ipaddress of the oscar_server, this was required because changes to the network structure. The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond etc. However, I have missed something because no jobs will now run on the cluster. The jobs basically site in the queue and then get cancelled because they have hit their walltime.
>
> Has anybody come across this problem before and be able to supply some insight into how to fix the problem(s).
>
> Thanks
>
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> HPC Systems Engineer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia
> Email: ***@usq.edu.au Phone: (07) 46315557
> Mob: 0437544370 Fax: (07) 46312798
> ---------------------------------------------------------------------
>
>
>
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
>
> The University of Southern Queensland is a registered provider of education with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>
>
> ----------------------------------------------------------------------
> -------- What NetFlow Analyzer can do for you? Monitors network
> bandwidth and traffic patterns at an interface-level. Reveals which
> users, apps, and protocols are consuming the most bandwidth. Provides
> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
> informed decisions using capacity planning reports.
> https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users



_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Oscar-users mailing list
Oscar-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Oscar-users mailing list
Oscar-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users


_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
Kim, DongInn
2016-06-01 02:50:56 UTC
Permalink
Hi Richard,

I think that this is a torque+maui configuration issue on your cluster.

Can you please make sure that your configurations of torque and maui are setup properly?
I hope that you can find the torque and maui admin manual on google.

One thing that I would like to play with is to see what log messages are generated on the server and client sides when a new job is submitted.
That would show many hints on your problem.

Regards,

--
- DongInn



> On May 31, 2016, at 8:40 PM, Richard Young <***@usq.edu.au> wrote:
>
> Lahaye
> - No I can't see the ganglia web interface on either the public or private interfaces, it says "you have no permission"
> - the admin node is setup as a forwarding dns server and lookups seem to work correctly
> - the firewall/iptables services have been stopped, with on an iptables rule set from the command line to forward and NAT traffic
> - nscd cache has been turned off
> - munge is running
> - torque/maui packages did get updated, configurations have been check to make certain they were the same as before the update.
>
> Thanks
>
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> Email: ***@usq.edu.au Phone: (07) 46315557
> Mob: 0437544370 Fax: (07) 46312798
> ---------------------------------------------------------------------
>
> -----Original Message-----
> From: LAHAYE Olivier [mailto:***@cea.fr]
> Sent: Tuesday, 31 May 2016 6:12 PM
> To: oscar-***@lists.sourceforge.net
> Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster
>
> Hi Richard,
>
> - Can you see ganglia web interface?
> - Are you using a DNS for your cluster?
> - Are firewalld / iptables services stopped?
> - Is nscd cache reseted?
> - is munge running?
> - I'm not using torque/maui anymore, so I can't check on my side to see if there are some specific config to check...
> - were the torque / maui package got updated during the process?
>
> Olivier.
> --
> Olivier LAHAYE
> CEA DRT/LIST/DIR
>
> ________________________________________
> De : Richard Young [***@usq.edu.au] Envoyé : mardi 31 mai 2016 06:29 À : 'oscar-***@lists.sourceforge.net'
> Objet : Re: [Oscar-users] Jobs not running on reconfigured cluster
>
> DongInn
> Did check these before but I re-checked as below:
> 1. /etc/hosts are the same across the cluster.
> 2. can ssh to a node and back without any problems or password. The known_hosts file has been updated and copied across the cluster.
> 3. checked nagios/nrpe and it is setup to allow the admin node to collect details.
> 4. ganglia/gmond is setup to talk to the admin node.
> 5. pbs_server and maui on the admin have been restarted with no reported errors in the log files.
> 6. pbs_mom on the nodes has been restarted with no reported errors in the log files.
> 7. a search through /etc and /var/lib/torque for the ip-address of the server doesn't find anything other old log entries.
> 8. /etc/dhcp/dhcpd.conf has been updated.
> 9. /etc/ntp.conf has been updated across the cluster.
>
> Thanks
>
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> Email: ***@usq.edu.au Phone: (07) 46315557
> Mob: 0437544370 Fax: (07) 46312798
> ---------------------------------------------------------------------
>
> -----Original Message-----
> From: Kim, DongInn [mailto:***@indiana.edu]
> Sent: Tuesday, 31 May 2016 12:05 PM
> To: Users OSCAR
> Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster
>
> Hi Richard,
>
> I would like to double check the following items if I were you.
>
> 1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all the nodes.
> 2. Make sure that the root user can ssh into all the nodes back and forth without password.
> 3. All the daemons of the job submission are running on all the nodes:
> (torque-server, torque-mom in the head node and torque-mom in the client nodes and maui on the head node)
> I assume that you are using torque as RM and maui as a scheduler.
>
> Regards,
>
> --
> - DongInn
>
>
>
>> On May 30, 2016, at 7:25 PM, Richard Young <***@usq.edu.au> wrote:
>>
>> I was hoping somebody would be able to help me with the following problem.
>>
>> Recently I have applied updates and done some reconfiguration on a RHEL6.8 cluster running Oscar. The major change was changing the ipaddress of the oscar_server, this was required because changes to the network structure. The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond etc. However, I have missed something because no jobs will now run on the cluster. The jobs basically site in the queue and then get cancelled because they have hit their walltime.
>>
>> Has anybody come across this problem before and be able to supply some insight into how to fix the problem(s).
>>
>> Thanks
>>
>> ---------------------------------------------------------------------
>> Richard A. Young
>> ICT Services
>> HPC Systems Engineer
>> University of Southern Queensland
>> Toowoomba, Queensland 4350
>> Australia
>> Email: ***@usq.edu.au Phone: (07) 46315557
>> Mob: 0437544370 Fax: (07) 46312798
>> ---------------------------------------------------------------------
>>
>>
>>
>> _____________________________________________________________
>> This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
>>
>> The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
>>
>> The University of Southern Queensland is a registered provider of education with the Australian Government.
>> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>>
>>
>> ----------------------------------------------------------------------
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
>> _______________________________________________
>> Oscar-users mailing list
>> Oscar-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>
>
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
>
> The University of Southern Queensland is a registered provider of education with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.
>
> The University of Southern Queensland is a registered provider of education with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
Loading...