Difference between revisions of "GWU cluster (Corcoran Hall)"

From Gw-qcd-wiki
Jump to: navigation, search
(Cluster administration)
 
(13 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 +
=== Hardware ===
 +
 
Some information about the old cluster can be found at [http://eagle.phys.gwu.edu/~fxlee/cluster/ here].
 
Some information about the old cluster can be found at [http://eagle.phys.gwu.edu/~fxlee/cluster/ here].
  
 
The cluster has been updated: it is now a cluster with 16 nodes each with 2 GPUs for a total of 32 GPUs (gtx480s). These cards have 1.5 GB of memory and deliver for dslash about 50GFlops/s in double precision and 125 GFlops/s in single precision (see Ben's [[Performance]] benchmarks).
 
The cluster has been updated: it is now a cluster with 16 nodes each with 2 GPUs for a total of 32 GPUs (gtx480s). These cards have 1.5 GB of memory and deliver for dslash about 50GFlops/s in double precision and 125 GFlops/s in single precision (see Ben's [[Performance]] benchmarks).
 +
 +
Each node has 12GB of memory (6 x 2GB PC3-10600 ECC unbuffered DDR3 1333MHz) and 2 Intel Quad Core Xeon E5620 running at 2.4GHz with 12MB of cache. The chassis is a Supermicro product, model number 1026GT-TF-FM207 and the motherboard is X8DTG-DF which is based on Intel 5520 chipset (Tylersburg). Each node has a 320 GB HDD.
  
 
The interconnects are still 4x DDR infiniband which provides 5Gb/s x 4lanes = 20Gb/s (signaling) which due to 8/10 encoding translates to a 16Gb/s=2GB/s in each direction (for more Infinband details consult [http://en.wikipedia.org/wiki/Infiniband wikipedia page]).
 
The interconnects are still 4x DDR infiniband which provides 5Gb/s x 4lanes = 20Gb/s (signaling) which due to 8/10 encoding translates to a 16Gb/s=2GB/s in each direction (for more Infinband details consult [http://en.wikipedia.org/wiki/Infiniband wikipedia page]).
  
The scheduling system on the cluster is Sun Grid Engine (sge). For infiniband we have installed OFED and openmpi (v1.5) and mvapich2 (v1.5.1). We prefer to use openmpi since it provides tight integration with the scheduler (the mpi processes get killed when you delete a job from the scheduler). A simple script is given below:
 
  
  #$ -S /bin/csh
+
=== Configuration ===
#$ -cwd
+
 
#$ -j y
+
If you plan to use openmpi or mvapich2 to run your parallel jobs, don't forget to load the proper modules in your .cshrc file (for example for openmpi you need to have  '''module load openmpi/gnu'''). Furthermore, if you plan to use cuda v3.1 load module cuda31 (we also have cuda v3.2 installed but our codes are not yet compatible with it).
#$ -o job_output.log$JOB_ID
+
 
##$ -q all.q@@node
+
The scheduling system on the cluster is Sun Grid Engine v6.2u5 ([http://gridengine.sunsource.net/ sge]). For infiniband we have installed OFED and openmpi (v1.5) and mvapich2 (v1.5.1). We prefer to use openmpi since it provides tight integration with the scheduler (the mpi processes get killed when you delete a job from the scheduler). A simple script is given below:
#$ -q all.q@@gpu
+
 
#$ -l gpu_count=1
+
  <nowiki>
#$ -pe openmpi 8
+
#$ -S /bin/csh
#$ -l h_rt = 01:30:00
+
#$ -cwd
 +
#$ -j y
 +
#$ -o job_output.log$JOB_ID
 +
##$ -q all.q
 +
#$ -q gpu.q
 +
#$ -l gpu_count=1
 +
#$ -pe openmpi 8
 +
#$ -l h_rt=01:30:00
 
   
 
   
 
  mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log
 
  mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log
 +
</nowiki>
 +
 +
 +
Note that if your job consumes gpus you should use '''-l gpu_count=1''' (one gpu per process) and you should also include '''--mca btl_openib_flags 1''' flag to make sure that openmpi and cuda work together. If you only use the cpu codes, the flag is not necessary. Note that the new nodes are in the gpu.q queue and you should use gpu.q if you want to use them or all.q if you want to use the old nodes.
 +
 +
'''CUDA 4.0 note:''' For cuda versions latter than 4.0 we can set the environment variable ''CUDA_NIC_INTEROP=1'' to force cudaMallocHost to use an alternate code path that doesn't conflict with openmpi. When this flag is set we don't need the ''--mca btl_openib_flags 1'' flag set.
 +
 +
=== Queue admin ===
 +
 +
==== Disable/enable nodes ====
 +
 +
If you want to reboot a node while there are pending jobs in the queue, you should disable it from the queue first:
 +
 +
qmod -d \*@gpu05
 +
 +
disables node gpu05 from all the queues. Note that we need to escape * so that the shell doesn't expand it. Once the node is rebooted, we need to reenable the node using
 +
 +
qmod -e \*@gpu05
 +
 +
==== Setting up parallel environments ====
 +
 +
To add/remove/modify a parallel environment (pe) use qconf. For example '''qconf -spl''' shows you all available pe's and  '''qconf -sp openmpi''' shows you the configuration of openmpi pe. To modify it, use '''qconf -mp openmpi'''. The important parameters to configure are '''start_proc_args''' and '''end_proc_args''' which allow you to do a pre/post execution setup and '''allocation_rule''' which can be either '''<nowiki>$</nowiki>round_robin''' or '''<nowiki>$</nowiki>fill_up'''.
 +
 +
=== Cluster administration ===
 +
 +
The new nodes on the cluster have support for Intelligent Platform Management Interface ([http://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface ipmi]). We can use it to restart and monitor the nodes remotely even when we can no longer login on the nodes. The ipmi service is running on all nodes and it is configured to listen for packets on 192.168.2.1xx network, where xx stands for the gpuxx node. To configure node gpu01, the commands I had to issue were the following:
 +
 +
ipmitool lan set 1 ipsrc static
 +
ipmitool lan set 1 ipaddr 192.168.2.101
 +
 +
To check the configuration, you need to run '''ipmitool lan print 1'''. For the other nodes similar commands were issued. With this configuration we can monitor each individual node without logging on the node. For example to get sensor information on node gpu01 we would type on the head node:
 +
 +
  ipmitool -H 192.168.2.101 -I lanplus -U user_name -P password sensor
 +
 +
==== System Administration Manual ====
 +
 +
The system administration manual from advanced clustering can be found [[Media:Apex_Cluster_Manual-2.2.pdf | here]]. Note that on this cluster the act_* tools are called beo_*.
 +
 +
==== Adding a user ====
 +
 +
To add a user, you need to do the following steps:
 +
* Use sudo /usr/sbin/adduser to add it to the head node
 +
* Use sudo passwd -l username to remove the password login option
 +
* Add a ssh-key to authorized key to allow user to login on the machine
 +
* Propagate the change to the compute nodes using  sudo /act/bin/beo_authsync -g gpus (if you only want to give access to gpus)
 +
* Add the user to the gpuuser queue using qconf -mu gpuusers
 +
* Make sure to change the permission on the .ssh/authorized_keys file so that it is not writable by anyone but the user
 +
 +
=== Write buffering issues ===
 +
 +
At the moment there are some thermal issues related to the write buffering. The RAID card has a write buffer which greatly improves performance; it has a battery backup that prevents data loss in the event of a power failure. It's programmed to fail safe and shut down the write buffer if there is anything wrong with the battery. In particular, this is going on if it takes an inordinately long time to delete (in particular) a file.
 +
 +
Well, recently the battery is overheating. We're still trying to figure out what is going on and how to fix it, but for now you can monitor the status of the battery.
 +
 +
As an ordinary user, you can run
 +
 +
  % dmesg | grep AEN
 +
 +
to check the system log for messages from the RAID controller. Unfortunately, those are not timestamped. The controller only reports three temperature states: Normal, High, and Too High (at which point it will shut down the battery and buffering).
 +
 +
These messages are written to /var/log/messages (and files rotated with it; messages.1, etc...), and are timestamped, but you will need to be root to examine them. If you have root, you can look at them, and find lines like:
 +
 +
  May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x004B): Battery temperature is high:.
 +
  May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0008): Unclean shutdown detected:unit=0. 
 +
  May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.
 +
  May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x004D): Battery temperature is too high:.
 +
 +
You can also query the controller directly:
 +
  # /usr/local/bin/tw_cli /c0/bbu show
 +
  Name  OnlineState  BBUReady  Status    Volt    Temp    Hours  LastCapTest
 +
  ---------------------------------------------------------------------------
 +
  bbu  On          Yes      OK        OK      High    0      xx-xxx-xxxx
 +
 +
or, more generally, ask about everything it knows:
 +
 +
  # /usr/local/bin/tw_cli /c0 show
 +
 +
  Unit  UnitType  Status        %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 +
  ------------------------------------------------------------------------------
 +
  u0    RAID-6    VERIFYING      -      31%(A)  64K    11175.8  ON    OFF   
 +
 
 +
  Port  Status          Unit  Size        Blocks        Serial 
 +
  ---------------------------------------------------------------
 +
  p0    OK              u0    1.82 TB    3907029168    WD-WCAVY4029115   
 +
  p1    OK              u0    1.82 TB    3907029168    WD-WCAVY4022616   
 +
  p2    OK              u0    1.82 TB    3907029168    WD-WCAVY3899073   
 +
  p3    OK              u0    1.82 TB    3907029168    WD-WCAVY3996530   
 +
  p4    OK              u0    1.82 TB    3907029168    WD-WCAVY3924154   
 +
  p5    OK              u0    1.82 TB    3907029168    WD-WCAVY3898411   
 +
  p6    OK              u0    1.82 TB    3907029168    WD-WCAVY3751982   
 +
  p7    OK              u0    1.82 TB    3907029168    WD-WCAVY3917383   
 +
 
 +
  Name  OnlineState  BBUReady  Status    Volt    Temp    Hours  LastCapTest
 +
  ---------------------------------------------------------------------------
 +
  bbu  On          Yes      OK        OK      High    0      xx-xxx-xxx
 +
 +
 +
(It will say "Fault" under "Cache" if the buffering has been disabled.)
 +
 +
If you want to reenable the cache once the battery has cooled down, do
 +
 +
  # /usr/local/bin/tw_cli /c0/u0 set cache=on
 +
 +
I don't think this will work if the temperature is still "Too High" (in that case, you are enabling it without the battery backup).
  
Note that if your job consumes gpus you should use '''-l gpu_count=1''' (one gpu per process) and you should also include '''--mca btl_openib_flags 1''' flag to make sure that openmpi and cuda work together. If you only use the cpu codes, the flag is not necessary. Note that the new nodes are in the all.q queue and you should use all.q@@gpu if you want to use them or all.q@@node if you want to use the old nodes.
+
CERN has a writeup on this at https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTwBbuFault .

Latest revision as of 10:12, 4 September 2014

Hardware

Some information about the old cluster can be found at here.

The cluster has been updated: it is now a cluster with 16 nodes each with 2 GPUs for a total of 32 GPUs (gtx480s). These cards have 1.5 GB of memory and deliver for dslash about 50GFlops/s in double precision and 125 GFlops/s in single precision (see Ben's Performance benchmarks).

Each node has 12GB of memory (6 x 2GB PC3-10600 ECC unbuffered DDR3 1333MHz) and 2 Intel Quad Core Xeon E5620 running at 2.4GHz with 12MB of cache. The chassis is a Supermicro product, model number 1026GT-TF-FM207 and the motherboard is X8DTG-DF which is based on Intel 5520 chipset (Tylersburg). Each node has a 320 GB HDD.

The interconnects are still 4x DDR infiniband which provides 5Gb/s x 4lanes = 20Gb/s (signaling) which due to 8/10 encoding translates to a 16Gb/s=2GB/s in each direction (for more Infinband details consult wikipedia page).


Configuration

If you plan to use openmpi or mvapich2 to run your parallel jobs, don't forget to load the proper modules in your .cshrc file (for example for openmpi you need to have module load openmpi/gnu). Furthermore, if you plan to use cuda v3.1 load module cuda31 (we also have cuda v3.2 installed but our codes are not yet compatible with it).

The scheduling system on the cluster is Sun Grid Engine v6.2u5 (sge). For infiniband we have installed OFED and openmpi (v1.5) and mvapich2 (v1.5.1). We prefer to use openmpi since it provides tight integration with the scheduler (the mpi processes get killed when you delete a job from the scheduler). A simple script is given below:

#$ -S /bin/csh
#$ -cwd
#$ -j y
#$ -o job_output.log$JOB_ID
##$ -q all.q
#$ -q gpu.q
#$ -l gpu_count=1
#$ -pe openmpi 8
#$ -l h_rt=01:30:00
 
 mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log


Note that if your job consumes gpus you should use -l gpu_count=1 (one gpu per process) and you should also include --mca btl_openib_flags 1 flag to make sure that openmpi and cuda work together. If you only use the cpu codes, the flag is not necessary. Note that the new nodes are in the gpu.q queue and you should use gpu.q if you want to use them or all.q if you want to use the old nodes.

CUDA 4.0 note: For cuda versions latter than 4.0 we can set the environment variable CUDA_NIC_INTEROP=1 to force cudaMallocHost to use an alternate code path that doesn't conflict with openmpi. When this flag is set we don't need the --mca btl_openib_flags 1 flag set.

Queue admin

Disable/enable nodes

If you want to reboot a node while there are pending jobs in the queue, you should disable it from the queue first:

qmod -d \*@gpu05

disables node gpu05 from all the queues. Note that we need to escape * so that the shell doesn't expand it. Once the node is rebooted, we need to reenable the node using

qmod -e \*@gpu05

Setting up parallel environments

To add/remove/modify a parallel environment (pe) use qconf. For example qconf -spl shows you all available pe's and qconf -sp openmpi shows you the configuration of openmpi pe. To modify it, use qconf -mp openmpi. The important parameters to configure are start_proc_args and end_proc_args which allow you to do a pre/post execution setup and allocation_rule which can be either $round_robin or $fill_up.

Cluster administration

The new nodes on the cluster have support for Intelligent Platform Management Interface (ipmi). We can use it to restart and monitor the nodes remotely even when we can no longer login on the nodes. The ipmi service is running on all nodes and it is configured to listen for packets on 192.168.2.1xx network, where xx stands for the gpuxx node. To configure node gpu01, the commands I had to issue were the following:

ipmitool lan set 1 ipsrc static
ipmitool lan set 1 ipaddr 192.168.2.101

To check the configuration, you need to run ipmitool lan print 1. For the other nodes similar commands were issued. With this configuration we can monitor each individual node without logging on the node. For example to get sensor information on node gpu01 we would type on the head node:

 ipmitool -H 192.168.2.101 -I lanplus -U user_name -P password sensor

System Administration Manual

The system administration manual from advanced clustering can be found here. Note that on this cluster the act_* tools are called beo_*.

Adding a user

To add a user, you need to do the following steps:

  • Use sudo /usr/sbin/adduser to add it to the head node
  • Use sudo passwd -l username to remove the password login option
  • Add a ssh-key to authorized key to allow user to login on the machine
  • Propagate the change to the compute nodes using sudo /act/bin/beo_authsync -g gpus (if you only want to give access to gpus)
  • Add the user to the gpuuser queue using qconf -mu gpuusers
  • Make sure to change the permission on the .ssh/authorized_keys file so that it is not writable by anyone but the user

Write buffering issues

At the moment there are some thermal issues related to the write buffering. The RAID card has a write buffer which greatly improves performance; it has a battery backup that prevents data loss in the event of a power failure. It's programmed to fail safe and shut down the write buffer if there is anything wrong with the battery. In particular, this is going on if it takes an inordinately long time to delete (in particular) a file.

Well, recently the battery is overheating. We're still trying to figure out what is going on and how to fix it, but for now you can monitor the status of the battery.

As an ordinary user, you can run

 % dmesg | grep AEN

to check the system log for messages from the RAID controller. Unfortunately, those are not timestamped. The controller only reports three temperature states: Normal, High, and Too High (at which point it will shut down the battery and buffering).

These messages are written to /var/log/messages (and files rotated with it; messages.1, etc...), and are timestamped, but you will need to be root to examine them. If you have root, you can look at them, and find lines like:

 May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x004B): Battery temperature is high:.
 May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0008): Unclean shutdown detected:unit=0.  
 May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.
 May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x004D): Battery temperature is too high:.

You can also query the controller directly:

 # /usr/local/bin/tw_cli /c0/bbu show
 Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
 ---------------------------------------------------------------------------
 bbu   On           Yes       OK        OK       High     0      xx-xxx-xxxx

or, more generally, ask about everything it knows:

 # /usr/local/bin/tw_cli /c0 show
 Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
 u0    RAID-6    VERIFYING      -       31%(A)  64K     11175.8   ON     OFF    
 
 Port   Status           Unit   Size        Blocks        Serial  
 ---------------------------------------------------------------
 p0     OK               u0     1.82 TB     3907029168    WD-WCAVY4029115     
 p1     OK               u0     1.82 TB     3907029168    WD-WCAVY4022616     
 p2     OK               u0     1.82 TB     3907029168    WD-WCAVY3899073     
 p3     OK               u0     1.82 TB     3907029168    WD-WCAVY3996530     
 p4     OK               u0     1.82 TB     3907029168    WD-WCAVY3924154     
 p5     OK               u0     1.82 TB     3907029168    WD-WCAVY3898411     
 p6     OK               u0     1.82 TB     3907029168    WD-WCAVY3751982     
 p7     OK               u0     1.82 TB     3907029168    WD-WCAVY3917383     
 
 Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
 ---------------------------------------------------------------------------
 bbu   On           Yes       OK        OK       High     0      xx-xxx-xxx


(It will say "Fault" under "Cache" if the buffering has been disabled.)

If you want to reenable the cache once the battery has cooled down, do

 # /usr/local/bin/tw_cli /c0/u0 set cache=on

I don't think this will work if the temperature is still "Too High" (in that case, you are enabling it without the battery backup).

CERN has a writeup on this at https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTwBbuFault .