Difference between revisions of "GWU cluster (Corcoran Hall)"
(→Cluster administration) |
|||
(13 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
+ | === Hardware === | ||
+ | |||
Some information about the old cluster can be found at [http://eagle.phys.gwu.edu/~fxlee/cluster/ here]. | Some information about the old cluster can be found at [http://eagle.phys.gwu.edu/~fxlee/cluster/ here]. | ||
The cluster has been updated: it is now a cluster with 16 nodes each with 2 GPUs for a total of 32 GPUs (gtx480s). These cards have 1.5 GB of memory and deliver for dslash about 50GFlops/s in double precision and 125 GFlops/s in single precision (see Ben's [[Performance]] benchmarks). | The cluster has been updated: it is now a cluster with 16 nodes each with 2 GPUs for a total of 32 GPUs (gtx480s). These cards have 1.5 GB of memory and deliver for dslash about 50GFlops/s in double precision and 125 GFlops/s in single precision (see Ben's [[Performance]] benchmarks). | ||
+ | |||
+ | Each node has 12GB of memory (6 x 2GB PC3-10600 ECC unbuffered DDR3 1333MHz) and 2 Intel Quad Core Xeon E5620 running at 2.4GHz with 12MB of cache. The chassis is a Supermicro product, model number 1026GT-TF-FM207 and the motherboard is X8DTG-DF which is based on Intel 5520 chipset (Tylersburg). Each node has a 320 GB HDD. | ||
The interconnects are still 4x DDR infiniband which provides 5Gb/s x 4lanes = 20Gb/s (signaling) which due to 8/10 encoding translates to a 16Gb/s=2GB/s in each direction (for more Infinband details consult [http://en.wikipedia.org/wiki/Infiniband wikipedia page]). | The interconnects are still 4x DDR infiniband which provides 5Gb/s x 4lanes = 20Gb/s (signaling) which due to 8/10 encoding translates to a 16Gb/s=2GB/s in each direction (for more Infinband details consult [http://en.wikipedia.org/wiki/Infiniband wikipedia page]). | ||
− | |||
− | #$ -S /bin/csh | + | === Configuration === |
− | + | ||
− | + | If you plan to use openmpi or mvapich2 to run your parallel jobs, don't forget to load the proper modules in your .cshrc file (for example for openmpi you need to have '''module load openmpi/gnu'''). Furthermore, if you plan to use cuda v3.1 load module cuda31 (we also have cuda v3.2 installed but our codes are not yet compatible with it). | |
− | + | ||
− | + | The scheduling system on the cluster is Sun Grid Engine v6.2u5 ([http://gridengine.sunsource.net/ sge]). For infiniband we have installed OFED and openmpi (v1.5) and mvapich2 (v1.5.1). We prefer to use openmpi since it provides tight integration with the scheduler (the mpi processes get killed when you delete a job from the scheduler). A simple script is given below: | |
− | + | ||
− | + | <nowiki> | |
− | + | #$ -S /bin/csh | |
− | + | #$ -cwd | |
+ | #$ -j y | ||
+ | #$ -o job_output.log$JOB_ID | ||
+ | ##$ -q all.q | ||
+ | #$ -q gpu.q | ||
+ | #$ -l gpu_count=1 | ||
+ | #$ -pe openmpi 8 | ||
+ | #$ -l h_rt=01:30:00 | ||
mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log | mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log | ||
+ | </nowiki> | ||
+ | |||
+ | |||
+ | Note that if your job consumes gpus you should use '''-l gpu_count=1''' (one gpu per process) and you should also include '''--mca btl_openib_flags 1''' flag to make sure that openmpi and cuda work together. If you only use the cpu codes, the flag is not necessary. Note that the new nodes are in the gpu.q queue and you should use gpu.q if you want to use them or all.q if you want to use the old nodes. | ||
+ | |||
+ | '''CUDA 4.0 note:''' For cuda versions latter than 4.0 we can set the environment variable ''CUDA_NIC_INTEROP=1'' to force cudaMallocHost to use an alternate code path that doesn't conflict with openmpi. When this flag is set we don't need the ''--mca btl_openib_flags 1'' flag set. | ||
+ | |||
+ | === Queue admin === | ||
+ | |||
+ | ==== Disable/enable nodes ==== | ||
+ | |||
+ | If you want to reboot a node while there are pending jobs in the queue, you should disable it from the queue first: | ||
+ | |||
+ | qmod -d \*@gpu05 | ||
+ | |||
+ | disables node gpu05 from all the queues. Note that we need to escape * so that the shell doesn't expand it. Once the node is rebooted, we need to reenable the node using | ||
+ | |||
+ | qmod -e \*@gpu05 | ||
+ | |||
+ | ==== Setting up parallel environments ==== | ||
+ | |||
+ | To add/remove/modify a parallel environment (pe) use qconf. For example '''qconf -spl''' shows you all available pe's and '''qconf -sp openmpi''' shows you the configuration of openmpi pe. To modify it, use '''qconf -mp openmpi'''. The important parameters to configure are '''start_proc_args''' and '''end_proc_args''' which allow you to do a pre/post execution setup and '''allocation_rule''' which can be either '''<nowiki>$</nowiki>round_robin''' or '''<nowiki>$</nowiki>fill_up'''. | ||
+ | |||
+ | === Cluster administration === | ||
+ | |||
+ | The new nodes on the cluster have support for Intelligent Platform Management Interface ([http://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface ipmi]). We can use it to restart and monitor the nodes remotely even when we can no longer login on the nodes. The ipmi service is running on all nodes and it is configured to listen for packets on 192.168.2.1xx network, where xx stands for the gpuxx node. To configure node gpu01, the commands I had to issue were the following: | ||
+ | |||
+ | ipmitool lan set 1 ipsrc static | ||
+ | ipmitool lan set 1 ipaddr 192.168.2.101 | ||
+ | |||
+ | To check the configuration, you need to run '''ipmitool lan print 1'''. For the other nodes similar commands were issued. With this configuration we can monitor each individual node without logging on the node. For example to get sensor information on node gpu01 we would type on the head node: | ||
+ | |||
+ | ipmitool -H 192.168.2.101 -I lanplus -U user_name -P password sensor | ||
+ | |||
+ | ==== System Administration Manual ==== | ||
+ | |||
+ | The system administration manual from advanced clustering can be found [[Media:Apex_Cluster_Manual-2.2.pdf | here]]. Note that on this cluster the act_* tools are called beo_*. | ||
+ | |||
+ | ==== Adding a user ==== | ||
+ | |||
+ | To add a user, you need to do the following steps: | ||
+ | * Use sudo /usr/sbin/adduser to add it to the head node | ||
+ | * Use sudo passwd -l username to remove the password login option | ||
+ | * Add a ssh-key to authorized key to allow user to login on the machine | ||
+ | * Propagate the change to the compute nodes using sudo /act/bin/beo_authsync -g gpus (if you only want to give access to gpus) | ||
+ | * Add the user to the gpuuser queue using qconf -mu gpuusers | ||
+ | * Make sure to change the permission on the .ssh/authorized_keys file so that it is not writable by anyone but the user | ||
+ | |||
+ | === Write buffering issues === | ||
+ | |||
+ | At the moment there are some thermal issues related to the write buffering. The RAID card has a write buffer which greatly improves performance; it has a battery backup that prevents data loss in the event of a power failure. It's programmed to fail safe and shut down the write buffer if there is anything wrong with the battery. In particular, this is going on if it takes an inordinately long time to delete (in particular) a file. | ||
+ | |||
+ | Well, recently the battery is overheating. We're still trying to figure out what is going on and how to fix it, but for now you can monitor the status of the battery. | ||
+ | |||
+ | As an ordinary user, you can run | ||
+ | |||
+ | % dmesg | grep AEN | ||
+ | |||
+ | to check the system log for messages from the RAID controller. Unfortunately, those are not timestamped. The controller only reports three temperature states: Normal, High, and Too High (at which point it will shut down the battery and buffering). | ||
+ | |||
+ | These messages are written to /var/log/messages (and files rotated with it; messages.1, etc...), and are timestamped, but you will need to be root to examine them. If you have root, you can look at them, and find lines like: | ||
+ | |||
+ | May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x004B): Battery temperature is high:. | ||
+ | May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0008): Unclean shutdown detected:unit=0. | ||
+ | May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0. | ||
+ | May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x004D): Battery temperature is too high:. | ||
+ | |||
+ | You can also query the controller directly: | ||
+ | # /usr/local/bin/tw_cli /c0/bbu show | ||
+ | Name OnlineState BBUReady Status Volt Temp Hours LastCapTest | ||
+ | --------------------------------------------------------------------------- | ||
+ | bbu On Yes OK OK High 0 xx-xxx-xxxx | ||
+ | |||
+ | or, more generally, ask about everything it knows: | ||
+ | |||
+ | # /usr/local/bin/tw_cli /c0 show | ||
+ | |||
+ | Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy | ||
+ | ------------------------------------------------------------------------------ | ||
+ | u0 RAID-6 VERIFYING - 31%(A) 64K 11175.8 ON OFF | ||
+ | |||
+ | Port Status Unit Size Blocks Serial | ||
+ | --------------------------------------------------------------- | ||
+ | p0 OK u0 1.82 TB 3907029168 WD-WCAVY4029115 | ||
+ | p1 OK u0 1.82 TB 3907029168 WD-WCAVY4022616 | ||
+ | p2 OK u0 1.82 TB 3907029168 WD-WCAVY3899073 | ||
+ | p3 OK u0 1.82 TB 3907029168 WD-WCAVY3996530 | ||
+ | p4 OK u0 1.82 TB 3907029168 WD-WCAVY3924154 | ||
+ | p5 OK u0 1.82 TB 3907029168 WD-WCAVY3898411 | ||
+ | p6 OK u0 1.82 TB 3907029168 WD-WCAVY3751982 | ||
+ | p7 OK u0 1.82 TB 3907029168 WD-WCAVY3917383 | ||
+ | |||
+ | Name OnlineState BBUReady Status Volt Temp Hours LastCapTest | ||
+ | --------------------------------------------------------------------------- | ||
+ | bbu On Yes OK OK High 0 xx-xxx-xxx | ||
+ | |||
+ | |||
+ | (It will say "Fault" under "Cache" if the buffering has been disabled.) | ||
+ | |||
+ | If you want to reenable the cache once the battery has cooled down, do | ||
+ | |||
+ | # /usr/local/bin/tw_cli /c0/u0 set cache=on | ||
+ | |||
+ | I don't think this will work if the temperature is still "Too High" (in that case, you are enabling it without the battery backup). | ||
− | + | CERN has a writeup on this at https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTwBbuFault . |
Latest revision as of 10:12, 4 September 2014
Contents
Hardware
Some information about the old cluster can be found at here.
The cluster has been updated: it is now a cluster with 16 nodes each with 2 GPUs for a total of 32 GPUs (gtx480s). These cards have 1.5 GB of memory and deliver for dslash about 50GFlops/s in double precision and 125 GFlops/s in single precision (see Ben's Performance benchmarks).
Each node has 12GB of memory (6 x 2GB PC3-10600 ECC unbuffered DDR3 1333MHz) and 2 Intel Quad Core Xeon E5620 running at 2.4GHz with 12MB of cache. The chassis is a Supermicro product, model number 1026GT-TF-FM207 and the motherboard is X8DTG-DF which is based on Intel 5520 chipset (Tylersburg). Each node has a 320 GB HDD.
The interconnects are still 4x DDR infiniband which provides 5Gb/s x 4lanes = 20Gb/s (signaling) which due to 8/10 encoding translates to a 16Gb/s=2GB/s in each direction (for more Infinband details consult wikipedia page).
Configuration
If you plan to use openmpi or mvapich2 to run your parallel jobs, don't forget to load the proper modules in your .cshrc file (for example for openmpi you need to have module load openmpi/gnu). Furthermore, if you plan to use cuda v3.1 load module cuda31 (we also have cuda v3.2 installed but our codes are not yet compatible with it).
The scheduling system on the cluster is Sun Grid Engine v6.2u5 (sge). For infiniband we have installed OFED and openmpi (v1.5) and mvapich2 (v1.5.1). We prefer to use openmpi since it provides tight integration with the scheduler (the mpi processes get killed when you delete a job from the scheduler). A simple script is given below:
#$ -S /bin/csh #$ -cwd #$ -j y #$ -o job_output.log$JOB_ID ##$ -q all.q #$ -q gpu.q #$ -l gpu_count=1 #$ -pe openmpi 8 #$ -l h_rt=01:30:00 mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log
Note that if your job consumes gpus you should use -l gpu_count=1 (one gpu per process) and you should also include --mca btl_openib_flags 1 flag to make sure that openmpi and cuda work together. If you only use the cpu codes, the flag is not necessary. Note that the new nodes are in the gpu.q queue and you should use gpu.q if you want to use them or all.q if you want to use the old nodes.
CUDA 4.0 note: For cuda versions latter than 4.0 we can set the environment variable CUDA_NIC_INTEROP=1 to force cudaMallocHost to use an alternate code path that doesn't conflict with openmpi. When this flag is set we don't need the --mca btl_openib_flags 1 flag set.
Queue admin
Disable/enable nodes
If you want to reboot a node while there are pending jobs in the queue, you should disable it from the queue first:
qmod -d \*@gpu05
disables node gpu05 from all the queues. Note that we need to escape * so that the shell doesn't expand it. Once the node is rebooted, we need to reenable the node using
qmod -e \*@gpu05
Setting up parallel environments
To add/remove/modify a parallel environment (pe) use qconf. For example qconf -spl shows you all available pe's and qconf -sp openmpi shows you the configuration of openmpi pe. To modify it, use qconf -mp openmpi. The important parameters to configure are start_proc_args and end_proc_args which allow you to do a pre/post execution setup and allocation_rule which can be either $round_robin or $fill_up.
Cluster administration
The new nodes on the cluster have support for Intelligent Platform Management Interface (ipmi). We can use it to restart and monitor the nodes remotely even when we can no longer login on the nodes. The ipmi service is running on all nodes and it is configured to listen for packets on 192.168.2.1xx network, where xx stands for the gpuxx node. To configure node gpu01, the commands I had to issue were the following:
ipmitool lan set 1 ipsrc static ipmitool lan set 1 ipaddr 192.168.2.101
To check the configuration, you need to run ipmitool lan print 1. For the other nodes similar commands were issued. With this configuration we can monitor each individual node without logging on the node. For example to get sensor information on node gpu01 we would type on the head node:
ipmitool -H 192.168.2.101 -I lanplus -U user_name -P password sensor
System Administration Manual
The system administration manual from advanced clustering can be found here. Note that on this cluster the act_* tools are called beo_*.
Adding a user
To add a user, you need to do the following steps:
- Use sudo /usr/sbin/adduser to add it to the head node
- Use sudo passwd -l username to remove the password login option
- Add a ssh-key to authorized key to allow user to login on the machine
- Propagate the change to the compute nodes using sudo /act/bin/beo_authsync -g gpus (if you only want to give access to gpus)
- Add the user to the gpuuser queue using qconf -mu gpuusers
- Make sure to change the permission on the .ssh/authorized_keys file so that it is not writable by anyone but the user
Write buffering issues
At the moment there are some thermal issues related to the write buffering. The RAID card has a write buffer which greatly improves performance; it has a battery backup that prevents data loss in the event of a power failure. It's programmed to fail safe and shut down the write buffer if there is anything wrong with the battery. In particular, this is going on if it takes an inordinately long time to delete (in particular) a file.
Well, recently the battery is overheating. We're still trying to figure out what is going on and how to fix it, but for now you can monitor the status of the battery.
As an ordinary user, you can run
% dmesg | grep AEN
to check the system log for messages from the RAID controller. Unfortunately, those are not timestamped. The controller only reports three temperature states: Normal, High, and Too High (at which point it will shut down the battery and buffering).
These messages are written to /var/log/messages (and files rotated with it; messages.1, etc...), and are timestamped, but you will need to be root to examine them. If you have root, you can look at them, and find lines like:
May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x004B): Battery temperature is high:. May 15 00:13:34 qcd kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0008): Unclean shutdown detected:unit=0. May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0. May 15 00:13:41 qcd kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x004D): Battery temperature is too high:.
You can also query the controller directly:
# /usr/local/bin/tw_cli /c0/bbu show Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK High 0 xx-xxx-xxxx
or, more generally, ask about everything it knows:
# /usr/local/bin/tw_cli /c0 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-6 VERIFYING - 31%(A) 64K 11175.8 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 1.82 TB 3907029168 WD-WCAVY4029115 p1 OK u0 1.82 TB 3907029168 WD-WCAVY4022616 p2 OK u0 1.82 TB 3907029168 WD-WCAVY3899073 p3 OK u0 1.82 TB 3907029168 WD-WCAVY3996530 p4 OK u0 1.82 TB 3907029168 WD-WCAVY3924154 p5 OK u0 1.82 TB 3907029168 WD-WCAVY3898411 p6 OK u0 1.82 TB 3907029168 WD-WCAVY3751982 p7 OK u0 1.82 TB 3907029168 WD-WCAVY3917383 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK High 0 xx-xxx-xxx
(It will say "Fault" under "Cache" if the buffering has been disabled.)
If you want to reenable the cache once the battery has cooled down, do
# /usr/local/bin/tw_cli /c0/u0 set cache=on
I don't think this will work if the temperature is still "Too High" (in that case, you are enabling it without the battery backup).
CERN has a writeup on this at https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTwBbuFault .