Difference between revisions of "GWU cluster (Corcoran Hall)"

From Gw-qcd-wiki
Jump to: navigation, search
(Cluster administration)
(Cluster administration)
Line 60: Line 60:
 
   ipmitool -H 192.168.2.101 -I lanplus -U user_name -P password sensor
 
   ipmitool -H 192.168.2.101 -I lanplus -U user_name -P password sensor
  
The system administration manual from advanced clustering can be found [[Media:Apex_Cluster_Manual-2.2.pdf | here]].
+
The system administration manual from advanced clustering can be found [[Media:Apex_Cluster_Manual-2.2.pdf | here]]. Note that on this cluster the act_* tools are called beo_*.

Revision as of 12:54, 17 June 2011

Hardware

Some information about the old cluster can be found at here.

The cluster has been updated: it is now a cluster with 16 nodes each with 2 GPUs for a total of 32 GPUs (gtx480s). These cards have 1.5 GB of memory and deliver for dslash about 50GFlops/s in double precision and 125 GFlops/s in single precision (see Ben's Performance benchmarks).

Each node has 12GB of memory (6 x 2GB PC3-10600 ECC unbuffered DDR3 1333MHz) and 2 Intel Quad Core Xeon E5620 running at 2.4GHz with 12MB of cache. The chassis is a Supermicro product, model number 1026GT-TF-FM207 and the motherboard is X8DTG-DF which is based on Intel 5520 chipset (Tylersburg). Each node has a 320 GB HDD.

The interconnects are still 4x DDR infiniband which provides 5Gb/s x 4lanes = 20Gb/s (signaling) which due to 8/10 encoding translates to a 16Gb/s=2GB/s in each direction (for more Infinband details consult wikipedia page).


Configuration

If you plan to use openmpi or mvapich2 to run your parallel jobs, don't forget to load the proper modules in your .cshrc file (for example for openmpi you need to have module load openmpi/gnu). Furthermore, if you plan to use cuda v3.1 load module cuda31 (we also have cuda v3.2 installed but our codes are not yet compatible with it).

The scheduling system on the cluster is Sun Grid Engine v6.2u5 (sge). For infiniband we have installed OFED and openmpi (v1.5) and mvapich2 (v1.5.1). We prefer to use openmpi since it provides tight integration with the scheduler (the mpi processes get killed when you delete a job from the scheduler). A simple script is given below:

#$ -S /bin/csh
#$ -cwd
#$ -j y
#$ -o job_output.log$JOB_ID
##$ -q all.q
#$ -q gpu.q
#$ -l gpu_count=1
#$ -pe openmpi 8
#$ -l h_rt=01:30:00
 
 mpirun -n $NSLOTS --mca btl_openib_flags 1 test_dslash_multi_gpu < check.in >& output_np${NSLOTS}.log


Note that if your job consumes gpus you should use -l gpu_count=1 (one gpu per process) and you should also include --mca btl_openib_flags 1 flag to make sure that openmpi and cuda work together. If you only use the cpu codes, the flag is not necessary. Note that the new nodes are in the gpu.q queue and you should use gpu.q if you want to use them or all.q if you want to use the old nodes.

Queue admin

Disable/enable nodes

If you want to reboot a node while there are pending jobs in the queue, you should disable it from the queue first:

qmod -d \*@gpu05

disables node gpu05 from all the queues. Note that we need to escape * so that the shell doesn't expand it. Once the node is rebooted, we need to reenable the node using

qmod -e \*@gpu05

Setting up parallel environments

To add/remove/modify a parallel environment (pe) use qconf. For example qconf -spl shows you all available pe's and qconf -sp openmpi shows you the configuration of openmpi pe. To modify it, use qconf -mp openmpi. The important parameters to configure are start_proc_args and end_proc_args which allow you to do a pre/post execution setup and allocation_rule which can be either $round_robin or $fill_up.

Cluster administration

The new nodes on the cluster have support for Intelligent Platform Management Interface (ipmi). We can use it to restart and monitor the nodes remotely even when we can no longer login on the nodes. The ipmi service is running on all nodes and it is configured to listen for packets on 192.168.2.1xx network, where xx stands for the gpuxx node. To configure node gpu01, the commands I had to issue were the following:

ipmitool lan set 1 ipsrc static
ipmitool lan set 1 ipaddr 192.168.2.101

To check the configuration, you need to run ipmitool lan print 1. For the other nodes similar commands were issued. With this configuration we can monitor each individual node without logging on the node. For example to get sensor information on node gpu01 we would type on the head node:

 ipmitool -H 192.168.2.101 -I lanplus -U user_name -P password sensor

The system administration manual from advanced clustering can be found here. Note that on this cluster the act_* tools are called beo_*.