Communication
To parallelize our codes we need a communication infrastructure. The basic idea is to divide the lattice equally between processes that run in parallel and manage their own resources: CPU memory, GPU memory etc. The processes are mostly equivalent: the calculations and memory access patterns are very similar. This allows us to write codes that look very much like serial codes. Most of the code path is run redundantly by all processes; it is only the lattice-wide operations that benefit from the multi-processor hardware. However, since the computational time is dominated by these routines, the codes scale well.
While the I/O resources are also managed independently by each process, in most of our codes, the I/O is handled by rank 0 process. This is mainly due to the fact that we cannot guarantee that the disk is mounted by all compute nodes and, even if it is, most I/O libraries do not handle well multi-process access to a single file.
Code organization
The communication is organized in three layers:
- low level: these routines send/receive unstructured data from one process to another
- intermediate level: these routines handle data movement between lattice data structures
- high level: these routines implement shifts
Low level routines
These routines are simple wrappers that sit on top of the communication library (most likely MPI). Their purpose is to insulate our codes from the communication layer. Currently, our code provides an implementation that uses MPI and a vanilla implementation that is used for single node architectures. Here is a list of the low level routines in comm_low.h with a short description:
void init_machine(int &argc, char **&argv, bool single_node = false)
- This function needs to be called if the communication library is linked, to initialize the communication layer. The parameters argc and argv are the ones passed to main routine by the operating system. The last parameter indicates whether the code is only meant to run on single node machines.
void shutdown_machine()
- comm library cleanup. call before exiting main program.
int get_num_nodes()
- returns the number of processes in the current program.
rank_t get_node_rank()
- returns the rank (number) of the current process. rank_t is defined to be unsigned int in layout.h.
void synchronize()
- inserts a communication barrier. This is block the execution in all processes until all of them reach this point.
void broadcast(char* buffer, size_t sz)
- broadcast the buffer of size sz bytes to all nodes, i.e., copy it from node of rank 0 to all other nodes.
void send(char* buffer, size_t sz, rank_t dest, int tag=1) void recv(char* buffer, size_t size, rank_t src, int tag=1);
- send/receive a buffer of size sz bytes from/to the current node to/from the node of rank dest and add a numeric tag to it (if necessary). This is a blocking call, that will return when a matching receive/send is executed on node dest.
struct comm_link
- this structure implement an asynchronous communication link. start_send/start_recv are non-blocking calls that initiate the data transfer. To wait for the call to complete, call finish_comm method, which blocks until the data transfer is complete.
template <typename T> void global_sum(T& v) template <typename R> void global_sum(complex<R>& v) template <typename T> void global_sum(T* v, int len)
- these routines sum v over all processes, when v is any arithmetic type (T), a complex type or an array of values.
GPU low level layer
Codes running on GPUs store relevant data in GPU memory. In the current implementation, communications between GPUs is staged: first data is copied into a CPU buffer, then low level communication routines are used to transfer data between the CPU buffers.
Here are the routines that make up the low level GPU layer (the relevant header file is comm_low_cuda.h):
void initialize_cuda_devices()
- This routine is called by init_machine when the GPU layer is enabled. NOTE: the CUDA API doesn't report whether a device is already in use. Thus, when running in multi-GPU scenario, you need to make sure that the all the GPUs are available on the node where your code is run, otherwise the GPUs will be over-allocated and they will either have very poor performance or the code will fail outright if the GPUs are allocated exclusively to another process.
void device_send(void* buf, int size, int tonode) void device_recv(void* buf, int size, int fromnode)
- Blocking communication calls. These routine allocate the CPU buffers as needed and remove them when the communication is complete.
class device_comm_link
- This structure is used for the non-blocking communication routines. The main difference from the comm_link is that it own a CPU buffer and a reference to the GPU buffer. Depending on the communication strategy, the CPU buffer can be pinned to allow for faster transfer between the GPU and CPU buffer (this is also required if the data transfer is done asynchronously over the PCI bus). The various strategies are controlled by flags defined in comm/comm_low_cuda.cu.
struct buffered_device_comm_link
- buffered_device_comm_link is similar to a device_comm_link, except that it also manages the GPU buffer, not only the CPU buffer; this allows for greater flexibility and it is especially useful for repeated calls. As the device_comm_link, the CPU buffer can be pinned, to improve copy speed and allow for async calls. This is controlled by flags set in comm/comm_low_cuda.cu. Moreover, if the CPU memory is pinned and mapped in the device memory too, you can use the same pointer on the device. This allows you to have kernels that write and read directly from CPU memory. There are scenarios where this could lead to improved performance.
void device_send(buffered_device_comm_link &buf, int size, int tonode) void device_recv(buffered_device_comm_link &buf, int size, int fromnode)
- These routines implement blocking calls for buffered_device_comm_link.
This is found in the comm folder, it handles the communication between nodes.