Talk Overview : Graphical Processing Units (GPUs) are critical to modern HPC (high performance compute) and ML/DL (machine learning/deep learning) computing workloads. Requirements of engineers and scientists can easily scale to petaflops whereas the current state of the art GPU performance is in teraflops range. Continuing in the tradition of cluster computing GPUs are scaled to petaflops performance by using traditional technologies such as MPI (message passing interface), high performance interconnects (such as infiniband and RoCE), and RDMA (remote direct memory access). The presentation will explore the challenges involved with multi-node scaling and how containerization is helping manage the software complexities of running workloads on clusters. An overview will be presented of how to orchestrate multinode workflows using GPU hardware and MPI using containers. The containers technology focus in the presentation will be on docker, singularity, HPC resource schedulers such as SLURM/PBS/etc., and container orchestration platforms such as Kubernetes.
From the CentOS Dojo at ORNL - https://wiki.centos.org/Events/Dojo/ORNL2019