distributed deep learning

Distributed Deep Learning with FPGA Ring Allreduce

In this work, we propose a new In-Network Computing system that can support Ring Allreduce. In order to minimize communication overhead, we apply layer-based computing/communication overlap and optimize it for our proposed In-Network Computing system.

Large-Message Size Allreduce at Wire Speed for Distributed Deep Learning

To reduce the latency, we devised a dataflow architecture with an Allreduce-specific hardware accelerator that performs data aggregation and reduction while data is being transferred. The accelerator is designed to immediately start Allreduce operation before an entire message is recived.