sc | NGO HUY CU

Large-Message Size Allreduce at Wire Speed for Distributed Deep Learning

To reduce the latency, we devised a dataflow architecture with an Allreduce-specific hardware accelerator that performs data aggregation and reduction while data is being transferred. The accelerator is designed to immediately start Allreduce operation before an entire message is recived.