To reduce the latency, we devised a dataflow architecture with an Allreduce-specific hardware accelerator that performs data aggregation and reduction while data is being transferred. The accelerator is designed to immediately start Allreduce operation before an entire message is recived.