copy and paste this google map to your website or blog!
Press copy button and paste into your blog or website.
(Please switch to 'HTML' mode when posting into your blog. Examples: WordPress Example, Blogger Example)
DirectReduce: A Scalable Ring AllReduce Offloading Architecture for . . . Based on this insight, we propose DirectReduce, a fully offloading ring all-reduce architecture that is comprised of three components: 1) the GateKeeper module, responsible for evaluating outgoing data to decide its progression-either directing it to the Protocol Engine for packetization or intercepting it for reduction (e g , sum, maximum); 2
Optimizing Allreduce Operations for Heterogeneous Architectures with . . . Figure 8: Standard and multi-lane (with MPI_Allreduce and ring all-reduce) algorithms with the maximum of multiple-processes per GPU algorithm increase with node count, achieving 2 1x speedups at 32 nodes Figure 10: The multi-lane all-reduce with multiple processes per GPU Figure 9: A standard MPI_Allreduce with multiple processes per GPU
Swing: Short-cutting Rings for Higher Bandwidth Allreduce - arXiv. org By analyzing the results, we observe that Swing outperforms all the other allreduce algorithms for vectors ranging from 32B to 32MiB due to the lower latency deficiency compared to the ring and bucket algorithms and the lower bandwidth deficiency compared to the latency-optimal recursive doubling algorithm
Short-circuiting Rings for Low-Latency AllReduce - arXiv. org For physical ring network topologies with negligble fixed startup delays, this implies that the Ring AllReduce algorithm is indeed optimal across all message sizes This begs the question: Can we improve AllReduce completion times beyond the Ring algorithm in ring-based GPU-to-GPU topologies?
Research Collection | ETH Library 500 Service Unavailable The server is temporarily unable to service your request due to maintenance downtime or capacity problems Please try again later Take me to the home page
SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale . . . Here, we derive a theoretical lower bound on the expected completion time for Allreduce operation, highlighting the accumulated impact of reliability costs Consider datacenters participating in the Allreduce The ring Allreduce algorithm involves 2 − 2 sequential rounds of P2P communication steps [45]
OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs - arXiv. org Abstract Multi-tenancy is essential for unleashing SmartNIC’s poten-tial in datacenters Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provision-ing for compute and IO resources Compared to standard NIC data paths with a
An RDMA-First Object Storage System with SmartNIC Offload AI training and inference impose sustained, fine-grain I O that stresses host-mediated, TCP-based storage paths Motivated by kernel-bypass networking and user-space storage stacks, we revisit POSIX-compatible object storage for GPU-centric pipelines We present ROS2, an RDMA-first object storage system design that offloads the DAOS client to an NVIDIA BlueField-3 SmartNIC while leaving the
Characterizing Off-path SmartNIC for Accelerating Distributed Systems SmartNICs have recently emerged as an appealing device for accelerating distributed systems However, there has not been a comprehensive characterization of SmartNICs, and existing designs typically only leverage a single communication path for workload offloading This paper presents the first holis- tic study of a representative off-path SmartNIC, specifically the Bluefield-2, from a
FlexiNS: A SmartNIC-Centric, Line-Rate and Flexible Network Stack Offloading network stack to off-path SmartNIC seems promising to provide high flexibility; however, throughput remains constrained by inherent SmartNIC architectural limitations To this end, we design FlexiNS, a SmartNIC-centric network stack with software transport programmability and line-rate packet processing capabilities