SiP-ML: high-bandwidth optical network interconnects for machine learning training
This paper proposes optical network interconnects as a key enabler for building high-
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …
Software-defined “hardware” infrastructures: A survey on enabling technologies and open research directions
This paper provides an overview of software-defined “hardware” infrastructures (SDHI).
SDHI builds upon the concept of hardware (HW) resource disaggregation. HW resource …
SDHI builds upon the concept of hardware (HW) resource disaggregation. HW resource …
Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions
With the widespread use of distributed machine learning (DML), many IT companies have
established networks dedicated to DML. Different communication architectures of DML have …
established networks dedicated to DML. Different communication architectures of DML have …
Orion: A distributed file system for {Non-Volatile} main memory and {RDMA-Capable} networks
High-performance, byte-addressable non-volatile main memories (NVMMs) force system
designers to rethink trade-offs throughout the system stack, often leading to dramatic …
designers to rethink trade-offs throughout the system stack, often leading to dramatic …
Aquila: A unified, low-latency fabric for datacenter networks
D Gibson, H Hariharan, E Lance, M McLaren… - … USENIX Symposium on …, 2022 - usenix.org
Datacenter workloads have evolved from the data intensive, loosely-coupled workloads of
the past decade to more tightly coupled ones, wherein ultra-low latency communication is …
the past decade to more tightly coupled ones, wherein ultra-low latency communication is …
Thymesisflow: A software-defined, hw/sw co-designed interconnect stack for rack-scale memory disaggregation
With cloud providers constantly seeking the best infrastructure trade-off between
performance delivered to customers and overall energy/utilization efficiency of their data …
performance delivered to customers and overall energy/utilization efficiency of their data …
Shoal: A network architecture for disaggregated racks
Disaggregated racks comprise a dense cluster of separate pools of compute, memory and
storage blades, all inter-connected through an internal network within a single rack …
storage blades, all inter-connected through an internal network within a single rack …
Shale: A practical, scalable oblivious reconfigurable network
Circuit-switched technologies have long been proposed for handling high-throughput traffic
in datacenter networks, but recent developments in nanosecond-scale reconfiguration have …
in datacenter networks, but recent developments in nanosecond-scale reconfiguration have …
Uniform-Cost Multi-Path Routing for Reconfigurable Data Center Networks
Reconfigurable data center networks (RDCNs) are arising as a promising data center
network (DCN) design in the post-Moore's law era. However, the constantly reconfigured …
network (DCN) design in the post-Moore's law era. However, the constantly reconfigured …
A tale of two topologies: Exploring convertible data center network architectures with flat-tree
This paper promotes convertible data center network architectures, which can dynamically
change the network topology to combine the benefits of multiple architectures. We propose …
change the network topology to combine the benefits of multiple architectures. We propose …