Outerspace: An outer product based sparse matrix multiplication accelerator
Sparse matrices are widely used in graph and data analytics, machine learning, engineering
and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator …
and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator …
Gamma: Leveraging Gustavson's algorithm to accelerate sparse matrix multiplication
Sparse matrix-sparse matrix multiplication (spMspM) is at the heart of a wide range of
scientific and machine learning applications. spMspM is inefficient on general-purpose …
scientific and machine learning applications. spMspM is inefficient on general-purpose …
Co-designing accelerators and SoC interfaces using gem5-Aladdin
Increasing demand for power-efficient, high-performance computing has spurred a growing
number and diversity of hardware accelerators in mobile and server Systems on Chip …
number and diversity of hardware accelerators in mobile and server Systems on Chip …
Buffets: An efficient and composable storage idiom for explicit decoupled data orchestration
Accelerators spend significant area and effort on custom on-chip buffering. Unfortunately,
these solutions are strongly tied to particular designs, hampering re-usability across other …
these solutions are strongly tied to particular designs, hampering re-usability across other …
Zorua: A holistic approach to resource virtualization in GPUs
This paper introduces a new resource virtualization framework, Zorua, that decouples the
programmer-specified resource usage of a GPU application from the actual allocation in the …
programmer-specified resource usage of a GPU application from the actual allocation in the …
Capstan: A vector RDA for sparsity
This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable dataflow
accelerator (RDA) for sparse and dense tensor applications. Instead of designing for one …
accelerator (RDA) for sparse and dense tensor applications. Instead of designing for one …
Efficient GPU synchronization without scopes: Saying no to complex consistency models
As GPUs have become increasingly general purpose, applications with more general
sharing patterns and fine-grained synchronization have started to emerge. Unfortunately …
sharing patterns and fine-grained synchronization have started to emerge. Unfortunately …
SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator
Dynamic adaptation is a post-silicon optimization technique that adapts the hardware to
workload phases. However, current adaptive approaches are oblivious to implicit phases …
workload phases. However, current adaptive approaches are oblivious to implicit phases …
Whirlpool: Improving dynamic cache management with static data classification
Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques,
such as scratchpads or reuse hints, use static information about how programs access data …
such as scratchpads or reuse hints, use static information about how programs access data …
Morpheus: Extending the last level cache capacity in GPU systems using idle GPU core resources
Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel
applications. In many GPU applications, GPU memory bandwidth bottlenecks performance …
applications. In many GPU applications, GPU memory bandwidth bottlenecks performance …