A full spectrum of computing-in-memory technologies
Computing in memory (CIM) could be used to overcome the von Neumann bottleneck and to
provide sustainable improvements in computing throughput and energy efficiency …
provide sustainable improvements in computing throughput and energy efficiency …
[HTML][HTML] A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural Networks
This comprehensive review explores the advancements in processing-in-memory (PIM)
techniques and chiplet-based architectures for deep neural networks (DNNs). It addresses …
techniques and chiplet-based architectures for deep neural networks (DNNs). It addresses …
Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology
Processing-in-memory (PIM) has been explored for decades by computer architects, yet it
has never seen the light of day in real-world products due to its high design overheads and …
has never seen the light of day in real-world products due to its high design overheads and …
Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis
Processing-using-DRAM (PuD) is an emerging paradigm that leverages the analog
operational properties of DRAM circuitry to enable massively parallel in-DRAM computation …
operational properties of DRAM circuitry to enable massively parallel in-DRAM computation …
pluto: Enabling massively parallel computation in dram via lookup tables
Data movement between the main memory and the processor is a key contributor to
execution time and energy consumption in memory-intensive applications. This data …
execution time and energy consumption in memory-intensive applications. This data …
MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data …
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a
DRAM array's massive internal parallelism to execute very-wide (eg, 16,384-262,144-bit …
DRAM array's massive internal parallelism to execute very-wide (eg, 16,384-262,144-bit …
Near-optimal wafer-scale reduce
Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-
performance computing (HPC) applications. We present the first systematic investigation of …
performance computing (HPC) applications. We present the first systematic investigation of …
Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing
Modern transformer-based Large Language Models (LLMs) are constructed with a series of
decoder blocks. Each block comprises three key components:(1) QKV generation,(2) multi …
decoder blocks. Each block comprises three key components:(1) QKV generation,(2) multi …
Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System
Computing on encrypted data is a promising approach to reduce data security and privacy
risks, with homomorphic encryption serving as a facilitator in achieving this goal. In this work …
risks, with homomorphic encryption serving as a facilitator in achieving this goal. In this work …
Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis
We experimentally analyze the computational capability of commercial off-the-shelf (COTS)
DRAM chips and the robustness of these capabilities under various timing delays between …
DRAM chips and the robustness of these capabilities under various timing delays between …