SHARP: A short-word hierarchical accelerator for robust and practical fully homomorphic encryption

J Kim, S Kim, J Choi, J Park, D Kim… - Proceedings of the 50th …, 2023 - dl.acm.org
Fully homomorphic encryption (FHE) is an emerging cryptographic technology that
guarantees the privacy of sensitive user data by enabling direct computations on encrypted …

Evolution of the graphics processing unit (GPU)

WJ Dally, SW Keckler, DB Kirk - IEEE Micro, 2021 - ieeexplore.ieee.org
Graphics processing units (GPUs) power today's fastest supercomputers, are the dominant
platform for deep learning, and provide the intelligence for devices ranging from self-driving …

MIMD programs execution support on SIMD machines: a holistic survey

D Mustafa, R Alkhasawneh, F Obeidat… - IEEE Access, 2024 - ieeexplore.ieee.org
The Single Instruction Multiple Data (SIMD) architecture, supported by various high-
performance computing platforms, efficiently utilizes data-level parallelism. The SIMD model …

Analyzing CUDA workloads using a detailed GPU simulator

A Bakhoda, GL Yuan, WWL Fung… - … analysis of systems …, 2009 - ieeexplore.ieee.org
Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models
that understanding their performance can provide insight in designing tomorrow's manycore …

PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

A Klöckner, N Pinto, Y Lee, B Catanzaro, P Ivanov… - Parallel computing, 2012 - Elsevier
High-performance computing has recently seen a surge of interest in heterogeneous
systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices …

Brook for GPUs: stream computing on graphics hardware

I Buck, T Foley, D Horn, J Sugerman… - ACM transactions on …, 2004 - dl.acm.org
In this paper, we present Brook for GPUs, a system for general-purpose computation on
programmable graphics hardware. Brook extends C to include simple data-parallel …

Conservation cores: reducing the energy of mature computations

G Venkatesh, J Sampson, N Goulding, S Garcia… - ACM Sigplan …, 2010 - dl.acm.org
Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are
currently conspiring to create a utilization wall that limits the fraction of a chip that can run at …

Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads

D Abts, J Ross, J Sparling… - 2020 ACM/IEEE 47th …, 2020 - ieeexplore.ieee.org
In this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-
sliced microarchitecture with memory units interleaved with vector and matrix deep learning …

Dynamic warp formation and scheduling for efficient GPU control flow

WWL Fung, I Sham, G Yuan… - 40th Annual IEEE/ACM …, 2007 - ieeexplore.ieee.org
Recent advances in graphics processing units (GPUs) have resulted in massively parallel
hardware that is easily programmable and widely available in commodity desktop computer …

Sequoia: Programming the memory hierarchy

K Fatahalian, DR Horn, TJ Knight, L Leem… - Proceedings of the …, 2006 - dl.acm.org
We present Sequoia, a programming language designed to facilitate the development of
memory hierarchy aware parallel programs that remain portable across modern machines …