Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain
An accelerator System is shown that includes a plurality of processing cores. Each
processing core includes a plurality of processing chains configured to perform parallel …
processing core includes a plurality of processing chains configured to perform parallel …
Why on-chip cache coherence is here to stay
Why on-chip cache coherence is here to stay Page 1 78 CommuniCations oF the aCm | juLy 2012
| voL. 55 | no. 7 contributed articles shAred MeMorY is the dominant low-level communication …
| voL. 55 | no. 7 contributed articles shAred MeMorY is the dominant low-level communication …
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
GF Diamos, AR Kerr, S Yalamanchili… - Proceedings of the 19th …, 2010 - dl.acm.org
Ocelot is a dynamic compilation framework designed to map the explicitly data parallel
execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms …
execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms …
An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth
DH Woo, NH Seong, DL Lewis… - HPCA-16 2010 The …, 2010 - ieeexplore.ieee.org
Memory bandwidth has become a major performance bottleneck as more and more cores
are integrated onto a single die, demanding more and more data from the system memory …
are integrated onto a single die, demanding more and more data from the system memory …
Relax: An architectural framework for software recovery of hardware faults
As technology scales ever further, device unreliability is creating excessive complexity for
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …
hardware to maintain the illusion of perfect operation. In this paper, we consider whether …
Thread block compaction for efficient SIMT control flow
Manycore accelerators such as graphics processor units (GPUs) organize processing units
into single-instruction, multiple data “cores” to improve throughput per unit hardware cost …
into single-instruction, multiple data “cores” to improve throughput per unit hardware cost …
DeNovo: Rethinking the memory hierarchy for disciplined parallelism
For parallelism to become tractable for mass programmers, shared-memory languages and
environments must evolve to enforce disciplined practices that ban" wild shared-memory …
environments must evolve to enforce disciplined practices that ban" wild shared-memory …
Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces
B Pichai, L Hsu, A Bhattacharjee - ACM SIGARCH Computer Architecture …, 2014 - dl.acm.org
The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent
example, necessitates a manageable programming model to ensure widespread adoption …
example, necessitates a manageable programming model to ensure widespread adoption …
An asymmetric distributed shared memory model for heterogeneous parallel systems
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently
execute both sequential control-intensive and data-parallel phases of applications. Existing …
execute both sequential control-intensive and data-parallel phases of applications. Existing …
Goldmine: Automatic assertion generation using data mining and static analysis
S Vasudevan, D Sheridan, S Patel… - … , Automation & Test …, 2010 - ieeexplore.ieee.org
We present GOLDMINE, a methodology for generating assertions automatically. Our method
involves a combination of data mining and static analysis of the Register Transfer Level …
involves a combination of data mining and static analysis of the Register Transfer Level …