Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach

H Kwon, P Chatarasi, M Pellauer, A Parashar… - Proceedings of the …, 2019 - dl.acm.org
The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse
and perform staging are known as dataflow, which directly impacts the performance and …

A survey of cache simulators

H Brais, R Kalayappan, PR Panda - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Computer architecture simulation tools are essential for implementing and evaluating new
ideas in the domain and can be useful for understanding the behavior of programs and …

Analytical characterization and design space exploration for optimization of cnns

R Li, Y Xu, A Sukumaran-Rajam, A Rountev… - Proceedings of the 26th …, 2021 - dl.acm.org
Moving data through the memory hierarchy is a fundamental bottleneck that can limit the
performance of core algorithms of machine learning, such as convolutional neural networks …

Incremental flattening for nested data parallelism

T Henriksen, F Thorøe, M Elsman… - Proceedings of the 24th …, 2019 - dl.acm.org
Compilation techniques for nested-parallel applications that can adapt to hardware and
dataset characteristics are vital for unlocking the power of modern hardware. This paper …

Polydl: Polyhedral optimizations for creation of high-performance dl primitives

S Tavarageri, A Heinecke, S Avancha, B Kaul… - ACM Transactions on …, 2021 - dl.acm.org
Deep Neural Networks (DNNs) have revolutionized many aspects of our lives. The use of
DNNs is becoming ubiquitous, including in software for image recognition, speech …

A fast analytical model of fully associative caches

T Gysi, T Grosser, L Brandner, T Hoefler - Proceedings of the 40th ACM …, 2019 - dl.acm.org
While the cost of computation is an easy to understand local property, the cost of data
movement on cached architectures depends on global state, does not compose, and is hard …

Fast and exact analysis for LRU caches

V Touzeau, C Maïza, D Monniaux… - Proceedings of the ACM on …, 2019 - dl.acm.org
For applications in worst-case execution time analysis and in security, it is desirable to
statically classify memory accesses into those that result in cache hits, and those that result …

Falcon: A scalable analytical cache model

A Pitchanathan, K Grover, T Grosser - Proceedings of the ACM on …, 2024 - dl.acm.org
Compilers often use performance models to decide how to optimize code. This is often
preferred over using hardware performance measurements, since hardware measurements …

A methodology for efficient tile size selection for affine loop kernels

V Kelefouras, K Djemame, G Keramidas… - International Journal of …, 2022 - Springer
Reducing the number of data accesses in memory hierarchy is of paramount importance on
modern computer systems. One of the key optimizations addressing this problem is loop …

Parallel Loop Locality Analysis for Symbolic Thread Counts

F Liu, Y Zhu, S Sun, C Ding, W Smith… - Proceedings of the 2024 …, 2024 - dl.acm.org
Data movement limits program performance. This bottleneck is more significant in multi-
thread programs but more difficult to analyze, especially for multiple thread counts. For …