Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions

N Vasilache, O Zinenko, T Theodoridis, P Goyal… - arxiv preprint arxiv …, 2018 - arxiv.org
Deep learning models with convolutional and recurrent networks are now ubiquitous and
analyze massive amounts of audio, image, video, text and graph data, with applications in …

Tiramisu: A polyhedral compiler for expressing fast and portable code

R Baghdadi, J Ray, MB Romdhane… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
This paper introduces Tiramisu, a polyhedral framework designed to generate high
performance code for multiple platforms including multicores, GPUs, and distributed …

Dnnfusion: accelerating deep neural networks execution with advanced operator fusion

W Niu, J Guan, Y Wang, G Agrawal, B Ren - Proceedings of the 42nd …, 2021 - dl.acm.org
Deep Neural Networks (DNNs) have emerged as the core enabler of many major
applications on mobile devices. To achieve high accuracy, DNN models have become …

Futhark: purely functional GPU-programming with nested parallelism and in-place array updates

T Henriksen, NGW Serup, M Elsman… - Proceedings of the 38th …, 2017 - dl.acm.org
Futhark is a purely functional data-parallel array language that offers a machine-neutral
programming model and an optimising compiler that generates OpenCL code for GPUs …

Hasco: Towards agile hardware and software co-design for tensor computation

Q **ao, S Zheng, B Wu, P Xu, X Qian… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
Tensor computations overwhelm traditional general-purpose computing devices due to the
large amounts of data and operations of the computations. They call for a holistic solution …

When polyhedral transformations meet SIMD code generation

M Kong, R Veras, K Stock, F Franchetti… - Proceedings of the 34th …, 2013 - dl.acm.org
Data locality and parallelism are critical optimization objectives for performance on modern
multi-core machines. Both coarse-grain parallelism (eg, multi-core) and fine-grain …

Optimising purely functional GPU programs

TL McDonell, MMT Chakravarty, G Keller… - ACM SIGPLAN …, 2013 - dl.acm.org
Purely functional, embedded array programs are a good match for SIMD hardware, such as
GPUs. However, the naive compilation of such programs quickly leads to both code …

The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically

N Vasilache, O Zinenko, T Theodoridis… - ACM Transactions on …, 2019 - dl.acm.org
Deep learning frameworks automate the deployment, distribution, synchronization, memory
allocation, and hardware acceleration of models represented as graphs of computational …

Optimizing for parallelism and data locality

K Kennedy, KS McKinley - … of the 6th international conference on …, 1992 - dl.acm.org
Previous research has used program transformation to introduce parallelism and to exploit
data locality. Unfortunately, these two objectives have usually been considered …

Generating configurable hardware from parallel patterns

R Prabhakar, D Koeplinger, KJ Brown, HJ Lee… - Acm Sigplan …, 2016 - dl.acm.org
In recent years the computing landscape has seen an increasing shift towards specialized
accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the …