Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions
Deep learning models with convolutional and recurrent networks are now ubiquitous and
analyze massive amounts of audio, image, video, text and graph data, with applications in …
analyze massive amounts of audio, image, video, text and graph data, with applications in …
Tiramisu: A polyhedral compiler for expressing fast and portable code
R Baghdadi, J Ray, MB Romdhane… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
This paper introduces Tiramisu, a polyhedral framework designed to generate high
performance code for multiple platforms including multicores, GPUs, and distributed …
performance code for multiple platforms including multicores, GPUs, and distributed …
Dnnfusion: accelerating deep neural networks execution with advanced operator fusion
Deep Neural Networks (DNNs) have emerged as the core enabler of many major
applications on mobile devices. To achieve high accuracy, DNN models have become …
applications on mobile devices. To achieve high accuracy, DNN models have become …
Futhark: purely functional GPU-programming with nested parallelism and in-place array updates
Futhark is a purely functional data-parallel array language that offers a machine-neutral
programming model and an optimising compiler that generates OpenCL code for GPUs …
programming model and an optimising compiler that generates OpenCL code for GPUs …
Hasco: Towards agile hardware and software co-design for tensor computation
Tensor computations overwhelm traditional general-purpose computing devices due to the
large amounts of data and operations of the computations. They call for a holistic solution …
large amounts of data and operations of the computations. They call for a holistic solution …
When polyhedral transformations meet SIMD code generation
Data locality and parallelism are critical optimization objectives for performance on modern
multi-core machines. Both coarse-grain parallelism (eg, multi-core) and fine-grain …
multi-core machines. Both coarse-grain parallelism (eg, multi-core) and fine-grain …
Optimising purely functional GPU programs
Purely functional, embedded array programs are a good match for SIMD hardware, such as
GPUs. However, the naive compilation of such programs quickly leads to both code …
GPUs. However, the naive compilation of such programs quickly leads to both code …
The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically
Deep learning frameworks automate the deployment, distribution, synchronization, memory
allocation, and hardware acceleration of models represented as graphs of computational …
allocation, and hardware acceleration of models represented as graphs of computational …
Optimizing for parallelism and data locality
K Kennedy, KS McKinley - … of the 6th international conference on …, 1992 - dl.acm.org
Previous research has used program transformation to introduce parallelism and to exploit
data locality. Unfortunately, these two objectives have usually been considered …
data locality. Unfortunately, these two objectives have usually been considered …
Generating configurable hardware from parallel patterns
In recent years the computing landscape has seen an increasing shift towards specialized
accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the …
accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the …