Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
MLIR: Scaling compiler infrastructure for domain specific computation
This work presents MLIR, a novel approach to building reusable and extensible compiler
infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous …
infrastructure. MLIR addresses software fragmentation, compilation for heterogeneous …
MLIR: A compiler infrastructure for the end of Moore's law
This work presents MLIR, a novel approach to building reusable and extensible compiler
infrastructure. MLIR aims to address software fragmentation, improve compilation for …
infrastructure. MLIR aims to address software fragmentation, improve compilation for …
Kernel operations on the GPU, with autodiff, without memory overflows
The KeOps library provides a fast and memory-efficient GPU support for tensors whose
entries are given by a mathematical formula, such as kernel and distance matrices. KeOps …
entries are given by a mathematical formula, such as kernel and distance matrices. KeOps …
Graphit: A high-performance graph dsl
The performance bottlenecks of graph applications depend not only on the algorithm and
the underlying hardware, but also on the size and structure of the input graph. As a result …
the underlying hardware, but also on the size and structure of the input graph. As a result …
Exocompilation for productive programming of hardware accelerators
High-performance kernel libraries are critical to exploiting accelerators and specialized
instructions in many applications. Because compilers are difficult to extend to support …
instructions in many applications. Because compilers are difficult to extend to support …
DietCode: Automatic optimization for dynamic tensor programs
Achieving high performance for compute-intensive operators in machine learning (ML)
workloads is a crucial but challenging task. Many ML and system practitioners rely on …
workloads is a crucial but challenging task. Many ML and system practitioners rely on …
Accelerating reduction and scan using tensor core units
Driven by deep learning, there has been a surge of specialized processors for matrix
multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of …
multiplication, referred to as Tensor Core Units (TCUs). These TCUs are capable of …
Domain-specific architectures: Research problems and promising approaches
Process technology-driven performance and energy efficiency improvements have slowed
down as we approach physical design limits. General-purpose manycore architectures …
down as we approach physical design limits. General-purpose manycore architectures …
Optimizing tensor programs on flexible storage
Tensor programs often need to process large tensors (vectors, matrices, or higher order
tensors) that require a specialized storage format for their memory layout. Several such …
tensors) that require a specialized storage format for their memory layout. Several such …
Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies
Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for
many applications. The predominantly used imperative languages-like C or OpenCL-force …
many applications. The predominantly used imperative languages-like C or OpenCL-force …