Designing efficient sorting algorithms for manycore GPUs

N Satish, M Harris, M Garland - 2009 IEEE International …, 2009 - ieeexplore.ieee.org
We describe the design of high-performance parallel radix sort and merge sort routines for
manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix …

A comprehensive performance comparison of CUDA and OpenCL

J Fang, AL Varbanescu, H Sips - … International Conference on …, 2011 - ieeexplore.ieee.org
This paper presents a comprehensive performance comparison between CUDA and
OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world …

Relational joins on graphics processors

B He, K Yang, R Fang, M Lu, N Govindaraju… - Proceedings of the …, 2008 - dl.acm.org
We present a novel design and implementation of relational join algorithms for new-
generation graphics processing units (GPUs). The most recent GPU features include support …

GPUTeraSort: high performance graphics co-processor sorting for large database management

N Govindaraju, J Gray, R Kumar… - Proceedings of the 2006 …, 2006 - dl.acm.org
We present a novel external sorting algorithm using graphics processors (GPUs) on large
databases composed of billions of records and wide keys. Our algorithm uses the data …

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

N Satish, C Kim, J Chhugani, AD Nguyen… - Proceedings of the …, 2010 - dl.acm.org
Sort is a fundamental kernel used in many database operations. In-memory sorts are now
feasible; sort performance is limited by compute flops and main memory bandwidth rather …

[PDF][PDF] A comparison of sorting algorithms for the connection machine CM-2

GE Blelloch, CE Leiserson, BM Maggs… - Proceedings of the third …, 1991 - dl.acm.org
We have implemented three parallel sorting algorithms on the Connection Machine
Supercomputer model CM-2: B atcher's bitonic sort, a parallel radix sor~ and a sample sort …

High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing

D Merrill, A Grimshaw - Parallel Processing Letters, 2011 - World Scientific
The need to rank and order data is pervasive, and many algorithms are fundamentally
dependent upon sorting and partitioning operations. Prior to this work, GPU stream …

Revisiting sorting for GPGPU stream architectures

DG Merrill, AS Grimshaw - … of the 19th international conference on …, 2010 - dl.acm.org
This poster presents efficient strategies for sorting large sequences of fixed-length keys (and
values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting …

[KNYGA][B] Vector microprocessors

K Asanovic - 1998 - search.proquest.com
Most previous research into vector architectures has concentrated on supercomputing
applications and small enhancements to existing vector supercomputer implementations …

Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators

Y Lee, R Avizienis, A Bishara, R **a… - Proceedings of the 38th …, 2011 - dl.acm.org
We present a taxonomy and modular implementation approach for data-parallel
accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) …