GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

P Zhao, X Yuan - arxiv preprint arxiv:2501.12956, 2025 - arxiv.org
Large Language Models (LLMs) face significant deployment challenges due to their
substantial resource requirements. While low-bit quantized weights can reduce memory …

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Y Zhang, M Wang, L Zou, W Liu, HL Zhen… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformer-based large language models (LLMs) have achieved remarkable success as
model sizes continue to grow, yet their deployment remains challenging due to significant …