GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models
Large Language Models (LLMs) face significant deployment challenges due to their
substantial resource requirements. While low-bit quantized weights can reduce memory …
substantial resource requirements. While low-bit quantized weights can reduce memory …
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
Transformer-based large language models (LLMs) have achieved remarkable success as
model sizes continue to grow, yet their deployment remains challenging due to significant …
model sizes continue to grow, yet their deployment remains challenging due to significant …