Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

Evaluating large language models for radiology natural language processing

Z Liu, T Zhong, Y Li, Y Zhang, Y Pan, Z Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org
The rise of large language models (LLMs) has marked a pivotal shift in the field of natural
language processing (NLP). LLMs have revolutionized a multitude of domains, and they …

Achieving Peak Performance for Large Language Models: A Systematic Review

ZRK Rostam, S Szénási, G Kertész - IEEE Access, 2024 - ieeexplore.ieee.org
In recent years, large language models (LLMs) have achieved remarkable success in
natural language processing (NLP). LLMs require an extreme amount of parameters to …

Transformer uncertainty estimation with hierarchical stochastic attention

J Pei, C Wang, G Szarvas - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Transformers are state-of-the-art in a wide range of NLP tasks and have also been applied
to many real-world products. Understanding the reliability and certainty of transformer …

Influential recommender system

H Zhu, H Ge, X Gu, P Zhao… - 2023 IEEE 39th …, 2023 - ieeexplore.ieee.org
Traditional recommender systems are typically passive in that they try to adapt their
recommendations to the user's historical interests. However, it is highly desirable for …

HPipe: Large Language Model Pipeline Parallelism for Long Context on Heterogeneous Cost-effective Devices

R Ma, X Yang, J Wang, Q Qi, H Sun… - Proceedings of the …, 2024 - aclanthology.org
Micro-enterprises and individual developers emerge analysis demands for long sequence
with powerful Large Language Models (LLMs). They try to deploy the LLMs at local, but only …

TCP: A Tensor Contraction Processor for AI Workloads Industrial Product

H Kim, Y Choi, J Park, B Bae, H Jeong… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
We introduce a novel tensor contraction processor (TCP) architecture that offers a paradigm
shift from traditional architectures that rely on fixed-size matrix multiplications. TCP aims at …

iServe: An Intent-based Serving System for LLMs

D Liakopoulos, T Hu, P Sinha… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models (LLMs) are becoming ubiquitous across industries, where
applications demand they fulfill diverse user intents. However, developers currently face the …

Dynamic batching for inference system for transformer-based generation tasks

YU Gyeongin, G Kim, JS Jeong, S Kim… - US Patent …, 2022 - Google Patents
An inference system applies a machine-learning transformer model to a batch of requests
with variable input length or variable target length or variable internal state length by …

Selective batching for inference system for transformer-based generation tasks

YU Gyeongin, G Kim, JS Jeong, S Kim… - US Patent …, 2024 - Google Patents
An inference system applies a machine-learning transformer model to a batch of requests
with variable input length or variable target length or variable internal state length by …