A survey on evaluation of large language models

Y Chang, X Wang, J Wang, Y Wu, L Yang… - ACM Transactions on …, 2024 - dl.acm.org
Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …

Gemma 2: Improving open language models at a practical size

G Team, M Riviere, S Pathak, PG Sessa… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-
of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new …

AI generates covertly racist decisions about people based on their dialect

V Hofmann, PR Kalluri, D Jurafsky, S King - Nature, 2024 - nature.com
Hundreds of millions of people now interact with language models, with uses ranging from
help with writing, to informing hiring decisions. However, these language models are known …

Larger and more instructable language models become less reliable

L Zhou, W Schellaert, F Martínez-Plumed… - Nature, 2024 - nature.com
The prevailing methods to make large language models more powerful and amenable have
been based on continuous scaling up (that is, increasing their size, data volume and …

Fairness in serving large language models

Y Sheng, S Cao, D Li, B Zhu, Z Li, D Zhuo… - … USENIX Symposium on …, 2024 - usenix.org
High-demand LLM inference services (eg, ChatGPT and BARD) support a wide range of
requests from short chat conversations to long document reading. To ensure that all client …

Online speculative decoding

X Liu, L Hu, P Bailis, A Cheung, Z Deng, I Stoica… - arxiv preprint arxiv …, 2023 - arxiv.org
Speculative decoding is a pivotal technique to accelerate the inference of large language
models (LLMs) by employing a smaller draft model to predict the target model's outputs …

Introducing v0. 5 of the ai safety benchmark from mlcommons

B Vidgen, A Agrawal, AM Ahmed, V Akinwande… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces v0. 5 of the AI Safety Benchmark, which has been created by the
MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to …

Power-aware Deep Learning Model Serving with {μ-Serve}

H Qiu, W Mao, A Patke, S Cui, S Jha, C Wang… - 2024 USENIX Annual …, 2024 - usenix.org
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …

Agentohana: Design unified data and training pipeline for effective agent learning

J Zhang, T Lan, R Murthy, Z Liu, W Yao, M Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Autonomous agents powered by large language models (LLMs) have garnered significant
research attention. However, fully harnessing the potential of LLMs for agent-based tasks …

Generative language models exhibit social identity biases

T Hu, Y Kyrychenko, S Rathje, N Collier… - Nature Computational …, 2024 - nature.com
Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity)
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …