A survey of ai-generated content (aigc)

Y Cao, S Li, Y Liu, Z Yan, Y Dai, P Yu, L Sun - ACM Computing Surveys, 2025 - dl.acm.org
Recently, Artificial Intelligence Generated Content (AIGC) has gained significant attention
from society, especially with the rise of Generative AI (GAI) techniques such as ChatGPT …

A survey on data selection for language models

A Albalak, Y Elazar, SM **e, S Longpre… - arxiv preprint arxiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Octopack: Instruction tuning code large language models

N Muennighoff, Q Liu, A Zebaze, Q Zheng… - arxiv preprint arxiv …, 2023 - arxiv.org
Finetuning large language models (LLMs) on instructions leads to vast performance
improvements on natural language tasks. We apply instruction tuning using code …

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

RT McCoy, S Yao, D Friedman, MD Hardy… - Proceedings of the …, 2024 - pnas.org
The widespread adoption of large language models (LLMs) makes it important to recognize
their strengths and limitations. We argue that to develop a holistic understanding of these …

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

M Deitke, C Clark, S Lee, R Tripathi, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …

Language models scale reliably with over-training and on downstream tasks

SY Gadre, G Smyrnis, V Shankar, S Gururangan… - arxiv preprint arxiv …, 2024 - arxiv.org
Scaling laws are useful guides for derisking expensive training runs, as they predict
performance of large models using cheaper, small-scale experiments. However, there …

Consent in crisis: The rapid decline of the ai data commons

S Longpre, R Mahari, A Lee, C Lund, H Oderinwale… - NEURIPS, 2024 - hal.science
General-purpose artificial intelligence (AI) systems are built on massive swathes of public
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …

Leave no context behind: Efficient infinite context transformers with infini-attention

T Munkhdalai, M Faruqui, S Gopal - arxiv preprint arxiv:2404.07143, 2024 - arxiv.org
This work introduces an efficient method to scale Transformer-based Large Language
Models (LLMs) to infinitely long inputs with bounded memory and computation. A key …

Generative language models exhibit social identity biases

T Hu, Y Kyrychenko, S Rathje, N Collier… - Nature Computational …, 2024 - nature.com
Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity)
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …

Generalization vs Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

X Wang, A Antoniades, Y Elazar, A Amayuelas… - arxiv preprint arxiv …, 2024 - arxiv.org
The impressive capabilities of large language models (LLMs) have sparked debate over
whether these models genuinely generalize to unseen tasks or predominantly rely on …