A survey of ai-generated content (aigc)
Recently, Artificial Intelligence Generated Content (AIGC) has gained significant attention
from society, especially with the rise of Generative AI (GAI) techniques such as ChatGPT …
from society, especially with the rise of Generative AI (GAI) techniques such as ChatGPT …
A survey on data selection for language models
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
Octopack: Instruction tuning code large language models
Finetuning large language models (LLMs) on instructions leads to vast performance
improvements on natural language tasks. We apply instruction tuning using code …
improvements on natural language tasks. We apply instruction tuning using code …
Embers of autoregression show how large language models are shaped by the problem they are trained to solve
The widespread adoption of large language models (LLMs) makes it important to recognize
their strengths and limitations. We argue that to develop a holistic understanding of these …
their strengths and limitations. We argue that to develop a holistic understanding of these …
Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models
Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
Language models scale reliably with over-training and on downstream tasks
Scaling laws are useful guides for derisking expensive training runs, as they predict
performance of large models using cheaper, small-scale experiments. However, there …
performance of large models using cheaper, small-scale experiments. However, there …
Consent in crisis: The rapid decline of the ai data commons
General-purpose artificial intelligence (AI) systems are built on massive swathes of public
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …
Leave no context behind: Efficient infinite context transformers with infini-attention
This work introduces an efficient method to scale Transformer-based Large Language
Models (LLMs) to infinitely long inputs with bounded memory and computation. A key …
Models (LLMs) to infinitely long inputs with bounded memory and computation. A key …
Generative language models exhibit social identity biases
Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity)
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …
Generalization vs Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
The impressive capabilities of large language models (LLMs) have sparked debate over
whether these models genuinely generalize to unseen tasks or predominantly rely on …
whether these models genuinely generalize to unseen tasks or predominantly rely on …