Security and privacy on generative data in aigc: A survey

T Wang, Y Zhang, S Qi, R Zhao, Z **a… - ACM Computing Surveys, 2024 - dl.acm.org
The advent of artificial intelligence-generated content (AIGC) represents a pivotal moment in
the evolution of information technology. With AIGC, it can be effortless to generate high …

The life cycle of large language models in education: A framework for understanding sources of bias

J Lee, Y Hicke, R Yu, C Brooks… - British Journal of …, 2024 - Wiley Online Library
Large language models (LLMs) are increasingly adopted in educational contexts to provide
personalized support to students and teachers. The unprecedented capacity of LLM‐based …

Dolma: An open corpus of three trillion tokens for language model pretraining research

L Soldaini, R Kinney, A Bhagia, D Schwenk… - arxiv preprint arxiv …, 2024 - arxiv.org
Information about pretraining corpora used to train the current best-performing language
models is seldom discussed: commercial models rarely detail their data, and even open …

Black-box access is insufficient for rigorous ai audits

S Casper, C Ezell, C Siegmann, N Kolt… - Proceedings of the …, 2024 - dl.acm.org
External audits of AI systems are increasingly recognized as a key mechanism for AI
governance. The effectiveness of an audit, however, depends on the degree of access …

Rethinking open source generative AI: open-washing and the EU AI Act

A Liesenfeld, M Dingemanse - … of the 2024 ACM Conference on …, 2024 - dl.acm.org
The past year has seen a steep rise in generative AI systems that claim to be open. But how
open are they really? The question of what counts as open source in generative AI is poised …

[PDF][PDF] Consent in crisis: The rapid decline of the ai data commons

S Longpre, R Mahari, A Lee, C Lund… - …, 2024 - proceedings.neurips.cc
General-purpose artificial intelligence (AI) systems are built on massive swathes of public
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …

No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance

V Udandarao, A Prabhu, A Ghosh… - The Thirty-eighth …, 2024 - openreview.net
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …

Participation in the age of foundation models

H Suresh, E Tseng, M Young, M Gray… - Proceedings of the …, 2024 - dl.acm.org
Growing interest and investment in the capabilities of foundation models has positioned
such systems to impact a wide array of services, from banking to healthcare. Alongside …

LLAVAGUARD: VLM-based Safeguard for Vision Dataset Curation and Safety Assessment

L Helff, F Friedrich, M Brack… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce LlavaGuard a family of multimodal safe-guard models based on Llava offering
a robust framework for evaluating the safety compliance of vision datasets and models. Our …

Open problems in technical ai governance

A Reuel, B Bucknall, S Casper, T Fist, L Soder… - arxiv preprint arxiv …, 2024 - arxiv.org
AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …