Red teaming language models with language models

E Perez, S Huang, F Song, T Cai, R Ring… - arxiv preprint arxiv …, 2022 - arxiv.org
Language Models (LMs) often cannot be deployed because of their potential to harm users
in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using …

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai… - arxiv preprint arxiv …, 2022 - arxiv.org
We describe our early efforts to red team language models in order to simultaneously
discover, measure, and attempt to reduce their potentially harmful outputs. We make three …

On evaluating adversarial robustness of large vision-language models

Y Zhao, T Pang, C Du, X Yang, C Li… - Advances in …, 2024 - proceedings.neurips.cc
Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling more creative …

Jailbreaking black box large language models in twenty queries

P Chao, A Robey, E Dobriban, H Hassani… - arxiv preprint arxiv …, 2023 - arxiv.org
There is growing interest in ensuring that large language models (LLMs) align with human
values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which …

Explore, establish, exploit: Red teaming language models from scratch

S Casper, J Lin, J Kwon, G Culp… - arxiv preprint arxiv …, 2023 - arxiv.org
Deploying Large language models (LLMs) can pose hazards from harmful outputs such as
toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order …

Wanli: Worker and ai collaboration for natural language inference dataset creation

A Liu, S Swayamdipta, NA Smith, Y Choi - arxiv preprint arxiv:2201.05955, 2022 - arxiv.org
A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often
rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We …

[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora

A Warstadt, A Mueller, L Choshen… - … of the BabyLM …, 2023 - research-collection.ethz.ch
Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …

State-of-the-art generalisation research in NLP: a taxonomy and review

D Hupkes, M Giulianelli, V Dankers, M Artetxe… - arxiv preprint arxiv …, 2022 - arxiv.org
The ability to generalise well is one of the primary desiderata of natural language
processing (NLP). Yet, what'good generalisation'entails and how it should be evaluated is …

Can llms augment low-resource reading comprehension datasets? opportunities and challenges

V Samuel, H Aynaou, AG Chowdhury… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) have demonstrated impressive zero shot performance on a
wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A …

On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex

TY Zhuo, Z Li, Y Huang, F Shiri, W Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Semantic parsing is a technique aimed at constructing a structured representation of the
meaning of a natural-language question. Recent advancements in few-shot language …