" I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset

EM Smith, M Hall, M Kambadur, E Presani… - arxiv preprint arxiv …, 2022 - arxiv.org
As language models grow in popularity, it becomes increasingly important to clearly
measure all possible markers of demographic identity in order to avoid perpetuating existing …

Designing responsible ai: Adaptations of ux practice to meet responsible ai challenges

Q Wang, M Madaio, S Kane, S Kapania… - Proceedings of the …, 2023 - dl.acm.org
Technology companies continue to invest in efforts to incorporate responsibility in their
Artificial Intelligence (AI) advancements, while efforts to audit and regulate AI systems …

QuALITY: Question answering with long input texts, yes!

RY Pang, A Parrish, N Joshi, N Nangia, J Phang… - arxiv preprint arxiv …, 2021 - arxiv.org
To enable building and testing models on long-document comprehension, we introduce
QuALITY, a multiple-choice QA dataset with context passages in English that have an …

WANLI: Worker and AI collaboration for natural language inference dataset creation

A Liu, S Swayamdipta, NA Smith, Y Choi - arxiv preprint arxiv:2201.05955, 2022 - arxiv.org
A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often
rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We …

Webqa: Multihop and multimodal qa

Y Chang, M Narang, H Suzuki… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature
of web searches, requires fundamental advances in visual representation learning …

Don't blame the annotator: Bias already starts in the annotation instructions

M Parmar, S Mishra, M Geva, C Baral - arxiv preprint arxiv:2205.00415, 2022 - arxiv.org
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are
typically collected by crowdsourcing, where annotators write examples based on annotation …

Creak: A dataset for commonsense reasoning over entity knowledge

Y Onoe, MJQ Zhang, E Choi, G Durrett - arxiv preprint arxiv:2109.01653, 2021 - arxiv.org
Most benchmark datasets targeting commonsense reasoning focus on everyday scenarios:
physical knowledge like knowing that you could fill a cup under a waterfall [Talmor et al …

Multimodal large language models for inclusive collaboration learning tasks

A Lewis - Proceedings of the 2022 Conference of the North …, 2022 - aclanthology.org
This PhD project leverages advancements in multimodal large language models to build an
inclusive collaboration feedback loop, in order to facilitate the automated detection …

Analyzing dynamic adversarial training data in the limit

E Wallace, A Williams, R Jia, D Kiela - arxiv preprint arxiv:2110.08514, 2021 - arxiv.org
To create models that are robust across a wide range of test inputs, training datasets should
include diverse examples that span numerous phenomena. Dynamic adversarial data …

A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains

A Jacovi, Y Bitton, B Bohnet, J Herzig… - arxiv preprint arxiv …, 2024 - arxiv.org
Prompting language models to provide step-by-step answers (eg," Chain-of-Thought") is the
prominent approach for complex reasoning tasks, where more accurate reasoning chains …