" My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

X Wang, B Ma, C Hu, L Weber-Genzel… - arxiv preprint arxiv …, 2024 - arxiv.org
The open-ended nature of language generation makes the evaluation of autoregressive
large language models (LLMs) challenging. One common evaluation approach uses …

Cvqa: Culturally-diverse multilingual visual question answering benchmark

D Romero, C Lyu, HA Wibowo, T Lynn, I Hamed… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used
to test the ability of vision-language models to understand and reason on knowledge …

Are Large Language Models Consistent over Value-laden Questions?

J Moore, T Deshpande, D Yang - arxiv preprint arxiv:2407.02996, 2024 - arxiv.org
Large language models (LLMs) appear to bias their survey answers toward certain values.
Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are …

Take care of your prompt bias! investigating and mitigating prompt bias in factual knowledge extraction

Z Xu, K Peng, L Ding, D Tao, X Lu - arxiv preprint arxiv:2403.09963, 2024 - arxiv.org
Recent research shows that pre-trained language models (PLMs) suffer from" prompt bias"
in factual knowledge extraction, ie, prompts tend to introduce biases toward specific labels …

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

R Schaeffer, H Schoelkopf, B Miranda… - arxiv preprint arxiv …, 2024 - arxiv.org
Predictable behavior from scaling advanced AI systems is an extremely desirable property.
Although a well-established literature exists on how pretraining performance scales, the …

Look at the text: Instruction-tuned language models are more robust multiple choice selectors than you think

X Wang, C Hu, B Ma, P Röttger, B Plank - arxiv preprint arxiv:2404.08382, 2024 - arxiv.org
Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large
language models (LLMs). One common way to evaluate the model response is to rank the …

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?

L Parcalabescu, A Frank - arxiv preprint arxiv:2404.18624, 2024 - arxiv.org
Vision and language model (VLM) decoders are currently the best-performing architectures
on multimodal tasks. Next to answers, they are able to produce natural language …

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

A Khatun, DG Brown - arxiv preprint arxiv:2401.07955, 2024 - arxiv.org
The widespread adoption of Large Language Models (LLMs) has become commonplace,
particularly with the emergence of open-source models. More importantly, smaller models …

(perhaps) beyond human translation: Harnessing multi-agent collaboration for translating ultra-long literary texts

M Wu, Y Yuan, G Haffari, L Wang - arxiv preprint arxiv:2405.11804, 2024 - arxiv.org
Recent advancements in machine translation (MT) have significantly enhanced translation
quality across various domains. However, the translation of literary texts remains a …

Benchmarking Distributional Alignment of Large Language Models

N Meister, C Guestrin, T Hashimoto - arxiv preprint arxiv:2411.05403, 2024 - arxiv.org
Language models (LMs) are increasingly used as simulacra for people, yet their ability to
match the distribution of views of a specific demographic group and be\textit {distributionally …