Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

TT Wang, J Hughes, H Sleight, R Schaeffer… - arxiv preprint arxiv …, 2024 - arxiv.org
Defending large language models against jailbreaks so that they never engage in a broadly-
defined set of forbidden behaviors is an open problem. In this paper, we investigate the …

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

N Raman, T Lundy, T Amin, J Perla… - arxiv preprint arxiv …, 2025 - arxiv.org
How should one judge whether a given large language model (LLM) can reliably perform
economic reasoning? Most existing LLM benchmarks focus on specific applications and fail …

On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

A Yang, J Tab, P Shah, P Kotchavong - arxiv preprint arxiv:2412.10535, 2024 - arxiv.org
The increasing reliance on large language models (LLMs) for diverse applications
necessitates a thorough understanding of their robustness to adversarial perturbations and …

Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers

TT Wang, J Hughes, H Sleight, R Schaeffer… - The Third Workshop on … - openreview.net
Defending large language models against jailbreaks so that they never engage in a broad
set of forbidden behaviors is an open problem. In this paper, we study if jailbreak-defense is …

Large Language Models for Explainability in Machine Learning

D Beamish, G Exarchakis - openreview.net
We investigate the potential of large language models (LLMs) in explainable artificial
intelligence (XAI) by examining their ability to generate understandable explanations for …