Google Učenjak

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

TT Wang, J Hughes, H Sleight, R Schaeffer… - arxiv preprint arxiv …, 2024 - arxiv.org

Defending large language models against jailbreaks so that they never engage in a broadly-
defined set of forbidden behaviors is an open problem. In this paper, we investigate the …

Shrani Navedi Navedeno v 3 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

N Raman, T Lundy, T Amin, J Perla… - arxiv preprint arxiv …, 2025 - arxiv.org

How should one judge whether a given large language model (LLM) can reliably perform
economic reasoning? Most existing LLM benchmarks focus on specific applications and fail …

Shrani Navedi Sorodni članki V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

A Yang, J Tab, P Shah, P Kotchavong - arxiv preprint arxiv:2412.10535, 2024 - arxiv.org

The increasing reliance on large language models (LLMs) for diverse applications
necessitates a thorough understanding of their robustness to adversarial perturbations and …

Shrani Navedi Sorodni članki Vse različice: 3 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers

TT Wang, J Hughes, H Sleight, R Schaeffer… - The Third Workshop on … - openreview.net

Defending large language models against jailbreaks so that they never engage in a broad
set of forbidden behaviors is an open problem. In this paper, we study if jailbreak-defense is …

Shrani Navedi Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Large Language Models for Explainability in Machine Learning

D Beamish, G Exarchakis - openreview.net

We investigate the potential of large language models (LLMs) in explainable artificial
intelligence (XAI) by examining their ability to generate understandable explanations for …

Shrani Navedi Sorodni članki V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,...

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers

Large Language Models for Explainability in Machine Learning