Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Prompting a pretrained transformer can be a universal approximator

A Petrov, PHS Torr, A Bibi - arxiv preprint arxiv:2402.14753, 2024 - arxiv.org
Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of
transformer models, our theoretical understanding of these fine-tuning methods remains …

Pseudorandom error-correcting codes

M Christ, S Gunn - Annual International Cryptology Conference, 2024 - Springer
We construct pseudorandom error-correcting codes (or simply pseudorandom codes), which
are error-correcting codes with the property that any polynomial number of codewords are …

Advancing beyond identification: Multi-bit watermark for large language models

KY Yoo, W Ahn, N Kwak - arxiv preprint arxiv:2308.00221, 2023 - arxiv.org
We show the viability of tackling misuses of large language models beyond the identification
of machine-generated text. While existing zero-bit watermark methods focus on detection …

Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models

A Kalavasis, A Karbasi, A Oikonomou… - Advances in …, 2025 - proceedings.neurips.cc
As ML models become increasingly complex and integral to high-stakes domains such as
finance and healthcare, they also become more susceptible to sophisticated adversarial …

Provably secure public-key steganography based on elliptic curve cryptography

X Zhang, K Chen, J Ding, Y Yang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Steganography is the technique of hiding secret messages within seemingly harmless
covers to elude examination by censors. Despite having been proposed several decades …

Exploring the relevance of data Privacy-Enhancing technologies for AI governance use cases

E Bluemke, T Collins, B Garfinkel, A Trask - arxiv preprint arxiv …, 2023 - arxiv.org
The development of privacy-enhancing technologies has made immense progress in
reducing trade-offs between privacy and performance in data exchange and analysis …

Minimum-entropy coupling approximation guarantees beyond the majorization barrier

S Compton, D Katz, B Qi… - International …, 2023 - proceedings.mlr.press
Given a set of discrete probability distributions, the minimum entropy coupling is the
minimum entropy joint distribution that has the input distributions as its marginals. This has …

Excuse me, sir? your language model is leaking (information)

O Zamir - arxiv preprint arxiv:2401.10360, 2024 - arxiv.org
We introduce a cryptographic method to hide an arbitrary secret payload in the response of
a Large Language Model (LLM). A secret key is required to extract the payload from the …

Hidden in plain text: Emergence & mitigation of steganographic collusion in LLMs

Y Mathew, O Matthews, R McCarthy, J Velja… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid proliferation of frontier model agents promises significant societal advances but
also raises concerns about systemic risks arising from unsafe interactions. Collusion to the …