SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment

Q Liu, F Wang, C **ao, M Chen - arxiv preprint arxiv:2410.14676, 2024 - arxiv.org
Existing preference alignment is a one-size-fits-all alignment mechanism, where the part of
the large language model (LLM) parametric knowledge with non-preferred features is …

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

Y Xu, R Gumaste, G Singh - arxiv preprint arxiv:2402.09695, 2024 - arxiv.org
We study the problem of universal black-boxed reward poisoning attacks against general
offline reinforcement learning with deep neural networks. We consider a black-box threat …