- Academic Search

Q Liu, F Wang, C **ao, M Chen - arxiv preprint arxiv:2410.14676, 2024 - arxiv.org

Existing preference alignment is a one-size-fits-all alignment mechanism, where the part of
the large language model (LLM) parametric knowledge with non-preferred features is …

Uložit Citovat Počet citací tohoto článku: 1 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

Y Xu, R Gumaste, G Singh - arxiv preprint arxiv:2402.09695, 2024 - arxiv.org

We study the problem of universal black-boxed reward poisoning attacks against general
offline reinforcement learning with deep neural networks. We consider a black-box threat …

Uložit Citovat Související články Všechny verze (počet: 2) Zobrazit jako HTML

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Preference poisoning attacks on reward model learning

SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment

Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning