- Academic Search

Z Wei, Y Wang, A Li, Y Mo, Y Wang - arxiv preprint arxiv:2310.06387, 2023 - arxiv.org

Large Language Models (LLMs) have shown remarkable success in various tasks, yet their
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …

Salva Cita Citato da 177 Articoli correlati Tutte e 2 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Can llm-generated misinformation be detected?

C Chen, K Shu - arxiv preprint arxiv:2309.13788, 2023 - arxiv.org

The advent of Large Language Models (LLMs) has made a transformative impact. However,
the potential that LLMs such as ChatGPT can be exploited to generate misinformation has …

Salva Cita Citato da 162 Articoli correlati Tutte e 6 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

S Han, K Rao, A Ettinger, L Jiang, BY Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce WildGuard--an open, light-weight moderation tool for LLM safety that achieves
three goals:(1) identifying malicious intent in user prompts,(2) detecting safety risks of model …

Salva Cita Citato da 28 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

An adversarial perspective on machine unlearning for ai safety

J Łucki, B Wei, Y Huang, P Henderson… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models are finetuned to refuse questions about hazardous knowledge, but
these protections can often be bypassed. Unlearning methods aim at completely removing …

Salva Cita Citato da 11 Articoli correlati Tutte e 6 le versioni Versione HTML

[Free GPT-4]

[PDF] nature.com

Generative language models exhibit social identity biases

T Hu, Y Kyrychenko, S Rathje, N Collier… - Nature Computational …, 2024 - nature.com

Social identity biases, particularly the tendency to favor one's own group (ingroup solidarity)
and derogate other groups (outgroup hostility), are deeply rooted in human psychology and …

Salva Cita Citato da 19 Articoli correlati Tutte e 2 le versioni

[Free GPT-4]

[PDF] arxiv.org

Can Editing LLMs Inject Harm?

C Chen, B Huang, Z Li, Z Chen, S Lai, X Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

Knowledge editing has been increasingly adopted to correct the false or outdated
knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored …

Salva Cita Citato da 14 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arxiv preprint arxiv …, 2024 - arxiv.org

This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

Salva Cita Citato da 12 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

The art of saying no: Contextual noncompliance in language models

F Brahman, S Kumar, V Balachandran, P Dasigi… - arxiv preprint arxiv …, 2024 - arxiv.org

Chat-based language models are designed to be helpful, yet they should not comply with
every user request. While most existing work primarily focuses on refusal of" unsafe" …

Salva Cita Citato da 11 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

What makes and breaks safety fine-tuning? a mechanistic study

S Jain, ES Lubana, K Oksuz, T Joy, PHS Torr… - arxiv preprint arxiv …, 2024 - arxiv.org

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …

Salva Cita Citato da 8 Articoli correlati Tutte e 5 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Safety cases for frontier AI

MD Buhl, G Sett, L Koessler, J Schuett… - arxiv preprint arxiv …, 2024 - arxiv.org

As frontier artificial intelligence (AI) systems become more capable, it becomes more
important that developers can explain why their systems are sufficiently safe. One way to do …

Salva Cita Citato da 6 Articoli correlati Versione HTML

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Jailbreak and guard aligned language models with only few in-context demonstrations

Can llm-generated misinformation be detected?

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

An adversarial perspective on machine unlearning for ai safety

Generative language models exhibit social identity biases

Can Editing LLMs Inject Harm?

International Scientific Report on the Safety of Advanced AI (Interim Report)

The art of saying no: Contextual noncompliance in language models

What makes and breaks safety fine-tuning? a mechanistic study

Safety cases for frontier AI