Jailbroken: How does llm safety training fail?

A Wei, N Haghtalab… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
Large language models trained for safety and harmlessness remain susceptible to
adversarial misuse, as evidenced by the prevalence of “jailbreak” attacks on early releases …

[PDF][PDF] DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.

B Wang, W Chen, H Pei, C **e, M Kang, C Zhang, C Xu… - NeurIPS, 2023‏ - blogs.qub.ac.uk
Abstract Generative Pre-trained Transformer (GPT) models have exhibited exciting progress
in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the …

[PDF][PDF] Trustllm: Trustworthiness in large language models

L Sun, Y Huang, H Wang, S Wu, Q Zhang… - arxiv preprint arxiv …, 2024‏ - mosis.eecs.utk.edu
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …

Pretraining language models with human preferences

T Korbak, K Shi, A Chen, RV Bhalerao… - International …, 2023‏ - proceedings.mlr.press
Abstract Language models (LMs) are pretrained to imitate text from large and diverse
datasets that contain content that would violate human preferences if generated by an LM …

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Y Yuan, W Jiao, W Wang, J Huang, P He, S Shi… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Safety lies at the core of the development of Large Language Models (LLMs). There is
ample work on aligning LLMs with human ethics and preferences, including data filtering in …

[HTML][HTML] Contemporary approaches in evolving language models

D Oralbekova, O Mamyrbayev, M Othman… - Applied Sciences, 2023‏ - mdpi.com
This article provides a comprehensive survey of contemporary language modeling
approaches within the realm of natural language processing (NLP) tasks. This paper …

Deepinception: Hypnotize large language model to be jailbreaker

X Li, Z Zhou, J Zhu, J Yao, T Liu, B Han - arxiv preprint arxiv:2311.03191, 2023‏ - arxiv.org
Despite remarkable success in various applications, large language models (LLMs) are
vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous …

[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu… - International …, 2024‏ - proceedings.mlr.press
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …

Factuality enhanced language models for open-ended text generation

N Lee, W **, P Xu, M Patwary… - Advances in …, 2022‏ - proceedings.neurips.cc
Pretrained language models (LMs) are susceptible to generate text with nonfactual
information. In this work, we measure and improve the factual accuracy of large-scale LMs …

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

P Ding, J Kuang, D Ma, X Cao, Y **an, J Chen… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide
useful and safe responses. However, adversarial prompts known as' jailbreaks' can …