- Academic Search

WL Chiang, L Zheng, Y Sheng… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have unlocked new capabilities and applications; however,
evaluating the alignment with human preferences still poses significant challenges. To …

Save Cite Cited by 370 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A survey of attacks on large vision-language models: Resources, advances, and future trends

D Liu, M Yang, X Qu, P Zhou, Y Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org

With the significant development of large models in recent years, Large Vision-Language
Models (LVLMs) have demonstrated remarkable capabilities across a wide range of …

Save Cite Cited by 17 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Introducing v0. 5 of the ai safety benchmark from mlcommons

B Vidgen, A Agrawal, AM Ahmed, V Akinwande… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper introduces v0. 5 of the AI Safety Benchmark, which has been created by the
MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to …

Save Cite Cited by 30 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Sorry-bench: Systematically evaluating large language model safety refusal behaviors

T **e, X Qi, Y Zeng, Y Huang, UM Sehwag… - arxiv preprint arxiv …, 2024 - arxiv.org

Evaluating aligned large language models'(LLMs) ability to recognize and reject unsafe user
requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts …

Save Cite Cited by 26 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Salad-bench: A hierarchical and comprehensive safety benchmark for large language models

L Li, B Dong, R Wang, X Hu, W Zuo, D Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety
measures is paramount. To meet this crucial need, we propose\emph {SALAD-Bench}, a …

Save Cite Cited by 66 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

S Han, K Rao, A Ettinger, L Jiang, BY Lin… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce WildGuard--an open, light-weight moderation tool for LLM safety that achieves
three goals:(1) identifying malicious intent in user prompts,(2) detecting safety risks of model …

Save Cite Cited by 26 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

Save Cite Cited by 10 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Shieldgemma: Generative ai content moderation based on gemma

W Zeng, Y Liu, R Mullins, L Peran, J Fernandez… - arxiv preprint arxiv …, 2024 - arxiv.org

We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation
models built upon Gemma2. These models provide robust, state-of-the-art predictions of …

Save Cite Cited by 17 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

R-judge: Benchmarking safety risk awareness for llm agents

T Yuan, Z He, L Dong, Y Wang, R Zhao, T **a… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) have exhibited great potential in autonomously completing
tasks across real-world applications. Despite this, these LLM agents introduce unexpected …

Save Cite Cited by 43 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Rigorllm: Resilient guardrails for large language models against undesired content

Z Yuan, Z **ong, Y Zeng, N Yu, R Jia, D Song… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in Large Language Models (LLMs) have showcased remarkable
capabilities across various tasks in different domains. However, the emergence of biases …

Save Cite Cited by 27 Related articles All 3 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation

Chatbot arena: An open platform for evaluating llms by human preference

A survey of attacks on large vision-language models: Resources, advances, and future trends

Introducing v0. 5 of the ai safety benchmark from mlcommons

Sorry-bench: Systematically evaluating large language model safety refusal behaviors

Salad-bench: A hierarchical and comprehensive safety benchmark for large language models

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

A Survey on LLM-as-a-Judge

Shieldgemma: Generative ai content moderation based on gemma

R-judge: Benchmarking safety risk awareness for llm agents

Rigorllm: Resilient guardrails for large language models against undesired content