Can Editing LLMs Inject Harm?

C Chen, B Huang, Z Li, Z Chen, S Lai, X Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge editing has been increasingly adopted to correct the false or outdated
knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored …

An Overview of Trustworthy AI: Advances in IP Protection, Privacy-preserving Federated Learning, Security Verification, and GAI Safety Alignment

Y Zheng, CH Chang, SH Huang… - IEEE Journal on …, 2024 - ieeexplore.ieee.org
AI has undergone a remarkable evolution journey marked by groundbreaking milestones.
Like any powerful tool, it can be turned into a weapon for devastation in the wrong hands …

On evaluating the durability of safeguards for open-weight llms

X Qi, B Wei, N Carlini, Y Huang, T **e, L He… - arxiv preprint arxiv …, 2024 - arxiv.org
Stakeholders--from model developers to policymakers--seek to minimize the dual-use risks
of large language models (LLMs). An open challenge to this goal is whether technical …

Differentially private kernel density estimation

E Liu, JYC Hu, A Reneau, Z Song, H Liu - arxiv preprint arxiv:2409.01688, 2024 - arxiv.org
We introduce a refined differentially private (DP) data structure for kernel density estimation
(KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior …

A Realistic Threat Model for Large Language Model Jailbreaks

V Boreiko, A Panfilov, V Voracek, M Hein… - arxiv preprint arxiv …, 2024 - arxiv.org
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from
safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing …

Data-Aware Training Quality Monitoring and Certification for Reliable Deep Learning

F Yeganegi, A Eamaz, M Soltanalian - arxiv preprint arxiv:2410.10984, 2024 - arxiv.org
Deep learning models excel at capturing complex representations through sequential layers
of linear and non-linear transformations, yet their inherent black-box nature and multi-modal …

AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models

Y Wang, J Chen, Q Li, X Yang, S Ji - arxiv preprint arxiv:2412.18123, 2024 - arxiv.org
As text-to-image (T2I) models continue to advance and gain widespread adoption, their
associated safety issues are becoming increasingly prominent. Malicious users often exploit …

Position: We Need An Adaptive Interpretation of Helpful, Honest, and Harmless Principles

Y Huang, C Gao, Y Zhou, K Guo, X Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
The Helpful, Honest, and Harmless (HHH) principle is a foundational framework for aligning
AI systems with human values. However, existing interpretations of the HHH principle often …

SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach

R Sun, J Chang, H Pearce, C **ao, B Li, Q Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal foundation models (MFMs) represent a significant advancement in artificial
intelligence, combining diverse data modalities to enhance learning and understanding …

Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond

S Han - arxiv preprint arxiv:2410.18114, 2024 - arxiv.org
The advancements in generative AI inevitably raise concerns about their risks and safety
implications, which, in return, catalyzes significant progress in AI safety. However, as this …