Can Editing LLMs Inject Harm?
Knowledge editing has been increasingly adopted to correct the false or outdated
knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored …
knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored …
An Overview of Trustworthy AI: Advances in IP Protection, Privacy-preserving Federated Learning, Security Verification, and GAI Safety Alignment
AI has undergone a remarkable evolution journey marked by groundbreaking milestones.
Like any powerful tool, it can be turned into a weapon for devastation in the wrong hands …
Like any powerful tool, it can be turned into a weapon for devastation in the wrong hands …
On evaluating the durability of safeguards for open-weight llms
Stakeholders--from model developers to policymakers--seek to minimize the dual-use risks
of large language models (LLMs). An open challenge to this goal is whether technical …
of large language models (LLMs). An open challenge to this goal is whether technical …
Differentially private kernel density estimation
We introduce a refined differentially private (DP) data structure for kernel density estimation
(KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior …
(KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior …
A Realistic Threat Model for Large Language Model Jailbreaks
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from
safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing …
safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing …
Data-Aware Training Quality Monitoring and Certification for Reliable Deep Learning
Deep learning models excel at capturing complex representations through sequential layers
of linear and non-linear transformations, yet their inherent black-box nature and multi-modal …
of linear and non-linear transformations, yet their inherent black-box nature and multi-modal …
AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models
As text-to-image (T2I) models continue to advance and gain widespread adoption, their
associated safety issues are becoming increasingly prominent. Malicious users often exploit …
associated safety issues are becoming increasingly prominent. Malicious users often exploit …
Position: We Need An Adaptive Interpretation of Helpful, Honest, and Harmless Principles
The Helpful, Honest, and Harmless (HHH) principle is a foundational framework for aligning
AI systems with human values. However, existing interpretations of the HHH principle often …
AI systems with human values. However, existing interpretations of the HHH principle often …
SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach
Multimodal foundation models (MFMs) represent a significant advancement in artificial
intelligence, combining diverse data modalities to enhance learning and understanding …
intelligence, combining diverse data modalities to enhance learning and understanding …
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
S Han - arxiv preprint arxiv:2410.18114, 2024 - arxiv.org
The advancements in generative AI inevitably raise concerns about their risks and safety
implications, which, in return, catalyzes significant progress in AI safety. However, as this …
implications, which, in return, catalyzes significant progress in AI safety. However, as this …