Red-Teaming for generative AI: Silver bullet or security theater?

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

Zero shot VLMs for hate meme detection: Are we there yet?

N Rizwan, P Bhaskar, M Das, SS Majhi, P Saha… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimedia content on social media is rapidly evolving, with memes gaining prominence as
a distinctive form. Unfortunately, some malicious users exploit memes to target individuals or …

Hate Speech Detection using Large Language Models: A Comprehensive Review

A Albladi, M Islam, A Das, M Bigonah, Z Zhang… - IEEE …, 2025 - ieeexplore.ieee.org
The widespread use of social media and other online platforms has facilitated
unprecedented communication and information exchange. However, it has also led to the …

Mmidr: Teaching large language model to interpret multimodal misinformation via knowledge distillation

L Wang, X Xu, L Zhang, J Lu, Y Xu, H Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Automatic detection of multimodal misinformation has gained a widespread attention
recently. However, the potential of powerful Large Language Models (LLMs) for multimodal …

Dell: Generating reactions and explanations for llm-based misinformation detection

H Wan, S Feng, Z Tan, H Wang, Y Tsvetkov… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models are limited by challenges in factuality and hallucinations to be
directly employed off-the-shelf for judging the veracity of news articles, where factual …

Hate Personified: Investigating the role of LLMs in content moderation

S Masud, S Singh, V Hangya, A Fraser… - arxiv preprint arxiv …, 2024 - arxiv.org
For subjective tasks such as hate detection, where people perceive hate differently, the
Large Language Model's (LLM) ability to represent diverse groups is unclear. By including …

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Y Li, H Jiang, C Gong, Z Wei - arxiv preprint arxiv:2404.10464, 2024 - arxiv.org
Despite the remarkable achievements of language models (LMs) across a broad spectrum
of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current …

A Survey on Online Aggression: Content Detection and Behavioural Analysis on Social Media Platforms

S Mane, S Kundu, R Sharma - ACM Computing Surveys, 2023 - dl.acm.org
The proliferation of social media has increased cyber-aggressive behavior behind the
freedom of speech, posing societal risks from online anonymity to real-world consequences …

What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detection

S Feng, H Wan, N Wang, Z Tan, M Luo… - arxiv preprint arxiv …, 2024 - arxiv.org
Social media bot detection has always been an arms race between advancements in
machine learning bot detectors and adversarial bot strategies to evade detection. In this …

Decoding Hate: Exploring Language Models' Reactions to Hate Speech

P Piot, J Parapar - arxiv preprint arxiv:2410.00775, 2024 - arxiv.org
Hate speech is a harmful form of online expression, often manifesting as derogatory posts. It
is a significant risk in digital environments. With the rise of Large Language Models (LLMs) …