Xstest: A test suite for identifying exaggerated safety behaviours in large language models

P Röttger, HR Kirk, B Vidgen, G Attanasio… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Without proper safeguards, large language models will readily follow malicious instructions
and generate toxic content. This risk motivates safety efforts such as red-teaming and large …

ROBBIE: Robust bias evaluation of large generative language models

D Esiobu, X Tan, S Hosseini, M Ung, Y Zhang… - arxiv preprint arxiv …, 2023‏ - arxiv.org
As generative large language models (LLMs) grow more performant and prevalent, we must
develop comprehensive enough tools to measure and improve their fairness. Different …

Hate speech detection: A comprehensive review of recent works

A Gandhi, P Ahir, K Adhvaryu, P Shah… - Expert …, 2024‏ - Wiley Online Library
There has been surge in the usage of Internet as well as social media platforms which has
led to rise in online hate speech targeted on individual or group. In the recent years, hate …

Recent advances in hate speech moderation: Multimodality and the role of large models

MS Hee, S Sharma, R Cao, P Nandi… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In the evolving landscape of online communication, moderating hate speech (HS) presents
an intricate challenge, compounded by the multimodal nature of digital content. This …

Culturellm: Incorporating cultural differences into large language models

C Li, M Chen, J Wang, S Sitaram, X **e - arxiv preprint arxiv:2402.10946, 2024‏ - arxiv.org
Large language models (LLMs) are reported to be partial to certain cultures owing to the
training data dominance from the English corpora. Since multilingual cultural data are often …

Evaluating ChatGPT's performance for multilingual and emoji-based hate speech detection

M Das, SK Pandey, A Mukherjee - arxiv preprint arxiv:2305.13276, 2023‏ - arxiv.org
Hate speech is a severe issue that affects many online platforms. So far, several studies
have been performed to develop robust hate speech detection systems. Large language …

Validating multimedia content moderation software via semantic fusion

W Wang, J Huang, C Chen, J Gu, J Zhang… - Proceedings of the …, 2023‏ - dl.acm.org
The exponential growth of social media platforms, such as Facebook, Instagram, Youtube,
and TikTok, has revolutionized communication and content publication in human society …

Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore

J Haber, B Vidgen, M Chapman… - Proceedings of the …, 2023‏ - aclanthology.org
Toxic content is a global problem, but most resources for detecting toxic content are in
English. When datasets are created in other languages, they often focus exclusively on one …

Exploring Amharic hate speech data collection and classification approaches

AA Ayele, SM Yimam, TD Belay, T Asfaw… - Proceedings of the …, 2023‏ - aclanthology.org
In this paper, we present a study of efficient data selection and annotation strategies for
Amharic hate speech. We also build various classification models and investigate the …

Jailbreakhunter: a visual analytics approach for jailbreak prompts discovery from large-scale human-llm conversational datasets

Z **, S Liu, H Li, X Zhao, H Qu - arxiv preprint arxiv:2407.03045, 2024‏ - arxiv.org
Large Language Models (LLMs) have gained significant attention but also raised concerns
due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards …