Llama guard: Llm-based input-output safeguard for human-ai conversations

H Inan, K Upasani, J Chi, R Rungta, K Iyer… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce Llama Guard, an LLM-based input-output safeguard model geared towards
Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a …

On the rise of fear speech in online social media

P Saha, K Garimella, NK Kalyan… - Proceedings of the …, 2023 - National Acad Sciences
Recently, social media platforms are heavily moderated to prevent the spread of online hate
speech, which is usually fertile in toxic words and is directed toward an individual or a …

Just say no: Analyzing the stance of neural dialogue generation in offensive contexts

A Baheti, M Sap, A Ritter, M Riedl - arxiv preprint arxiv:2108.11830, 2021 - arxiv.org
Dialogue models trained on human conversations inadvertently learn to generate toxic
responses. In addition to producing explicitly offensive utterances, these models can also …

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

S Han, K Rao, A Ettinger, L Jiang, BY Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce WildGuard--an open, light-weight moderation tool for LLM safety that achieves
three goals:(1) identifying malicious intent in user prompts,(2) detecting safety risks of model …

Multilingual abusive comment detection at scale for indic languages

V Gupta, S Roychowdhury, M Das… - Advances in …, 2022 - proceedings.neurips.cc
Social media platforms were conceived to act as online town squares' where people could
get together, share information and communicate with each other peacefully. However …

When do annotator demographics matter? measuring the influence of annotator demographics with the popquorn dataset

J Pei, D Jurgens - arxiv preprint arxiv:2306.06826, 2023 - arxiv.org
Annotators are not fungible. Their demographics, life experiences, and backgrounds all
contribute to how they label data. However, NLP has only recently considered how …

Position: measure dataset diversity, don't just claim it

D Zhao, JTA Andrews, O Papakyriakopoulos… - arxiv preprint arxiv …, 2024 - arxiv.org
Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract
and disputed social constructs. Dataset curators frequently employ value-laden terms such …

BERT-based approach to arabic hate speech and offensive language detection in Twitter: exploiting emojis and sentiment analysis

MJ Althobaiti - … Journal of Advanced Computer Science and …, 2022 - search.proquest.com
The user-generated content on the internet including that on social media may contain
offensive language and hate speech which negatively affect the mental health of the whole …

''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT Generated English Text

R Hada, A Seth, H Diddee, K Bali - arxiv preprint arxiv:2310.17428, 2023 - arxiv.org
Language serves as a powerful tool for the manifestation of societal belief systems. In doing
so, it also perpetuates the prevalent biases in our society. Gender bias is one of the most …

The Unseen Targets of Hate: A Systematic Review of Hateful Communication Datasets

Z Yu, I Sen, D Assenmacher… - Social Science …, 2024 - journals.sagepub.com
Machine learning (ML)-based content moderation tools are essential to keep online spaces
free from hateful communication. Yet ML tools can only be as capable as the quality of the …