Towards generalisable hate speech detection: a review on obstacles and solutions

W Yin, A Zubiaga - PeerJ Computer Science, 2021 - peerj.com
Hate speech is one type of harmful online content which directly attacks or promotes hate
towards a group or an individual member based on their actual or perceived aspects of …

Handling bias in toxic speech detection: A survey

T Garg, S Masud, T Suresh, T Chakraborty - ACM Computing Surveys, 2023 - dl.acm.org
Detecting online toxicity has always been a challenge due to its inherent subjectivity. Factors
such as the context, geography, socio-political climate, and background of the producers …

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

P Röttger, HR Kirk, B Vidgen, G Attanasio… - arxiv preprint arxiv …, 2023 - arxiv.org
Without proper safeguards, large language models will readily follow malicious instructions
and generate toxic content. This risk motivates safety efforts such as red-teaming and large …

Five sources of bias in natural language processing

D Hovy, S Prabhumoye - Language and linguistics compass, 2021 - Wiley Online Library
Recently, there has been an increased interest in demographically grounded bias in natural
language processing (NLP) applications. Much of the recent work has focused on describing …

Nationality bias in text generation

PN Venkit, S Gautam, R Panchanadikar… - arxiv preprint arxiv …, 2023 - arxiv.org
Little attention is placed on analyzing nationality bias in language models, especially when
nationality is highly used as a factor in increasing the performance of social NLP models …

HateCheck: Functional tests for hate speech detection models

P Röttger, B Vidgen, D Nguyen, Z Waseem… - arxiv preprint arxiv …, 2020 - arxiv.org
Detecting online hate is a difficult task that even state-of-the-art models struggle with.
Typically, hate speech detection models are evaluated by measuring their performance on …

Learning from the worst: Dynamically generated datasets to improve online hate detection

B Vidgen, T Thrush, Z Waseem, D Kiela - arxiv preprint arxiv:2012.15761, 2020 - arxiv.org
We present a human-and-model-in-the-loop process for dynamically generating datasets
and training better performing and more robust hate detection models. We provide a new …

[PDF][PDF] HONEST: Measuring hurtful sentence completion in language models

D Nozza, F Bianchi, D Hovy - … of the 2021 conference of the …, 2021 - iris.unibocconi.it
Abstract Language models have revolutionized the field of NLP. However, language models
capture and proliferate hurtful stereotypes, especially in text generation. Our results show …

Hate speech classifiers learn normative social stereotypes

AM Davani, M Atari, B Kennedy… - Transactions of the …, 2023 - direct.mit.edu
Social stereotypes negatively impact individuals' judgments about different groups and may
have a critical role in understanding language directed toward marginalized groups. Here …

A survey on gender bias in natural language processing

K Stanczak, I Augenstein - arxiv preprint arxiv:2112.14168, 2021 - arxiv.org
Language can be used as a means of reproducing and enforcing harmful stereotypes and
biases and has been analysed as such in numerous research. In this paper, we present a …