# Introduction

• Goal of Hate Detection: Try and make the internet less toxic.
• Toxicity ML models are generally trained on text-only annotations and don’t have information about the speaker.
• Current datasets ignore social context of speech, like - identity of speaker, dialect of English.
• Ignoring these nuances risks harming minority populations by suppressing inoffensive speech.

# Problem

• Severe Racial bias in hate speech detection.
• Do ML models acquire this racial bias from datasets?
• Can annotation task design affect these racial biases?

# How it Solves

Empirically characterize the racial bias present in several widely used Twitter corpora annotated for toxic content, and quantify the propagation of this bias through models trained on them. They established strong associations between AAE markers (e.g., “n*ggas”, “ass”) and toxicity annotations, and show that models acquire and replicate this bias: in other corpora, tweets inferred to be in AAE and tweets from self-identifying African American users are more likely to be classified as offensive.

It uses Dialect as a proxy for racial identity. African American English (AAE) dialect is used in the paper. Lexical detector by (Blodgett et al., 2016) was used to infer the presence of AAE. To find out whether ML models are affected by racial bias in datasets, train/test classifiers on 2 corpora (TWT-HATEBASE, TWT-BOOTSTRAP) used in hate detection systems were performed to predict the toxicity label of a tweet. A held-out set broken down by dialect group was used to assess the performance of these classifiers by counting the number of mistakes made. Minimization of the cross-entropy of the annotated class conditional on text, $x$:

\begin{align*} p(class{|}{x}) \propto \exp(\textbf{W}_o\textbf{h} + \textbf{b}_o) \end{align*}

with $h = f(x)$, where $f$ is a BiLSTM with attention, followed by a projection layer to encode the tweets into an $H$-dimensional vector.

Predictions by both the classifiers were biased against AAE tweets as shown by the following results:

46% of non-offensive AAE tweets were mistaken for offensive compared to 9% of White.

26% of non-abusive AAE tweets were mistaken for abusive compared to 5% of White.

AAE tweets and tweets by Black folks were more often flagged as toxic. This racial bias generalizes to other Twitter corpora.

How to reduce the bias?

To test this hypothesis, 350 AAE tweets stratified by dataset labels were given to Amazon Mechanical Turkers.

The annotation was done in a three-fold manner:

1. only text. (55% were labelled as offensive)
2. text + dialect information. (44% were labelled as offensive)
3. text + dialect + race information. (44% were labelled as offensive)

Annotators are substantially more likely to rate a tweet as being offensive to someone, than to rate it as offensive to themselves, suggesting that people recognize the subjectivity of offensive language.

# Results

While both models achieve high accuracy, the false positive rates (FPR) differ across groups for several toxicity labels. The DWMW17 classifier predicts almost 50% of non-offensive AAE tweets as being offensive, and FDCL18 classifier shows higher FPR for the “Abusive” and “Hateful” categories for AAE tweets. Additionally, both classifiers show strong tendencies to label White tweets as “none”. These discrepancies in FPR across groups violate the equality of opportunity criterion, indicating discriminatory impact.

# Takeaways

• Evidence of racial bias in existing hate detection corpora.
• Bias will propagate downstream through ML models.
• Priming annotators influences label of offensiveness.
• In general, hate speech language is highly subjective and contextual. Factors like slang, dialects, slurs, etc must be taken into consideration.

# References

• Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The Risk of Racial Bias in Hate Speech Detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678. https://doi.org/10.18653/v1/P19-1163