- Some languages encode grammatical gender - Spanish, Italian, Russian, etc. One word for male, other word for female.
- Other languages are gender neutral - English, Turkish, Finnish. One word for both male and female.
- When translating between genderless to gender-ed languages, translator needs to take gender into account, and make decisions.
- Examples include gender biases in visual Semantic Role Labeling (SRL) - Cooking is stereotypically done by women, construction workers are stereotypically men.
- Can we quantitatively evaluate gender translation in MT?
- How much does MT rely on gender stereotypes vs. meaningful context?
- Can we reduce gender bias by rephrasing source texts?
How it Solves
Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018) datasets were used following the Winograd schema. Collectively, they compose of 3888 English sentences designed to test gender bias in coreference resolution.
These are very useful for evaluating gender bias in MT!
- Equally split between stereotypical and non-stereotypical role assignments. (For every female doctor, we’ll have a female nurse)
- Gold annotations for gender.
Six widely used MT models, representing the state of the art in both commercial and academic research were used:
- Google Translate
- Microsoft Translator
- Amazon Translate
- the model of (Ott et al., 2018), which recently achieved the best performance on English-to-French translation on the WMT’14 test set.
- the model of (Edunov et al., 2018), the WMT’18 winner on English-to-German translation.
Input : MT model + target language
Output : Accuracy score for gender translation
- Translate the coreference bias datasets.
- To target languages with grammatical gender.
- Align between source and target.
- Using fast align. (Dyer et al., 2013)
- Identify gender in target language.
- Using off-the-shelf morphological analyzers or simple heuristics in the target languages.
To estimate noise, a sample of gender predictions were shown to native speakers of target languages.
Quality estimated >85% vs. 90% IAA (Inter Annotator Agreement)
To test if gendered adjectives affect translation?
For the sentence:
“The doctor asked the nurse to help her in the operation.”
Black-box injection of gendered adjectives were done:
“The pretty doctor asked the nurse to help her in the operation.” [Since doctor is female]
“The handsome nurse asked the doctor to help him in the operation.” [Since nurse is male]
Here, “pretty” and “handsome” are stereotypical adjectives for feminine and masculine genders respectively.
This approach improved performance for most tested languages and models. [mean +8.6%]
This metric shows that all tested systems have a significant and consistently better performance when presented with pro-stereotypical assignments (e.g., a female nurse), while their performance deteriorates when translating anti-stereotypical roles (e.g., a male receptionist).
Limitations and Future Work
- Artificially created dataset.
- Allows for controlled experiment.
- Yet, might introduce it’s own annotation biases.
- Medium Size
- Easy to overfit.
- First quantitative automatic evaluation of gender bias in MT.
- 6 SOTA MT models on 8 diverse target languages.
- Doesn’t require reference translations.
- Significant gender bias found in all models in all tested languages.
- Easily extensible with more languages and MT models.
- Stanovsky, G., Smith, N. A., & Zettlemoyer, L. (2019). Evaluating Gender Bias in Machine Translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1679–1684. https://doi.org/10.18653/v1/P19-1164
We present the first challenge set and evaluation protocol for the analysis of gender bias in machine translation (MT). Our approach uses two recent coreference resolution datasets composed of English sentences which cast participants into non-stereotypical gender roles (e.g., “The doctor asked the nurse to help her in the operation”). We devise an automatic gender bias evaluation method for eight target languages with grammatical gender, based on morphological analysis (e.g., the use of female inflection for the word “doctor”). Our analyses show that four popular industrial MT systems and two recent state-of-the-art academic MT models are significantly prone to gender-biased translation errors for all tested target languages. Our data and code are publicly available at https://github.com/gabrielStanovsky/mt_gender.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.