this is a placeholder image

Introduction

  • Some languages encode grammatical gender - Spanish, Italian, Russian, etc. One word for male, other word for female.
  • Other languages are gender neutral - English, Turkish, Finnish. One word for both male and female.
  • When translating between genderless to gender-ed languages, translator needs to take gender into account, and make decisions.
this is a placeholder image
An example of gender bias in machine translation from English (top) to Spanish (bottom).
  • Examples include gender biases in visual Semantic Role Labeling (SRL) - Cooking is stereotypically done by women, construction workers are stereotypically men.

Problem

  • Can we quantitatively evaluate gender translation in MT?
  • How much does MT rely on gender stereotypes vs. meaningful context?
  • Can we reduce gender bias by rephrasing source texts?

How it Solves

Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018) datasets were used following the Winograd schema. Collectively, they compose of 3888 English sentences designed to test gender bias in coreference resolution.

this is a placeholder image
For every female doctor in the example, there will be a male doctor as a counter example.

These are very useful for evaluating gender bias in MT!

  • Equally split between stereotypical and non-stereotypical role assignments. (For every female doctor, we’ll have a female nurse)
  • Gold annotations for gender.

Six widely used MT models, representing the state of the art in both commercial and academic research were used:

  1. Google Translate
  2. Microsoft Translator
  3. Amazon Translate
  4. SYSTRAN
  5. the model of (Ott et al., 2018), which recently achieved the best performance on English-to-French translation on the WMT’14 test set.
  6. the model of (Edunov et al., 2018), the WMT’18 winner on English-to-German translation.

Implementation

Input : MT model + target language
Output : Accuracy score for gender translation

  1. Translate the coreference bias datasets.
    • To target languages with grammatical gender.
  2. Align between source and target.
  3. Identify gender in target language.
    • Using off-the-shelf morphological analyzers or simple heuristics in the target languages.

To estimate noise, a sample of gender predictions were shown to native speakers of target languages.

Quality estimated >85% vs. 90% IAA (Inter Annotator Agreement)

To test if gendered adjectives affect translation?

For the sentence:
“The doctor asked the nurse to help her in the operation.”

Black-box injection of gendered adjectives were done:
“The pretty doctor asked the nurse to help her in the operation.” [Since doctor is female]
“The handsome nurse asked the doctor to help him in the operation.” [Since nurse is male]

Here, “pretty” and “handsome” are stereotypical adjectives for feminine and masculine genders respectively.

This approach improved performance for most tested languages and models. [mean +8.6%]

Results

This metric shows that all tested systems have a significant and consistently better performance when presented with pro-stereotypical assignments (e.g., a female nurse), while their performance deteriorates when translating anti-stereotypical roles (e.g., a male receptionist).

this is a placeholder image
Google Translate’s performance on gender translation on our tested languages. The performance on the stereotypical portion of WinoMT is consistently better than that on the non-stereotypical portion. The other MT systems we tested display similar trends.

Limitations and Future Work

  • Artificially created dataset.
    • Allows for controlled experiment.
    • Yet, might introduce it’s own annotation biases.
  • Medium Size
    • Easy to overfit.

Takeaways

  • First quantitative automatic evaluation of gender bias in MT.
    • 6 SOTA MT models on 8 diverse target languages.
    • Doesn’t require reference translations.
  • Significant gender bias found in all models in all tested languages.
  • Easily extensible with more languages and MT models.

References

  1. Stanovsky, G., Smith, N. A., & Zettlemoyer, L. (2019). Evaluating Gender Bias in Machine Translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1679–1684. https://doi.org/10.18653/v1/P19-1164


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Leave a comment