this is a placeholder image

Introduction

Prediction of speaker commitment is the task of determining to what extent the speaker is committed to an event in a sentence as actual, non-actual, or uncertain. This matters for downstream NLP applications, such as information extraction or question answering.

For eg: Florence is the most beautiful city I’ve ever seen.” (Speaker is committed to the fact)
Compared to: “Florence might be the most beautiful city I’ve ever seen.” (Speaker is not as committed to the fact)

this is a placeholder image

Speaker is committed - “Do you know that Florence is packed with visitors?”.
Mainly, due to “know” being a Factive verb has a tendency to suggest that the embedding clause is true.

Speaker is not so committed - “Do you think that Florence is packed with visitors?”.
Mainly, due to “think” being a Non-Factive verb has a tendency to suggest that the embedding clause is neither true nor false.

Speaker is not committed - “I don’t know that Florence is packed with visitors.”
Mainly, due to “don’t know” being a Neg-raising (negation) has a tendency to suggest that the embedding clause is false.

Problem

Evaluating SOTA models of Speaker Commitment.

How it Solves

They explore the hypothesis that linguistic deficits drive the error patterns of speaker commitment models by analyzing the linguistic correlations of model errors on a challenging naturalistic dataset.

Dataset

The CommitmentBank corpus is used which consists of 1,200 naturally occurring items involving clause-embedding verbs under four entailment-canceling environments (negations, modals, questions, conditionals).

For each item, speaker commitment judgments were gathered on Mechanical Turk from at least eight native English speakers. Participants judged whether or not the speaker is certain that the content of the complement in the target sentence is true, using a Likert scale labeled at 3 points (+3/speaker is certain that the complement is true, 0/speaker is not certain whether it is true or false, -3/speaker is certain that it is false). They took the mean annotations of each item as gold score of speaker commitment.

Models

1. Rule-based TruthTeller (Stanovsky et al., 2017)

It uses a top-down approach on a dependency tree and predicts speaker commitment score in [−3, 3] according to the implicative signatures of the predicates, and whether the predicates are under the scope of negation and uncertainty modifiers. For example, refuse $\mathcal{p}$ entails $\neg{\mathcal{p}}$, so the factuality of its complement $\mathcal{p}$ gets flipped if encountered.

2. BiLSTM-based models (Rudinger et al., 2018)

  • Linear: encodes the sentence left-to-right and right-to-left.
  • Tree: encodes the dependency tree of the sentence top-down and bottom-up.
  • Hybrid: combination of the two.

Metrics

Task is to predict a gradience of commitment $\in [-3, 3]$

  • Pearson R Correlation: Capture the variability of the model. (Higher Better)
  • Mean Absolute Error: Measures the absolute fit of the model. (Lower Better)

Results

this is a placeholder image

Hybrid model predictions are mostly positive, whereas the rule-based model predictions are clustered at −3 and +3. This suggests that the rule-based model cannot capture the gradience present in commitment judgments, while the hybrid model struggles to recognize negative commitments. The rule-based model predicts $+$ by default unless it has clear evidence (e.g., negation) for negative commitment. This behavior is reflected in the high precision for $-$. Both models perform well on $+$ and $-$, but neither is able to identify no commitment ($o$).

  • Embedding environment

The rule-based model can only capture inferences involving negation ($r$ = 0.45), while the hybrid model performs more consistently across negation, modal, and question ($r$ ∼ 0.25). Both models cannot handle inferences with conditionals.

  • Genre

Both models achieve the best correlation on dialog (Switchboard), and the worst on newswire (WSJ). The good performance of the rule-based model on dialog could be due to the fact that 70% of the items in dialog are in a negation environment with a nonfactive verb.

  • Factive embedding verb

Both models get better MAE on factives, but better correlation on nonfactives. The improved MAE of the rule-based model might be due to its use of factive/implicative signatures. However, the poor correlations suggest that neither model can robustly capture the variability in inference which exists in sentences involving factive/nonfactive verbs.

  • Neg-raising

There is almost no correlation between both models’ predictions and gold judgments suggesting that the models are not able to capture neg-raising inferences.

Limitations and Future Work

  • Models are not able to generalize to other linguistic environments such as conditional, modal, and neg-raising, which display inference patterns that are important for information extraction.
  • Both models are able to identify the polarity of commitment, but cannot capture its gradience.
  • To perform robust language understanding, models will need to incorporate more linguistic foreknowledge and be able to generalize to a wider range of linguistic constructions.

Takeaways

  • Conditionals, factive verbs, neg-raisings are hard for models.
  • Models can identify polarity, but not gradience.
  • Linguistically motivated models scale more successfully.

References

  1. Jiang, N., & de Marneffe, M.-C. (2019). Do You Know That Florence Is Packed with Visitors? Evaluating State-of-the-art Models of Speaker Commitment. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4208–4213. https://doi.org/10.18653/v1/P19-1412


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Leave a comment