One of the things I really enjoyed about this paper was seeing the precise assumptions that (we think) underlie dependent measures. It’s important to understand them — and understand the linking story more generally — if you’re going to connect model output (which typically is about some knowledge state that’s achieved/learned) to behavioral results (which involve using that knowledge to generate the observed behavior).
Meanwhile, I was just as surprised as the authors that the most natural of the three behavioral tasks they used (the sentence interpretation, i.e. what did the speaker mean by this?) was the one that seemed to wash away the pragmatic effects. I would have thought that pragmatic reasoning is what we use to understand how utterances are used in conversation (i.e., to figure out what the speaker meant in context). So, they ought to be more in effect for this kind of task than the more metalinguistic truth-value-ish (Expt 1) or what’s-the-speaker-going-to-say (Expt 2) tasks. But, clearly they weren’t.
D&G2014 offer up a potential explanation involving an RSA model that views the interpretation task as involving a pragmatic listener (who reasons about a speaker informing a naive listener). In contrast, the truth-value and speaker-production tasks involve imagining a speaker’s productions. The reason the pragmatic effects disappear for the interpretation task is because they get washed away by the pragmatic listener’s reasoning, according to D&G. I think I’d like to understand this a bit better (i.e., why exactly is this true, using the equation they provide). Is it because the pragmatic effects are only in play for certain utterances, and the world-state priors are really low for those utterances, so this yields no effect at the pragmatic listener level? (More specifically using equation 1 notation: Is it that P_speaker(w | b, QUD) has the pragmatic effect for certain box world-states b, and these are the ones with low prior P(b)?)
Some additional thoughts:
(1) Expt 2, predicting the probability of a speaker’s word choice, Figure 2: It seems funny that speakers give any probability to answers besides “all” and the exact number when shown the complete set of marbles for the utterance “I found X of the marbles.” Even when the QUD is “Did she find all of them?”, we see some probability on “some” (for the “4” set, it’s not that much different from 0, but for the 16 set, it’s up there at 20%). Maybe this is really D&G’s note about people not wanting to be bothered with counting if they can’t subitize? (That is, having probability on “some” is a hedge because the participant is too lazy to count if there are sixteen marbles present.)
(2) Expt 3 and what it means for truth-value judgment (TVJT) tasks that we often use with kids to assess interpretations: Maybe we should back off from truth-judgments and try to go for more naturalistic “which of these did the speaker mean” judgments? For example, we give them an utterance and do some sort of eyetracking thing where they look at one of two pictures that correspond to possible utterance interpretations. This would seem to factor out some of the pragmatic interference, based on the adult results. I guess the main response from the TVJT people is they want to know when children allow a certain interpretation, even if it’s a very minority one — the setup of the TVJT is typically that children will only answer “no” if they really can’t get the interpretation in question period. But maybe you can also get around this with more indirect measures like eye gaze, too. That is, even if children consciously would say “no” for a TVJT, their eye gaze between two pictures would indicate they considered the relevant interpretation at some point during processing.