Monday, November 20, 2017

Some thoughts on Stevens et al. 2017

It’s really nice to see an RSA model engaging with pretty technical aspects of linguistic theory, as S&al2017 do here. In these kinds of problems, there tend to be a lot of links to follow in the chain of reasoning, and it’s definitely not easy to adequately communicate them in such a limited space. (Side note: I forget how disorienting it can be to not know specific linguistics terms until I try to read them all at once in an abstract without a concrete example. This is a good reminder to those of us who work in more technical areas: Make sure to have concrete examples handy. The same thing is true for walking through the empirical details with the prosodic realizations as S&al2017 have here —  I found the concrete examples super-helpful.)

Specific thoughts:

(1) For S&al2017, “information structure” = inferring the QUD probabilistically from prosodic cues?

 (2) I think the technical linguistic material is worth going over, as it connects to the RSA model. For instance, I’m struggling a bit to understand the QUD implications for having incomplete answers vs. having complete answers, especially as it relates to a QUD’s compatibility with a given melody. 

For example, when we hear “Masha didn’t run QUICKLY”, the QUD is something like “How did Masha run?”. That’s an example of an incomplete answer. What’s a complete answer version of this scenario, and how does this impact the QUD? Once I get this, then I think it makes complete sense to use the utility function defined in equation (10). 

(3) I was struck by S&al2017’s notational trick, where they get out of the recursive social reasoning loop of literal listener to speaker to pragmatic listener. Here, it’s utility function to speaker to hearer because they’re presumably trying to deemphasize the social reasoning aspect? Or they just thought it made more sense described this way?

(4) About those results:
Figure 2: It’s nice to see modelers investigating the effect of the rationality (softmax) parameter in the speaker function. From the look of Figure 2, speakers need to be pretty darned rational indeed (really exaggerate endpoint behavior) in order to get any separation in commitment certainty predictions. 

Thinking about this intuitively, we should expect the LH Name condition (MASHA didn’t run quickly) to continue to be ambivalent about commitment to Masha running at all. That definitely shows up. I think. (Actually, I wonder if if might have been more helpful to ask participants to rate things on a scale from 1 (No, certainly not) to 7 (Yes, certainly so). That seems like it would make a 4 score easier to interpret (4 = maybe yes, maybe no). Here, I’m a little unsure how participants were interpreting the middle of the scale. I would have thought “No, not certain” would be the “maybe yes, maybe no” option, and so we would expect scores of 1. This is something of an issue when we come to the quantitative fit of the model results to the experimental results. Is the behavioral difference shallow just because of the way humans were asked to give their answers?  The way the model probability is calculated in (16) suggests that the model is operating more under the 1 = “no, certainly not” version (if I’m interpreting it correctly - -you have the “certainly yes” option contrasted with the “certainly not” option).


Clearly, however, we see a shift up in human responses in Figure 3 for the LH Adverb condition (Masha didn’t run QUICKLY), which does accord with my intuitions. And we get them from the model in Figure 2, as long as that rationality parameter is turned way up. (Side note: I’m a little unclear about how to interpret the rationality parameter, though. We always hedge about it in our simulation results. It seems to be treated as a noise parameter, i.e., humans are noisy, so let’s use this to capture some messy bits of their behavior. In that case, maybe it doesn’t mean much of anything that it has to be turned up so high here.)

Monday, November 6, 2017

Thoughts on Orita et al. 2015

I really appreciated how O&al2015 used the RSA modeling framework to make a theory (in this case, about discourse salience) concrete enough to implement and then evaluate against observable behavior. As always, this is the kind of thing I think modeling is particularly good at, so the more that we as modelers emphasize that, the better.

Some more targeted thoughts:

(1) The Uniform Information Density (UID) Hypothesis assumes receiving information in chunks of approximately the same size is better for communication. I was trying to get the intuition of that down -- is it that new information is easier to integrate if the amount of hypothesis adjustment needed based on that new information is always the same? (And if so, why should that be exactly? Some kind of processing thing?)

Related: If I’m understanding correctly, the discourse salience version of the UID hypothesis means more predictable forms become pronouns. This gets cashed out initially as the surprisal component of the speaker function in (3) (I(words; intended referent, available referent)), which is just about vocabulary specificity (that is, inversely proportional w.r.t how ambiguous the literal meaning of the word is). Then 3.2 talks about how to incorporate discourse salience. In particular, (4) incorporates the literal listener interpretation given the word, and (5) is just straight Bayesian inference where the priors over referents are what discourse salience affects. Question: Would we need these discourse-salience-based priors to reappear in the pragmatic listener level if we were using that level? (It seems like they belong there too, right?)

Speaking of levels, since O&al2015 are modeling speaker productions, is the S1 level the right level? Or should they be using an S2 level, where the speaker assumes a pragmatic listener is the conversational partner? Maybe not because we usually save the S2 level for metalinguistic judgments like endorsements in a truth-value judgment task?

(2) Table 1: Just looking at the log likelihood scores, it seems like frequency-based discourse salience is the way to go (and this effect is much more pronounced in child-directed speech). However, the text in the discussion by the authors notes how the recency-based discourse salience version has better accuracy scores, though most of that is due to the proper name accuracy since every model is pretty terrible at pronoun accuracy. I’m not entirely sure I follow the authors’ point about why the accuracy and log likelihood scores don’t agree on the winner. If the recency-based models return higher probabilities for a proper name, shouldn’t that make the recency-based log likelihood score better than the frequency-based log likelihood score? Is the idea that some proper names get all the probability (for whatever reason) for the recency-based version, and this so drastically lowers the probabilities of the other proper names that a worse log likelihood results?

But still, no matter what, discourse saliency looks like it’s having the most impact (though there’s some impact of expression cost). In the adult-directed dataset, you can actually get pretty close to the best log likelihood with the -cost frequency-based version (-1017) vs. the complete  frequency-based version (-958). But if you remove discourse salience, things get much, much worse (-6904). Similarly, in the child-directed dataset, the -cost versions aren’t too much worse than the complete versions, but the -discourse version is horrible.

All that said, what on earth happened with pronoun accuracy? There’s clearly a dichotomy between the proper name results and the pronoun results, no matter what model version you look at (except maybe the adult-directed -unseen frequency-based version).

(3) In terms of next steps, incorporating visual salience seems like a natural step when calculating discourse saliency. Probably the best way to do this is as a joint distribution in the listener function for the prior? (I also liked the proposed extension that involves speaker identity as part of the relevant context.) Similarly, incorporating grammatical and semantic constraints seems like a natural extension that could be implemented the same way. Probably a hard part is getting plausible estimates for these priors?