I really appreciated how O&al2015 used the RSA modeling framework to make a theory (in this case, about discourse salience) concrete enough to implement and then evaluate against observable behavior. As always, this is the kind of thing I think modeling is particularly good at, so the more that we as modelers emphasize that, the better.
Some more targeted thoughts:
(1) The Uniform Information Density (UID) Hypothesis assumes receiving information in chunks of approximately the same size is better for communication. I was trying to get the intuition of that down -- is it that new information is easier to integrate if the amount of hypothesis adjustment needed based on that new information is always the same? (And if so, why should that be exactly? Some kind of processing thing?)
Related: If I’m understanding correctly, the discourse salience version of the UID hypothesis means more predictable forms become pronouns. This gets cashed out initially as the surprisal component of the speaker function in (3) (I(words; intended referent, available referent)), which is just about vocabulary specificity (that is, inversely proportional w.r.t how ambiguous the literal meaning of the word is). Then 3.2 talks about how to incorporate discourse salience. In particular, (4) incorporates the literal listener interpretation given the word, and (5) is just straight Bayesian inference where the priors over referents are what discourse salience affects. Question: Would we need these discourse-salience-based priors to reappear in the pragmatic listener level if we were using that level? (It seems like they belong there too, right?)
Speaking of levels, since O&al2015 are modeling speaker productions, is the S1 level the right level? Or should they be using an S2 level, where the speaker assumes a pragmatic listener is the conversational partner? Maybe not because we usually save the S2 level for metalinguistic judgments like endorsements in a truth-value judgment task?
(2) Table 1: Just looking at the log likelihood scores, it seems like frequency-based discourse salience is the way to go (and this effect is much more pronounced in child-directed speech). However, the text in the discussion by the authors notes how the recency-based discourse salience version has better accuracy scores, though most of that is due to the proper name accuracy since every model is pretty terrible at pronoun accuracy. I’m not entirely sure I follow the authors’ point about why the accuracy and log likelihood scores don’t agree on the winner. If the recency-based models return higher probabilities for a proper name, shouldn’t that make the recency-based log likelihood score better than the frequency-based log likelihood score? Is the idea that some proper names get all the probability (for whatever reason) for the recency-based version, and this so drastically lowers the probabilities of the other proper names that a worse log likelihood results?
But still, no matter what, discourse saliency looks like it’s having the most impact (though there’s some impact of expression cost). In the adult-directed dataset, you can actually get pretty close to the best log likelihood with the -cost frequency-based version (-1017) vs. the complete frequency-based version (-958). But if you remove discourse salience, things get much, much worse (-6904). Similarly, in the child-directed dataset, the -cost versions aren’t too much worse than the complete versions, but the -discourse version is horrible.
All that said, what on earth happened with pronoun accuracy? There’s clearly a dichotomy between the proper name results and the pronoun results, no matter what model version you look at (except maybe the adult-directed -unseen frequency-based version).
(3) In terms of next steps, incorporating visual salience seems like a natural step when calculating discourse saliency. Probably the best way to do this is as a joint distribution in the listener function for the prior? (I also liked the proposed extension that involves speaker identity as part of the relevant context.) Similarly, incorporating grammatical and semantic constraints seems like a natural extension that could be implemented the same way. Probably a hard part is getting plausible estimates for these priors?
No comments:
Post a Comment