Monday, October 31, 2016

Some thoughts on Degen & Goodman 2014

One of the things I really enjoyed about this paper was seeing the precise assumptions that (we think) underlie dependent measures. It’s important to understand them — and understand the linking story more generally — if you’re going to connect model output (which typically is about some knowledge state that’s achieved/learned) to behavioral results (which involve using that knowledge to generate the observed behavior).

Meanwhile, I was just as surprised as the authors that the most natural of the three behavioral tasks they used (the sentence interpretation, i.e. what did the speaker mean by this?) was the one that seemed to wash away the pragmatic effects. I would have thought that pragmatic reasoning is what we use to understand how utterances are used in conversation (i.e., to figure out what the speaker meant in context). So, they ought to be more in effect for this kind of task than the more metalinguistic truth-value-ish (Expt 1) or what’s-the-speaker-going-to-say (Expt 2) tasks. But, clearly they weren’t. 

D&G2014 offer up a potential explanation involving an RSA model that views the interpretation task as involving a pragmatic listener (who reasons about a speaker informing a naive listener). In contrast, the truth-value and speaker-production tasks involve imagining a speaker’s productions. The  reason the pragmatic effects disappear for the interpretation task is because they get washed away by the pragmatic listener’s reasoning, according to D&G. I think I’d like to understand this a bit better (i.e., why exactly is this true, using the equation they provide). Is it because the pragmatic effects are only in play for certain utterances, and the world-state priors are really low for those utterances, so this yields no effect at the pragmatic listener level? (More specifically using equation 1 notation: Is it that P_speaker(w | b, QUD) has the pragmatic effect for certain box world-states b, and these are the ones with low prior P(b)?)

Some additional thoughts:

(1) Expt 2, predicting the probability of a speaker’s word choice, Figure 2: It seems funny that speakers give any probability to answers besides “all” and the exact number when shown the complete set of marbles for the utterance “I found X of the marbles.” Even when the QUD is “Did she find all of them?”, we see some probability on “some” (for the “4” set, it’s not that much different from 0, but for the 16 set, it’s up there at 20%). Maybe this is really D&G’s note about people not wanting to be bothered with counting if they can’t subitize? (That is, having probability on “some” is a hedge because the participant is too lazy to count if there are sixteen marbles present.)

(2) Expt 3 and what it means for truth-value judgment (TVJT) tasks that we often use with kids to assess interpretations: Maybe we should back off from truth-judgments and try to go for more naturalistic “which of these did the speaker mean” judgments? For example, we give them an utterance and do some sort of eyetracking thing where they look at one of two pictures that correspond to possible utterance interpretations. This would seem to factor out some of the pragmatic interference, based on the adult results. I guess the main response from the TVJT people is they want to know when children allow a certain interpretation, even if it’s a very minority one — the setup of the TVJT is typically that children will only answer “no” if they really can’t get the interpretation in question period. But maybe you can also get around this with more indirect measures like eye gaze, too. That is, even if children consciously would say “no” for a TVJT, their eye gaze between two pictures would indicate they considered the relevant interpretation at some point during processing.

Tuesday, October 18, 2016

Some thoughts on Kao et al. 2016

I really like the approach this paper takes, where insights from humor research are operationalized using existing formal metrics from language understanding. It’s the kind of approach I think is very fruitful for computational research because it demonstrates the utility of bothering to formalize things — in this case, the outcome is surprisingly good prediction about degrees of funniness and more explanatory power about why exactly something is funny. 

As an organizational note, I love the tactic the authors take here of basically saying “we’ll explain the details of this bit in the next section”, which is an implicit way of saying “here’s the intuition and feel free to skip the details in the next section if you’re not as interested in the implementation.” For me, one big challenge of writing up modeling results of this kind is the level of detail to include when you’re explaining how everything works. It’s tricky because of how wide an audience you’re attempting to interest. Put too little computational detail in and everyone’s irritated at your hand-waving; put too much in and readers get derailed from your main point.  So, this presentation style may be a new format to try.

A few more specific thoughts:

(1) I mostly followed the details of the ambiguity and distinctiveness calculations (ambiguity is about entropy, distinctiveness is about KL divergence of the indicator variables f for each meaning). However, I think it’s worth pondering more carefully how the part described at the end of section 2.2 works which goes into the (R)elatedness calculation. If we’re getting relatedness scores between pairs of words (R(w_i, h), where h = homophone and w_i is another word from the utterance, then how do we compile that together to get the R(w_i, m) that shows up in equation 8? For example, where does free parameter r (which was empirically fit to people’s funniness judgments and captures a word’s own relatedness) show up?

(2) I really like that this model is able to pull out exactly which words correspond to which meanings. This features reminds me of topic models, where each word is generated by a specific topic. Here, a word is generated by a specific meaning (or at least, I think that’s what Figure 1 shows, with the idea that m could be a variety of meanings).

(3) I always find it funny in computational linguistics research that the general language statistics portion (here, in the generative model) can be captured effectively by a trigram model. The linguist in me revolts, but the engineer in me shrugs and thinks if it works, then it’s clearly a good enough. The pun classification results here are simply another example of why computational linguistics often uses trigram models to approximate language structure and pretty much finds it adequate for most of what they want to do. Maybe for puns (and other joke types) that involve structural ambiguity rather than phonological ambiguity, we’d need something more than trigrams, though.