Tuesday, October 18, 2016

Some thoughts on Kao et al. 2016

I really like the approach this paper takes, where insights from humor research are operationalized using existing formal metrics from language understanding. It’s the kind of approach I think is very fruitful for computational research because it demonstrates the utility of bothering to formalize things — in this case, the outcome is surprisingly good prediction about degrees of funniness and more explanatory power about why exactly something is funny. 

As an organizational note, I love the tactic the authors take here of basically saying “we’ll explain the details of this bit in the next section”, which is an implicit way of saying “here’s the intuition and feel free to skip the details in the next section if you’re not as interested in the implementation.” For me, one big challenge of writing up modeling results of this kind is the level of detail to include when you’re explaining how everything works. It’s tricky because of how wide an audience you’re attempting to interest. Put too little computational detail in and everyone’s irritated at your hand-waving; put too much in and readers get derailed from your main point.  So, this presentation style may be a new format to try.

A few more specific thoughts:

(1) I mostly followed the details of the ambiguity and distinctiveness calculations (ambiguity is about entropy, distinctiveness is about KL divergence of the indicator variables f for each meaning). However, I think it’s worth pondering more carefully how the part described at the end of section 2.2 works which goes into the (R)elatedness calculation. If we’re getting relatedness scores between pairs of words (R(w_i, h), where h = homophone and w_i is another word from the utterance, then how do we compile that together to get the R(w_i, m) that shows up in equation 8? For example, where does free parameter r (which was empirically fit to people’s funniness judgments and captures a word’s own relatedness) show up?

(2) I really like that this model is able to pull out exactly which words correspond to which meanings. This features reminds me of topic models, where each word is generated by a specific topic. Here, a word is generated by a specific meaning (or at least, I think that’s what Figure 1 shows, with the idea that m could be a variety of meanings).

(3) I always find it funny in computational linguistics research that the general language statistics portion (here, in the generative model) can be captured effectively by a trigram model. The linguist in me revolts, but the engineer in me shrugs and thinks if it works, then it’s clearly a good enough. The pun classification results here are simply another example of why computational linguistics often uses trigram models to approximate language structure and pretty much finds it adequate for most of what they want to do. Maybe for puns (and other joke types) that involve structural ambiguity rather than phonological ambiguity, we’d need something more than trigrams, though.

1 comment:

  1. Definitely an interesting paper! I like the quantification of some basic concepts in humor towards information theoretic measures, this feels like the right approach to me.

    Re: Lisa's #1, I agree that there's some ambiguity going on in the description of their calculations. It's also unclear to me how the parameter r is incorporated into the similarity equations or what its effect might be, especially since it is tuned to optimize the fit towards human judgments. I believe that R(w_i, m) is the same as R(w_i, h) where h takes on values h or h' depending on which of the two possible interpretations a sentence is given.

    What I dislike most is probably the number of moving parts in the model and no exploration of whether or not any of them are actually necessary. Could we get a baseline model which relies 100% on the trigram model? Did they really need to get similarity ratings from human participants or could they use readily available metrics like cosine distance in a word embedding space (as popularized through word2vec, Mikolov et al (2013))?

    I dislike some of the minutiae of the data preparation. The fact that different people rated different TYPES of sentences, rather than having the sentences in randomized blocks, for instance. Also some of the puns are found online, some made by an undergrad (presumably). No report of whether or not there's a difference between those in terms of funniness ratings, which seems crucial for the work to be taken seriously. So I do worry that there are some confounds here, that the details in the model are not as necessary as the paper makes them seem. The work clearly needs to be expanded towards other formats of puns. Many are framed in a dialogue, which could be easily expanded. Also, MORE DATA! Such a small amount here, and having done absolutely no background research, I feel like they should have been able to find more.

    All that being said, it's a great start. Will be interesting to see if they decide this is a research line worth pursuing further!