Friday, May 24, 2013

Some thoughts on Kwiatkowski et al 2012

One of the things I really enjoyed about this paper was that it was a much fuller syntax & semantics system than anything I've seen in awhile, which means we get to see the nitty gritty in the assumptions that are required to make it all work. Having seen the assumptions, though, I did find it a little unfair for the authors to claim that no language-specific knowledge was required - as far as I could tell, the "language-universal" rules between syntax and semantics at the very least seem to be a language-specific kind of knowledge (in the sense of domain-specific vs. domain-general). In this respect, whatever learning algorithms they might explore, the overall approach seems similar to other learning models I've seen that are predicated on very precise theoretical linguistic knowledge (e.g., the parameter-setting systems of Yang 2002, Sakas & Fodor 2001, Niyogi & Berwick 1996, Gibson & Wexler 1994, among others.) It just so happens here that CCG assumes different primitives/principles than those other systems - but domain-specific primitives/principles are still there a priori.

Getting back to the semantic learning - I'm a big fan of them learning words besides nouns, and connecting with the language acquisition behavioral literature on syntactic bootstrapping and fast mapping.  That being said, the actual semantics they seemed to learn was a bit different than what I think the fast mapping people generally intend.  In particular, if we look at Figure 5, while three different quantifier meanings are learned, it's more about the form the meaning takes, rather than the actual lexical meaning of the word (i.e., the form for a, another, and any looks identical, so any differences in meaning are not recognized, even though these words clearly do differ in meaning). I think lexical meaning is what people are generally talking about for fast mapping, though. What this seems like is almost grammatical categorization, where knowing the grammatical category means you know the general form the meaning will have (due to those linking rules between syntactic category and semantic form) rather than the precise meaning - that's very in line with syntactic bootstrapping, where the syntactic context might point you towards verb-y meanings or preposition-y meanings, for example.

More specific thoughts:

I found it interesting that the authors wanted to explicitly respond to a criticism that statistical learning models can't generate sudden step-like behavior changes.  I think it's certainly an unspoken view by many in linguistics that statistical learning implies more gradual learning (which was usually seen as a bonus, from what I understood, given how noisy data are). It's also unclear to me that the data taken as evidence for step-wise changes really reflect a step-wise change or instead only seem to be step-wise because of how often the samples were taken and how much learning happened in between.  It's interesting that the model here can generate it for learning word order (in Figure 6), though I think the only case that really stands out for me is the 5 meaning example, around 400 utterances.

I could have used a bit more unpacking of the CCG framework in Figure 2. I know there were space limitations, but the translation from semantic type to the example logical form wasn't always obvious to me. For example, the first and last examples (S_dcl and PP) have the same semantic type but not the same lambda calculus form. Is the semantic type what's linked to the syntactic category (presumably), and then there are additional rules for how to generate the lambda form for any given semantic type?

This provides a nice example where the information that's easily available in dependency structures appears more useful, since the authors describe (in section 6) how they created a deterministic procedure for using the primitive labels in the dependency structures to create the lambda forms. (Though as a side note, I was surprised how this mapping only worked for a third of the child-directed speech examples, leaving out not only fragments but also imperatives and nouns with prepositional phrase modifiers. I guess it's not unreasonable to try to first get your system working on a constrained subset of the data, though.)

I wish they had told us a bit more about the guessing procedure they used for parsing unseen utterances, since it had a clear beneficial impact throughout the learning period. Was it random (and so guessing at all was better than not, since sometimes you'd be right as opposed to always being penalized for not having a representation for a given word)?  Was it some kind of probabilistic sampling?  Or maybe just always picking the most probable hypothesis?




No comments:

Post a Comment