Wednesday, May 13, 2015

Some thoughts on Kolodny et al. 2015

There are two main things that I really enjoyed about this paper: (1) the explicit attempt to incorporate known properties of language acquisition into the proposed model (unsupervised learning, incremental learning, generative capacity of the learner), and (2) the breadth of empirical studies they tried to validate the proposed model on. Having said this, each of these things has a slight downside for me, given the way that they were covered in this paper.  

First, there seems to be a common refrain of “biological realism”, with the idea that the proposed model does this far better than any other model to date. I found myself wondering how true this was — pretty much every acquisition model we examine in the reading group has included the core properties of unsupervised learning and generative capacity, and all the algorithmic-level ones include incremental learning of some kind. What seems to separate the proposed model from these is the potential domain-generality of its components. That is, it’s meant to apply to any sequential hierarchically structured system, as opposed just to language. But even for this, isn’t that exactly what Bayesian inference does too? It’s the units and hypothesis spaces that are language-specific, not the inference mechanism.

Second, because K&al covered empirical data from so many studies, I felt like I didn’t really understand any individual study that well or even the specifics of how the model works on a concrete example. This is probably a length consideration issue (breadth of coverage trumped depth of coverage), but I really do wish more space had been devoted to a concrete walk-through of how these graphs get built up incrementally (and how the different link types are decided and what it means for something to be “shifted in time”, etc.). I want to like this model, but I just don’t understand the nitty gritty of how it works.

So, given this, I wasn’t too surprised that the Pearl & Sprouse island effects didn’t work out. The issue to me is that K&al were running their model over units that weren’t abstract enough — the P&S strategy worked because it was using trigrams of phrase structure (not trigrams of POS categories, as K&al described it). And not just any phrase structure —specifically, the phrase structure that would be “activated” because the gap is contained inside that phrase structure. So basically, the units are even more abstract than just phrase structure. They’re a subset of phrase structure nodes.  And that’s what trigrams get made out of. Trying to capture these same effects by using local context over words (or even categories that include clumps of words or phrases) seems like we're using the wrong units. I think K&al’s idea is that the appropriate “functionally similar” abstract units would be built up over time with the slot capacity of the graph inference algorithm (and maybe that’s why they alluded to a data sparseness issue). And that might be true…but it certainly remains to be concretely demonstrated.

Some other specific thoughts:

(1) 2.1, “…a unit that re-appears within a short time is likely to be significant” — This seems related to the idea of burstiness.

(2) 2.2, “…tokens are either separated by whitespaces…or…a whitespace is inserted between every two adjacent tokens” — Is this a default for a buffer size of two units? And if so, why? Something about adjacency?

(3) 2.3, “…create a new supernode, A + B, if sanctioned by Barlow’s (1990) principle of suspicious coincidence, subject to a prior” — How exactly does this work? Is it like Bayesian inference? What determines the prior?

(4) 2.4, “…when a recurring sequence….is found within the short-term memory by alignment of the sequence to a shifted version of itself” — How exactly is the shifted version created? How big is the buffer? How cognitively intensive is this to do?

(5) 2.6, “…i.e., drawing with a higher probability nodes that contain longer sequences” — Why would this bias be built in explicitly? If anything, I would thing shorter sequences would have a higher probability.

(6) 3.1, Figure 2: It seems like the U-MILA suddenly does just great on and 9 and 10 word sequences, after doing poorly on 6-8 word sequences. Why should this be?