Sunday, November 25, 2012

Some thoughts on Frank (2012)

I thought this was a really nice big picture piece about computational modeling work in language acquisition, and it tries (admirably!) to consolidate insights in different domains about the kind of learning assumptions/strategies that are useful. This is such an incredibly good thing to do, I think - one of the questions I get a lot is whether there's one general purpose style of computational model that's the right way to do things, and I'm usually left shrugging and saying, "Depends on what you're trying to do." And to some extent of course, this is right - but there's also something to be said about what the different useful models have in common.

Another note: Despite the empirical coverage, I did feel there was something of a disconnect between the phenomena generative linguists get excited about (w.r.t poverty of the stimulus, for example - syntactic islands, case theory, etc.) and the phenomena modeled in the studies discussed here. There's nothing wrong with this, since everyone's goal is to understand language acquisition, and that means acquisition of a lot of different kinds of knowledge. But I did wonder how the insights discussed here could be applied to more sophisticated knowledge acquisition problems in language. Frank notes already that it's unclear what insights successful models of more sophisticated knowledge have in common.

Some more targeted thoughts:

Frank focuses on two metrics of model success: sufficiency (basically, acquisition success) and fidelity (fitting patterns of human behavior). I've seen other proposed metrics, such as formal sufficiency, developmental compatibility, and explanatory power (discussed, for example, in Pearl 2010, which is based on prior work by Yang). I feel like formal sufficiency maps pretty well to sufficiency (and actually may cover fidelity too). Developmental compatibility, though, is more about psychological plausibility, and explanatory power is about the ability of the model to give informative (explanatory) answers about what causes the acquisition process modeled. I think all of the studies discussed hold up on the explanatory power metric, so that's fine. It's unclear how well they hold up for developmental compatibility - it may not matter if they're computational-level analyses, for example. But I feel like that's something that should be mentioned as a more prominent thing to think about when judging a computational model. (But maybe that's my algorithmic bias showing through.)

Related point: Frank clearly is aware of the tension between computational-level and algorithmic-level approaches, and spends some time discussing things like incremental vs. batch learning. I admit, I was surprised to see this though: "Fully incremental learning prevents backtracking or re-evaluation of hypotheses in light of earlier data". If I'm understanding this correctly, the idea is that you can't use earlier data at all in a fully incremental model. I think this conflates incremental with memoryless - for example, you can have an incremental learner that has some memory of prior data (usually in some kind of compressed format, perhaps tallying statistics of some kind, etc.). For me, all incremental means is that the learner processes data as it comes in - it doesn't preclude the ability to remember prior data with some (or even a lot of) detail.

Related point: Human memory constraints. In the word segmentation section, Frank mentions that experimental results suggest that "learners may not store the results of segmentation veridically, falsely interpolating memories that they have heard novel items that share all of their individual transitions within a set of observed items". At first, I thought this was about humans not storing the actual segmentations in memory (and I thought, well, of course not - they're storing the recovered word forms). But the second bit made me think this was actually even more abstract than that - it seems to suggest that artificial language participants were extracting probabilistic rules about word forms, rather than the word forms themselves. Maybe this is because the word forms were disconnected from meaning in the experiments described, so the most compact representation was of the rules for making word forms, rather than the word forms themselves?

I loved the Goldsmith (2010) quote: "...if you dig deep enough into any task in acquisition, it will become clear that in order to model that task effectively, a model of every other task is necessary". This is probably generally true, no matter what you're studying, actually - you always have to simplify and pretend things are disconnected when you start out in order to make any progress. But then, once you know a little something, you can relax the idealizations. And Frank notes the synergies in acquisition tasks, which seems like exactly the right way to think about it (at least, now that we think we know something about the individual acquisition tasks involved). It seems like a good chunk of the exciting work going on in acquisition modeling is investigating solving multiple tasks simultaneously, leveraging information from the different tasks to make solving all of them easier. However, once you start trying to do this, you then need to have a precise model of how that leveraging/integration process works.

Another great quote (this time from George Box): "all models are wrong, but some are useful". So true - and related to the point above. I think a really nice contribution Frank makes is in thinking about ways in which models can be useful - whether they provide a general framework or are formal demonstrations of simple principles, for example.

I think this quote might ruffle a few linguist feathers: "...lexicalized (contain information that is linked to individual word forms), the majority of language acquisition could be characterized as 'word learning'. Inferring the meaning of individual lexical items...". While technically this could be true (given really complex ideas about word "meaning"), the complexity of the syntactic acquisition task gets a little lost here, especially given what many people think about as "word meaning". In particular, the rules for putting words together isn't necessarily connected directly to lexical semantics (though of course, individual word meaning plays a part).

I think the Frank et al. work on intention inference when learning a lexicon demonstrates a nice sequence of research w.r.t. the utility of computational models. Basically, child behavior was best explained by a principle of mutual exclusivity. So, for awhile, that was a placeholder, i.e., something like "Use mutual exclusivity to make your decision". Then, Frank et al. came along and hypothesized where mutual exclusivity could come from, and showed how it could arise from more basic learning biases (e.g., "use probabilistic learning this way"). That is, mutual exclusivity itself didn't have to be a basic unit. This reminds me of the Subset Principle in generative linguistics, which falls out nicely from the Size Principle of Bayesian inference.

It's an interesting idea that humans do best at learning when there are multiple (informationally redundant) cues available, as opposed to just one really informative cue. I'm not sure if the Mintz frequent frame is a really good example of this, though - it seems like a frame vs. a bigram is really just the same kind of statistical cue. Though maybe the point is more that the framing words provide more redundancy, rather than being different kinds of cues.

It's also a really interesting idea to measure success by having the output of a model be an intermediate representation used in some other task that has an uncontroversial gold standard. Frank talks about it in the context of syntactic categories, but I could easily imagine the same thing applying to word segmentation. It's definitely a recurring problem that we don't want perfect segmentation for models of infant word segmentation - but then, what do we want? So maybe we can use the output of word segmentation as the input to word- (or morpheme-) meaning mapping.

It took me a little to understand what "expressive" meant in this context. I think it relates to the informational content of some representation - so if a representation is expressive, it can cover a lot of data while being very compact (e.g., rule-based systems, instead of mappings between individual lexical items). A quote near the end gets at this more directly: "...it becomes possible to generate new sentences and to encode sentences more efficiently. At all levels of organization, language is non-random: it is characterized by a high degree of redundancy and hence there is a lot of room for compression." I think this is basically an information-theoretic motivation for having a grammar (which is great!). In a similar vein, it seems like this would be an argument in favor of Universal Grammar-style parameters, because they would be a very good compression of complex regularities and relationships in the data.

~~~

References

Pearl, L. 2010.Using computational modeling in language acquisition research. In E. Blom & S. Unsworth (eds). Experimental Methods in Language Acquisition Research, John Benjamins.

No comments:

Post a Comment