Friday, February 3, 2012

Some thoughts on Waterfall et al (2010)

What I really like about this paper is the opening discussion where they sketch the broad ideas that motivated the studies discussed in the rest of the paper.  They explicitly talk about why the aim of language acquisition is a grammar, why we should care about the algorithmic level, what developmental computational psycholinguistics ought to be, why current computational models are still lacking because they miss out on the social situatedness of language, and what exactly is meant by "psychologically real" (and also how that differs from "algorithmically learnable").  I found this to be very valuable to just have all in one place.  And I admit, it got my hopes up for what kind of model they would actually be using.

Unfortunately (for me), the rest of the paper ended up being somewhat anti-climactic because they don't end up implementing a model that has all the features of interest.  Of course, that's a tall order, but they go through the process of running models that have the first three features, and then they talk about a lovely new discourse-related information type that seems like it should be incorporated into their model - and then they don't incorporate it.  I think I was expecting them to at least talk about how to incorporate it into the models they spent so much time on in the beginning, even if it was infeasible at the current time to actually implement (for whatever reason).  But that didn't seem to be what happened.

This isn't to say that the models they implemented and the identification of the "variation set" construct aren't interesting - it's just that I was expecting more based on the opening.  As it is, the paper ends up feeling a bit scattered to me - a lot of potentially useful pieces, but they're not tied together very well.

Some more targeted thoughts:

p.674: I like that they were questioning the use of a gold standard, given that our theories about what the syntactic structure might be may not necessarily match psychological reality.  I did find their definitions of recall and precision a bit hard to understand, though.  Like many other things in the paper, I would have found an explicit formula (and possibly an example) to be more helpful than the text description.  My best understanding of recall was something like the number of new generalizations divided by the test set plus the number of new generalizations, while precision was something like the number of correct new generalizations over the total number of new generalizations.

p.676: They talk about how a strength of their models is that there's no preliminary knowledge of things like grammatical categories (parts-of-speech).  While it's nice to be able to say "Look what we can do with no knowledge!", I think this actually makes the problem less psychologically realistic. As far as I know, everyone's willing to grant that the child has some (at least rudimentary) knowledge of grammatical categories before the child starts positing syntactic structure.  This is the kind of thing we might get from a child using frequent frames, for instance.

The ADIOS algorithm: I admit, I found this description very difficult to decipher without accompanying examples.  It appears to be a batch algorithm, or is it (it appears that the graph is "rewired" every time a new pattern is detected)?  What's an example of a bundle? What's a local flow quantity that would act as a context-sensitive probabilistic criterion for a significant bundle?  How exactly does that work?  How dissimilar is this whole process from frequent frames, which also induce equivalence classes?  What are the basic abilities/knowledge required to make this algorithm work - the ability to create a graph, to identify bundles, to allow recursion of abstract patterns?

The ConText algorithm:  This was a little better, because they provided a simple example.  But again, I found myself wanting more explicit definitions for the different model components in order to understand how reasonable (or not) a model this was psychologically.  For example, there's a local context window of 2, which means in a sentence like "I really like cute penguins", we would get a context vector for "like" where the lefthand context is "I really" and the righthand context is "cute penguins".  Okay, great (though I worry about a window of 2 on each side in terms of data sparseness).  And in order to construct equivalence classes based on this, the algorithm operates in batch mode over the data.  Again, okay.  But then, some kind of distance measure is posited to compare different context vectors to each other involving the angle between context vectors - how is this instantiated?  What does the angle between "I really" and "But I" look like, for example? Presumably these are mapped into real numbers somehow...  On a related note, once the algorithm gets clusters based on these context vectors, it then seems to do something with rewriting sequences - but what are sequences?  Are these the utterances themselves, the partially abstracted representations the learner is forming, something else?

p.681: ConText results - I thought it was interesting that the ConText model ends up with subcategorization (for example, eat and drink being in the same class).  This again reminds of frequent frame results, and made me want an explicit compare and contrast.

p.683: Human judgments of acceptability of new sentences created by ConText learner - I thought it was a bit strange to ask the participants to judge the acceptability based on how likely it was to appear in child-directed speech.  Would the participants have a good sense of child-directed speech?  My experience with undergrads who parse utterances from child-directed speech is that they're utterly surprised by how "ungrammatical" and semi-nonsensical conversational speech (and especially child-directed speech) is.

Variation sets: This is something of real value to computational models, I think. We have empirical evidence that children especially benefit from these particular data units and we have a reasonable idea of how to automatically identify them, and so we could reasonable expect a model to be extra sensitive to these kinds of data (perhaps give these data more weight).  There's an interesting comment on p.688 where variation sets with roughly 50% of the material changing are the most helpful to children.  My big question was why - what's so special about 50%?  Does this represent some optimal tradeoff in terms of recognition and contrast?  Another interesting note on p.689 and Table 2 on p.695, where they looked at how predictive the frequent n-grams were in variation sets for part-of-speech - some of them are pretty predictive, which is nice, and this shows that sometimes n-grams are useful, as opposed to needing framing elements (this was something a paper by Chemla et al. 2009 looked at).  I do wonder at how this predictive quality would hold up cross-linguistically, though - what about languages where the wh-word doesn't move, or languages without auxiliary "do"?

Incremental learning (p.698): There's some discussion at the very end about how to transform ConText into an incremental learner, which I think is a good thing to think about.  However, I wonder about the  motivation behind using the gap automatically (i.e., a furry marmot gets additional "frames" of ___ furry marmot, a ____ marmot, and a furry _____ presumably).  Is the idea that this will jumpstart the abstraction process, which otherwise would have to wait until it saw another instance that used two of those words?  (Or in the case of a context window of 2 on each side, 4 of the words?)



References

Chemla, E., Mintz, T., Bernal, S., & Christophe, A. (2009). Categorizing Words Using "Frequent Frames": What Cross-Linguistic Analyses Reveal About Distributional Acquisition Strategies. Developmental Science.





No comments:

Post a Comment