Thanks to everyone who was able to join us for our spirited discussion of Yang (2010). I think we definitely clarified what that study accomplishes in the debate between the two theoretical viewpoints. Next time on March 12, we'll be looking at a paper that also investigates productivity, examining it through the learning angle, in addition to the basic question of representation.
O'Donnell, T.J., Snedeker, J., Tenenbaum, J.B., & Goodman, N.D. (2011). Productivity and reuse in language. Proceedings of the Thirty-Third Annual Conference of the Cognitive Science Society. Boston, MA.
See you then!
Discussion board for the reading group based out of UCI.
Monday, February 27, 2012
Friday, February 24, 2012
Some thoughts on Yang (2010)
I found this paper a real delight to read - like many of Yang's other papers that we've looked at, it's very clear what was done and how this relates to the larger questions that are being examined. In particular, I thought it was excellent to compare the item-based approach to a generative approach, based on what predictions they would make for children's productions. As Yang pointed out, a lot of previous intuitions about what it means to have a generative (or productive grammar) didn't take into account the Zipfian distribution nature of linguistic data. So, by having a way to generate predictions about how much productivity (as measured by overlap) is expected under each viewpoint, we not only get support for the generative system viewpoint but also actually have support against (at least one version of) the item-based approach. Given how popular the item-based approach is in some circles (e.g., a 2009 PNAS article by Bannard, Lieven, & Tomasello), I thought this was quite striking. From my viewpoint, this is one great way to use mathematical & modeling techniques: to adjudicate between competing theoretical representations.
Some more targeted thoughts:
~~~
References:
Bannard, C., Lieven, E. & Tomasello, M (2009). Modeling children's early grammatical knowledge. Proc Natl Acad Sci U S A, 106(41), 17284-9.
Some more targeted thoughts:
- I really liked in section 1 where the quotes from Tomasello were presented - this gives a clear idea about what exactly is claimed by the item-based approach, and how they have previously used (apparently flawed) intuitions about expected productivity to support that approach. I thought a quote at the end of section 3.3 summed it up beautifully: "...the advocates of item-based learning not only rejected the alternative hypothesis without adequate statistical tests, but also accepted the favored hypothesis without adequate statistical tests."
- The remark in section 2.2 about how even adult usage isn't "productive" by the standard of the item-based crowd is a really nice point. If adult usage isn't "productive", but we believe adults have a generative system, then this should make us question our assumption that "unproductive" child usage indicates a lack of a generative system. Of course, I suppose one might argue that maybe we don't think adults have a fully generative system (this is the view of construction grammar, to some extent, I believe.)
- In section 3.2, I thought Table 1 was a beautiful demonstration of the match between expected overlap for the generative system and the empirically observed overlap in children's speech.
- A minor point about the S/N threshold discussed in 3.2 - I get that S/ln N is a reasonable approximation for rank, especially as N gets very large. However, I'm not quite sure I understand why S/N was chosen as the threshold. I get that it's an upper bound kind of thing, but if S/ln N grows more slowly than S/N, why not just use S/ln N to get a more accurate threshold? It's not as if ln N is hard to calculate.
- In section 3.3, I get that this is merely an attempt to make the item-based approach explicit (and maybe the item-based folk would think it's not the right characterization), but I think it's a pretty good attempt. It gets at the heart of what their theory predicts - you get lots of storage of individual lexical item combinations. Then, of course, Table 2 shows how this representation doesn't match the empirically observed overlap rates nearly as well, so we have a point against that representation.
- Section 4 is nice in that it suggests that this way of testing theoretical representations should be a general-purpose one - do it for determiner usage, but also for verbal morphology and verb argument structure. Though this analysis wasn't conducted for those other phenomena, I was very convinced that the data show a Zipfian distribution, and so we might expect a generative system to be compatible with them.
~~~
References:
Bannard, C., Lieven, E. & Tomasello, M (2009). Modeling children's early grammatical knowledge. Proc Natl Acad Sci U S A, 106(41), 17284-9.
Monday, February 6, 2012
Next time on Feb 27: Yang (2010)
Thanks to everyone who was able to join our extremely lively discussion on Waterfall et al. (2010), and their approach to learning generative grammars from realistic data! Next time on February 27, we'll be looking at a paper that examines a way to quantify claims of linguistic productivity.
Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.
See you then!
Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.
See you then!
Friday, February 3, 2012
Some thoughts on Waterfall et al (2010)
What I really like about this paper is the opening discussion where they sketch the broad ideas that motivated the studies discussed in the rest of the paper. They explicitly talk about why the aim of language acquisition is a grammar, why we should care about the algorithmic level, what developmental computational psycholinguistics ought to be, why current computational models are still lacking because they miss out on the social situatedness of language, and what exactly is meant by "psychologically real" (and also how that differs from "algorithmically learnable"). I found this to be very valuable to just have all in one place. And I admit, it got my hopes up for what kind of model they would actually be using.
Unfortunately (for me), the rest of the paper ended up being somewhat anti-climactic because they don't end up implementing a model that has all the features of interest. Of course, that's a tall order, but they go through the process of running models that have the first three features, and then they talk about a lovely new discourse-related information type that seems like it should be incorporated into their model - and then they don't incorporate it. I think I was expecting them to at least talk about how to incorporate it into the models they spent so much time on in the beginning, even if it was infeasible at the current time to actually implement (for whatever reason). But that didn't seem to be what happened.
This isn't to say that the models they implemented and the identification of the "variation set" construct aren't interesting - it's just that I was expecting more based on the opening. As it is, the paper ends up feeling a bit scattered to me - a lot of potentially useful pieces, but they're not tied together very well.
Some more targeted thoughts:
p.674: I like that they were questioning the use of a gold standard, given that our theories about what the syntactic structure might be may not necessarily match psychological reality. I did find their definitions of recall and precision a bit hard to understand, though. Like many other things in the paper, I would have found an explicit formula (and possibly an example) to be more helpful than the text description. My best understanding of recall was something like the number of new generalizations divided by the test set plus the number of new generalizations, while precision was something like the number of correct new generalizations over the total number of new generalizations.
p.676: They talk about how a strength of their models is that there's no preliminary knowledge of things like grammatical categories (parts-of-speech). While it's nice to be able to say "Look what we can do with no knowledge!", I think this actually makes the problem less psychologically realistic. As far as I know, everyone's willing to grant that the child has some (at least rudimentary) knowledge of grammatical categories before the child starts positing syntactic structure. This is the kind of thing we might get from a child using frequent frames, for instance.
The ADIOS algorithm: I admit, I found this description very difficult to decipher without accompanying examples. It appears to be a batch algorithm, or is it (it appears that the graph is "rewired" every time a new pattern is detected)? What's an example of a bundle? What's a local flow quantity that would act as a context-sensitive probabilistic criterion for a significant bundle? How exactly does that work? How dissimilar is this whole process from frequent frames, which also induce equivalence classes? What are the basic abilities/knowledge required to make this algorithm work - the ability to create a graph, to identify bundles, to allow recursion of abstract patterns?
The ConText algorithm: This was a little better, because they provided a simple example. But again, I found myself wanting more explicit definitions for the different model components in order to understand how reasonable (or not) a model this was psychologically. For example, there's a local context window of 2, which means in a sentence like "I really like cute penguins", we would get a context vector for "like" where the lefthand context is "I really" and the righthand context is "cute penguins". Okay, great (though I worry about a window of 2 on each side in terms of data sparseness). And in order to construct equivalence classes based on this, the algorithm operates in batch mode over the data. Again, okay. But then, some kind of distance measure is posited to compare different context vectors to each other involving the angle between context vectors - how is this instantiated? What does the angle between "I really" and "But I" look like, for example? Presumably these are mapped into real numbers somehow... On a related note, once the algorithm gets clusters based on these context vectors, it then seems to do something with rewriting sequences - but what are sequences? Are these the utterances themselves, the partially abstracted representations the learner is forming, something else?
p.681: ConText results - I thought it was interesting that the ConText model ends up with subcategorization (for example, eat and drink being in the same class). This again reminds of frequent frame results, and made me want an explicit compare and contrast.
p.683: Human judgments of acceptability of new sentences created by ConText learner - I thought it was a bit strange to ask the participants to judge the acceptability based on how likely it was to appear in child-directed speech. Would the participants have a good sense of child-directed speech? My experience with undergrads who parse utterances from child-directed speech is that they're utterly surprised by how "ungrammatical" and semi-nonsensical conversational speech (and especially child-directed speech) is.
Variation sets: This is something of real value to computational models, I think. We have empirical evidence that children especially benefit from these particular data units and we have a reasonable idea of how to automatically identify them, and so we could reasonable expect a model to be extra sensitive to these kinds of data (perhaps give these data more weight). There's an interesting comment on p.688 where variation sets with roughly 50% of the material changing are the most helpful to children. My big question was why - what's so special about 50%? Does this represent some optimal tradeoff in terms of recognition and contrast? Another interesting note on p.689 and Table 2 on p.695, where they looked at how predictive the frequent n-grams were in variation sets for part-of-speech - some of them are pretty predictive, which is nice, and this shows that sometimes n-grams are useful, as opposed to needing framing elements (this was something a paper by Chemla et al. 2009 looked at). I do wonder at how this predictive quality would hold up cross-linguistically, though - what about languages where the wh-word doesn't move, or languages without auxiliary "do"?
Incremental learning (p.698): There's some discussion at the very end about how to transform ConText into an incremental learner, which I think is a good thing to think about. However, I wonder about the motivation behind using the gap automatically (i.e., a furry marmot gets additional "frames" of ___ furry marmot, a ____ marmot, and a furry _____ presumably). Is the idea that this will jumpstart the abstraction process, which otherwise would have to wait until it saw another instance that used two of those words? (Or in the case of a context window of 2 on each side, 4 of the words?)
References
Chemla, E., Mintz, T., Bernal, S., & Christophe, A. (2009). Categorizing Words Using "Frequent Frames": What Cross-Linguistic Analyses Reveal About Distributional Acquisition Strategies. Developmental Science.
Unfortunately (for me), the rest of the paper ended up being somewhat anti-climactic because they don't end up implementing a model that has all the features of interest. Of course, that's a tall order, but they go through the process of running models that have the first three features, and then they talk about a lovely new discourse-related information type that seems like it should be incorporated into their model - and then they don't incorporate it. I think I was expecting them to at least talk about how to incorporate it into the models they spent so much time on in the beginning, even if it was infeasible at the current time to actually implement (for whatever reason). But that didn't seem to be what happened.
This isn't to say that the models they implemented and the identification of the "variation set" construct aren't interesting - it's just that I was expecting more based on the opening. As it is, the paper ends up feeling a bit scattered to me - a lot of potentially useful pieces, but they're not tied together very well.
Some more targeted thoughts:
p.674: I like that they were questioning the use of a gold standard, given that our theories about what the syntactic structure might be may not necessarily match psychological reality. I did find their definitions of recall and precision a bit hard to understand, though. Like many other things in the paper, I would have found an explicit formula (and possibly an example) to be more helpful than the text description. My best understanding of recall was something like the number of new generalizations divided by the test set plus the number of new generalizations, while precision was something like the number of correct new generalizations over the total number of new generalizations.
p.676: They talk about how a strength of their models is that there's no preliminary knowledge of things like grammatical categories (parts-of-speech). While it's nice to be able to say "Look what we can do with no knowledge!", I think this actually makes the problem less psychologically realistic. As far as I know, everyone's willing to grant that the child has some (at least rudimentary) knowledge of grammatical categories before the child starts positing syntactic structure. This is the kind of thing we might get from a child using frequent frames, for instance.
The ADIOS algorithm: I admit, I found this description very difficult to decipher without accompanying examples. It appears to be a batch algorithm, or is it (it appears that the graph is "rewired" every time a new pattern is detected)? What's an example of a bundle? What's a local flow quantity that would act as a context-sensitive probabilistic criterion for a significant bundle? How exactly does that work? How dissimilar is this whole process from frequent frames, which also induce equivalence classes? What are the basic abilities/knowledge required to make this algorithm work - the ability to create a graph, to identify bundles, to allow recursion of abstract patterns?
The ConText algorithm: This was a little better, because they provided a simple example. But again, I found myself wanting more explicit definitions for the different model components in order to understand how reasonable (or not) a model this was psychologically. For example, there's a local context window of 2, which means in a sentence like "I really like cute penguins", we would get a context vector for "like" where the lefthand context is "I really" and the righthand context is "cute penguins". Okay, great (though I worry about a window of 2 on each side in terms of data sparseness). And in order to construct equivalence classes based on this, the algorithm operates in batch mode over the data. Again, okay. But then, some kind of distance measure is posited to compare different context vectors to each other involving the angle between context vectors - how is this instantiated? What does the angle between "I really" and "But I" look like, for example? Presumably these are mapped into real numbers somehow... On a related note, once the algorithm gets clusters based on these context vectors, it then seems to do something with rewriting sequences - but what are sequences? Are these the utterances themselves, the partially abstracted representations the learner is forming, something else?
p.681: ConText results - I thought it was interesting that the ConText model ends up with subcategorization (for example, eat and drink being in the same class). This again reminds of frequent frame results, and made me want an explicit compare and contrast.
p.683: Human judgments of acceptability of new sentences created by ConText learner - I thought it was a bit strange to ask the participants to judge the acceptability based on how likely it was to appear in child-directed speech. Would the participants have a good sense of child-directed speech? My experience with undergrads who parse utterances from child-directed speech is that they're utterly surprised by how "ungrammatical" and semi-nonsensical conversational speech (and especially child-directed speech) is.
Variation sets: This is something of real value to computational models, I think. We have empirical evidence that children especially benefit from these particular data units and we have a reasonable idea of how to automatically identify them, and so we could reasonable expect a model to be extra sensitive to these kinds of data (perhaps give these data more weight). There's an interesting comment on p.688 where variation sets with roughly 50% of the material changing are the most helpful to children. My big question was why - what's so special about 50%? Does this represent some optimal tradeoff in terms of recognition and contrast? Another interesting note on p.689 and Table 2 on p.695, where they looked at how predictive the frequent n-grams were in variation sets for part-of-speech - some of them are pretty predictive, which is nice, and this shows that sometimes n-grams are useful, as opposed to needing framing elements (this was something a paper by Chemla et al. 2009 looked at). I do wonder at how this predictive quality would hold up cross-linguistically, though - what about languages where the wh-word doesn't move, or languages without auxiliary "do"?
Incremental learning (p.698): There's some discussion at the very end about how to transform ConText into an incremental learner, which I think is a good thing to think about. However, I wonder about the motivation behind using the gap automatically (i.e., a furry marmot gets additional "frames" of ___ furry marmot, a ____ marmot, and a furry _____ presumably). Is the idea that this will jumpstart the abstraction process, which otherwise would have to wait until it saw another instance that used two of those words? (Or in the case of a context window of 2 on each side, 4 of the words?)
References
Chemla, E., Mintz, T., Bernal, S., & Christophe, A. (2009). Categorizing Words Using "Frequent Frames": What Cross-Linguistic Analyses Reveal About Distributional Acquisition Strategies. Developmental Science.
Subscribe to:
Posts (Atom)