Some thoughts on Yang (2010)

I found this paper a real delight to read - like many of Yang's other papers that we've looked at, it's very clear what was done and how this relates to the larger questions that are being examined.  In particular, I thought it was excellent to compare the item-based approach to a generative approach, based on what predictions they would make for children's productions.  As Yang pointed out, a lot of previous intuitions about what it means to have a generative (or productive grammar) didn't take into account the Zipfian distribution nature of linguistic data.  So, by having a way to generate predictions about how much productivity (as measured by overlap) is expected under each viewpoint, we not only get support for the generative system viewpoint but also actually have support against (at least one version of) the item-based approach.  Given how popular the item-based approach is in some circles (e.g., a 2009 PNAS article by Bannard, Lieven, & Tomasello), I thought this was quite striking. From my viewpoint, this is one great way to use mathematical & modeling techniques: to adjudicate between competing theoretical representations.

Some more targeted thoughts:

  • I really liked in section 1 where the quotes from Tomasello were presented - this gives a clear idea about what exactly is claimed by the item-based approach, and how they have previously used (apparently flawed) intuitions about expected productivity to support that approach. I thought a quote at the end of section 3.3 summed it up beautifully:  "...the advocates of item-based learning not only rejected the alternative hypothesis without adequate statistical tests, but also accepted the favored hypothesis without adequate statistical tests."
  • The remark in section 2.2 about how even adult usage isn't "productive" by the standard of the item-based crowd is a really nice point.  If adult usage isn't "productive", but we believe adults have a generative system, then this should make us question our assumption that "unproductive" child usage indicates a lack of a generative system.  Of course, I suppose one might argue that maybe we don't think adults have a fully generative system (this is the view of construction grammar, to some extent, I believe.)
  • In section 3.2, I thought Table 1 was a beautiful demonstration of the match between expected overlap for the generative system and the empirically observed overlap in children's speech. 
  • A minor point about the S/N threshold discussed in 3.2 - I get that S/ln N is a reasonable approximation for rank, especially as N gets very large.  However, I'm not quite sure I understand why S/N was chosen as the threshold.  I get that it's an upper bound kind of thing, but   if S/ln N grows more slowly than S/N, why not just use S/ln N to get a more accurate threshold? It's not as if ln N is hard to calculate.
  • In section 3.3, I get that this is merely an attempt to make the item-based approach explicit (and maybe the item-based folk would think it's not the right characterization), but I think it's a pretty good attempt.  It gets at the heart of what their theory predicts - you get lots of storage of individual lexical item combinations.  Then, of course, Table 2 shows how this representation doesn't match the empirically observed overlap rates nearly as well, so we have a point against that representation. 
  • Section 4 is nice in that it suggests that this way of testing theoretical representations should be a general-purpose one - do it for determiner usage, but also for verbal morphology and verb argument structure.  Though this analysis wasn't conducted for those other phenomena, I was very convinced that the data show a Zipfian distribution, and so we might expect a generative system to be compatible with them.

