Wednesday, October 20, 2010

Next time: Bod (2009)

Thanks to everyone who was able to join us this time to discuss Yang (2010)! We had quite a rousing discussion. Next time on November 3, we'll be looking at Bod (2009) (available at the CoLa reading group schedule page), who has a differing viewpoint on models of syntactic acquisition.

Monday, October 18, 2010

Yang (2010): Some thoughts

One thing I always like about Yang's work: whether or not you agree with what he says, it's always very clear what his position is and what evidence he considers relevant to the question at hand. Because of this, his papers (for me) are very enjoyable reads.

One thing that stood out to me in this paper was his stance on computational-level vs. algorithmic-level models of syntactic acquisition. Right up front, he establishes his view that algorithmic-level models are the ones with the greatest contribution (and this line of discussion seems to continue in section 4, where he seems dismissive of some existing computational-level models). I do have great sympathy for wanting to create algorithmic-level models, but I still believe computational-level models have something to offer. The basic idea for me is this: if you have an ideal learner that can't learn the required knowledge from the available data, this seems like a great starting point for a poverty of the stimulus claim. (It may turn out that some algorithmic-level model doesn't have the same issue, but then you know the "magic" that happens is in the specific process that algorithmic-level model uses. And maybe that "magic" corresponds to some prior knowledge or innate bias in the learning procedure, etc. At any rate, the ideal learner model has contributed something.)

I also found Yang's discussion of the PAC learnability framework enlightening in section 3. A couple of comments stood out to me:

  • p.6: The comment about how to turn infinite grammars finite, by ignoring sufficiently long sentences (that, for example, contain lots of recursion). Yang notes that few language scientists would find the notion of a finite grammar appealing. On the other hand, I feel like we could have some sympathy for people who believe that sentences of infinite length are not really part of the language. Yes, they're part of the language by definition (of what recursion is, for example), but they seem not to be part of the language if we define language as something like "the strings that people could utter in order to communicate". I think Yang's larger point remains that the set of strings in a grammar of any language is infinite in size, though.
  • In that same paragraph, Yang seems dismissive of the hypothesis space of probabilistic context-free grammars (PCFGs) being realistic in current model implementations, specifically because the "prior probabilities of these grammars must be assumed". While it may be the case that some models take this approach, I don't think it's necessarily true. If you already have a PCFG, couldn't the prior for the grammar be derived by the some combination of the rules' probabilities? (I feel like Hsu & Chater (2010) do something like this with their MDL framework, where the prior is the encoding of the grammar.)


Wednesday, October 6, 2010

Next time: Yang (2010)

Thanks to everyone who was able to come to our discussion this week! Next time we'll be reading a review of computational models of syntax acquisition by Charles Yang, available for download at the CoLa Reading group's schedule page.

Monday, October 4, 2010

Hsu & Chater (2010): Some Thoughts

So this was a bit longer of an article than the ones we've been reading, probably because it was trying to establish a framework for answering questions about language acquisition rather than tackling a single problem/phenomena. I definitely appreciated how much effort the authors went to at the beginning to motivate the particular framework they advocate, and the extensive energy-saving appliance analogy. It's certainly true that applying the framework to a range of different phenomena shows its utility.

More targeted thoughts:
  • The authors do go out of their way to highlight that this is a framework for learnability (rather than the "acquirability" I'm fond of), since it assumes an ideal learner. They often mention that it represents an "upper bound on learnability", and note that they provide a "predicted order for the acquisition by children". I think it's important to remember that this upper bound and predicted order still only applies if the problem is viewed by children the particular way that the authors frame it. Looking through the supplemental material and the specifics of the phenomena they examine, sometimes very specific knowledge (or kinds of knowledge) is assumed (for example, knowledge of traces, "it-punctuation", "non-trivial" parenthood - which may or may not be equivalent to the linguistic notion of c-command). In addition, I think they have to assume that no other information is available for these phenomena in order for the "predicted order for the acquisition by children" to hold. I have the same reservations about the data they use to evaluate their ideal MDL learners - some of the data quite clearly isn't child-directed speech because the frequencies are too low for their MDL learner to function properly. But...doesn't that say something about the data real children are exposed to, and how there may not be sufficient data for various phenomena in child-directed speech? Also, relatedly, maybe the mismatch the authors see between their model's predictions and actual child behavior in figure 5 has to do with the fact they didn't train their model on child-directed speech data?
  • On a related note, I really wonder if there's some way to translate the MDL framework to something more realistic for children (cognitively plausible, etc.). The intuitions behind the framework are simple and intuitive - you want a balance between simplicity of grammar and data coverage. The Bayesian framework is a specific form of this balancing act, and can be adapted easily to be an online kind of process. Can the MDL? What would code length correspond to - efficient processing and representation of data? The authors definitely try to point out where the MDL evaluation would come in for learning, saying that it is the decision to add or not add a rule to the existing grammar (and so the comparison is between two competing grammars that differ by one rule).
  • I also really want to be able to map some of the MDL-specific notions to something psychological. (Though maybe this isn't possible.) For example, what is C[prob] (the constant value of encoding probabilities)? Is it some psychological processing/evaluation cost? In some cases, the fact that a particular form is required in order to learn the restriction reminds me strongly of the notion of "unambiguous data" that's been around in generative grammar acquisition for awhile.
  • Specific phenomena: I was surprised by some of the results the authors found. For example "what is" (Table 5) - the frequency in child-directed speech is over 12000 occurrences per year but adult-directed speech is less than 300 per year? That seems odd. The same happens for "who is" (Table 6). Turning to the dative alternation examples, "donate" apparently appears far more often per year (15 times) than "shout" ( less than 4), "whisper" (less than 2), or "suggest" (less than 1). That seems odd to me. Also, for Table 16 on the transitive/intransitive examples, how does a encoding "savings" of 0.0 bits lead to any kind of learning under this framework? Maybe this is a rounding error?