More targeted thoughts:
- The authors do go out of their way to highlight that this is a framework for learnability (rather than the "acquirability" I'm fond of), since it assumes an ideal learner. They often mention that it represents an "upper bound on learnability", and note that they provide a "predicted order for the acquisition by children". I think it's important to remember that this upper bound and predicted order still only applies if the problem is viewed by children the particular way that the authors frame it. Looking through the supplemental material and the specifics of the phenomena they examine, sometimes very specific knowledge (or kinds of knowledge) is assumed (for example, knowledge of traces, "it-punctuation", "non-trivial" parenthood - which may or may not be equivalent to the linguistic notion of c-command). In addition, I think they have to assume that no other information is available for these phenomena in order for the "predicted order for the acquisition by children" to hold. I have the same reservations about the data they use to evaluate their ideal MDL learners - some of the data quite clearly isn't child-directed speech because the frequencies are too low for their MDL learner to function properly. But...doesn't that say something about the data real children are exposed to, and how there may not be sufficient data for various phenomena in child-directed speech? Also, relatedly, maybe the mismatch the authors see between their model's predictions and actual child behavior in figure 5 has to do with the fact they didn't train their model on child-directed speech data?
- On a related note, I really wonder if there's some way to translate the MDL framework to something more realistic for children (cognitively plausible, etc.). The intuitions behind the framework are simple and intuitive - you want a balance between simplicity of grammar and data coverage. The Bayesian framework is a specific form of this balancing act, and can be adapted easily to be an online kind of process. Can the MDL? What would code length correspond to - efficient processing and representation of data? The authors definitely try to point out where the MDL evaluation would come in for learning, saying that it is the decision to add or not add a rule to the existing grammar (and so the comparison is between two competing grammars that differ by one rule).
- I also really want to be able to map some of the MDL-specific notions to something psychological. (Though maybe this isn't possible.) For example, what is C[prob] (the constant value of encoding probabilities)? Is it some psychological processing/evaluation cost? In some cases, the fact that a particular form is required in order to learn the restriction reminds me strongly of the notion of "unambiguous data" that's been around in generative grammar acquisition for awhile.
- Specific phenomena: I was surprised by some of the results the authors found. For example "what is" (Table 5) - the frequency in child-directed speech is over 12000 occurrences per year but adult-directed speech is less than 300 per year? That seems odd. The same happens for "who is" (Table 6). Turning to the dative alternation examples, "donate" apparently appears far more often per year (15 times) than "shout" ( less than 4), "whisper" (less than 2), or "suggest" (less than 1). That seems odd to me. Also, for Table 16 on the transitive/intransitive examples, how does a encoding "savings" of 0.0 bits lead to any kind of learning under this framework? Maybe this is a rounding error?