More targeted thoughts:
- The authors do go out of their way to highlight that this is a framework for learnability (rather than the "acquirability" I'm fond of), since it assumes an ideal learner. They often mention that it represents an "upper bound on learnability", and note that they provide a "predicted order for the acquisition by children". I think it's important to remember that this upper bound and predicted order still only applies if the problem is viewed by children the particular way that the authors frame it. Looking through the supplemental material and the specifics of the phenomena they examine, sometimes very specific knowledge (or kinds of knowledge) is assumed (for example, knowledge of traces, "it-punctuation", "non-trivial" parenthood - which may or may not be equivalent to the linguistic notion of c-command). In addition, I think they have to assume that no other information is available for these phenomena in order for the "predicted order for the acquisition by children" to hold. I have the same reservations about the data they use to evaluate their ideal MDL learners - some of the data quite clearly isn't child-directed speech because the frequencies are too low for their MDL learner to function properly. But...doesn't that say something about the data real children are exposed to, and how there may not be sufficient data for various phenomena in child-directed speech? Also, relatedly, maybe the mismatch the authors see between their model's predictions and actual child behavior in figure 5 has to do with the fact they didn't train their model on child-directed speech data?
- On a related note, I really wonder if there's some way to translate the MDL framework to something more realistic for children (cognitively plausible, etc.). The intuitions behind the framework are simple and intuitive - you want a balance between simplicity of grammar and data coverage. The Bayesian framework is a specific form of this balancing act, and can be adapted easily to be an online kind of process. Can the MDL? What would code length correspond to - efficient processing and representation of data? The authors definitely try to point out where the MDL evaluation would come in for learning, saying that it is the decision to add or not add a rule to the existing grammar (and so the comparison is between two competing grammars that differ by one rule).
- I also really want to be able to map some of the MDL-specific notions to something psychological. (Though maybe this isn't possible.) For example, what is C[prob] (the constant value of encoding probabilities)? Is it some psychological processing/evaluation cost? In some cases, the fact that a particular form is required in order to learn the restriction reminds me strongly of the notion of "unambiguous data" that's been around in generative grammar acquisition for awhile.
- Specific phenomena: I was surprised by some of the results the authors found. For example "what is" (Table 5) - the frequency in child-directed speech is over 12000 occurrences per year but adult-directed speech is less than 300 per year? That seems odd. The same happens for "who is" (Table 6). Turning to the dative alternation examples, "donate" apparently appears far more often per year (15 times) than "shout" ( less than 4), "whisper" (less than 2), or "suggest" (less than 1). That seems odd to me. Also, for Table 16 on the transitive/intransitive examples, how does a encoding "savings" of 0.0 bits lead to any kind of learning under this framework? Maybe this is a rounding error?
I don't recall if anyone answered your question about how this use of MDL is the same as Bayesian inference. The relationship is that the description length is interpreted as being the inverse of the likelihood, so a length of 0 is p = 1 and an infinite length is p = 0. I think the posterior is then the inverse of the sum of the encoded model (the prior) and data lengths. I don't know the exact inverse used though. I do believe the tricky bit is that because the encoding of the model is different than the data, the choice of encoding methods has a lot to do with what the results mean. I'm sure there are mathematical papers that sort out all this in detail.
ReplyDeleteSo thinking about what kind of encoding we might want when judging biological plausibility, it seems like the ideal code would be one that corresponded to some neurological capacity or aspect of performance. If we could know how much brain volume is used for rule encoding and data encoding then that might make a very good measure of plausibility!
But even if we could, I'm not all that sure that there is the same optimization pressure on the model as there is on the data. While it is important to have a fair idea of whether it is even possible for a certain model to be used by humans, that isn't the same thing as thinking that what we're doing is "optimal" in any sense. In fact because human language is widely regarded as a human-only capacity there is no reason to believe it has undergone much optimization at all due to a lack of competitive pressure on the language facility, assuming, as we generally do, that human language facility has great utility in survival.
If we could know how much brain volume is used for rule encoding and data encoding then that might make a very good measure of plausibility!
ReplyDeleteThat's an interesting thought - is the idea that this would provide a limit on what a reasonable sized encoding is?