Something I really liked about this paper was Johnson’s sensitivity to the problems that occur during actual acquisition even as he gave an intuitive overview about different approximation algorithms used in machine learning. He also made a point to connect with linguistic theory related to acquisition (e.g., Universal Grammar, uniqueness constraint, etc.) This makes it much easier for acquisition people who aren’t necessarily modelers to understand why they should care about these approaches, especially when the particular structures Johnson uses for his demonstrations (PCFGs) are known to be not quite right (which Johnson himself helpfully points out right at the beginning).
Some more targeted thoughts:
(1) Johnson makes a point at the very beginning about the utility of joint inference of syntactic structure and grammatical categories (which he calls lexical categories), and how better performance is obtained that way (as opposed to solving one problem after another). This seems to be another example of this joint-inference-is-better thing, which is getting a fair amount of play in the acquisition modeling literature. Bigger point: Information from one problem can help usefully constrain another. Smaller quibble: I think grammatical categories may be learned earlier than syntactic structure, so we may want something like an informed prior when it comes to the grammatical categories if we still want syntactic structure and grammatical categorization to be solved simultaneously.
(2) This comment in section 3: “…suggesting the attractive possibility that at least some aspects of language acquisition may be an almost cost-free by-product of parsing. That is, the child’s efforts at language comprehension may supply the information they need for language acquisition.” This reminds me very strongly of Fodor’s (1998) “Parsing to Learn” approach, which talks about exactly this idea. (A number of follow up papers with William Sakas also tackle this issue.) Fodor’s learner was using parsing to help figure out Universal Grammar parameter settings, but the idea is exactly the same — because parsing is already happening, the learner can leverage the information from that process to learn about the structure of her language.
**Fodor, J. D. 1998. Parsing to learn. Journal of Psycholinguistic Research, 27(3), 339-374.
(3) Related to the smaller quibble above in (1): Johnson notes later on in section 3 that “it’s hard to see how any ‘staged’ learner (which attempted to learn lexical entries before learning syntax, or vice versa) could succeed on this data”. The important unspoken part is “using just this strategy”, I’m assuming — because certainly it’s possible to learn grammatical categories using other strategies just fine. In fact, most of the grammatical categorization models I’m aware of do just this (though some do incorporate aspects of syntactic structure in the grammatical category inference).
(4) This point in section 5 seems spot on to me: “…language learning may require additional information beyond that contained in a set of strings of surface forms.” Johnson jumps straight to non-linguistic information, but I’m imagining that semantics would still be counted as linguistic, and that seems super-important for a number of syntactic structure things (e.g., animacy for learning about the appropriate meanings for tough-constructions: The apple was easy to eat. vs. The girl was eager to eat (the apple).)
(5) The production model by Johnson & Riezler (2002) discussed later on in that section was interesting, where the input is the intended logical form (hierarchical semantic structure…which presumably maps to syntactic structure?) and the output is the observed string. Presumably this is how you could design a generative learning model, where the goal is to infer the syntactic structure that corresponds to the observed string, with the idea that the syntactic structure was used to generate the string.
(6) This in the conclusion: “…in principle it should be possible for Bayesian priors to express the kinds of rich linguistic knowledge that linguists posit for Universal Grammar. It would be extremely interesting to investigate just what a statistical estimator using linguistically plausible parameters might be able to learn.” — Exactly this! I’ve long (vaguely) pondered how to connect the sorts of parameters in, say, a parametric representation of metrical phonology to the kinds of precise mathematical priors Bayesian models use. Somehow, somehow it seems possible…and then perhaps the two uses of “parameter” could be reconciled more precisely.