Monday, November 12, 2012

Some thoughts on Hsu et al. 2011

So this seems to be more of an overview paper showcasing how to apply a probabilistic learning framework at the computational level to problems in language acquisition, whether we're concerned with theoretical learnability results or predicting observable behavior. As a followup to Hsu & Chater (2010), which we discussed a few years back, this re-emphasized some of the nice intuitions in the MDL framework (such as "more compact representations are better").  I think a strength of this framework is its ability to identify linguistic knowledge pieces that are hard to learn from the available data, since this is exactly the sort of thing poverty of the stimulus (PoS) is all about. (Of course, the results rest on the particular assumptions made about the input, forms of the rules, etc., but that's true of all computational analyses, I think.)  On a related note, I did notice that nearly all the phenomena examined by Hsu et al. were based on lexical item classification (verb argument subcategorization) or contraction (what generativist might call "traces" in some cases).  This is fine (especially the "wanna" case, which I have seen actually used in PoS arguments), but I was surprised that we're not really getting into the kind of complex sentential semantics or syntax that I usually see talked about in generativist circles (e.g., syntactic islands, case theory - see Crain & Pietroski (2002) for some examples on the semantic side). Also, even though Hsu et al's own analysis shows that wanna & that-traces are "practically" unlearnable (i.e., even with probabilistic learning, these look like PoS problems), it seems like they close this paper by sort of downplaying this: "probabilistic language learning is theoretically and computationally possible").

Some more targeted thoughts below:

I think my biggest issue with the computational learnability analyses (and proofs) is that I find it very hard to connect them to the psychological problem of language acquisition that I'm used to thinking about.  (In fact, Kent Johnson in UCI's LPS department has a really nice 2004 paper talking about how this connection probably shouldn't have been made with the (in)famous Gold (1967) learnability results.) I do understand that this type of argument is meant to combat the claim about the "logical problem of language acquisition", with the specific interpretation that the "logical problem" comes from computational learnability results (and the Gold paper in particular). However, I've also seen "logical problem of language acquisition" apply to the simple fact that there are induction problems in language acquisition, i.e., the data are compatible with multiple hypotheses, and "logically" any of them could be right, but only one actually is, so "logical problem".  This second interpretation still seems right to me, and I don't feel particularly swayed to change this view after reading the learnability results here (though maybe that's (again) because I have trouble connecting these results to the psychological problem).

Related to the point above - in section 2, where we see a brief description of the learnability proof, the process is described as an algorithm that "generates a sequence of guesses concerning the generative probabilistic model of the language".  Are these guesses probabilities over utterances, probabilities over the generative grammars that produce the utterances, something else?  It seems like we might want them to be probabilities over the generative grammars, but then don't we need some definition of the hypothesis space of possible generative grammars?

I had a little trouble understanding the distinction that Hsu et al. were making between discriminative and generative models in the introduction. Basically, it seemed to me that "discriminative" behavior could be the output of a generative model, so we could view a discriminative model as a special case of a generative model. So is the idea that we really want to emphasize that humans are identifying the underlying probability distribution, instead of just making binary classifications based on their grammars? That is, that there is no such thing as "grammatical" and "ungrammatical", but instead these are epiphenomena of thresholding a probabilistic system?

In section 3, at the very end, Hsu et al. mention that the ideal statistical learner provides an "upper bound" on learnability.  I found this somewhat odd - I always thought of ideal learners as providing a lower bound in some sense, since they're not constrained by cognitive resource limitations, and are basically looking at the question of whether the data contain enough information to solve the problem in question.

The practical example in 3.2 with the "going to" contraction threw me for a bit, since I couldn't figure out how to interpret this: "Under the new grammar, going to contraction never occurs when to is a preposition and thus 0 bits are required to encode contraction." Clearly, the intent is that "no contraction" is cheaper to encode than the process of contraction, but why was that? Especially since the new grammar that has the "don't contract when to is a preposition" seems to require an extra rule.  Looking back to Hsu & Chater (2010), it seems to be that rules with probability 1 (like going to --> going to when to=prep) require 0 bits to encode.  So in effect, the new grammar that has a special exception when to is a preposition gets a data encoding boost, even though the actual grammar model is longer (since it has this exception explicitly encoded).  So,  "exceptions" that always apply (in a context-dependent way) are cheaper than general rules when the observable data appear in that context.

I liked the idea that learnability should correlate with grammaticality judgments, with the idea that more "learnable" rules (i.e., ones with more data in the input) are encountered more and so their probabilities are stronger in whichever direction. In looking at the computational results though, I have to admit I was surprised that "going to" ranked 12th in learnability (Fig 2), maybe putting it on the order of 50 years to learn. That rule seems very easy, and I assume the grammaticality judgments are very strong for it. (My intuitions are at least.)

A small methodological quibble, section 4.1: "...because many constructions do not occur often enough for statistical significance [in child-directed speech]...we use...the full Corpus of Contemporary American English." Isn't this the point for PoS arguments, though?  There are differences between child-directed and adult-directed input (especially between child-directed speech and adult-directed written text), especially at this lexical item level that Hsu et al. are looking at (and also even at very abstract levels like wh-dependencies: Pearl & Sprouse (forthcoming)). So if we don't find these often enough in child-directed speech, and the thing we're concerned with is child acquisition of language, doesn't this also suggest there's a potential PoS problem?

I liked that Hsu et al. connect their work to entrenchment theory, and basically provide a formal (computational-level) instantiation of how/why entrenchment occurs.


Crain, C. & P. Pietroski. 2002. Why language acquisition is a snap. The Linguistic Review, 19, 163-183.

Gold, E. 1967. Language Identification in the Limit. Information and Control, 10, 447-474.

Hsu, A. & N. Chater. 2010. The Logical Problem of Language Acquisition: A Probabilistic Perspective. Cognitive Science, 34, 972-1016.

Johnson, K. 2004. Gold's Theorem and Cognitive Science. Philosophy of Science, 71, 571-592.

Pearl, L. & J. Sprouse. Forthcoming 2012. Syntactic islands and learning biases: Combining experimental  syntax and computational modeling to investigate the language acquisition problem. Language Acquisition.

No comments:

Post a Comment