Wednesday, November 28, 2012

See you in the winter!

Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Sunday, November 25, 2012

Some thoughts on Frank (2012)

I thought this was a really nice big picture piece about computational modeling work in language acquisition, and it tries (admirably!) to consolidate insights in different domains about the kind of learning assumptions/strategies that are useful. This is such an incredibly good thing to do, I think - one of the questions I get a lot is whether there's one general purpose style of computational model that's the right way to do things, and I'm usually left shrugging and saying, "Depends on what you're trying to do." And to some extent of course, this is right - but there's also something to be said about what the different useful models have in common.

Another note: Despite the empirical coverage, I did feel there was something of a disconnect between the phenomena generative linguists get excited about (w.r.t poverty of the stimulus, for example - syntactic islands, case theory, etc.) and the phenomena modeled in the studies discussed here. There's nothing wrong with this, since everyone's goal is to understand language acquisition, and that means acquisition of a lot of different kinds of knowledge. But I did wonder how the insights discussed here could be applied to more sophisticated knowledge acquisition problems in language. Frank notes already that it's unclear what insights successful models of more sophisticated knowledge have in common.

Some more targeted thoughts:

Frank focuses on two metrics of model success: sufficiency (basically, acquisition success) and fidelity (fitting patterns of human behavior). I've seen other proposed metrics, such as formal sufficiency, developmental compatibility, and explanatory power (discussed, for example, in Pearl 2010, which is based on prior work by Yang). I feel like formal sufficiency maps pretty well to sufficiency (and actually may cover fidelity too). Developmental compatibility, though, is more about psychological plausibility, and explanatory power is about the ability of the model to give informative (explanatory) answers about what causes the acquisition process modeled. I think all of the studies discussed hold up on the explanatory power metric, so that's fine. It's unclear how well they hold up for developmental compatibility - it may not matter if they're computational-level analyses, for example. But I feel like that's something that should be mentioned as a more prominent thing to think about when judging a computational model. (But maybe that's my algorithmic bias showing through.)

Related point: Frank clearly is aware of the tension between computational-level and algorithmic-level approaches, and spends some time discussing things like incremental vs. batch learning. I admit, I was surprised to see this though: "Fully incremental learning prevents backtracking or re-evaluation of hypotheses in light of earlier data". If I'm understanding this correctly, the idea is that you can't use earlier data at all in a fully incremental model. I think this conflates incremental with memoryless - for example, you can have an incremental learner that has some memory of prior data (usually in some kind of compressed format, perhaps tallying statistics of some kind, etc.). For me, all incremental means is that the learner processes data as it comes in - it doesn't preclude the ability to remember prior data with some (or even a lot of) detail.

Related point: Human memory constraints. In the word segmentation section, Frank mentions that experimental results suggest that "learners may not store the results of segmentation veridically, falsely interpolating memories that they have heard novel items that share all of their individual transitions within a set of observed items". At first, I thought this was about humans not storing the actual segmentations in memory (and I thought, well, of course not - they're storing the recovered word forms). But the second bit made me think this was actually even more abstract than that - it seems to suggest that artificial language participants were extracting probabilistic rules about word forms, rather than the word forms themselves. Maybe this is because the word forms were disconnected from meaning in the experiments described, so the most compact representation was of the rules for making word forms, rather than the word forms themselves?

I loved the Goldsmith (2010) quote: "...if you dig deep enough into any task in acquisition, it will become clear that in order to model that task effectively, a model of every other task is necessary". This is probably generally true, no matter what you're studying, actually - you always have to simplify and pretend things are disconnected when you start out in order to make any progress. But then, once you know a little something, you can relax the idealizations. And Frank notes the synergies in acquisition tasks, which seems like exactly the right way to think about it (at least, now that we think we know something about the individual acquisition tasks involved). It seems like a good chunk of the exciting work going on in acquisition modeling is investigating solving multiple tasks simultaneously, leveraging information from the different tasks to make solving all of them easier. However, once you start trying to do this, you then need to have a precise model of how that leveraging/integration process works.

Another great quote (this time from George Box): "all models are wrong, but some are useful". So true - and related to the point above. I think a really nice contribution Frank makes is in thinking about ways in which models can be useful - whether they provide a general framework or are formal demonstrations of simple principles, for example.

I think this quote might ruffle a few linguist feathers: "...lexicalized (contain information that is linked to individual word forms), the majority of language acquisition could be characterized as 'word learning'. Inferring the meaning of individual lexical items...". While technically this could be true (given really complex ideas about word "meaning"), the complexity of the syntactic acquisition task gets a little lost here, especially given what many people think about as "word meaning". In particular, the rules for putting words together isn't necessarily connected directly to lexical semantics (though of course, individual word meaning plays a part).

I think the Frank et al. work on intention inference when learning a lexicon demonstrates a nice sequence of research w.r.t. the utility of computational models. Basically, child behavior was best explained by a principle of mutual exclusivity. So, for awhile, that was a placeholder, i.e., something like "Use mutual exclusivity to make your decision". Then, Frank et al. came along and hypothesized where mutual exclusivity could come from, and showed how it could arise from more basic learning biases (e.g., "use probabilistic learning this way"). That is, mutual exclusivity itself didn't have to be a basic unit. This reminds me of the Subset Principle in generative linguistics, which falls out nicely from the Size Principle of Bayesian inference.

It's an interesting idea that humans do best at learning when there are multiple (informationally redundant) cues available, as opposed to just one really informative cue. I'm not sure if the Mintz frequent frame is a really good example of this, though - it seems like a frame vs. a bigram is really just the same kind of statistical cue. Though maybe the point is more that the framing words provide more redundancy, rather than being different kinds of cues.

It's also a really interesting idea to measure success by having the output of a model be an intermediate representation used in some other task that has an uncontroversial gold standard. Frank talks about it in the context of syntactic categories, but I could easily imagine the same thing applying to word segmentation. It's definitely a recurring problem that we don't want perfect segmentation for models of infant word segmentation - but then, what do we want? So maybe we can use the output of word segmentation as the input to word- (or morpheme-) meaning mapping.

It took me a little to understand what "expressive" meant in this context. I think it relates to the informational content of some representation - so if a representation is expressive, it can cover a lot of data while being very compact (e.g., rule-based systems, instead of mappings between individual lexical items). A quote near the end gets at this more directly: " becomes possible to generate new sentences and to encode sentences more efficiently. At all levels of organization, language is non-random: it is characterized by a high degree of redundancy and hence there is a lot of room for compression." I think this is basically an information-theoretic motivation for having a grammar (which is great!). In a similar vein, it seems like this would be an argument in favor of Universal Grammar-style parameters, because they would be a very good compression of complex regularities and relationships in the data.



Pearl, L. 2010.Using computational modeling in language acquisition research. In E. Blom & S. Unsworth (eds). Experimental Methods in Language Acquisition Research, John Benjamins.

Wednesday, November 14, 2012

Next time on 11/28/12 @ 2pm in SBSG 2221 = Frank (2012)

Thanks to everyone who participated in our vigorous and thoughtful discussion of Hsu et al. (2011)!  For our next meeting on Wednesday November 28th @ 2pm in SBSG 2221, we'll be looking at a paper that investigates the role of computational models in the study of early language acquisition and how to evaluate them.

Frank, M. 2012. Computational models of early language acquisition. Manuscript, Stanford University.

Monday, November 12, 2012

Some thoughts on Hsu et al. 2011

So this seems to be more of an overview paper showcasing how to apply a probabilistic learning framework at the computational level to problems in language acquisition, whether we're concerned with theoretical learnability results or predicting observable behavior. As a followup to Hsu & Chater (2010), which we discussed a few years back, this re-emphasized some of the nice intuitions in the MDL framework (such as "more compact representations are better").  I think a strength of this framework is its ability to identify linguistic knowledge pieces that are hard to learn from the available data, since this is exactly the sort of thing poverty of the stimulus (PoS) is all about. (Of course, the results rest on the particular assumptions made about the input, forms of the rules, etc., but that's true of all computational analyses, I think.)  On a related note, I did notice that nearly all the phenomena examined by Hsu et al. were based on lexical item classification (verb argument subcategorization) or contraction (what generativist might call "traces" in some cases).  This is fine (especially the "wanna" case, which I have seen actually used in PoS arguments), but I was surprised that we're not really getting into the kind of complex sentential semantics or syntax that I usually see talked about in generativist circles (e.g., syntactic islands, case theory - see Crain & Pietroski (2002) for some examples on the semantic side). Also, even though Hsu et al's own analysis shows that wanna & that-traces are "practically" unlearnable (i.e., even with probabilistic learning, these look like PoS problems), it seems like they close this paper by sort of downplaying this: "probabilistic language learning is theoretically and computationally possible").

Some more targeted thoughts below:

I think my biggest issue with the computational learnability analyses (and proofs) is that I find it very hard to connect them to the psychological problem of language acquisition that I'm used to thinking about.  (In fact, Kent Johnson in UCI's LPS department has a really nice 2004 paper talking about how this connection probably shouldn't have been made with the (in)famous Gold (1967) learnability results.) I do understand that this type of argument is meant to combat the claim about the "logical problem of language acquisition", with the specific interpretation that the "logical problem" comes from computational learnability results (and the Gold paper in particular). However, I've also seen "logical problem of language acquisition" apply to the simple fact that there are induction problems in language acquisition, i.e., the data are compatible with multiple hypotheses, and "logically" any of them could be right, but only one actually is, so "logical problem".  This second interpretation still seems right to me, and I don't feel particularly swayed to change this view after reading the learnability results here (though maybe that's (again) because I have trouble connecting these results to the psychological problem).

Related to the point above - in section 2, where we see a brief description of the learnability proof, the process is described as an algorithm that "generates a sequence of guesses concerning the generative probabilistic model of the language".  Are these guesses probabilities over utterances, probabilities over the generative grammars that produce the utterances, something else?  It seems like we might want them to be probabilities over the generative grammars, but then don't we need some definition of the hypothesis space of possible generative grammars?

I had a little trouble understanding the distinction that Hsu et al. were making between discriminative and generative models in the introduction. Basically, it seemed to me that "discriminative" behavior could be the output of a generative model, so we could view a discriminative model as a special case of a generative model. So is the idea that we really want to emphasize that humans are identifying the underlying probability distribution, instead of just making binary classifications based on their grammars? That is, that there is no such thing as "grammatical" and "ungrammatical", but instead these are epiphenomena of thresholding a probabilistic system?

In section 3, at the very end, Hsu et al. mention that the ideal statistical learner provides an "upper bound" on learnability.  I found this somewhat odd - I always thought of ideal learners as providing a lower bound in some sense, since they're not constrained by cognitive resource limitations, and are basically looking at the question of whether the data contain enough information to solve the problem in question.

The practical example in 3.2 with the "going to" contraction threw me for a bit, since I couldn't figure out how to interpret this: "Under the new grammar, going to contraction never occurs when to is a preposition and thus 0 bits are required to encode contraction." Clearly, the intent is that "no contraction" is cheaper to encode than the process of contraction, but why was that? Especially since the new grammar that has the "don't contract when to is a preposition" seems to require an extra rule.  Looking back to Hsu & Chater (2010), it seems to be that rules with probability 1 (like going to --> going to when to=prep) require 0 bits to encode.  So in effect, the new grammar that has a special exception when to is a preposition gets a data encoding boost, even though the actual grammar model is longer (since it has this exception explicitly encoded).  So,  "exceptions" that always apply (in a context-dependent way) are cheaper than general rules when the observable data appear in that context.

I liked the idea that learnability should correlate with grammaticality judgments, with the idea that more "learnable" rules (i.e., ones with more data in the input) are encountered more and so their probabilities are stronger in whichever direction. In looking at the computational results though, I have to admit I was surprised that "going to" ranked 12th in learnability (Fig 2), maybe putting it on the order of 50 years to learn. That rule seems very easy, and I assume the grammaticality judgments are very strong for it. (My intuitions are at least.)

A small methodological quibble, section 4.1: "...because many constructions do not occur often enough for statistical significance [in child-directed speech]...we use...the full Corpus of Contemporary American English." Isn't this the point for PoS arguments, though?  There are differences between child-directed and adult-directed input (especially between child-directed speech and adult-directed written text), especially at this lexical item level that Hsu et al. are looking at (and also even at very abstract levels like wh-dependencies: Pearl & Sprouse (forthcoming)). So if we don't find these often enough in child-directed speech, and the thing we're concerned with is child acquisition of language, doesn't this also suggest there's a potential PoS problem?

I liked that Hsu et al. connect their work to entrenchment theory, and basically provide a formal (computational-level) instantiation of how/why entrenchment occurs.


Crain, C. & P. Pietroski. 2002. Why language acquisition is a snap. The Linguistic Review, 19, 163-183.

Gold, E. 1967. Language Identification in the Limit. Information and Control, 10, 447-474.

Hsu, A. & N. Chater. 2010. The Logical Problem of Language Acquisition: A Probabilistic Perspective. Cognitive Science, 34, 972-1016.

Johnson, K. 2004. Gold's Theorem and Cognitive Science. Philosophy of Science, 71, 571-592.

Pearl, L. & J. Sprouse. Forthcoming 2012. Syntactic islands and learning biases: Combining experimental  syntax and computational modeling to investigate the language acquisition problem. Language Acquisition.