Computational Models of Language (at UC Irvine): November 2010

Monday, November 29, 2010

Thoughts on Parisien & Stevenson (2010)

So I very much like the fact that they're combining aspects of previous models (one of which we looked at last time: Perfors, Tenenbaum, & Wonnacott (2010)) into a model that is tackling a more realistic learning problem: not only grouping verbs into classes based on their usage in various constructions, but also identifying the relevant constructions themselves from among many different verbs and construction types. A couple of more targeted thoughts:

I think this is the first time I've seen "competency model" used this way - I think this is basically the same as what we've been calling computational-level models, since this model is interested in whether the statistical information is present in the environment (rather than worrying about how humans could extract that information)?
Practical note: I didn't realize a matlab package (NPBayes) was available that does this kind of Bayesian inference, courtesy of Teh at UCL. This seems like a very nice option if WinBUGS isn't your thing.
Figure 3, generalization of novel dative verbs: The difference between the model with verbs classes (Model 2) and the model without verb classes (Model 1) doesn't seem so great to me. While it's true there's a small change in the right direction for PD only and DO only verbs, it's unclear to me that this is really a huge advantage. Implications: Knowing about verb classes isn't a big advantage at this stage of acquisition? (Which seems not quite right, given what we saw with Perfors, Tenenbaum, & Wonnacott (2010), where having those classes was a key feature in the good model.)
The comparison of the model's behavior to three-year-old generalization behavior, and why it's not the same: While it's entirely possible that they're right and the difference is due to the model learning from too small (and biased) a corpus, isn't the idea to try to use the data children have access to in order to make the kind of judgments children do? So whether the sample is biased or not compared to normal (adult?) corpora like the ICE-GB, the point is that the Manchester corpus is fairly large (1.5 million words of child-directed speech) and presumed to be a reasonable sample of the kind of data children hear - shouldn't the model generalize from these data the way children do if it's a competency model? I suppose it's possible that the Manchester corpus is particularly biased for some reason, but this corpus is made up of data from multiple children, so it would be somewhat surprising if the data sets all happened to be biased the same way by chance.

Wednesday, November 17, 2010

Next time: Parisien & Stevenson (2010)

Thanks to all who were able to join us for our discussion of Perfors, Tenenbaum, & Wonnacott (2010)! I think we definitely had some good ideas about how to incorporate semantics into the existing learning model in a more sophisticated way. Next time, we'll be looking at another Bayesian take on syntax learning, courtesy of Parisien & Stevenson (2010) (downloadable from the schedule section of the CoLa Reading Group webpage).

Monday, November 15, 2010

Thoughts on Perfors, Tenenbaum, & Wonnacott (2010)

I think the main innovation of this model (and this style of modeling in general) is to provide a formal representation of how multiple levels of knowledge can be learned simultaneously in a mathematical framework. This is also (unsurprisingly) something that appeals to me very much. Interestingly, this kind of thing seems to be very similar to the idea of linguistic parameters that nativists talk about. A linguistic parameter would most naturally correspond to hyper-parameters (or hyper-hyper-parameters, etc.), with individual linguistic items providing a way to identify the linguistic parameter value - which then allows generalization to items rarely or not yet seen.

The interesting claim (to me) that the authors make here, of course, is that these hyper-parameters have extremely weak priors and so (perhaps) domain-specific knowledge is not required to set their values. I think this may still leave open the question of how the learner knows what the parameters are, however. In this model, the parameters are domain-general things, but linguistic parameters are often assumed to be domain-specific (ex: head-directionality, subject-drop, etc.). Perhaps the claim would be then that linguistic parameters can be re-imagined as these domain-general parameters, and all the details the domain-specific parameters were originally created to explain would fall out from some interplay between the new domain-general parameters and the data.

Some more targeted thoughts:

p.2, dative alternation restrictions: I admit I have great curiosity about how easy it is to saturate on these kind of restrictions. If they're easily mutable, then they're not necessarily the kind of phenomena that linguists often posit linguistic principles and parameters for. Instead, the alternations that we see would be more a sort of accident of usage, rather than reflecting any deep underlying restrictions of language structure. This idea of "accident of usage" comes up again on p.5, where they mention that the distinctions in usage don't seem to be semantically-driven (no completely reliable semantic cues).

p.12, footnote 4: This footnote mentions that the model doesn't involve memory limitations the way humans do, which leads me to my usual question when dealing with rational models - how do we convert this into a process model? Is it straight-forward to add memory limitations and other processing effects? And then, once you have this, do the results found with the computational-level model still occur? This gets at the difference between "is it possible to do" (yes, apparently) and "is it possible for humans to do" (unclear)?

p.15, related thought to the above one involving the different epochs of training: If this process of running the model's inference engine after every so many data points was taken to its extreme, then it seems we could create an incremental version that does its inference thing after every data point encountered. This would be a first step towards creating a process model, I think.

p.23: related to the large opening comment, with respect to the semantic features: This seems like a place where nativists might want to claim an innate bias to heed certain kinds of semantic features over others. (And then they can think about whether the necessary bias is domain-specific or domain-general.)

Wednesday, November 3, 2010

Next time: Perfors, Tenenbaum, & Wonnacott (2010)

Thanks to everyone who was able to join us this time to discuss Bod (2009)! Next time on November 17, we'll be looking at Perfors, Tenenbaum, & Wonnacott (2010) (available at the CoLa reading group schedule page), who apply hierarchical Bayesian models to learning syntactic alternations.

Monday, November 1, 2010

Thoughts on Bod (2009)

I'm very fond of the general idea underlying this approach, where the structure of sentences is used explicitly to generalize appropriately during acquisition. The way that Bod's learner has access to all possible tree structures reminds me very much of work by Janet Fodor and William Sakas, who have some papers in the early 2000s about their Structural Triggers Learner, which also has access to all possible tree structures. I think the interesting addition in Bod's work is that tree structures can be discontiguous in ways that don't necessarily have to do with dependencies (e.g., Fig 13, p.768, with the discontiguous subtree that involves the subject and part of the object). That being said, I don't know how reasonable/plausible it is for a child to keep track of these kind of strange discontinuities, really. Also, I don't know how plausible it is to track statistics on all possible sub-trees. I know Bod offers some techniques for making this tractable, but it seems trickier than the Bayesian hypothesis spaces, because those hypothesis spaces of structures are very large but importantly implicit - the learner doesn't actually deal with all the items in the hypothesis space. Bod's learner, on the other hand, seems to need to track all those possible sub-trees explicitly.

More specific comments:

I admit my feathers got a little ruffled in section 6 with the poverty of the stimulus discussion. On p.777, Bod cites Crain (1991) who claims that (at the time - it's been 20 years since then) complex yes/no questions were the "parade case of an innate constraint". And then, Bod goes on to show how the U-DOP learner can learn complex yes/no questions. This is all well and good, because the "innate constraint" the nativists claimed was needed to learn this is precisely what Bod's U-DOP learner uses: structure-dependence. So it would actually be really bad (and strange) if the U-DOP learner, with all its knowledge of language structure, couldn't learn how to form complex yes/no questions properly. It seems to me that what Bod has done is shown a method that uses structure-dependence in order to learn complex yes/no questions from the input. Since his learner assumes the knowledge the nativists say children need to assume, I don't think he can claim that he's shown anything that should change nativists' views on the problem.
It seems like this learner is actually tackling a harder problem than is necessary, since children will likely have some idea of grammatical category knowledge (even if they don't have it for all words yet). Given this, children also may be able to use some simple probability information between grammatical categories to form initial groupings (constituents) - so the U-DOP learner is actually considering a wider hypothesis space of possible tree structures when it allows any fragment of a sentence to form a productive unit (e.g., "the book" (constituent) vs. "in the").
I found it interesting that binarity plays such an integral role for this learner. That property seems similar to the property "Merge" (wikipedia info here) in current generative linguistics.
It also seems like the overall process behind the U-DOP is a formalization of the chunking or "algebraic learning" process that gets talked about a lot for learning. In this case, it's chunking over tree structures. This struck me particularly in section 5.2, on p.774, with the "blow it up" example.
Smaller note: Why does the U-DOP do so poorly on Chinese, when compared to German and English data in section 4? It makes me wonder if there's something language-specific about approaching the learning problem this way, or perhaps something language-specific about using this particular structural representation.

Computational Models of Language (at UC Irvine)