Tuesday, May 28, 2013

Have a good summer, and see you in the fall!


Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be on hiatus this summer, and we'll resume again in the fall quarter.  As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Friday, May 24, 2013

Some thoughts on Kwiatkowski et al 2012

One of the things I really enjoyed about this paper was that it was a much fuller syntax & semantics system than anything I've seen in awhile, which means we get to see the nitty gritty in the assumptions that are required to make it all work. Having seen the assumptions, though, I did find it a little unfair for the authors to claim that no language-specific knowledge was required - as far as I could tell, the "language-universal" rules between syntax and semantics at the very least seem to be a language-specific kind of knowledge (in the sense of domain-specific vs. domain-general). In this respect, whatever learning algorithms they might explore, the overall approach seems similar to other learning models I've seen that are predicated on very precise theoretical linguistic knowledge (e.g., the parameter-setting systems of Yang 2002, Sakas & Fodor 2001, Niyogi & Berwick 1996, Gibson & Wexler 1994, among others.) It just so happens here that CCG assumes different primitives/principles than those other systems - but domain-specific primitives/principles are still there a priori.

Getting back to the semantic learning - I'm a big fan of them learning words besides nouns, and connecting with the language acquisition behavioral literature on syntactic bootstrapping and fast mapping.  That being said, the actual semantics they seemed to learn was a bit different than what I think the fast mapping people generally intend.  In particular, if we look at Figure 5, while three different quantifier meanings are learned, it's more about the form the meaning takes, rather than the actual lexical meaning of the word (i.e., the form for a, another, and any looks identical, so any differences in meaning are not recognized, even though these words clearly do differ in meaning). I think lexical meaning is what people are generally talking about for fast mapping, though. What this seems like is almost grammatical categorization, where knowing the grammatical category means you know the general form the meaning will have (due to those linking rules between syntactic category and semantic form) rather than the precise meaning - that's very in line with syntactic bootstrapping, where the syntactic context might point you towards verb-y meanings or preposition-y meanings, for example.

More specific thoughts:

I found it interesting that the authors wanted to explicitly respond to a criticism that statistical learning models can't generate sudden step-like behavior changes.  I think it's certainly an unspoken view by many in linguistics that statistical learning implies more gradual learning (which was usually seen as a bonus, from what I understood, given how noisy data are). It's also unclear to me that the data taken as evidence for step-wise changes really reflect a step-wise change or instead only seem to be step-wise because of how often the samples were taken and how much learning happened in between.  It's interesting that the model here can generate it for learning word order (in Figure 6), though I think the only case that really stands out for me is the 5 meaning example, around 400 utterances.

I could have used a bit more unpacking of the CCG framework in Figure 2. I know there were space limitations, but the translation from semantic type to the example logical form wasn't always obvious to me. For example, the first and last examples (S_dcl and PP) have the same semantic type but not the same lambda calculus form. Is the semantic type what's linked to the syntactic category (presumably), and then there are additional rules for how to generate the lambda form for any given semantic type?

This provides a nice example where the information that's easily available in dependency structures appears more useful, since the authors describe (in section 6) how they created a deterministic procedure for using the primitive labels in the dependency structures to create the lambda forms. (Though as a side note, I was surprised how this mapping only worked for a third of the child-directed speech examples, leaving out not only fragments but also imperatives and nouns with prepositional phrase modifiers. I guess it's not unreasonable to try to first get your system working on a constrained subset of the data, though.)

I wish they had told us a bit more about the guessing procedure they used for parsing unseen utterances, since it had a clear beneficial impact throughout the learning period. Was it random (and so guessing at all was better than not, since sometimes you'd be right as opposed to always being penalized for not having a representation for a given word)?  Was it some kind of probabilistic sampling?  Or maybe just always picking the most probable hypothesis?




Wednesday, May 15, 2013

Some thoughts on Frank et al. 2010

So what I liked most about this article was the way in which they chose to explore the space of possibilities in a very computational-level way. I think this is a great example of what I'd like to see more of. As someone also interested in cross-linguistic viability for our models, I have to also commend them for testing on not just one foreign language, but on three.

So there were a number of aspects of the model which I think could have been more clearly specified. For instance, I don't believe they ever explicitly say that the model presumes knowledge of the number of states to be learned. Actual infants don't have the benefit of the doubt in this regard, so it would be nice to know what would happen if you inferred that from the data. It turns out there's a well specified model to do that, but I'll get to that later. Another problem with their description of the model has to do with how their hyperparameters are sampled. They apparently simplify the process by resampling only once per iteration of the Gibbs sampler. I'm happy with this although I'm going to assume that it was a typo that they say they run their model for 2000 iterations (Goldwater seems to prefer 20,000). Gibbs samplers tend to converge more slowly on time-dependent models so it would be nice to have some evidence that the sampler has actually converged. Splitting the data by sentence type seems to increase the size of their confidence intervals by quite a lot, which may be an artifact of having less data per parameter, but could also be due to a lack of convergence.

Typically I have to chastise modelers who attempt to use VI or V-measure, but fortunately they are not doing anything technically wrong here. They are correct in that comparing these scores across corpora is hazardous at best. Both of these measures are biased, VI prefers small numbers of tags and V-measure prefers large numbers of tags (they claim at some point that it is "invariant" to different numbers of tags, this is however not true!). It turns out that a measure, V-beta, is more useful than either of these two in that it is unbiased for the number of categories. So there's my rant about the wonders of V-beta.

What I really would have liked to see would be an infinite HMM for this data, which is a well-specified, very similar model which can infer the number of grammatical categories in the data. It has an efficient sampler (as of 2008) so there's no reason they couldn't run that model over their corpus. It's very useful for us to know what the space of possibilities is, but to what extent would their results change if they gave up the assumption that you knew from the get-go how many categories there were? There's really no reason they couldn't run it and I'd be excited to see how well it performed.

The one problem with the models they show here as well as the IHMM is that neither allows for there to be shared information about transition probabilities or emission probabilities (depending on the model) across sentence types. They're treated as entirely different. They mention this in their conclusion, but I wonder if there's any way to share that information in a useful way without hand coding it somehow.

Overall, I'm really happy someone is doing this. I liked the use of some very salient information to help tackle a hard problem, but I would've liked to have seen it a little more realistic by inferring the number of grammatical categories. I might've also liked to have seen better evidence of convergence (perhaps a beam sampler instead of Gibbs, at the very least I hope they ran it for more than 2000 iterations).

Tuesday, May 14, 2013

Next time on 5/28/13 @ 2pm in SBSG 2200 = Kwiatkowski et al. 2012

Thanks to everyone who joined our meeting this week, where we had a very thoughtful discussion about the experimental design for investigating "less is more" and the implications of the computational modeling in Perfors 2012.  Next time on Tuesday May 28 @ 2pm in SBSG 2200, we'll be looking at an article that presents an incremental learning model that incorporates both syntactic and semantic information during learning:

Kwiatkowski, T., Goldwater, S., Zettlemoyer, L., & Steedman, M. 2012. A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.



See you then!
-Lisa


Monday, May 13, 2013

Some thoughts on Perfors 2012 (JML)

One of the things I quite liked about this paper was the description of the intuitions behind the different model parameters and capacity limitations. As a computational modeler who's seen ideal Bayesian learners before, could I have just as easily decoded this from a standard graphical model representation? Sure.  Did I like to have the intuitions laid out for me anyway?  You bet. Moreover, if we want these kind of models to be recognized and used within language research, it's good to know how to explain them like this. On a related note, I also appreciated that Perfors explicitly recognized the potential issues involved in extending her results to actual language learning. As with most models, hers is a simplification, but it may be a useful simplification, and there are probably useful ways to un-simplify it.

It was also good to see the discussion of the relationship between the representations this model used for memory and the existing memory literature. (Given the publication venue, this probably isn't so surprising, but given that my knowledge of memory models is fairly limited, it was helpful to see this spelled out.)

I think the most surprising thing for me was how much memory loss was required for the regularization bias to be able to come into play and allow the model to show regularization. Do we really think children only remember 10-20% of what they hear? (Maybe they do, though, especially in more realistic scenarios.)

More specific thoughts:

Intro: I found the distinctions made between different "less is more" hypothesis variants to be helpful, in particular the difference between a "starting small" version that imposes explicit restrictions on the input (because of attention, memory, etc.) to identify useful units in the input vs. a general regularization tendency (which may be the byproduct of cognitive limitations, but isn't specifically about ignoring some of the input) which is about "smoothing" the input in some sense.

Section 2.1.2: The particular task Perfors chooses to investigate experimentally is based on previous tasks that have been done with children and adults to test regularization, but I wonder what kind of task it seemed like to the adult subjects. Since the stimuli were presented orally, did the subjects think of each one as a single word that had some internal inconsistency (and so might be treating the variable part as morphology tacked onto a noun) or would they have thought of each one as one consistent word plus a separate determiner-like thing (making this more of a combinatorial syntax task)?  I guess it doesn't really matter for the purposes of regularization - if children can regularize syntax (creoles, Nicaraguan sign language, Simon), then presumably they regularize morphology (e.g., children's overregularization of the past tense in English, like goed), and it's not an unreasonable assumption that the same regularization process would apply to both. Perfors touches again on the issue of how adults perceived the task a little in the discussion (p.40) - she mentions that mutual exclusivity might come into play if adults viewed this as a word learning task, and cause more of a bias for regularization.  Whether it's a morphology task or a combinatorial syntax task, I'm not sure I agree with that - mutual exclusivity seems like it would only apply if adults assumed the entire word was the name of the object (as opposed to the determiner-thing being an actual determiner like the or a or morphology like -ed or -ing). Because only a piece of the entire "word" would change with each presentation of the object, it doesn't seem like adults would make that assumption.

Section 3.0.6: For the Prior bias, it seems like prior is constructed from the global frequency of the determiner (based on the CRP). This seems reasonable, but I wonder if it would matter any to have a lexical-item-based prior (maybe in addition to the global prior)? I could imagine that the forgotten data for any individual item might be quite high (even if others are low) when memory loss is less than 80-90% globally, which might allow the regularization effects to show up without needing to forget 80-90% of all the data.

Section 4: It's an interesting observation that the previous experiments that found regularization effects conducted the experiment over multiple days, where consolidation during sleep would have presumably occurred. Perfors mentions this as a potential memory distortion that doesn't occur during encoding itself, or retrieval, but rather with the processes of memory maintenance. If this is true, running the experiments again with adults, but over multiple days, should presumably allow this effect to show up.