Friday, February 27, 2015

Next time on 3/13/15 @ 11:30am in SBSG 2221 = Viau et al. 2010

Thanks to everyone who was able to join us for our enlightening discussion of Frank et al. 2013!  For our next CoLa reading group meeting on Friday March 13 at 11:30am in SBSG 2221, we'll be looking at an article discusses some experimental results relating to quantifiers and scope. These kind of results can help us think about the learning process underlying them, and how we might be able to use modeling to discover things about that learning process.

Viau, J., Lidz, J., & Musolino, J. 2010. Priming of abstract logical representations in 4-year-oldsLanguage Acquisition, 17(1-2), 26-50.

See you then!

Wednesday, February 25, 2015

Some thoughts on Frank et al. 2013

One of the things I really liked about this paper was the intention to integrate more context into a computational model of acquisition (in this case, implemented as the child using utterance type information). While the particular utterance types may be idealized, it’s an excellent first step to show where this information helps and how the number of utterance types impacts that helpfulness (basically, as a proxy for more preceding context when there’s data sparseness). More generally, this got me thinking about approximation, e.g., approximating more context by the utterance type cue and approximating hierarchical structure with trigrams. We know it’s not the same, but it seems to be good enough, perhaps because it manages to capture the relevant property anyway. (For the utterance type approximating more context, this seems to be true — utterance type tells you about category order stuff more generally, and preceding context gives you specific category order information for the local environment.)

The authors also note that the actual utterance types children infer may be based on a number of cues, such as prosody (in particular, pitch contour or intonation). Knowing the ideal number of utterance types might be useful so we know how many classes we’re aiming for, based on these cues. At the very least, the results here suggest fewer may be better. Relatedly: recent experimental work by Geffen & Mintz (2014) suggests 12-month-olds can at least make a binary classification between declaratives vs. yes/no questions in English in the absence of prosodic contour cues — so there may be other cues infants are able to use besides prosody at the age when early grammatical categorization would be happening.
*Reference: Geffen, S. & Mintz, T. 2014. Can You Believe It? 12-Month-Olds Use Word Order to Distinguish Between Declaratives and Polar Interrogatives. Language Learning and Development. DOI: 10.1080/15475441.2014.951595.

More specific thoughts:

(1) Introduction, age ranges: “…children who are at the point of learning syntax — at 2-3 years of age” — This is just me being persnickety, but I think the age is closer to 1 if we’re talking about early categorization before the learner has any knowledge of categories (which is the start state of the learner modeled here).  I don’t think it matters for the model they do here and the cues they rely on, but it’s a more general point about these kinds of computational models. If we’re going to a model a process where the learner is basically starting from scratch (no prior knowledge of categories here), then this is going to be happening very early and probably won’t persist for very long. Even after a little of this kind of learning, the learner then has some knowledge, which ought to bias future learning (future categorization here). This brings up the tricky subject of what the output ought to be for such early stage learning models (which Lawrence, Galia, and I have been worrying about lately). Do we really want adult-level grammatical categories? Maybe not. But what’s acceptable output and how do we tell if we’ve got it? F&al2013 sensibly compare the inferred categories to the CHILDES-annotated categories, which are based on adult categories. But if this is meant to model early categorization occurring around 12 months and only lasting long enough to boost other categorization, maybe that’s not the output we want. 

(2) English experiments, 4.1.1. Corpora, methods quibble: “Wh-words are tagged as adverbs…pronouns…or determiners.”  — I wonder why. Wh-words have pretty distinct properties with respect to word order (wh-fronting in English), among other things. It seems like it might have been more useful to cluster wh-words together into their own category.

(3) English experiments, 4.1.2. Inference: Somewhat related to the point above, it’d be a nice extension to not preset the number of categories the learner is meant to identify, and instead infer how many categories are best and what words belong in those categories. (Hello, infinite BHMM…)

(4) 5.2, BHMM-E: This is a really nice demonstration of how wrong assumptions hurt. It seem like we often see models that show how assumptions are helpful (because, hey, that’s interesting!), but it’s less often that we see such a clean demonstration of active harm resulting (instead of it just having no effect).

(5) 5.4.3, cross-linguistic variation: “Spanish does not show the same improvement…BHMM-T models do not differ from the baseline BHMM” — It sounds like whether infants heed utterance type as a cue may need to be learned, rather than just being a thing they do. Though since it doesn’t actually harm (it just doesn’t help), maybe it’s okay for infants to try to use it anyway in Spanish. However, just brainstorming about how infants might learn to pay attention to word type…perhaps they could notice word order differences across utterance types (i.e., use various cues to identify utterance types and then see if word order as defined by specific recognizable words — rather than grammatical categories — seems to change). Then, if word order varies, use utterance type information for categorization; if not, then don’t. 

Friday, February 13, 2015

Next time on 2/27/15 @ 11:30am in SBSG 2221 = Frank et al. 2013

Thanks to everyone who was able to join us for our vigorous and educational discussion of Yurovsky & Frank’s 2014 Ms!  For our next CoLa reading group meeting on Friday February 27 at 11:30am in SBSG 2221, we'll be looking at an article that shows how a Bayesian model of early grammatical categorization can incorporate (and benefit from) information relating to utterance type.

Frank, S., Goldwater, S., & Keller, F. 2013. Adding sentence types to a model of syntactic category acquisitionTopics in Cognitive Science, 5(3), 495-521.

See you then!

Wednesday, February 11, 2015

Some thoughts on Yurovsky & Frank 2014 Ms

One thing I really enjoyed about this paper was the integration of cognitive resource constraints (memory and attention) into an ideal learner model. I may have some quibbles as to calling this “algorithmic” vs. “computational” (more on this below), since that distinction for me has to do with the inference process, but the core idea of including these aspects in the learning model seems like a nice step forward.

That being said, I thought the way “attention” was integrated was a bit curious — if I’m understanding correctly, it was part of the speaker’s intentions (I). Is this because the listener focuses her attention on the speaker’s intention to refer to something repeatedly? That’s the best link I could come up with. (More discussion on this below, too.) If so, I could imagine this ability maturing over time, so that early word-learners (~1 year old) have less ability to do this accurately than adults.

Back to more general things: This was also a nice demonstration of how two very different stories of a process can be implementations of a more general approach (as represented by the $sigma variable). Still, as the authors themselves note, it’s unclear what this particular study shows for either L1 or L2 learning. But it’s a good methodology demonstration, and maybe once more L1 data is available, this model can be applied to tell us something about word learning in toddlers.

More specific comments: 

(1) Introduction, “…both of these algorithmic-level solutions will, in the limit, produce successful word-reference mapping, they will do so at very different rates…may be necessary to posit additional biases and constraints on learners in order for human-scale lexicons to be learned in human-scale time from the input available to children” — This is a very good point, and highlights one important measure of algorithmic-level approaches. That being said, I think the particular approaches being discussed here are really only meant to apply to very early word-learning when almost no words are already known, which may only last a short while. So, the “human-scale lexicon” may be rather small.

(2) Model, p.17: “…the most convenient place to integrate attention is in defining the learner’s beliefs about P(I | O)…[o]ne possibility is to let each object be equally likely to be the intended reference…[a]lternatively, the learner could place all the probability mass on one hypothesized referent…more flexible alternative is to assign some probability mass $sigma to the hypothesized referent…” — So this is the specific instantiation I alluded to in my comment at the beginning. Since I is meant to be about the speaker’s intentions, it seems like this has to be some kind of theory of mind thing, where the listener assumes the speaker is intending to talk about everything with uniform probability (option one), only one thing all the time (option two), or some things more than other (option three). This seems vaguely odd as a model of listener “attention”, though it may capture assumptions about communicative goals very naturally.

(3) General Discussion, p.21: “…graded shift in representation was well-described by an ideal learning model, but only when this model was modified to take into account psychological constraints on attention and memory…the shift from a computational to an algorithmic (or, psychological) description was critical” — And this is where my quibbles arise. I completely agree that integrating resource constraints is a great step forward, but I hesitate to say these were integrated at the algorithmic level. The inference process was still MCMC, if I understood correctly, and I don’t think any modification was done to it. So, for me, that’s a way to approximate the optimal inference, and so is a computational-level thing. Maybe this is more “rational process model”, though (one step down from pure computational, but not yet what I'd call algorithmic)?