I really enjoyed seeing this extension of a reasonable existing word-learning model (which was focused on concrete nouns) to something that tries to capture more of the complexity of word meaning learning. I admit I was surprised to find out that the extension was on the semantics side (compositional meanings) rather than some sort of syntactic bootstrapping (using surrounding word contexts), especially given their opening example. Given the extensive syntactic bootstrapping experimental literature, I think a really cool extension would be to incorporate the idea that words appearing in similar distributional contexts have similar meanings. Maybe this requires a more sophisticated “meaning” hypothesis space, though?
I also appreciated seeing the empirical predictions resulting from their model (good modeling practices, check!). More specifically, they talk about why their model does better with a staged input representation, and suggest that learning from one, then two, then three words would lead to the same result as learning from three, then two, then one word (which is not so intuitive, and therefore an interesting prediction). To be honest however, I didn’t quite follow the nitty-gritty details of why that should be, so that’s worth hashing out together.
More specific thoughts:
(1) The learners here have the assumption that a word refers to a subset of world-states, and that presumably could be quite large (infinite even) if we’re talking about all possible combinations of objects, properties, and actions, etc. So this means the learner needs to have some restrictions on the possible components of the world-states. I think that’s pretty reasonable — we know from experimental studies that children have conceptual biases, and so probably also have equivalent perceptual biases that filter down the set of possible world-states in the hypothesis space.
(2) The “wag” example walk-through: I’m not sure I understand exactly how the likelihood works here. “Wag” refers to side-to-side motion. If the learner thinks “wag” refers to side-to-side motion + filled/black shading, this is described as being “consistent with the observed data”. But what about the instances of “wag” occurring with non-filled items (du ri wag, pu ri wag) - these aren’t consistent with that hypothesis. So shouldn’t the likelihood of generating those data, given this hypothesis, be 0? M&G2015 also note for this case that “the likelihood is relatively low in that the hypothesis picks out a larger number of world-states”. But isn’t side-to-side+black/filled compatible with fewer world-states than side-to-side alone?
(3) I like the incorporation of memory noise (which makes this simulation more cognitively plausible). Certainly the unintentional swapping of a word is one way to to implement memory noise that doesn’t require messing with the guts of the Bayesian model (it’s basically an update to the input the model gets). I wonder what would happen if we messed with the internal knowledge representation instead (or in addition to this) and let the learned mappings degrade over time. I could imagine implementing that as some kind of fuzzy sampling of the probabilities associated with the mappings between word and world-state.
(4) Figure 3, with the adult artificial learning results from Kertsen & Earles 2001: Adults are best at object or path mapping, and are much worse at manner mapping. My guess is that has to do with the English bias for manner-of-motion encoded in verbs over direction-of-motion (which happens to be the opposite of the Spanish bias). So, these results could come from a transfer effect from the English L1 — in essence, due to their L1 bias, it doesn’t occur to the English subjects to encode the manner as a separate word from the verb-y/action-y type word. Given what we know about the development of these language-specific verb biases, this may not be present in the same way in children learning their initial language (e.g., there’s some evidence that all kids come predisposed for direction-of-motion encoding — Maguire et al. 2010.) At any rate, it seems easy enough to build in a salience bias for one type of world-state - just weight the prior accordingly. At the moment, the model doesn’t show same manner deficit and so this could be an empirically-grounded bias to add to the model to account for those behavioral results.
Maguire, M. J., Hirsh-Pasek, K., Golinkoff, R. M., Imai, M., Haryu, E., Vanegas, S., Okada, H., Pulverman, R., & Sanchez-Davis, B. (2010). A developmental shift from similar to language-specific strategies in verb acquisition: A comparison of English, Spanish, and Japanese. Cognition, 114(3), 299-319.
(5) Also Figure 3: I’m not sure what to make of the model comparison with human behavior. I agree that there’s a qualitative match with respect to improvement for staged exposure over full exposure. Other than that? Maybe the percent correct if averaged (sort of) for eta = 0.25. I guess the real question is how well the model is supposed to match the adult behavior. (That is, maybe I’m being too exacting in my expectations for the output behavior of the model, given what it has built into it.)
(6) Simulation 3 setup: I didn’t quite follow this. Is the idea that the utterance is paired with four world-states, and the learner assumes the utterance refers to one of them? If so, what does this map to in a realistic acquisition scenario? Having more conceptual mappings possible? In general, I think the page limit forced the authors to cut the description of this simulation short, which makes it tricky to understand.