Tuesday, October 23, 2018

Some thoughts on Gauthier et al. 2018

I love seeing examples of joint learning because not only do joint learning models tend to do better than sequential models, but joint learning also seems to be the best fit to how real children learn (language) things. [I remember a more senior colleague who works on a variety of acquisition processes that happen during infancy and toddler-hood saying something like the following: “I used to think babies first learned how to segment words, then learned their language-specific sound categories, and then figured out words. I don’t think those babies exist anymore.”] As G&al2018 find, this is because it can be more efficient to learn jointly than sequentially. Why? Because you harness information from “the other thing” when you’re learning jointly, while you just ignore that information if you’re learning sequentially. I think a real hurdle in the past has been how to mathematically define joint learning models so the math is solvable with current techniques. Happily (at least when it comes to making modeling progress), that seems like a hurdle that’s being surmounted.


It’s also great to see models being evaluated against observable child behavior, rather than a target linguistic knowledge state that we can’t observe. It’s much easier to defend why your model is matching behavior (answer: because it’s what we see children doing -- even if it’s only a qualitative match, like what we see here) than it is to defend why your model is aiming for a specific target theoretically-motivated knowledge state instead of some other equally plausible theoretically-motivated target knowledge state.


What’s exciting about the results is how much you don’t need to build in to get the performance jump. You have to build in the possibility of connections between certain pieces of information in the overhypothesis (e.g., syntactic type to attribute), but not the explicit content of those connections (what the probabilities are). So, stepping back, this supports prior knowledge that focuses your attention on certain building blocks (i.e., “look at these connections”), but doesn’t explicitly have to define the exact form built from those blocks. That’s what you as a child learn to do, based on your input. To me, this is the way forward for generative theorizing about what’s in Universal Grammar.


Other specific thoughts:
(1) It’s nice to see the mention of Abend et al. 2017 -- that’s a paper I recently ran across that did an impressive job of jointly learning word meaning and syntactic structure. It looks like G&al2018 use the CCG formalism too, which is very interesting as CCG has a couple of core building blocks that are used to generate a lot of possible language structure. This is similar in spirit to Minimalism (few building blocks, lots of generative capacity), but CCG now has these acquisition models associated with it that explain how learning could work while Minimalism doesn’t yet.


(2) Given the ages in the Smith et al. 1992 study (2;11-3;9), it’s interesting that G&al2018 are focusing on the meaning of the prenominal adjective position. While this seems perfectly reasonable to start with, I could also imagine that children of this age have something like syntactic categories, and so it’s not just the prenominal adjective position that has some meaning connection, but adjectives in general that have some meaning connection. It’d be handy to know the distribution of meanings for adjectives in general, and use that in addition to the more specific positional information of prenominal adjective meaning. (It seems like this might be closer to what 3-year-olds are using.) Maybe the idea is that this is a model of how those categories form in the first place, and then we see the results of it in the three-year-olds?


(3) In the reference game, I wonder if the 3D nature of the environment matters. Given the properties of interest (shape, color), it seems like the same investigation could be accomplished with a simple list of potential referents and their properties (color, shape, material, size). Maybe this is for ease of extension later on, where perceptual properties of the objects (e.g., distance, contrast) might impact learner inferences about an intended referent?


(4) Marr’s levels check: This seems to be a computational-level (=~rational) model when it comes to inference (using optimal algorithms of various kinds for lexicon induction), yet it also incorporates incremental learning -- which makes it feel more like an algorithmic-level (=~process) model. Typically, I think about rational vs. process models as answering different kinds of acquisition questions. Rational = “Is it possible for a learner to accomplish this acquisition task, given this input, these abilities, and this desired output?”; process = “Is it possible for a child to accomplish this acquisition task, given this input, known child abilities, known child limitations (both cognitive and learning-time wise), and this desired output?” This model starts to incorporate at least one known limitation of child learning -- they see and learn from data incrementally, rather than being able to hold all the data at once in mind for analysis.


(5) If I’m interpreting Figure 4 correctly, I think s|t (syntactic type given abstract type, e.g., adjective given color) would correspond to a sort of inverse syntactic bootstrapping (traditional syntactic bootstrapping: the linguistic context provides the cue to word meaning). Here, the attribute of color, for example, gives you the syntactic type of adjective. Then, w|v (word form, given attribute value, e.g., “blue” given the blue color) corresponds to a more standard idea of a basic lexicon that consists just of word-form-to-word referent mappings?


(6) As proof of concept, I definitely understand starting with a synthetic group of referring expressions. But maybe a next step is to use naturalistic distributions of color+shape combinations? The corpus data used in the initial corpus analysis seem like a good reference distribution.


(7) Figure 5a (and also shown in 5b): It seems like the biggest difference is that the overhypothesis model jumps up to higher performance more quickly (though the base model catches up after not too many more examples). It’s striking how much can be learned after only 50 examples or so -- this super-fast learning highlights why this is more a rational (computational-level) model than a process (algorithmic-level) one. It’s unlikely children can do the same thing after 50 examples.

No comments:

Post a Comment