Tuesday, October 23, 2018

Some thoughts on Gauthier et al. 2018

I love seeing examples of joint learning because not only do joint learning models tend to do better than sequential models, but joint learning also seems to be the best fit to how real children learn (language) things. [I remember a more senior colleague who works on a variety of acquisition processes that happen during infancy and toddler-hood saying something like the following: “I used to think babies first learned how to segment words, then learned their language-specific sound categories, and then figured out words. I don’t think those babies exist anymore.”] As G&al2018 find, this is because it can be more efficient to learn jointly than sequentially. Why? Because you harness information from “the other thing” when you’re learning jointly, while you just ignore that information if you’re learning sequentially. I think a real hurdle in the past has been how to mathematically define joint learning models so the math is solvable with current techniques. Happily (at least when it comes to making modeling progress), that seems like a hurdle that’s being surmounted.


It’s also great to see models being evaluated against observable child behavior, rather than a target linguistic knowledge state that we can’t observe. It’s much easier to defend why your model is matching behavior (answer: because it’s what we see children doing -- even if it’s only a qualitative match, like what we see here) than it is to defend why your model is aiming for a specific target theoretically-motivated knowledge state instead of some other equally plausible theoretically-motivated target knowledge state.


What’s exciting about the results is how much you don’t need to build in to get the performance jump. You have to build in the possibility of connections between certain pieces of information in the overhypothesis (e.g., syntactic type to attribute), but not the explicit content of those connections (what the probabilities are). So, stepping back, this supports prior knowledge that focuses your attention on certain building blocks (i.e., “look at these connections”), but doesn’t explicitly have to define the exact form built from those blocks. That’s what you as a child learn to do, based on your input. To me, this is the way forward for generative theorizing about what’s in Universal Grammar.


Other specific thoughts:
(1) It’s nice to see the mention of Abend et al. 2017 -- that’s a paper I recently ran across that did an impressive job of jointly learning word meaning and syntactic structure. It looks like G&al2018 use the CCG formalism too, which is very interesting as CCG has a couple of core building blocks that are used to generate a lot of possible language structure. This is similar in spirit to Minimalism (few building blocks, lots of generative capacity), but CCG now has these acquisition models associated with it that explain how learning could work while Minimalism doesn’t yet.


(2) Given the ages in the Smith et al. 1992 study (2;11-3;9), it’s interesting that G&al2018 are focusing on the meaning of the prenominal adjective position. While this seems perfectly reasonable to start with, I could also imagine that children of this age have something like syntactic categories, and so it’s not just the prenominal adjective position that has some meaning connection, but adjectives in general that have some meaning connection. It’d be handy to know the distribution of meanings for adjectives in general, and use that in addition to the more specific positional information of prenominal adjective meaning. (It seems like this might be closer to what 3-year-olds are using.) Maybe the idea is that this is a model of how those categories form in the first place, and then we see the results of it in the three-year-olds?


(3) In the reference game, I wonder if the 3D nature of the environment matters. Given the properties of interest (shape, color), it seems like the same investigation could be accomplished with a simple list of potential referents and their properties (color, shape, material, size). Maybe this is for ease of extension later on, where perceptual properties of the objects (e.g., distance, contrast) might impact learner inferences about an intended referent?


(4) Marr’s levels check: This seems to be a computational-level (=~rational) model when it comes to inference (using optimal algorithms of various kinds for lexicon induction), yet it also incorporates incremental learning -- which makes it feel more like an algorithmic-level (=~process) model. Typically, I think about rational vs. process models as answering different kinds of acquisition questions. Rational = “Is it possible for a learner to accomplish this acquisition task, given this input, these abilities, and this desired output?”; process = “Is it possible for a child to accomplish this acquisition task, given this input, known child abilities, known child limitations (both cognitive and learning-time wise), and this desired output?” This model starts to incorporate at least one known limitation of child learning -- they see and learn from data incrementally, rather than being able to hold all the data at once in mind for analysis.


(5) If I’m interpreting Figure 4 correctly, I think s|t (syntactic type given abstract type, e.g., adjective given color) would correspond to a sort of inverse syntactic bootstrapping (traditional syntactic bootstrapping: the linguistic context provides the cue to word meaning). Here, the attribute of color, for example, gives you the syntactic type of adjective. Then, w|v (word form, given attribute value, e.g., “blue” given the blue color) corresponds to a more standard idea of a basic lexicon that consists just of word-form-to-word referent mappings?


(6) As proof of concept, I definitely understand starting with a synthetic group of referring expressions. But maybe a next step is to use naturalistic distributions of color+shape combinations? The corpus data used in the initial corpus analysis seem like a good reference distribution.


(7) Figure 5a (and also shown in 5b): It seems like the biggest difference is that the overhypothesis model jumps up to higher performance more quickly (though the base model catches up after not too many more examples). It’s striking how much can be learned after only 50 examples or so -- this super-fast learning highlights why this is more a rational (computational-level) model than a process (algorithmic-level) one. It’s unlikely children can do the same thing after 50 examples.

Tuesday, October 9, 2018

Some thoughts on Linzen & Oseki 2018

I really appreciate L&O2018’s focus on the replicability of linguistic judgments in non-English languages (and especially their calm tone about it). I think the situation of potentially unreliable judgments emerging during review highlights the utility of something like registered reports, even for theoretical researchers. If someone finds out during the planning stage that the contrasts they thought were so robust actually aren’t, this may help avoid wasted time building theories to account for the data in question (or perhaps bring in considerations of language variation). [Side note: I have especial feeling for this issue, having struggled to have an author’s judgments about allowed vs. unallowed interpretations in many a semantics seminar paper in graduate school.]

In theory, aspects of the peer review process are supposed to help cover this, but as L&O2018 note in section 4.1, this is harder for non-English languages. To help with this, L&O2018 suggest the open review system in section 4.2, with the crowdsourced database of published acceptability judgments, which sounds incredible. Someone should totally fund the construction of that. As L&O2018 note, this will be especially helpful for less-studied languages that have fewer native speakers.

I’m also completely with L&O2018 on focusing on judgments that aren’t self-evident - but then, who makes the call about what’s self-evident and what’s not? Is it about the subjective confidence of the individual (what’s “obvious to any native speaker”, as noted in section 4)? And if so, what if an individual finds something self-evident, but it’s actually a legitimate point of variation that this individual isn’t aware of, and so another individual wouldn’t view it as self-evident? I guess this is part of what L&O2018 set out to prove, i.e., that a trained linguist has good subjective confidence about self-evidentiality? Section 2.2 covers this, with the three-way classification. But even still, I wonder about the facts that are theoretically presupposed because they’re self-evident vs. theoretically meaningful because they’re not. It’d be great if there was some objective, measurable signal that distinguished them, aside from the acceptability judgments replications of course (since the whole point of having such a signal would be to focus replications on the ones that weren’t self-evident). Mahowald et al. (2016)’s approach of unanimous judgments from 7 people on 7 variants of the data point in question seems like one way to do this -- basically, it’s a mini-acceptability judgment replication. And it does seem more doable, especially with the crowd-sourced judgment platform L&O2018 advocate.

One more thought: L&O2018 make a striking point about the importance of relative acceptability and how acceptability isn’t the same as grammaticality, since raw acceptability value can differ so widely for “grammatical” and “ungrammatical” items. For example, if an ungrammatical item has a high acceptability score (e.g., H8’s starred version had a mean score of 6.06 out of 7), and no obvious dialectal variation, how do we interpret that? L&O2018 reasonable hypothesize that this means it’s not actually ungrammatical. But then, is ungrammatical just about a threshold of acceptability at some point? That is, is low acceptability necessary for (or highly correlated with) ungrammaticality?