Friday, November 15, 2019

Some thoughts on Villata et al. 2019

I appreciated seeing up front the traditional argument about economy of representation because it’s often at the heart of theoretical debate. What’s interesting to me is the assumption that something is more economical just based on intuition, without having some formal way to evaluate how economical it is. So, good on V&al2019 for thinking about this issue explicitly. More generally, when I hear this kind of debate about categorical grammar + external gradience vs. gradience in the grammar, I often wonder how on earth you could tell the difference. V&al2019 are approaching this from an economy angle, rather than a behavioral fit angle, and showing a proof-of-concept with the SOSP model. That said, it’s interesting to note that the SOSP implementation of a gradient grammar clearly includes both syntactic and semantic features -- and that’s where its ability to handle some of the desiderata comes from.

Other comments:
(1) Model implementation involving the differential equations: If I understand this correctly, the model is using a computational-level way to accomplish the inference about which treelets to choose. (This is because it seems like it requires a full enumeration of the possible structures that can be formed, which is typically a massively parallel process and not something we think humans are doing on a word-by-word processing basis.) Computational-level inference for me is exactly like this: we think this is the inference computation humans are trying to do, but they approximate it somehow. So, we’re not committed to this being the algorithm humans use to accomplish that inference. 

That said, the way V&al2019 describe here isn’t an optimal inference mechanism, the way Gibbs sampling is, since it seems to allow the equivalent of probabilistic sampling (where a sub-optimal option can win out in the long run). So, this differs from the way I often see computational-level inference in Bayesian modeling land, because there the goal is to identify the optimal result of the desired computation.

(2) Generating grammaticality vs. acceptability judgments from model output: I often hear “acceptability” used when grammaticality is one (major) aspect of human judgment, but there are other things in there too (like memory constraints or lexical choice, etc.). I originally thought the point of this model was that we’re trying to generate the judgment straight from the grammar, rather than from other factors (so it would be a grammaticality judgment). But maybe because a word’s feature vector also includes semantic features (whatever those look like), then this is why the judgment is getting termed acceptability rather than grammaticality?

(3) I appreciate the explanation in the Simulations section about the difference between whether islands and subject islands -- basically, for subject islands, there are no easy lexical alternatives that would allow sub-optimal treelets to persist that will eventually allow a link between the wh-phrase and the gap. But something I want to clear up is the issue of parallel vs. greedy parsing. I had thought that the SOSP approach does greedy parsing because it finds what it considers the (noisy) local maximum option for any given word, and proceeds on from there. So, for whether islands, it picks either the wonder option that’s an island or the wonder option that gets coerced to think, because both of those are somewhat close in how harmonic they are. (For subject islands, there’s only one option -- the island one -- so that’s the one that gets picked). Given this, how can we talk about the whether island as having two options at all? Is it that on any given parse, we pick one, and it’s just that sometimes it’s the one that allows a dependency to form? That would be fine. We’d just expect to see individual variation from instance to instance, and the aggregate effect would be that D-linked whether islands are better, basically because sometimes they’re okay-ish and sometimes they’re not.

Friday, November 1, 2019

Some thoughts on Gauthier et al. 2019

General thoughts:
I really enjoy seeing this kind of computational cognitive model, where the model is not only generating general patterns of behavior (like the ability to get the right interpretation for a novel utterance), but specifically matching a set of child behavioral results. I think it’s easier to believe in the model’s informativity when you see it able to account for a specific set of results. And those results then provide a fair benchmark for future models. (So, yay, good developmental modeling practice!)

Other thoughts:
(1) It’s always great to show what can be accomplished “from scratch” (as G&al2019 note), though this is probably harder than the child’s actual task. Presumably, by the time children are using syntactic bootstrapping to learn harder lexical items, they already have a lexicon seeded with some concrete noun items. But this is fine for a proof of concept -- basically, if we can get success on the harder task of starting from scratch, then we should also get success when we start with a headstart in the lexicon. (Caveat: Unless a concrete noun bias in the early lexicon somehow skews the learning the wrong direction for some reason.)

(2) It’s a pity that the Abend et al. 2017 study wasn’t discussed more thoroughly -- that’s another one using a CCG representation for the semantics, a loose idea of what the available meaning elements are from the scene, and doing this kind of rational search over possible syntactic rules, given naturalistic input. That model achieves syntactic bootstrapping, along with a variety of other features like one-shot learning, accelerated learning of individual vocabulary items corresponding to specific syntactic categories, and easier learning of nouns (thereby creating a noun bias in early lexicons). It seems like a compare & contrast with that Bayesian model would have been really helpful, especially noting what about those learning scenarios was simplified, compared with the one used here. 

For instance, “naturalistic” for G&al2019 means utterances which make reference to abstract events and relations. This isn’t what’s normally meant by naturalistic, because these utterances are still idealized (i.e., artificial). That said, these idealized data have more complex pieces in them that make them similar to naturalistic language data. I have no issue with this, per se -- it’s often a very reasonable first step, especially for cognitive models that take awhile to run.

(3) Figure 4: It looks like there’s a dependency where meaning depends on syntactic form, but not the other way around -- I guess that’s the linking rule? But I wonder why that direction and not the other (i.e., shouldn’t form depend on meaning, too, especially if we’re thinking about this as a generative model where the output is the utterance? So, we start with a meaning, and get the language form for that, which means the arrow should go from meaning to syntactic form?)? Certainly, it seems like you need something connecting syntactic type to meaning if you’re going to get syntactic bootstrapping, and I can see in their description of the inference process why it’s helpful to have the meaning depend on the structure (i.e., because they infer the meaning from the structure for a novel verb: P(m_w | s_w), which only works if you have the arrow going from s_w to m_w). 

(4) It took me a little bit to understand what was going on in equations 2 and 3, so let me summarize what I think I got here: if we want to get the probability of a particular meaning (which is comprised of several independent predicates), we have to multiply the probability of each of those predicates together (that’s equation 3). To get the probability of each predicate, we sum over all instances of that predicate that are associated with that syntactic type (that’s equation 2).

(5) The learner is constrained to only encode a limited number of entries per word at all times (i.e., only the l-highest weight lexical entries per wordform are retained): I love the ability to constrain the number of entries per word form. This seems exactly right from what I know of the kid word-learning literature, and I wonder how often a limit of two is the best…from Figure 7, it looks like 2 is pretty darned good (pretty much overlapping 7, and better than 3 or 5, if I’m reading those colors correctly).