Friday, February 11, 2022

Some thoughts on Wilcox et al. 2021

This paper made me really happy because it involved careful thought about what was being investigated, an accessible intuition about how each model works, what the selected models can and can’t tell us, how the models should be evaluated, sensible ways to interpret model results, and why we should care. Of course, I did have (a lot of) various things occur to me as I was reading (more on this below), but this is probably one of the few papers I’ve read recently using neural net models that I care about, as a developmental linguist who does cognitive modeling. Thanks, authors!


Specific thoughts:

(1) Poverty of the stimulus vs. the Argument from poverty of the stimulus (i.e, viable solutions to poverty of the stimulus): I think it’s useful to really separate these two ideas. Poverty of the stimulus is about whether the data are actually compatible with multiple generalizations. I think this seems to be true about learning constraints on filler-gap dependencies (though this assertion depends on the data considered relevant in the input signal, which is why it’s important to be clear about what the input is). But the argument from poverty of the stimulus is about viable solutions, i.e., the biases that are built in to navigate the possibilities and converge on the right generalization.


The abstract wording focuses on poverty of the stimulus itself for syntactic islands, while the general discussion in 6.2. is clearly focusing on the (potential) viable solutions uncovered via the models explored in the paper. That is, the focus isn’t about whether there’s poverty of the stimulus for learning about islands, but rather what built-in stuff it would take to solve it. And that’s where the linguistic nativist vs. non-linguistic nativist/empiricist discussion comes in. I think this distinction between poverty of the stimulus itself and the argument from poverty of the stimulus gets mushed together a bit sometimes, so it can be helpful to note it explicitly. Still, the authors are very careful in 6.2. to talk about what they’re interested in as the argument from poverty of the stimulus, and not poverty of the stimulus itself.


(2) Introduction, Mapping out a “lower bound for learnability”: I’m not quite sure I follow what this means: a lower bound in the sense of what’s learnable from this kind of setup, I guess? Which is why anything unlearnable might still require a language-specific constraint? 


Also, I’m not sure I quite follow the distinction between top-down vs bottom-up being made about constraints. Is it that top-down is explicitly defined and implemented, as opposed to bottom-up being an emerging thing from whatever was explicitly defined and implemented? But if so, isn’t that more of an implementational-level distinction, rather than a core aspect of the definition (=computational-level) of the constraint? That is, the bottom-up thing could be explicitly defined, if only we understood better how the explicitly defined things caused it to emerge?


(3) The “psycholinguistics paradigm” for model assessment: I really like this approach, precisely because it doesn’t commit you to an internal theory-specific representation. In general, this is a huge plus for evaluating models against observable behavior. Even if you use an internal representation (and someone doesn’t happen to like it), you can still say that whatever’s going on can yield human behavior so it must have something human-like about it. The same is true for distributed/connectionist language models where it’s hard to tell what the internal representations are, aside from being vectors of numbers.


(4) The expected superadditive pattern when both the filler and gap are present: Why should this be superadditive, instead of just additive? What extra thing is happening to make the presence of both yield a superadditive pattern? I have the same question once we get to island stimuli, too, where the factors are filler presence, gap presence, and island structure presence. 


(5) The domain-general property of the neural models: The neural models aren’t building any bias for language-specific representations in, but language-specific representations are in the hypothesis space. So, is it possible the best-fitting internal representations are language-specific? This would be similar to Bayesian approaches (e.g., Perfors et al 2011) that allow the hypothesis space to include domain-general options, but inference leads the learner to select language-specific options.


(6) The input: Just a quick note that the neural models here were trained on non-childlike input both in terms of content (e.g., newsire text, wikipedia) and quantity (though I do appreciate the legwork of estimating input quantity). This isn’t a really big deal for the proof-of-concept goal here, but starts to matter more for more targeted arguments about how children could learn various filler-gap knowledge so reliably from their experience. Of course, the authors are aware of this and explicitly discuss this right after they introduce the different models (thanks, authors!). 


One thing that could be done: cross-check the input quantity with known ages of acquisition (e.g., Complex NP islands in English by age four, De Villiers et al. 2008). Since the authors say input quantity doesn’t really affect their reported results anyway, then this should be both easy to do and not change any major findings.


The second thing is to train these models on child-directed speech samples and see if the results hold. The CHILDES database should have enough input samples from high-resource languages, and whatever limitations there might be in terms of sampling from multiple children at multiple ages from multiple backgrounds (and other variables), it seems like a step in the right direction that isn’t too hard to do (though I guess that does depend on how hard it is to train these models).


(7) Proof-of-concept argument with these neural models: The fact that these models do struggle with issues of length and word frequency in non-human-like ways does suggest that they might do other things (like learn about filler-gap dependencies) in non-human-like ways too. So we have to be careful about what kind of argument this proof-of-concept is — that is, it’s a computational-level “is it possible at all” argument, rather than a computational-level “is it possible for humans who have these known biases/limitations, etc” argument.


(8) N-grams always fail: Is this just because the 5-token window isn’t big enough, so there’s no hope of capturing dependencies that are longer? I expect so, but don’t remember the authors saying something explicitly like that.


(9) Figure 5: I want to better understand why inversion is an ok behavior (I’m looking at you, GRNN).  Does that mean that now a gap in matrix position with a licensing filler in the subject is more surprising than no gap in matrix position with no licensing filler in the subject? I guess that’s not too weird. Basically, GRNN doesn’t want gaps in places they shouldn’t be (which seems reminiscent of island restrictions, as islands are places where gaps shouldn’t be).


(10) One takeaway from the neural modeling results: Non-transformer models do better at generalizing.  Do we think this is just due to data overfitting (training input size, parameter number), or something else?


(11) Coordination islands: I know the text says all four neural models showed significant reduction in wh-effects, so I guess the reductions must be significant between the control conditions and the 1st conjunct gaps. But, there seems to be a qualitative difference in attenuation we see for a gap in the first conjunct vs. the second conjunct (and it’s true for all four neural models). I wonder why that should be. 


(12) Figure 10, checking my understanding: So, seeing no gap inside a control structure is less surprising sometimes than seeing no gap inside a left-branching structure…I think this may have to do with the weirdness of the control structures, if I’m following 14 correctly? In particular, the -gap control is “I know that you bought an expensive a car last week” and the -gap island is “I know how expensive you bought a car last week”. This may come back to being more precise about surprisal expectations for control vs. island structures. Usually, control structures are fine (grammatical), but here they’re not, and so that could interfere with the potential surprisal pattern we’re looking for.


(13) Subject islands: It was helpful to get a quick explanation about why the GRNN didn’t do as well as the other neural models here (basically, not having a robust wh-effect for the control structures). A quick explanation of this type would be helpful for other cases where we see some neural models (seem to) fail, like the first conjunct for Coordination islands, and then Left Branch and Sentential Subject islands.


(14) Table 14: (just a shout out) Thank you so much, authors, for providing this. Unbelievably helpful summary.


(15) One takeaway the authors point out: If learning is about maximizing input data probability, then these neural approaches are similar to previous approaches that do this. In particular, maximizing input data probability corresponds to the likelihood component of any Bayesian learning approach, which seems sensible. Then, the difference is just about the prior part, which corresponds to the inductive biases built in.


(16) General discussion: I’m not quite sure I follow why linguistic nativist biases would contrast with empiricist biases by a priori downweighting certain possibilities — maybe this is another way of saying that one type of language-specific bias skews/limits the hypothesis space a certain way only if it’s a language-based hypothesis space? In contrast, a domain-general bias skews/limits the hypothesis space no matter what kind of hypothesis space it is. The particular domain-general bias of maximizing input probability of course doesn’t occur a priori— the learner needs to see the input data. But other kinds of domain-general biases seem like they could skew the hypothesis space a priori (e.g., the simplicity preference from Perfors et al. 2006).


(17) Another takeaway from the general discussion is that the learner doesn’t obviously need built-in language-specific biases to learn these island constraints. But I would love to know what abstract representations get built up in the best-performing neural models from this set, like JRNN. These are likely linguistic, as they’re word forms passed through a convolutional neural network (and therefore compressed somehow), and it would be great to know if they look like syntactic categories we recognize or something else. 


So, I’m totally on board with being able to navigate to the right knowledge in this case without needing language-specific (in contrast with domain-general) help. I just would love to know more about the intermediate representations, and what it takes to plausibly construct them (especially for small humans).