I’m totally with Linzen on linguistic theory providing better evaluation items for RNNs. (Hurrah for linguistic theory contributions!) In contrast, I’m just not sold yet on the utility of RNNs for modeling human language development or processing. The interpretability issue just kills it for me (as it does for Rawski & Heinz)-- how can we know if the RNN is or isn’t representing something? And if we have a concrete idea about what it should be representing vs. not, why not use a symbolic model? (More on this below in the “Other thoughts” section.)
I find it heartening to hear that other folks like Rawski & Heinz are also talking about the ML revolution with deep learning techniques as “alchemy”, longing for the “rigor police” to return. I sympathize with the rigor police.
Rawski & Heinz offer their take on the rigor police, highlighting the contributions that computational learnability (CL) investigations can make, with respect to the problems that RNNs are currently being pitched at. In particular, Rawski & Heinz note how CL approaches can answer the question of “Is it possible to learn this thing at all, given this characterization of the learning problem?” The major selling point is that CL results are easily interpretable (“analytically transparent”). This is a key difference that matters a lot for understanding what’s going on. That said, I tend to have concerns with different CL implementations (basically, if they don’t characterize the learning problem in a way that maps well to children’s language acquisition, I don’t know why I should care as a developmental linguist). But, this is a different, solvable problem (i,e., investigate characterizations that do map well) — in contrast, interpretability of RNNs isn’t as immediately solvable.
Other thoughts:
(1) Linzen, on RNNs for testing what constraints are needed for learning different things: So far, I haven’t been convinced that it’s helpful to use neural networks to test what innate knowledge is required. All we know when we stumble upon a neural network that can learn something is that it hasn’t explicitly encoded knowledge beforehand in a way that’s easy to interpret; who knows what the implicit knowledge is that’s encoded in the architecture and initialization values? (As Rawski & Heinz note, ignorance of bias doesn’t mean absence of bias.)
(2) Linzen, “language model” = “estimating how likely a particular word is to occur given the words that have proceeded it”. I was surprised by this definition. What about other language tasks? I honestly thought “language model” referred to the representation of language knowledge, rather than the evaluation task. So, the language model is the thing that allows you to predict the next word, given the previous word, not the prediction itself. Richard Futrell says this definition of “language model” is right for current ML use, though. (Thanks, Richard!)
(3) Linzen, on using psycholinguistic materials designed to identify linguistic knowledge in humans in order to identify implicit linguistic knowledge in RNNs: This approach makes a lot of sense to me. The human mind is a black box, just like the RNN, and we have decades of materials designed to identify the nature of the knowledge inside that black box. So, I think the key is to start with the most basic tests, since the more complex tests build in assumptions about human knowledge due to the results from the basic ones.
(4) Linzen, noting the importance of having baseline models that are known not to be able to represent the linguistic properties of interest: But how do we know they can’t? Aren’t RNNs universal function approximators, so they can (theoretically) capture any behavior, given enough data? Maybe the point is to use one where we know it’s failed on the linguistic knowledge in question somehow…
(5) Linzen, on the Gulordava et al. RNNs that did better at capturing long-distance agreement when semantic information was helpful: “This suggests that the models did learn some of the syntactic principles underlying subject-verb agreement.” Does it? Maybe if we think “syntactic principles” = something based on the sequence of words, rather than word meaning (i.e., a very broad definition of “the syntactic principles”). But I have no idea how we could tell that the RNN used anything like the syntactic principles we think humans use.
(6) Linzen, on using RNNs for learnability tests: “First, is it indeed the case that the linguistic phenomenon in question cannot be learned from child-directed speech without the proposed constraint?” -- I’m sympathetic to this, but how do we know the RNN isn’t implicitly encoding that constraint in its distributed vectors?
“Second, and equally important, does the proposed constraint in fact aid acquisition?” -- Again, I’m very sympathetic, but why not use a symbolic model for this? Then you can easily tell the model has vs. doesn’t have the proposed constraint. (To be fair, Linzen notes this explicitly: “...the inductive biases of most neural network architectures are not well characterized.”)
(7) Linzen, on building in structural knowledge by giving that structural knowledge as part of the RNN’s input (e.g., “the man” together, then “eats pizza” together = structural knowledge that those two chunks are meaningful chunks): If this is an example of building in a proposed constraint, how do we know the RNN is using those chunks the way we think? Why couldn’t it be doing something wild and wacky with those chunks, instead of treating them as “structured units”? I guess by having chunks at all, it counts as doing something structural? But then how do we make the equivalent of an overhypothesis, where the model likes structured units, but we let the model pick out which structured units it wants?
(8) Linzen, “...neural networks replicate a behavioral result from psycholinguistics without the theoretical machinery...suggest that the human behavior...might arise from statistical patterns in the input.” Plus whatever implicit biases the RNN has, right? It’s not just statistical patterns working over a blank slate. For example, in the agreement attraction case Linzen discusses, how do we know the RNN didn’t encode some kind of markedness thing for plurals in its distributed representation?
Related to that same study, if the RNNs then show they’re not behaving like humans in other respects, how can we be sure that the behavior which looks human-like actually has the same underlying cause/representation as it does in humans? And if it doesn’t, what have we learned from the RNNs about how humans represent it?
(9) Rawski & Heinz, taking a grammar as target of acquisition, because it’s something of finite size with a symbolic, generative structure: Learning is then a problem of “grammatical inference”. This clearly differs from Linzen’s characterization, where the target of acquisition is something (a function) that can generate accurate predictions, and who cares what it looks like? Note that grammars can make predictions too — and we know what they look like and how they work to make those predictions. (Rigor police, check!)
(10) Rawski & Heinz, on typological arguments for learnability: I have a slight concern with their typological argument. In particular, just because we don’t see certain patterns across existing human languages doesn’t mean they’re impossible. It seems like we should couple typological observations with experimental studies of what generalizations are possible for humans to make when the data are available to support those generalizations.
A related thought regarding typological predictions, though: this seems like a useful evaluation metric for RNNs. In particular, any RNN that’s successful on one language can be applied to other languages’ input to see if it makes the right cross-linguistic generalizations.
(11) Rawski & Heinz, on Weiss et al 2018, which extracted a (symbolic) deterministic FSA representation from an RNN: This seems like exactly what we want for interpretability, though it’s more about identifying a symbolic representation that makes the same predictions as the RNN, rather than reading off the symbolic representation from the RNN. But I guess it doesn’t really matter, as long as you’re sure the symbolic representation really is doing exactly what the RNN is?