Tuesday, April 16, 2019

Some thoughts on Pater 2019

As you might imagine, a lot of my thoughts are covered by my commentary that we’re reading as one of the selections next time. But here’s the briefer version: I love seeing the fusion of linguistic representations with statistical methods. The real struggle for me as a cognitive modeler is when using RNNs is better than symbolic models that are more easily interpretable (e.g., hierarchical Bayesian models that allow overhypotheses to define a wider space of latent hypotheses).

At the very end of Pater’s article, I see a potentially exciting path forward with the advent of RNNs (or other models with distributed representations) that are interpretable. I’m definitely a fan of techniques that allow the learning of hidden structure without it being explicitly encoded — this is the same thing I see in hierarchical Bayesian overhypotheses. More on this below (and in my commentary for next time).

Specific thoughts:

(1) I couldn’t agree more with the importance of incorporating statistical approaches more thoroughly into learning/acquisition theories, but I remain to be sold on the neural networks side. It really depends on what kind of network: are they matching neurobiology (e.g., see Avery and Krichmar 2017, Beyeler, Rounds, Carlson, Dutt, & Krichmar 2017, Krichmar, Conrad, & Asada 2015; Neftci, Augustine, Paul, & Detorakis 2017, Neftci, Binas, Rutishauser, Chicca, Indiveri, & Douglas 2013) or are they a computational-level distributed representations approach (I think this is what most RNNs are), which seems hard to decipher, and so less useful for exploring symbolic theories more completely? Maybe the point is to explore non-symbolic theories.

Pater notes the following about non-symbolic approaches: “...it is hard to escape the conclusion that a successful theory of learning from realistic data will have a neural component.” If by neural, Pater means an implementational-level description, sure. But I’m not sold on distributed representations as being necessary for a successful theory of learning -- a theory can operate at the computational or algorithmic levels.

(2) I completely agree that structure-independent representations (statistical sequences that don’t involve phrases, etc.) can only get you so far. The interesting thing from an NLP standpoint, of course, is exactly how far they can get you — which often turns out to be surprisingly far. In fact, it’s often much further than I would have expected — e.g., n-grams over words (not even syntactic categories!!) work remarkably well as features for opinion spam detection, with near 90% classification accuracy: Ott et. al 2011, 2013. Though I guess n-grams do heuristically encode some local structure.

(3) RNNs seem to need to incorporate hierarchical representations to work (e.g., the Recurrent Neural Network Grammars of Dyer et al. 2016, and incorporating hierarchical structure into current neural network approaches in AI/NLP). But, sequence-to-sequence models do pretty well without explicit structure encoded in. So, if sequence-to-sequence models can handle aux-inversion (e.g., as in McCoy, Frank, & Linzen 2018...well, at least sort of -- it’s not clear they handle it the way humans do), what do we make of it from the linguistic cognition perspective?

This comes back to the question of model interpretation. With symbolic models, it’s usually clear what theory of representation is being evaluated. For RNNs, do we know what the distributed representations/continuous hypotheses are encoding? (This of course is less a problem from the engineering perspective -- we’re happy if we can get the machines to do it as well or better than humans.) As Pater noted, some read-out can be done with clever model comparisons, and some distributed representations (e.g., Palangi et al’s (2017) Tensor Product Recurrent Networks) may in fact encode syntactic structures we recognize. So then, the question is what we’re getting from the distributed representation.

Pater: “...it is given the building blocks of symbols and their roles, but must learn their configurations”. This starts to sound like the latent vs. explicit hypothesis space construction of Perfors (2012), which can be implemented in a variety of ways (e.g., variational learning as in Yang 2002). That is, RNNs allow the modeler to specify the building blocks but let the model construct the explicit hypotheses that get evaluated, based on its prior biases (RNN architecture, Bayesian overhypothesis hyperparameters, etc.). Something that could be interesting: the RNN version allows construction of explicit hypotheses from the building blocks that are outside what the modeler would have built in to the overhypothesis parameters; that is, they may be perfectly reasonable hypotheses from the given building blocks, but go against the natural overhypothesis-style parametric biases and so would get a low probability of being generated (and subsequently evaluated).

Since the RNN generates hypotheses with whatever architectural biases mold the explicit hypothesis construction, it may give higher probability to hypotheses that were lower-probability for a hierarchical Bayesian model.  That is, the Bayesian overhypotheses may be quite general (especially if we back off to over-over-hypotheses, and so on), but still require an explicit bias at some level for how hypotheses are generated from overhypotheses. That has to be specified by the modeler. This may cause Bayesian modelers to miss ways that certain building blocks can generate the kinds of linguistic hypotheses we want to generate.

An analogy: Genetic algorithms can be used to identify solutions that humans didn’t think of because they employ a much wider search of the latent hypothesis space; humans are fettered by their biases for what an optimal solution is going to look like.  Here: symbolic modelers may be fettered by ideas about how building blocks can be used to generate explicit hypotheses; RNNs may allow a wider search of the latent hypothesis space because they’re bound by different (implicit) ideas, via the RNN architecture. So, the solution an RNN comes up with (assuming you can interpret it) may provide a novel representational option, based on the building blocks given to it.

Bigger point: RNNs and distributed representations may provide a novel way of exploratory theorizing (especially for syntactic learning), to the extent that their innards are interpretable. For theory evaluation, on the other hand, it’s better to go with a symbolic model that’s already easy to understand….unless your theory is about the building blocks, leaving the explicit hypotheses they build and evaluate unspecified.