Before I read a single word of this paper, I already loved the idea of this: encoding useful symbolic knowledge into a distributed representation that’s been proven capable of Awesome Language Feats. This seems like exactly what we want in order to better understand how language acquisition is possible. I know the goal here is about making artificial neural networks (ANNs) better at language acquisition, but the way to do that is inspired by how children do the same thing. So it seems like there’s a good potential for accomplishing the goal I tend to be more interested in, which is using ANNs to better understand (tiny) human cognition.
Other targeted thoughts:
(1) In describing how the Bayesian prior is encoded into the ANN, M&G2023 say “hypotheses are sampled from that prior to create tasks that instantiate inductive bias in the data”. When I first read this, I wanted to understand better what it means to create a task from a sampled hypothesis. Section 2 says “each ‘task’ is a language so that the inductive bias being distilled is a prior over the space of languages. So…that would be a language whose distribution over elements matches the sampled hypothesis? (That might make sense, assuming a hypothesis in the Bayesian model is a distribution over elements of the potential language.)
After reading section 2, Step 2, this seems like what they’re doing. It’s just that the term “task” was new to me here, and doesn’t seem to describe what’s going on. Maybe this term comes from the ML literature on meta-learning.
(2) Meta-learning: to learn: M&G2023 use model-agnostic meta-learning (MAML), and they say MAML can be viewed as a way to perform hierarchical Bayesian modeling. Why? Because MAML involves learning about the equivalent of hyperparameters – the original model’s parameters – rather than only the model that actually learns directly from the data. It seems important to understand how the original model’s parameters are adjusted, on the basis of the temporary model’s learning of the sampled data. I don’t think I quite understand how this works.
Related: Pre-training vs. prior-training. M&G2023 describe these approaches as a head start (pre-training) vs. learning to learn (prior training). It feels like the details of how prior training works are now important – that is, transferring what was learned from temporary model M’ to original model M in the MAML approach. This transfer is clearly meant to be different from pre-training, which involves training M on a more-general task…which is somehow not “learning to learn”, even though the task is general. I may just need to read more in this literature to understand the difference, though.
(3) M&G2023 note that the prior-trained neural network can learn like a Bayesian model (e.g., pretty well from 10 examples), but it’s way faster because of the parallel processing architecture. This comment about the relative speed of Bayesian models vs. prior-trained neural networks that encode the equivalent of Bayesian inductive biases definitely makes me think about language evolution considerations. Basically, why do human languages have the shape they do? Because languages can be learned via inductive biases that can be encoded into parallel-processing, distributed-representation machines (i.e., human neural networks) that work fast.
(4) It’s great to see strong performance from the prior-trained NN, but the fact the other NNs do pretty darned good too seems noticeable. That is, 8.5 million words may be enough even for NNs with weak inductive biases. M&G2023 note at the end of the section that a better demo would be a smaller corpus, among other considerations, and they in fact explore smaller input sizes (hurrah!).
(5) out-of-distribution generalization: The prior-trained NN always does a little better. Again, it’s great to see the improvement, but is it surprising that the standard NN without the inductive bias does pretty good too? Maybe this is because the standard NN had enough data? (Although M&G2023 say in the next subsection that this may have to do with the distilled inductive biases not being that helpful. So the issue is distilling better biases, i.e., ones defined over naturalistic data more…somehow?) I wonder what would happen if we focused on the versions that only had 1/32nd of the data, since that’s one case where the prior-trained NN definitely did better than the standard one.
(6) Future work: M&G2023 note that future work can distill different inductive biases into NNs and see which ones work better. I love the idea of this, but I think we should be clear about the assumptions we would be making here. Basically, if we’re going to test different theories of inductive biases, then we‘re committing to the NN representation as “good enough” to simulate computation in the human mind. This is fine, but we should be clear about it, especially since it can be hard to interpret what other biases might be active in any given ANN implementation (e.g., LSTMs vs. Transformers).