Monday, December 4, 2023

Some thoughts on McCoy & Griffiths 2023

Before I read a single word of this paper, I already loved the idea of this: encoding useful symbolic knowledge into a distributed representation that’s been proven capable of Awesome Language Feats. This seems like exactly what we want in order to better understand how language acquisition is possible. I know the goal here is about making artificial neural networks (ANNs) better at language acquisition, but the way to do that is inspired by how children do the same thing. So it seems like there’s a good potential for accomplishing the goal I tend to be more interested in, which is using ANNs to better understand (tiny) human cognition.


Other targeted thoughts:

(1) In describing how the Bayesian prior is encoded into the ANN, M&G2023 say “hypotheses are sampled from that prior to create tasks that instantiate inductive bias in the data”.  When I first read this, I wanted to understand better what it means to create a task from a sampled hypothesis. Section 2 says “each ‘task’ is a language so that the inductive bias being distilled is a prior over the space of languages. So…that would be a language whose distribution over elements matches the sampled hypothesis? (That might make sense, assuming a hypothesis in the Bayesian model is a distribution over elements of the potential language.) 


After reading section 2, Step 2, this seems like what they’re doing. It’s just that the term “task” was new to me here, and doesn’t seem to describe what’s going on. Maybe this term comes from the ML literature on meta-learning.


(2) Meta-learning: to learn: M&G2023 use model-agnostic meta-learning (MAML), and they say MAML can be viewed as a way to perform hierarchical Bayesian modeling. Why? Because MAML involves learning about the equivalent of hyperparameters – the original model’s parameters – rather than only the model that actually learns directly from the data. It seems important to understand how the original model’s parameters are adjusted, on the basis of the temporary model’s learning of the sampled data. I don’t think I quite understand how this works.


Related: Pre-training vs. prior-training. M&G2023 describe these approaches as a head start (pre-training) vs. learning to learn (prior training). It feels like the details of how prior training works are now important – that is, transferring what was learned from temporary model M’ to original model M in the MAML approach. This transfer is clearly meant to be different from pre-training, which involves training M on a more-general task…which is somehow not “learning to learn”, even though the task is general. I may just need to read more in this literature to understand the difference, though.


(3) M&G2023 note that the prior-trained neural network can learn like a Bayesian model (e.g., pretty well from 10 examples), but it’s way faster because of the parallel processing architecture. This comment about the relative speed of Bayesian models vs. prior-trained neural networks that encode the equivalent of Bayesian inductive biases definitely makes me think about language evolution considerations. Basically, why do human languages have the shape they do? Because languages can be learned via inductive biases that can be encoded into parallel-processing, distributed-representation machines (i.e., human neural networks) that work fast.


(4)  It’s great to see strong performance from the prior-trained NN, but the fact the other NNs do pretty darned good too seems noticeable. That is, 8.5 million words may be enough even for NNs with weak inductive biases. M&G2023 note at the end of the section that a better demo would be a smaller corpus, among other considerations, and they in fact explore smaller input sizes (hurrah!).


(5) out-of-distribution generalization: The prior-trained NN always does a little better. Again, it’s great to see the improvement, but is it surprising that the standard NN without the inductive bias does pretty good too? Maybe this is because the standard NN had enough data? (Although M&G2023 say in the next subsection that this may have to do with the distilled inductive biases not being that helpful. So the issue is distilling better biases, i.e., ones defined over naturalistic data more…somehow?) I wonder what would happen if we focused on the versions that only had 1/32nd of the data, since that’s one case where the prior-trained NN definitely did better than the standard one.


(6) Future work: M&G2023 note that future work can distill different inductive biases into NNs and see which ones work better. I love the idea of this, but I think we should be clear about the assumptions we would be making here. Basically, if we’re going to test different theories of inductive biases, then we‘re committing to the NN representation as “good enough” to simulate computation in the human mind. This is fine, but we should be clear about it, especially since it can be hard to interpret what other biases might be active in any given ANN implementation (e.g., LSTMs vs. Transformers).


Wednesday, November 29, 2023

Some thoughts on Frank 2023a and 2023b

I’m definitely on board with the spirit of these papers. My position: I would love to understand more about how children do what they do when it comes to language acquisition. If that also helps large language models (LLMs) do what they do better, then that’s great too.


Some other specific thoughts, responding to certain ideas in “Bridging the gap”: 

(1)  I definitely understand that the interactive, social nature of children’s input matters. In particular, the social part in child language acquisition is usually about why certain input has more impact than others - the input in an interactive, social environment gets absorbed better by kids. But absorption doesn’t seem to be the problem for LLMs – they take in their data just fine. That said, it does seem like the interaction part helps Chat-GPT (I.e., the ability to query).


More generally, it could be that what a certain input quality (e.g., being social and interactive) does for human kids isn’t necessary for an LLM. But, we don’t know that until we understand why that input quality helps kids in the first place.



(2) I also understand that multimodal input gives concrete extensions to some concepts, and so helps “ground out” meaning in the real world for kids. I’m less sure how multimodal input would help current  AI systems — is it maybe helpful for bootstrapping the rest of the cognitive system (somehow?) that allows flexible reasoning?


(3) I think there’s a really good point made about needing the apples-to-apples comparison for evaluation. I remember earlier in the evaluation of speech segmentation models, the models were compared against perfect (adult-like) accuracy of segmentation, and few cognitively-plausible ones did all that well. In contrast, when these same models were tested on the segmentation tasks given to infants (which were meant to demonstrate infant segmentation ability), most models did just fine. Now, whether the models accomplished segmentation the way that the infants did is a different question, and one that would also apply to LLMs once we have apples-to-apples comparisons.


Tuesday, April 25, 2023

Some thoughts on Degen 2023

To me, this is a beautifully accessible review article for the probabilistic pragmatics approach, as implemented in RSA. (Figure 1 in particular made me happy – these helpful visuals really are worth it, though I know it’s hard to get them together just right.)  This review article definitely gets me wondering more about how to use RSA for language acquisition (especially when it discusses bounded cognition).

In particular, what’s the (potential) difference between a child’s approximation of Bayesian inference and an adult’s approximation? How much can be captured by this mental computation being pretty good but the units over which inference is operating being immature (e.g., utterance alternatives, meaning options, priors)? For instance, how worthwhile is it to try and capture child behavior on different pragmatic phenomena by assuming adult-like Bayesian inference but non-adult-like units that inference operates over? 

Scontras & Pearl 2021 did this a little for quantifier-scope interpretation, but those child data were from five-year-olds, who are known to be pretty adult-like for non-pragmatic things. What about younger kids? And of course, what about other pragmatic phenomena that we have child data for?

Tuesday, April 18, 2023

Some thoughts on Diercks et al. 2023

I really appreciated the leisurely pace and accessible tone of this writing, especially for someone who’s not super-familiar with the nuts and bolts of the Minimalist approach, but very interested in development. Here we can see one of the perks of not having a strict page limit. :)


Some other thoughts:


(1) One key idea of Developmental Minimalist Syntax (DMS) seems to be that the current bottom-up description of possible representations (which is what I take the iterated Merge cycles of the Minimalist approach to be) would actually have a cognitive correlate that we can observe and evaluate (i.e., stages of development). That is, this way of compactly describing acceptable/grammatical adult representations corresponds to an actual cognitive process (at the computational level of description, in Marr’s terms) whose signal can be seen in children’s developmental stages. So, this would support the validity (utility?) of describing adult representations this way.


(2) I didn’t quite follow the link between Minimalist Analytical Constructions (MACs) and Universal Cognition for Language. Is the idea that there are certain representations in the adult knowledge system, and we don’t care if their origin is language-specific? It sounds like that, from the text that follows. 


Later on, MACs are described as children’s “toolkit for grammaticalizing their language”. Would this mean that the adult representations are what children use to make sense of (“grammaticalize”) their language? That is, the representations children develop allow them to parse their input into useful information. In my standard way of thinking about these things, the developed/developing representations that children have allow them to perceive certain information in their input (which then is transformed into their “perceptual intake” of the input signal).


In ch 3, part 4, we get a fuller definition: “grammaticalizing” means arriving at and encoding generalizations for the language. So, I think that’s compatible with my idea above that “grammaticalizing” has to do with the developing adult-like representations, and children parse their input with whatever they’ve already developed along the way.


(3) Thinking about acquisition as addition, rather than replacement: Just to clarify, children can have immature representations in one of two ways: 


(1) a representation is immature because it’s still changing ([hug X] instead of [Predicate X]), or 


(2) a representation is immature because it’s fixed into the adult-like state, but it’s only part of the full adult-like structure (e.g., VP) instead of the adult-like full structure [CP [TP [vp [VP ]]]].  This second version is talked about later in ch.3 a little in “mixed status utterances”, which can have an adult-like part and an immature part.


(4) Predictions for VP before vP (section 4.3): So, I think a prediction of DMS is that we shouldn’t generally see agentive subjects combining productively with verbs (which would be vP) before we see verbs combining productively with their objects (which would be VP). (Ex: Not “I put” before “put the ball” or “put down”, as a specific item.) 


How would we then distinguish an item-specific combination that might seem to violate this from a language-general implementation involving that item that might seem to violate this? (That is, if we encounter “I put” before “put the ball”, how do we know if it’s an item-specific use or a productive language-general use?) Is it about where the child seems to be with respect to language-general use (e.g., productively using verbs with objects, but not subjects with verbs)? That is, we’d assume that an instance of “I put” would be item-specific and immature, but “put down” would be productive and general?


Friday, February 10, 2023

Some thoughts on Hahn et al. (2022)

I love the way Hahn et al. (2022) set up the two approaches they’re combining – it seems like the most natural thing in the world to combine them and reap the benefits of both. Hats off to the authors for some masterful narrative there.


In general, I’d love to think about how to apply the resource-rational lossy-context surprisal approach to models of acquisition. It seems like this approach to input representation could be applied to child input for any given existing model (say, of syntactic learning, but really for learning anything), so that we get a better sense of what (skewed) input children might actually be working from when they’re trying to infer properties of their native language. 


A first pass might be just to use this adult-like version to skew children’s input (maybe a neural model trained on child-directed speech to get appropriate retention probabilities, etc.). That said, I can also imagine that the retention rate might just generally be less for kids (and kids of different ages) compared to adults because of lower thresholds on the parts that go into calculating that retention rate (e.g., the delta parameter that modulates how much context goes into calculating next-word probabilities). Still,  the exciting thing for me is the idea that this is a way to formally implement “developing processing” (or even just “more realistic processing”) in a model that’s meant to capture developing representations.