Tuesday, January 20, 2026

Some thoughts on Yang 2025

General thoughts: I’m definitely interested in what we can learn about humans from neural models—especially acquisition—so I really appreciate the goal of this paper. I especially appreciate the creation of a test set that better aligns with classic poverty-of-the-stimulus (PoS) phenomena. Even if my main takeaway ended up being “this is a great test set—thanks for making it!”, I think the paper raises some interesting questions about what we should count as success when we evaluate neural learners as models of child language acquisition.

1. But first, a small quibble about the characterization of prior Bayesian work

The paper describes earlier Bayesian modeling work along the lines of “Bayesian models succeed only in idealized hypothesis spaces” and “Bayesian models reproduce this behavior under idealized assumptions.”

I get the intended contrast here, but I don’t think that’s quite fair. The hypothesis spaces in this work were often large—the key point was that they were constrained along particular dimensions. But importantly, they were sometimes less constrained than existing generative accounts about what the learner’s hypothesis space must look like. So yes, there are assumptions. But the assumptions weren’t necessarily more “idealized” than the alternatives, and if anything, they seem less so to me.

2. Implications of the current results: what does “above chance” buy us?

The results I keep coming back to are in Table 4: models trained on developmentally realistic input (BABY / BABY-F) show above-chance performance for nearly all phenomena.

One possible (optimistic) interpretation is: GPT-2 has what it takes, mechanistically. That is, given psychologically-plausible child input, the learning mechanism can leverage it to generate the target qualitative pattern (at least “better than chance” across the board).

But I’m still trying to figure out what we should take away from that. “Above chance” is encouraging, but it’s also a low bar: it could reflect partial structural sensitivity, shallow heuristics that happen to correlate with the right answer, or unstable learning that doesn’t match the robustness of child generalization. Without a clear linking story between model outputs and child behavior, it’s hard to know what kind of evidence “above chance” really provides in the PoS debate. Maybe it’s good enough? (Or maybe not.)

3. Hierarchical bias via pre-pretraining: recursion bootcamp helps… and hurts?

One manipulation I found especially interesting: following Hu et al. (2025), the authors pre-pretrain GPT2-mini on a shuffled k-Dyck language for 2K steps (basically a short bootcamp in recursion and nested structure, with balanced parentheses) before training on BABY-F. The idea is to provide a cognitively-plausible hierarchical inductive bias.

What happens? There’s modest improvement for Anaphoric one (and some improvement for Wanna, though less reliably), but performance declines for Islands, Binding, and Question Formation.

At first glance, this feels surprising. If those three are “structural” phenomena, why would giving the model a hierarchical head start make it worse?

One way to interpret this is not “the model wasn’t using hierarchy for those phenomena,” but instead: a generic recursion bias is the wrong kind of hierarchy for these cases. Dyck-style nesting might encourage chunking and bracket-matching-like strategies, which could help with something like anaphoric one (constituent-like substitution), but distort the cues needed for islands/binding/question formation, which rely on more specific structural constraints (locality, c-command, movement dependencies).

This made me wonder: maybe children’s structural biases are less specific, or less strongly enforced, than what this Dyck pretraining induces—so they don’t “overfit” to the wrong kind of hierarchical representation.