Friday, February 20, 2026

Some thoughts on Portelance et al. 2024




General thoughts: I so appreciate how careful Portelance and colleagues are in interpreting the results of their modeled learners and tying those results back to current theoretical perspectives on the acquisition of function words. (Also, I appreciate the Pearl (2023) citation in that discussion — thank you!)

I also really like the developmental plausibility of the setup. The models aren’t learning language for its own sake. The feedback signal here is about task success, not about linguistic form. As the authors say, language is an auxiliary objective — a tool for accomplishing something else: communicating about the visual world. That feels right. Children are focused on acting and understanding in the world, and language turns out to be an efficient way to do that with other humans. Getting feedback about whether your interpretation of a scene works — but not about whether your internal representation of a connective is correct — seems developmentally plausible.

At a broad level, the paper offers a genuine proof-of-concept: aspects of the meanings of logical connectors and relational terms can be learned from distributions of linguistic and visual information, without prior knowledge of linguistic meaning.

But once we zoom in, things get more interesting.

(1) What does it mean to be “sensitive to alternative expressions”?

A central question in the paper is whether the existence of “alternative expressions” affects acquisition. In the case of and and or, this connects to a classic Gricean idea: hearing or often leads us to infer “not and,” because if and were true, it would have been more informative to say so.

The authors suggest that their models show early evidence of being “sensitive to alternative expressions when interpreting language.” But that can mean at least two different things:

(A) Representations change when alternatives are present in the training distribution.
(B) The system reasons about alternatives during interpretation in a Gricean sense.

The experiments strongly support (A). It’s less clear to me that they establish (B).

When and and or are both present in training, performance shifts. Removing one affects how the other behaves. That shows interaction and competition. But interaction between representations isn't (yet) the same thing as reasoning about alternative utterances in a pragmatic sense.

(2) Truth-conditional geometry matters

One structural feature seems especially important: the geometry of the truth space.

For and and or, we have a nested relation: AND ⊂ OR

When both conjuncts are true, both AND and inclusive-OR are true.That creates an overlap region in which two different expressions yield the same answer.

In contrast, for pairs like behind/in front of or more/fewer, the relation is symmetric. They only overlap in contexts where both are false (e.g., equivalent spatial position, equal numbers). There’s no scalar nesting.

This difference seems to matter.

If two expressions yield the same answer in many contexts, the modeled learner repeatedly sees:

Same world → two different words → same label

In a distributed learning system, that encourages representational similarity (“representational entanglement”). When those same expressions diverge elsewhere (e.g., one-conjunct-true cases for and vs or), the gradients conflict. From a logical perspective, opposing truth values increase discriminability. But from a distributed learning perspective, opposing labels on similar inputs can increase gradient conflict.

I think this helps explain why and “yes” contexts are fragile when or is present, and why removing or in Experiment 2 stabilizes and performance. It may not require the model to be explicitly reasoning about alternative utterances — overlapping supervision is enough.

(3) What counts as “meaning” here?

In this modeling setup, meaning is operationalized as the pattern of answer behavior across worlds.

So, if two expressions systematically yield the same answer in a subset of contexts, the model may treat them as similar in those contexts. If they diverge elsewhere, it may expect them to diverge consistently. The system is learning statistical mappings between linguistic forms and response patterns, not necessarily structured semantic objects.

That distinction becomes important when interpreting claims about logical knowledge, which seems like a different type of “meaning” to me.

(4) Implications for logical nativists

From a logical nativist perspective, the burden of proof is not simply showing that statistical learning can approximate correct answers in a constrained domain. The question is whether it yields the structured, compositional representations children appear to have.

So, I’m not sure how worried the logical nativists would be by the findings here.


Tuesday, January 20, 2026

Some thoughts on Yang 2025

General thoughts: I’m definitely interested in what we can learn about humans from neural models—especially acquisition—so I really appreciate the goal of this paper. I especially appreciate the creation of a test set that better aligns with classic poverty-of-the-stimulus (PoS) phenomena. Even if my main takeaway ended up being “this is a great test set—thanks for making it!”, I think the paper raises some interesting questions about what we should count as success when we evaluate neural learners as models of child language acquisition.

1. But first, a small quibble about the characterization of prior Bayesian work

The paper describes earlier Bayesian modeling work along the lines of “Bayesian models succeed only in idealized hypothesis spaces” and “Bayesian models reproduce this behavior under idealized assumptions.”

I get the intended contrast here, but I don’t think that’s quite fair. The hypothesis spaces in this work were often large—the key point was that they were constrained along particular dimensions. But importantly, they were sometimes less constrained than existing generative accounts about what the learner’s hypothesis space must look like. So yes, there are assumptions. But the assumptions weren’t necessarily more “idealized” than the alternatives, and if anything, they seem less so to me.

2. Implications of the current results: what does “above chance” buy us?

The results I keep coming back to are in Table 4: models trained on developmentally realistic input (BABY / BABY-F) show above-chance performance for nearly all phenomena.

One possible (optimistic) interpretation is: GPT-2 has what it takes, mechanistically. That is, given psychologically-plausible child input, the learning mechanism can leverage it to generate the target qualitative pattern (at least “better than chance” across the board).

But I’m still trying to figure out what we should take away from that. “Above chance” is encouraging, but it’s also a low bar: it could reflect partial structural sensitivity, shallow heuristics that happen to correlate with the right answer, or unstable learning that doesn’t match the robustness of child generalization. Without a clear linking story between model outputs and child behavior, it’s hard to know what kind of evidence “above chance” really provides in the PoS debate. Maybe it’s good enough? (Or maybe not.)

3. Hierarchical bias via pre-pretraining: recursion bootcamp helps… and hurts?

One manipulation I found especially interesting: following Hu et al. (2025), the authors pre-pretrain GPT2-mini on a shuffled k-Dyck language for 2K steps (basically a short bootcamp in recursion and nested structure, with balanced parentheses) before training on BABY-F. The idea is to provide a cognitively-plausible hierarchical inductive bias.

What happens? There’s modest improvement for Anaphoric one (and some improvement for Wanna, though less reliably), but performance declines for Islands, Binding, and Question Formation.

At first glance, this feels surprising. If those three are “structural” phenomena, why would giving the model a hierarchical head start make it worse?

One way to interpret this is not “the model wasn’t using hierarchy for those phenomena,” but instead: a generic recursion bias is the wrong kind of hierarchy for these cases. Dyck-style nesting might encourage chunking and bracket-matching-like strategies, which could help with something like anaphoric one (constituent-like substitution), but distort the cues needed for islands/binding/question formation, which rely on more specific structural constraints (locality, c-command, movement dependencies).

This made me wonder: maybe children’s structural biases are less specific, or less strongly enforced, than what this Dyck pretraining induces—so they don’t “overfit” to the wrong kind of hierarchical representation.