Wednesday, May 27, 2026

Some thoughts on Kakouros & Räsänen 2015

General thoughts:

One thing I really appreciated about this paper is that it maps wonderfully into my own acquisition theorizing terms, with respect to human-inspired intake (what aspects of the signal get attended to / extracted) and human-generated behavior as the target (comparing the model’s predictions against actual human prominence judgments).

In terms of mapping to infant acquisition, it seems like a great first proof-of-concept for prosodic surprisal as a key mechanism for identifying prominence. In particular, if infants (like adults) are especially sensitive to statistically unexpected regions of the speech stream, could this help guide attention toward informationally important material?

More specific thoughts:


1. The role of duration

The paper initially presents F0, energy, and duration as the major correlates of prominence, but then only explicitly models F0 and energy. At first this confused me, since duration is described as highly important cross-linguistically.

But later the authors explain that duration is effectively folded into the model through integration over word duration and syllabic weighting. So the implementation is actually doing a kind of dimensionality reduction: duration isn’t independently modeled, but its effects are partially reconstructable through the temporal integration of the other features.

That makes sense computationally, though I still wonder whether something is lost by not modeling duration directly.

2. Quantization – is it still useful today?

So the quantization step turns continuous acoustic trajectories into 32 discretized bins, so they can be modeled with n-grams. 

My immediate thought: If we weren’t using an n-gram-based learning model, would we need to quantize? (Example: Whatever state-of-the-art unsupervised classification exists for continuous structure.)

3. The locality bias 

One finding is that higher order n-grams don’t help that much, and the authors think of this in locality terms. This intuitively appeals to me, because it seems reasonable that humans naturally impose locality biases, especially if we’re talking about real time processing of something like an entire utterance.

4. F0 and energy seem surprisingly redundant

So I thought it was really interesting that F0 alone performs about as well as energy alone,and combining them only modestly improves performance. This is surprising apparent redundancy, given that these features don’t obviously seem reducible to each other. 

One thought: maybe how clean the input signal was (ex: no background noise) caused these features not to help much more together. However,  maybe in more realistic scenarios they wouldn’t be so redundant.

5. Thresholding 

One thing I was left wondering about was the thresholding step. Prominence is ultimately assigned by applying a threshold relative to the utterance distribution.

Conceptually, I actually like the idea that prominence is contextual rather than absolute — that matches linguistic intuitions fairly well.

But computationally, I still wonder:

  • how sensitive is performance to the threshold choice?
  • how much tuning is effectively hidden there
  • would humans themselves have dynamically adaptive thresholds?

Monday, May 4, 2026

Some thoughts on Räsänen et al. 2018

General thoughts: I really like this kind of question in acquisition modeling: what does the learner actually have access to early on? A lot of existing work assumes children operate over clean, adult-like syllables. That’s clearly idealized. So the more interesting question is this: what could plausibly function as syllables before phonology is in place—and would those units be good enough for downstream learning?

Räsänen et al. (2018) take a nice step in this direction. Instead of assuming syllables, they derive syllable-like “acoustic chunks” directly from the speech signal using sonority. These aren’t phonological syllables, but rather perceptual units that fall out of general auditory processing, grounded in properties of the human auditory system. So now… let’s talk “syllables.”

What counts as a useful “syllable”?

The core result is that these acoustic chunks align reasonably well with annotated syllable boundaries across languages. That’s encouraging: it suggests learners could extract something syllable-like without prior linguistic knowledge.

But… do we actually need a match to adult-like syllables at this stage of acquisition? The goal is to use syllables as input to other processes (like word segmentation). So, to me, the relevant question becomes: are these units good enough for the tasks syllables are supposed to support?

I would love to see a downstream test in future work. For example: take these acoustically-derived units and feed them into a word segmentation model. Do we still get reasonable performance? Does it degrade relative to idealized syllables? Or is it surprisingly robust—maybe especially for infant-directed speech like that in the Brent corpus?

For me, that’s the next step: not just approximating syllables, but testing whether those approximations are functionally adequate. This paper lays excellent groundwork for asking that question concretely.


Friday, February 20, 2026

Some thoughts on Portelance et al. 2024




General thoughts: I so appreciate how careful Portelance and colleagues are in interpreting the results of their modeled learners and tying those results back to current theoretical perspectives on the acquisition of function words. (Also, I appreciate the Pearl (2023) citation in that discussion — thank you!)

I also really like the developmental plausibility of the setup. The models aren’t learning language for its own sake. The feedback signal here is about task success, not about linguistic form. As the authors say, language is an auxiliary objective — a tool for accomplishing something else: communicating about the visual world. That feels right. Children are focused on acting and understanding in the world, and language turns out to be an efficient way to do that with other humans. Getting feedback about whether your interpretation of a scene works — but not about whether your internal representation of a connective is correct — seems developmentally plausible.

At a broad level, the paper offers a genuine proof-of-concept: aspects of the meanings of logical connectors and relational terms can be learned from distributions of linguistic and visual information, without prior knowledge of linguistic meaning.

But once we zoom in, things get more interesting.

(1) What does it mean to be “sensitive to alternative expressions”?

A central question in the paper is whether the existence of “alternative expressions” affects acquisition. In the case of and and or, this connects to a classic Gricean idea: hearing or often leads us to infer “not and,” because if and were true, it would have been more informative to say so.

The authors suggest that their models show early evidence of being “sensitive to alternative expressions when interpreting language.” But that can mean at least two different things:

(A) Representations change when alternatives are present in the training distribution.
(B) The system reasons about alternatives during interpretation in a Gricean sense.

The experiments strongly support (A). It’s less clear to me that they establish (B).

When and and or are both present in training, performance shifts. Removing one affects how the other behaves. That shows interaction and competition. But interaction between representations isn't (yet) the same thing as reasoning about alternative utterances in a pragmatic sense.

(2) Truth-conditional geometry matters

One structural feature seems especially important: the geometry of the truth space.

For and and or, we have a nested relation: AND ⊂ OR

When both conjuncts are true, both AND and inclusive-OR are true.That creates an overlap region in which two different expressions yield the same answer.

In contrast, for pairs like behind/in front of or more/fewer, the relation is symmetric. They only overlap in contexts where both are false (e.g., equivalent spatial position, equal numbers). There’s no scalar nesting.

This difference seems to matter.

If two expressions yield the same answer in many contexts, the modeled learner repeatedly sees:

Same world → two different words → same label

In a distributed learning system, that encourages representational similarity (“representational entanglement”). When those same expressions diverge elsewhere (e.g., one-conjunct-true cases for and vs or), the gradients conflict. From a logical perspective, opposing truth values increase discriminability. But from a distributed learning perspective, opposing labels on similar inputs can increase gradient conflict.

I think this helps explain why and “yes” contexts are fragile when or is present, and why removing or in Experiment 2 stabilizes and performance. It may not require the model to be explicitly reasoning about alternative utterances — overlapping supervision is enough.

(3) What counts as “meaning” here?

In this modeling setup, meaning is operationalized as the pattern of answer behavior across worlds.

So, if two expressions systematically yield the same answer in a subset of contexts, the model may treat them as similar in those contexts. If they diverge elsewhere, it may expect them to diverge consistently. The system is learning statistical mappings between linguistic forms and response patterns, not necessarily structured semantic objects.

That distinction becomes important when interpreting claims about logical knowledge, which seems like a different type of “meaning” to me.

(4) Implications for logical nativists

From a logical nativist perspective, the burden of proof is not simply showing that statistical learning can approximate correct answers in a constrained domain. The question is whether it yields the structured, compositional representations children appear to have.

So, I’m not sure how worried the logical nativists would be by the findings here.


Tuesday, January 20, 2026

Some thoughts on Yang 2025

General thoughts: I’m definitely interested in what we can learn about humans from neural models—especially acquisition—so I really appreciate the goal of this paper. I especially appreciate the creation of a test set that better aligns with classic poverty-of-the-stimulus (PoS) phenomena. Even if my main takeaway ended up being “this is a great test set—thanks for making it!”, I think the paper raises some interesting questions about what we should count as success when we evaluate neural learners as models of child language acquisition.

1. But first, a small quibble about the characterization of prior Bayesian work

The paper describes earlier Bayesian modeling work along the lines of “Bayesian models succeed only in idealized hypothesis spaces” and “Bayesian models reproduce this behavior under idealized assumptions.”

I get the intended contrast here, but I don’t think that’s quite fair. The hypothesis spaces in this work were often large—the key point was that they were constrained along particular dimensions. But importantly, they were sometimes less constrained than existing generative accounts about what the learner’s hypothesis space must look like. So yes, there are assumptions. But the assumptions weren’t necessarily more “idealized” than the alternatives, and if anything, they seem less so to me.

2. Implications of the current results: what does “above chance” buy us?

The results I keep coming back to are in Table 4: models trained on developmentally realistic input (BABY / BABY-F) show above-chance performance for nearly all phenomena.

One possible (optimistic) interpretation is: GPT-2 has what it takes, mechanistically. That is, given psychologically-plausible child input, the learning mechanism can leverage it to generate the target qualitative pattern (at least “better than chance” across the board).

But I’m still trying to figure out what we should take away from that. “Above chance” is encouraging, but it’s also a low bar: it could reflect partial structural sensitivity, shallow heuristics that happen to correlate with the right answer, or unstable learning that doesn’t match the robustness of child generalization. Without a clear linking story between model outputs and child behavior, it’s hard to know what kind of evidence “above chance” really provides in the PoS debate. Maybe it’s good enough? (Or maybe not.)

3. Hierarchical bias via pre-pretraining: recursion bootcamp helps… and hurts?

One manipulation I found especially interesting: following Hu et al. (2025), the authors pre-pretrain GPT2-mini on a shuffled k-Dyck language for 2K steps (basically a short bootcamp in recursion and nested structure, with balanced parentheses) before training on BABY-F. The idea is to provide a cognitively-plausible hierarchical inductive bias.

What happens? There’s modest improvement for Anaphoric one (and some improvement for Wanna, though less reliably), but performance declines for Islands, Binding, and Question Formation.

At first glance, this feels surprising. If those three are “structural” phenomena, why would giving the model a hierarchical head start make it worse?

One way to interpret this is not “the model wasn’t using hierarchy for those phenomena,” but instead: a generic recursion bias is the wrong kind of hierarchy for these cases. Dyck-style nesting might encourage chunking and bracket-matching-like strategies, which could help with something like anaphoric one (constituent-like substitution), but distort the cues needed for islands/binding/question formation, which rely on more specific structural constraints (locality, c-command, movement dependencies).

This made me wonder: maybe children’s structural biases are less specific, or less strongly enforced, than what this Dyck pretraining induces—so they don’t “overfit” to the wrong kind of hierarchical representation.