Wednesday, May 27, 2026

Some thoughts on Kakouros & Räsänen 2015

General thoughts:

One thing I really appreciated about this paper is that it maps wonderfully into my own acquisition theorizing terms, with respect to human-inspired intake (what aspects of the signal get attended to / extracted) and human-generated behavior as the target (comparing the model’s predictions against actual human prominence judgments).

In terms of mapping to infant acquisition, it seems like a great first proof-of-concept for prosodic surprisal as a key mechanism for identifying prominence. In particular, if infants (like adults) are especially sensitive to statistically unexpected regions of the speech stream, could this help guide attention toward informationally important material?

More specific thoughts:


1. The role of duration

The paper initially presents F0, energy, and duration as the major correlates of prominence, but then only explicitly models F0 and energy. At first this confused me, since duration is described as highly important cross-linguistically.

But later the authors explain that duration is effectively folded into the model through integration over word duration and syllabic weighting. So the implementation is actually doing a kind of dimensionality reduction: duration isn’t independently modeled, but its effects are partially reconstructable through the temporal integration of the other features.

That makes sense computationally, though I still wonder whether something is lost by not modeling duration directly.

2. Quantization – is it still useful today?

So the quantization step turns continuous acoustic trajectories into 32 discretized bins, so they can be modeled with n-grams. 

My immediate thought: If we weren’t using an n-gram-based learning model, would we need to quantize? (Example: Whatever state-of-the-art unsupervised classification exists for continuous structure.)

3. The locality bias 

One finding is that higher order n-grams don’t help that much, and the authors think of this in locality terms. This intuitively appeals to me, because it seems reasonable that humans naturally impose locality biases, especially if we’re talking about real time processing of something like an entire utterance.

4. F0 and energy seem surprisingly redundant

So I thought it was really interesting that F0 alone performs about as well as energy alone,and combining them only modestly improves performance. This is surprising apparent redundancy, given that these features don’t obviously seem reducible to each other. 

One thought: maybe how clean the input signal was (ex: no background noise) caused these features not to help much more together. However,  maybe in more realistic scenarios they wouldn’t be so redundant.

5. Thresholding 

One thing I was left wondering about was the thresholding step. Prominence is ultimately assigned by applying a threshold relative to the utterance distribution.

Conceptually, I actually like the idea that prominence is contextual rather than absolute — that matches linguistic intuitions fairly well.

But computationally, I still wonder:

  • how sensitive is performance to the threshold choice?
  • how much tuning is effectively hidden there
  • would humans themselves have dynamically adaptive thresholds?

Monday, May 4, 2026

Some thoughts on Räsänen et al. 2018

General thoughts: I really like this kind of question in acquisition modeling: what does the learner actually have access to early on? A lot of existing work assumes children operate over clean, adult-like syllables. That’s clearly idealized. So the more interesting question is this: what could plausibly function as syllables before phonology is in place—and would those units be good enough for downstream learning?

Räsänen et al. (2018) take a nice step in this direction. Instead of assuming syllables, they derive syllable-like “acoustic chunks” directly from the speech signal using sonority. These aren’t phonological syllables, but rather perceptual units that fall out of general auditory processing, grounded in properties of the human auditory system. So now… let’s talk “syllables.”

What counts as a useful “syllable”?

The core result is that these acoustic chunks align reasonably well with annotated syllable boundaries across languages. That’s encouraging: it suggests learners could extract something syllable-like without prior linguistic knowledge.

But… do we actually need a match to adult-like syllables at this stage of acquisition? The goal is to use syllables as input to other processes (like word segmentation). So, to me, the relevant question becomes: are these units good enough for the tasks syllables are supposed to support?

I would love to see a downstream test in future work. For example: take these acoustically-derived units and feed them into a word segmentation model. Do we still get reasonable performance? Does it degrade relative to idealized syllables? Or is it surprisingly robust—maybe especially for infant-directed speech like that in the Brent corpus?

For me, that’s the next step: not just approximating syllables, but testing whether those approximations are functionally adequate. This paper lays excellent groundwork for asking that question concretely.