I really like seeing this kind of model comparison work, as computational models like this encode specific theories of a developmental process (here, how language-specific sound contrasts get learned). I think we see a lot of good practices demonstrated in this paper when it comes to this approach, especially when borrowing models from the NLP world: using naturalistic data, explicitly highlighting the model distinctions and what they mean in terms of representation and learning mechanism, comparing model output to observable behavioral data (more on this below), and generating testable behavioral predictions that will distinguish currently-winning models.
Specific thoughts:
(1) Comparing model output to observable behavior: I love that M&al2020 do this with their models, especially since most previous models tried to learn unobservable theoretically-motivated representations. This is so useful. If you want the model’s target to be an unobserved knowledge state (like phonetic categories), you’re going to have a fight with the people who care about that knowledge representation level -- namely, is your target knowledge the right form? If instead you make the model’s target some observable behavior, then no one can argue with you. The behavior is an empirical fact, and your model either can generate it or not. It saves much angst on the modeling, and makes for far more convincing results. Bonus: You can then peek inside the model to see what representation it used to generate the observed behavior, and potentially inform the debates about what representation is the right one.
(2) Simulating the ABX task results: So, this seemed a little subtle to me, which is why I want to spell out what I understood (which may well be not quite right). Model performance is calculated by how many individual stimuli the model gets right -- for instance, none = 0% discrimination, 50% = chance performance; 100% = perfect discrimination. I guess maybe this deals with the discrimination threshold issue (i.e., how you know if a given stimulus pair is actually different enough to be discriminated) by just treating each stimulus as a probabilistic draw from a distribution? That is, highly overlapping distributions means A-X is often the same as B-X, and so this works out to no discrimination...I think I need to think this through with the collective a little. It feels like the model’s representation is the draw from a distribution over possible representations, and then that’s what gets translated into the error rate. So, if you get enough stimuli, you get enough draws, and that means the aggregate error rate captures the true degree of separation for these representations. I think?
(3) On the weak word-level supervision: This turns out to be recognizing that tokens of a word are in fact the same word form. That’s not crazy from an acquisition perspective -- meaning could help determine that the same lexical item was used in context (e.g., “kitty” one time and “kitty” another time when pointing at the family pet).
(4) Cognitive plausibility of the models: So what strikes me about the RNN models is that they’re clearly coming from the engineering side of the world -- I don’t know if we have evidence that humans do this forced encoding-decoding process. It doesn’t seem impossible (after all, we have memory and attention bottlenecks galore, especially as children), but I just don’t know if anyone’s mapped these autoencoder-style implementations to the cognitive computations we think kids are doing. So, even though the word-level supervision part of the correspondence RNNs seems reasonable, I have no idea about the other parts of the RNNs. Contrast this with the Dirichlet process Gaussian mixture model -- this kind of generative model is easy to map to a cognitive process of categorization, and the computation carried out by the MCMC sampling can be approximated by humans (or so it seems).
(5) Model input representations: MFCCs from 25ms long frames are used. M&al2020 say this is grounded in human auditory processing. This is news to me! I had thought MFCCs were something that NLP had found worked, but we didn’t really know about links to human auditory perception. Wikipedia says the mel (M) part is what’s connected to human auditory processing, in that the spacing of the bands by “mel” is what approximates the human auditory response. But the rest of the process of getting MFCCs from the acoustic input, who knows? This contrasts with using something like phonetic features, which certainly seems to be more like our conscious perception of what’s in the acoustic signal.
Still, M&al2020 then use speech alignments that map chunks of speech to corresponding phones. So, I think that the alignment process on the MFCCs yields something more like what linguistic theory bases things on, namely phones that would be aggregated together into phonetic categories.
Related thought, from the conclusion: “models learning representations directly from unsegmented natural speech can correctly predict some of the infant phone discrimination data”. Notably, there’s the transformation into MFCCs and speech alignment into phones, so the unit of representation is something more like phones, right? (Or whole words of MFCCs for the (C)AE-RNN models?) So should we take away something about what the infant unit of speech perception is from there, or not? I guess I can’t tell if the MFCC transformation and phone alignment is meant as an algorithmic-level description of how infants would get their phone-like/word-like representations, or if instead it’s a computational-level implementation where we think infants get phone-like/word-like representations out, but infants need to approximate the computation performed here.
(6) Data sparseness: Blaming data sparseness for no model getting the Catalan contrast doesn’t seem crazy to me. Around 8 minutes of Catalan training data (if I’m reading Table 3 correctly) isn’t a lot. If I’m reading Table 3 incorrectly, and it’s actually under 8 hours of Catalan training data, that still isn’t a lot. I mean, we’re talking less than a day’s worth of input for a child, even if this is in hours.
(7) Predictions for novel sound contrasts: I really appreciate seeing these predictions, and brief discussion of what the differences are (i.e., the CAE-RNN is better for differences in length, while the DPGMM is better for ones that observably differ in short time slices). What I don’t know is what to make of that -- and presumably M&al2020 didn’t either. They did their best to hook these findings into what’s known about human speech perception (i.e., certain contrasts like /θ/ are harder for human listeners and are harder for the CAE-RNN too), but the general distinction of length vs. observable short time chunks is unexplained. The only infant data to hook back into is whether certain contrasts are realized earlier than others, but the Catalan one was the earlier one at 8 months, and no model got that.