Tuesday, November 17, 2020

Some thoughts on Matusevych et al. 2020

I really like seeing this kind of model comparison work, as computational models like this encode specific theories of a developmental process (here, how language-specific sound contrasts get learned). I think we see a lot of good practices demonstrated in this paper when it comes to this approach, especially when borrowing models from the NLP world: using naturalistic data, explicitly highlighting the model distinctions and what they mean in terms of representation and learning mechanism, comparing model output to observable behavioral data (more on this below), and generating testable behavioral predictions that will distinguish currently-winning models. 


Specific thoughts:

(1) Comparing model output to observable behavior: I love that M&al2020 do this with their models, especially since most previous models tried to learn unobservable theoretically-motivated representations. This is so useful. If you want the model’s target to be an unobserved knowledge state (like phonetic categories), you’re going to have a fight with the people who care about that knowledge representation level -- namely, is your target knowledge the right form? If instead you make the model’s target some observable behavior, then no one can argue with you. The behavior is an empirical fact, and your model either can generate it or not. It saves much angst on the modeling, and makes for far more convincing results. Bonus: You can then peek inside the model to see what representation it used to generate the observed behavior, and potentially inform the debates about what representation is the right one.


(2) Simulating the ABX task results: So, this seemed a little subtle to me, which is why I want to spell out what I understood (which may well be not quite right). Model performance is calculated by how many individual stimuli the model gets right -- for instance, none = 0% discrimination, 50% = chance performance; 100% = perfect discrimination. I guess maybe this deals with the discrimination threshold issue (i.e., how you know if a given stimulus pair is actually different enough to be discriminated) by just treating each stimulus as a probabilistic draw from a distribution? That is, highly overlapping distributions means A-X is often the same as B-X, and so this works out to no discrimination...I think I need to think this through with the collective a little. It feels like the model’s representation is the draw from a distribution over possible representations, and then that’s what gets translated into the error rate. So, if you get enough stimuli, you get enough draws, and that means the aggregate error rate captures the true degree of separation for these representations. I think?


(3) On the weak word-level supervision: This turns out to be recognizing that tokens of a word are in fact the same word form. That’s not crazy from an acquisition perspective -- meaning could help determine that the same lexical item was used in context (e.g., “kitty” one time and “kitty” another time when pointing at the family pet).


(4) Cognitive plausibility of the models: So what strikes me about the RNN models is that they’re clearly coming from the engineering side of the world -- I don’t know if we have evidence that humans do this forced encoding-decoding process. It doesn’t seem impossible (after all, we have memory and attention bottlenecks galore, especially as children), but I just don’t know if anyone’s mapped these autoencoder-style implementations to the cognitive computations we think kids are doing. So, even though the word-level supervision part of the correspondence RNNs seems reasonable, I have no idea about the other parts of the RNNs. Contrast this with the Dirichlet process Gaussian mixture model -- this kind of generative model is easy to map to a cognitive process of categorization, and the computation carried out by the MCMC sampling can be approximated by humans (or so it seems).


(5) Model input representations: MFCCs from 25ms long frames are used. M&al2020 say this is grounded in human auditory processing. This is news to me! I had thought MFCCs were something that NLP had found worked, but we didn’t really know about links to human auditory perception. Wikipedia says the mel (M) part is what’s connected to human auditory processing, in that the spacing of the bands by “mel” is what approximates the human auditory response. But the rest of the process of getting MFCCs from the acoustic input, who knows? This contrasts with using something like phonetic features, which certainly seems to be more like our conscious perception of what’s in the acoustic signal. 


Still, M&al2020 then use speech alignments that map chunks of speech to corresponding phones. So, I think that the alignment process on the MFCCs yields something more like what linguistic theory bases things on, namely phones that would be aggregated together into phonetic categories.


Related thought, from the conclusion: “models learning representations directly from unsegmented natural speech can correctly predict some of the infant phone discrimination data”. Notably, there’s the transformation into MFCCs and speech alignment into phones, so the unit of representation is something more like phones, right? (Or whole words of MFCCs for the (C)AE-RNN models?) So should we take away something about what the infant unit of speech perception is from there, or not? I guess I can’t tell if the MFCC transformation and phone alignment is meant as an algorithmic-level description of how infants would get their phone-like/word-like representations, or if instead it’s a computational-level implementation where we think infants get phone-like/word-like representations out, but infants need to approximate the computation performed here.


(6) Data sparseness: Blaming data sparseness for no model getting the Catalan contrast doesn’t seem crazy to me. Around 8 minutes of Catalan training data (if I’m reading Table 3 correctly) isn’t a lot. If I’m reading Table 3 incorrectly, and it’s actually under 8 hours of Catalan training data, that still isn’t a lot. I mean, we’re talking less than a day’s worth of input for a child, even if this is in hours.


(7) Predictions for novel sound contrasts: I really appreciate seeing these predictions, and brief discussion of what the differences are (i.e., the CAE-RNN is better for differences in length, while the DPGMM is better for ones that observably differ in short time slices). What I don’t know is what to make of that -- and presumably M&al2020 didn’t either. They did their best to hook these findings into what’s known about human speech perception (i.e., certain contrasts like /θ/ are harder for human listeners and are harder for the CAE-RNN too), but the general distinction of length vs. observable short time chunks is unexplained. The only infant data to hook back into is whether certain contrasts are realized earlier than others, but the Catalan one was the earlier one at 8 months, and no model got that.


Tuesday, November 3, 2020

Some thoughts on Fourtassi et al. 2020

It’s really nice to see a computational cognitive model both (i) capture previously-observed human behavior (here, very young children in a specific word-learning experimental task), and (ii) make new testable, predictions that the authors then test in order to validate the developmental theory implemented in the model. What’s particularly nice (in my opinion) about the specific new prediction made here is that it seems so intuitive in hindsight -- of *course* noisiness in the representation of the referent (here: how distinct the objects are from each other) could impact the downstream behavior being measured, since it matters for generating that behavior. But it sure wasn’t obvious to me before seeing the model, and I was fairly familiar with this particular debate and set of studies. That’s the thing about good insights, though -- they’re often obvious in hindsight, but you don’t notice them until someone explicitly points them out. So, this computational cognitive model, by concretely implementing the different factors that lead to the behavior being measured, highlighted that there’s a new factor that should be considered to explain children’s non-adult-like behavior. (Yay, modeling!)


Other thoughts:

(1) Qualitative vs. quantitative developmental change: It certainly seems difficult (currently) to capture qualitative change in computational cognitive models. One of the biggest issues is how to capture qualitative “conceptual” change in, say, a Bayesian model of development. At the moment, the best I’m aware of is implementing models that themselves individually have qualitative differences and then doing model comparison to see which best captures child behavior. But that’s about snapshots of the child’s state, not about how qualitative change happens. Ideally, what we’d like is a way to define building blocks that allow us to construct “novel” hypotheses from their combination...but then qualitative change is about adding a completely new building block. And where does that come from?


Relatedly, having continuous change (“quantitative development”) is certainly in line with the Continuity Hypothesis in developmental linguistics. Under that hypothesis, kids are just navigating through pre-defined options (that adult languages happen to use), rather than positing completely new options (which would be a discontinuous, qualitative change). 



(2) Model implementation:  F&al2020 assume an unambiguous 1-1 mapping between concepts and labels, meaning that the child has learned these mappings completely correctly in the experimental setup. Given the age of the original children (14 months, and actually 8 months too), this seems a simplification. But it’s not an unreasonable one -- importantly, if the behavioral effects can be captured without making this model more complicated, then that’s good to know. That means the main things that matter don’t include this assumption about how well children learn the labels and mappings in the experimental setup.


(3) Model validation with kids and adults: Of course we can quibble with the developmental difference between a 4-year-old and a 14-month-old when it comes to their perceptions of the sounds that make up words and referent distinciveness. But as a starting proof of concept to show that visual salience matters, I think this is a reasonable first step. A great followup is to actually run the experiment with 14-month-olds, and vary the visual salience just the same way, as alluded to in the general discussion.


(4) Figure 6: Model 2 (sound fuzziness = visual referent fuzziness) is pretty good at matching kids and adults, but Model 3 (sound fuzziness isn’t the same amount as visual referent fuzziness) is a little better. I wonder, though, is Model 3 enough better to account for additional model complexity? Model 2 accounting for 0.96 of the variance seems pretty darned good. 


So, suppose we say that Model 2 is actually the best, once we take model complexity into account. The implication is interesting -- perceptual fuzziness, broadly construed, is what’s going on, whether that fuzziness is over auditory stimuli or visual stimuli (or over categorizations based on those auditory and visual stimuli, like phonetic categories and object categories). This contrasts with domain-specific fuzziness, where auditory stimuli have their fuzziness and visual stimuli have a different fuzziness (i.e., Model 3). So, if this is what’s happening, would this be more in line with some common underlying factor that feeds into perception, like memory or attention?


F&al2020 are very careful to note that their model doesn’t say why the fuzziness goes away, just that it goes away as kids get older. But I wonder...


(5) On minimal pairs for learning: I think another takeaway of this paper is that minimal pairs in visual stimuli -- just like minimal pairs in auditory stimuli -- are unlikely to be helpful for young learners. This is because young kids may miss that there are two things (i.e., word forms or visual referents) that need to be discriminated (i.e., by having different meanings for the word forms, or different labels for the visual referents). Potential practical advice with babies: Don’t try to point out tiny contrasts (auditory or visual) to make your point that two things are different. That’ll work better for adults (and older children).


(6) A subtle point that I really appreciated being walked through: F&al2020 note that just because their model predicts that kids have higher sound uncertainty than adults doesn’t mean their model goes against previous accounts showing that children are good at encoding fine phonetic detail. Instead, the issue may be about what kids think is a categorical distinction (i.e., how kids choose to view that fine phonetic detail) -- so, the sound uncertainty could be due to downstream processing of phonetic detail that’s been encoded just fine.