Tuesday, January 25, 2022

Some thoughts on van der Slik et al. 2021

I really appreciate the thoughtfulness that went into the reanalysis of the original Harthorne et al. 2018 data on second language acquisition and a potential critical/sensitive period. What struck me (more on this below) was the subtlety of the distinction that van der Slik et al. 2021 were really looking at: I think it’s not really a “critical period” vs. not, but rather a sensitive period where some language ability is equal before a certain point vs. not. In particular, both the discontinuous (=sensitive period) and continuous (=no sensitive period) approaches assume a dropoff at some point, and that dropoff is steeper at some points than others (hence, the S-shaped curve). So the fact that there is in fact a dropoff isn’t really in dispute. Instead, the question is whether before that dropoff point, are abilities equal (and in fact, equal to native = sensitive period) or not? To me, this is certainly interesting, but the big picture remains that there’s a steeper dropoff after some point that’s predictable, and it’s useful to know when that point is.



Specific thoughts:

(1) A bit more on the discontinuous vs. continuous models, and sensitive periods vs. not: I totally sympathize with the idea that a continuous sigmoidal function is the more parsimonious explanation for the available data, especially given the plausibility of external factors (i.e., non-biological factors like schooling) for the non-immersion learners. So, turning back to the idea of a critical/sensitive period, we still get a big dropoff in rate of learning, and if the slope is steep enough at the initial onset of the S-curve, it probably looks pretty stark. Is the big difference between that and a canonical sensitive period simply that the time before the dropoff isn’t all the same? That is, for a canonical sensitive period, all ages before the cutoff are the same. In contrast, for the continuous sigmoidal curve, all ages before the point of accelerated dropoff are mostly the same, but there may in fact be small differences the older you are. If that’s the takeaway, then great — we just have to be more nuanced in how we define what happens before the “cutoff” point. But the fact that a younger brain is better (broadly speaking) is true in either case.


(2) L1 vs. L2 sensitive periods:  It’s a good point that these may in fact be different (missing the L1 cutoff seems more catastrophic). This difference seems to call into question how much we can infer about a critical/sensitive period for L1 acquisition on the basis of L2 acquisition. Later results from this paper suggest qualitative similarities in early immersion (<10 years old), bilinguals, and monolinguals (L1) vs. later immersion, in terms of whether a continuous model with sigmoidal dropoff (early immersion) vs. a discontinuous model with constant rate followed by sigmoidal dropoff (later immersion) is the best fit. So maybe we can extrapolate from L2 to L1, provided we look at the right set of L2 learners (i.e., early immersion learners). And certainly we can learn useful things about L2 critical/sensitive periods.


(3) AIC score interpretation: I think I need more of a primer on this, as I was pretty confused on how to interpret these scores. I had thought that a negative score closer to 0 is better because the measure is based on log likelihood, and closer to 0 means a “smaller” negative, which is a higher probability.  Various googling suggests absolute lowest score is better,  but I don’t understand how you get a negative number in the first place if you’re subtracting the ln of the log likelihood. That is, you’re subtracting a negative number (because likelihoods are small probabilities often much less than 1), which is equivalent to adding a positive number. So, I would have expected these scores to be positive numbers.


Thursday, January 13, 2022

Some thoughts on Hu et al. 2021

It’s a nice change of pace for me to take a look at pragmatic modeling work more from the engineering/NLP side of the world (rather than the purely cognitive side), as I think this paper does. That said, I wonder if some of the specific techniques used here, such as the training of the initial context-free lexicon, might be useful for thinking about how humans represent of meaning (especially meaning that feeds into pragmatic reasoning). 


I admit, I also would have benefited from the authors having more space to explain their approach in different places (more on this below). For instance, the intuition of self-supervised vs. regular supervised learning is something I get, but the specific implementation of the self-supervised approach (in particular, why it counts as self-supervised) was a little hard for me to follow.


Specific thoughts:

(1) H&al2021 describe a two-step learning process, where the first step is learning a lexicon without “contextual supervision”. It sounds like this is “context-free” lexicon, like the L0 level level of RSA, which typically involves the semantic representation only. Though I do wonder how “context-free” the basic semantic representations actually are (e.g., they may incorporate the linguistic contexts words appear in), to be honest. But I suppose the main distinction is that no intentions or social information are involved.


The second step is to learn “pragmatic policies” by optimizing an appropriate objective function without “human supervision”. I initially took this to mean unsupervised learning, but then H&al2021 clarified (e.g., in section 3) that instead they meant that certain types of information provided by humans aren’t included during training, and this is useful from an engineering perspective because that kind of data can be costly to get. And so the learning gets the label “self-supervising”, from the standpoint of that withheld information.


 (2) Section 4.3, on the self-supervised learning (SSL) pragmatic agents.


For the AM model that the RSA implementations use, H&al2021 say that they train the base level agents with the full contextual supervision and then “enrich” it with subsequent AM steps. I think I need this unpacked more. I think I follow what it means to train agents with the full contextual supervision: in particular, include the contexts provided by the color triples. But I don’t understand what enriching the agents with AM steps afterwards means. How is that separate/different from the initial training process? Is the initial training not done via AM optimization? For the GD model, we see a similar process, with pragmatic enrichment done via GD steps, rather than AM steps. It seems like this is important to understand, as this distinction gets this approach classified as self-supervised rather than fully supervised. 


(3) For the GD approach, the listener model can train an utterance encoder and color context encoder. But why wouldn’t a listener be using decoders, since listeners can be intuitively thought of as decoding? I guess decoding is just the inverse of encoding, so maybe it’s translatable?


(4) I think I’m unclear on what “ground truth” is in Figure 2a, and why we’re interested in that if humans don’t match it either sometimes. I would have thought the ground truth would be what humans do for this kind of pragmatic language use.