Wednesday, October 19, 2022

Some thoughts on Hitczenko & Feldman 2022

I love seeing work that evaluates an idea against naturalistic data. It’s often the exciting next “proof of concept” once you’ve got an implemented theory that works on idealized data or controlled experimental data.


Some other thoughts:

(1) I completely sympathize with the idea that anything from the broader context might be relevant for discriminating contrastive dimensions. I think the question then becomes how infants decide which contextual factors to pay attention to, out of all the possible ones. Are certain ones more salient period, or because the infant brain has certain perceptual biases, etc? What’s the hypothesis space of possible contextual features, and how might an infant navigate through that hypothesis space?


(2) Thinking about noise: I wonder how much noise this kind of approach can tolerate. For instance (and this is a point the H&F2022 bring up in the discussion), if infants have a fuzzier notion of distributional similarity than Earthmover’s distance/KL divergence/whatever because of their developing learning abilities, can they still catch onto these distributional differences?


H&F2022 also implement some ideas for fuzzier (mis)perception of the input, which shows this approach can tolerate at least 20% noise in perception. So maybe someone could implement the fuzzier distributional similarity idea in a similar way.


Tuesday, October 4, 2022

Some thoughts on Cao et al. 2022

I really like seeing modeling work like this where a more complex, ideal computation (here, EIG) can be well-approximated by a simpler, more-heuristic computation (here, surprisal and KL divergence) when it comes to capturing developmental behavior. Of course, this paper is presenting a first-pass evaluation over adult behavior, but as the authors note, future work can extend their evaluation to infant looking behavior. I definitely would like to see how well this approach works for infant data, since I’d be surprised if there wasn’t some immaturity (i.e., resource constraints, other biases) at work for the computation itself in infants, compared with adult decision-making. And then the interesting question is how to capture that immaturity – for instance, do the approximations of the computation work even better than the idealized computation with EIG? Would even simpler heuristics that don’t approximate EIG as well but are also backward-looking, rather than forward-looking, be better?


Other specific thoughts:


(1) Noisy perception: It’s really nice to see this worked into a developmental model, since – especially for infants – imperfect representations of stimuli seems like a plausible situation. That is, the “perceptual intake” into the learning system depends on immature knowledge and abilities, and is therefore different from the input signal that’s out there in the world. (To be fair, the perceptual intake for adults is also different from the input signal out there in the world, and adults don’t have immature knowledge and abilities. So children basically have to learn to be adult-like in how they “skew” the input signal.)


(2) The RANCH model involves accumulating noisy samples and choosing what to do at each moment. This sounds like the diffusion model of decision-making from mathematical psych to me. I wonder if RANCH is an implementation of that (and if not, how they differ)?


(3) What the learner needs to know: A key idea here is that the motivation to sample the input at all is because the learner knows perception is noisy. To me, this is pretty reasonable knowledge to build into a modeled child. It reminds me of Perkins et al. 2022 where the learner knows misperception occurs, and so has to learn to filter out erroneous data. Importantly there, the modeled learner doesn’t have to know the specifics beyond that.


Perkins, L., Feldman, N. H., & Lidz, J. (2022). The Power of Ignoring: Filtering Input for Argument Structure Acquisition. Cognitive Science, 46(1), e13080.

Friday, February 11, 2022

Some thoughts on Wilcox et al. 2021

This paper made me really happy because it involved careful thought about what was being investigated, an accessible intuition about how each model works, what the selected models can and can’t tell us, how the models should be evaluated, sensible ways to interpret model results, and why we should care. Of course, I did have (a lot of) various things occur to me as I was reading (more on this below), but this is probably one of the few papers I’ve read recently using neural net models that I care about, as a developmental linguist who does cognitive modeling. Thanks, authors!


Specific thoughts:

(1) Poverty of the stimulus vs. the Argument from poverty of the stimulus (i.e, viable solutions to poverty of the stimulus): I think it’s useful to really separate these two ideas. Poverty of the stimulus is about whether the data are actually compatible with multiple generalizations. I think this seems to be true about learning constraints on filler-gap dependencies (though this assertion depends on the data considered relevant in the input signal, which is why it’s important to be clear about what the input is). But the argument from poverty of the stimulus is about viable solutions, i.e., the biases that are built in to navigate the possibilities and converge on the right generalization.


The abstract wording focuses on poverty of the stimulus itself for syntactic islands, while the general discussion in 6.2. is clearly focusing on the (potential) viable solutions uncovered via the models explored in the paper. That is, the focus isn’t about whether there’s poverty of the stimulus for learning about islands, but rather what built-in stuff it would take to solve it. And that’s where the linguistic nativist vs. non-linguistic nativist/empiricist discussion comes in. I think this distinction between poverty of the stimulus itself and the argument from poverty of the stimulus gets mushed together a bit sometimes, so it can be helpful to note it explicitly. Still, the authors are very careful in 6.2. to talk about what they’re interested in as the argument from poverty of the stimulus, and not poverty of the stimulus itself.


(2) Introduction, Mapping out a “lower bound for learnability”: I’m not quite sure I follow what this means: a lower bound in the sense of what’s learnable from this kind of setup, I guess? Which is why anything unlearnable might still require a language-specific constraint? 


Also, I’m not sure I quite follow the distinction between top-down vs bottom-up being made about constraints. Is it that top-down is explicitly defined and implemented, as opposed to bottom-up being an emerging thing from whatever was explicitly defined and implemented? But if so, isn’t that more of an implementational-level distinction, rather than a core aspect of the definition (=computational-level) of the constraint? That is, the bottom-up thing could be explicitly defined, if only we understood better how the explicitly defined things caused it to emerge?


(3) The “psycholinguistics paradigm” for model assessment: I really like this approach, precisely because it doesn’t commit you to an internal theory-specific representation. In general, this is a huge plus for evaluating models against observable behavior. Even if you use an internal representation (and someone doesn’t happen to like it), you can still say that whatever’s going on can yield human behavior so it must have something human-like about it. The same is true for distributed/connectionist language models where it’s hard to tell what the internal representations are, aside from being vectors of numbers.


(4) The expected superadditive pattern when both the filler and gap are present: Why should this be superadditive, instead of just additive? What extra thing is happening to make the presence of both yield a superadditive pattern? I have the same question once we get to island stimuli, too, where the factors are filler presence, gap presence, and island structure presence. 


(5) The domain-general property of the neural models: The neural models aren’t building any bias for language-specific representations in, but language-specific representations are in the hypothesis space. So, is it possible the best-fitting internal representations are language-specific? This would be similar to Bayesian approaches (e.g., Perfors et al 2011) that allow the hypothesis space to include domain-general options, but inference leads the learner to select language-specific options.


(6) The input: Just a quick note that the neural models here were trained on non-childlike input both in terms of content (e.g., newsire text, wikipedia) and quantity (though I do appreciate the legwork of estimating input quantity). This isn’t a really big deal for the proof-of-concept goal here, but starts to matter more for more targeted arguments about how children could learn various filler-gap knowledge so reliably from their experience. Of course, the authors are aware of this and explicitly discuss this right after they introduce the different models (thanks, authors!). 


One thing that could be done: cross-check the input quantity with known ages of acquisition (e.g., Complex NP islands in English by age four, De Villiers et al. 2008). Since the authors say input quantity doesn’t really affect their reported results anyway, then this should be both easy to do and not change any major findings.


The second thing is to train these models on child-directed speech samples and see if the results hold. The CHILDES database should have enough input samples from high-resource languages, and whatever limitations there might be in terms of sampling from multiple children at multiple ages from multiple backgrounds (and other variables), it seems like a step in the right direction that isn’t too hard to do (though I guess that does depend on how hard it is to train these models).


(7) Proof-of-concept argument with these neural models: The fact that these models do struggle with issues of length and word frequency in non-human-like ways does suggest that they might do other things (like learn about filler-gap dependencies) in non-human-like ways too. So we have to be careful about what kind of argument this proof-of-concept is — that is, it’s a computational-level “is it possible at all” argument, rather than a computational-level “is it possible for humans who have these known biases/limitations, etc” argument.


(8) N-grams always fail: Is this just because the 5-token window isn’t big enough, so there’s no hope of capturing dependencies that are longer? I expect so, but don’t remember the authors saying something explicitly like that.


(9) Figure 5: I want to better understand why inversion is an ok behavior (I’m looking at you, GRNN).  Does that mean that now a gap in matrix position with a licensing filler in the subject is more surprising than no gap in matrix position with no licensing filler in the subject? I guess that’s not too weird. Basically, GRNN doesn’t want gaps in places they shouldn’t be (which seems reminiscent of island restrictions, as islands are places where gaps shouldn’t be).


(10) One takeaway from the neural modeling results: Non-transformer models do better at generalizing.  Do we think this is just due to data overfitting (training input size, parameter number), or something else?


(11) Coordination islands: I know the text says all four neural models showed significant reduction in wh-effects, so I guess the reductions must be significant between the control conditions and the 1st conjunct gaps. But, there seems to be a qualitative difference in attenuation we see for a gap in the first conjunct vs. the second conjunct (and it’s true for all four neural models). I wonder why that should be. 


(12) Figure 10, checking my understanding: So, seeing no gap inside a control structure is less surprising sometimes than seeing no gap inside a left-branching structure…I think this may have to do with the weirdness of the control structures, if I’m following 14 correctly? In particular, the -gap control is “I know that you bought an expensive a car last week” and the -gap island is “I know how expensive you bought a car last week”. This may come back to being more precise about surprisal expectations for control vs. island structures. Usually, control structures are fine (grammatical), but here they’re not, and so that could interfere with the potential surprisal pattern we’re looking for.


(13) Subject islands: It was helpful to get a quick explanation about why the GRNN didn’t do as well as the other neural models here (basically, not having a robust wh-effect for the control structures). A quick explanation of this type would be helpful for other cases where we see some neural models (seem to) fail, like the first conjunct for Coordination islands, and then Left Branch and Sentential Subject islands.


(14) Table 14: (just a shout out) Thank you so much, authors, for providing this. Unbelievably helpful summary.


(15) One takeaway the authors point out: If learning is about maximizing input data probability, then these neural approaches are similar to previous approaches that do this. In particular, maximizing input data probability corresponds to the likelihood component of any Bayesian learning approach, which seems sensible. Then, the difference is just about the prior part, which corresponds to the inductive biases built in.


(16) General discussion: I’m not quite sure I follow why linguistic nativist biases would contrast with empiricist biases by a priori downweighting certain possibilities — maybe this is another way of saying that one type of language-specific bias skews/limits the hypothesis space a certain way only if it’s a language-based hypothesis space? In contrast, a domain-general bias skews/limits the hypothesis space no matter what kind of hypothesis space it is. The particular domain-general bias of maximizing input probability of course doesn’t occur a priori— the learner needs to see the input data. But other kinds of domain-general biases seem like they could skew the hypothesis space a priori (e.g., the simplicity preference from Perfors et al. 2006).


(17) Another takeaway from the general discussion is that the learner doesn’t obviously need built-in language-specific biases to learn these island constraints. But I would love to know what abstract representations get built up in the best-performing neural models from this set, like JRNN. These are likely linguistic, as they’re word forms passed through a convolutional neural network (and therefore compressed somehow), and it would be great to know if they look like syntactic categories we recognize or something else. 


So, I’m totally on board with being able to navigate to the right knowledge in this case without needing language-specific (in contrast with domain-general) help. I just would love to know more about the intermediate representations, and what it takes to plausibly construct them (especially for small humans).


Tuesday, January 25, 2022

Some thoughts on van der Slik et al. 2021

I really appreciate the thoughtfulness that went into the reanalysis of the original Harthorne et al. 2018 data on second language acquisition and a potential critical/sensitive period. What struck me (more on this below) was the subtlety of the distinction that van der Slik et al. 2021 were really looking at: I think it’s not really a “critical period” vs. not, but rather a sensitive period where some language ability is equal before a certain point vs. not. In particular, both the discontinuous (=sensitive period) and continuous (=no sensitive period) approaches assume a dropoff at some point, and that dropoff is steeper at some points than others (hence, the S-shaped curve). So the fact that there is in fact a dropoff isn’t really in dispute. Instead, the question is whether before that dropoff point, are abilities equal (and in fact, equal to native = sensitive period) or not? To me, this is certainly interesting, but the big picture remains that there’s a steeper dropoff after some point that’s predictable, and it’s useful to know when that point is.



Specific thoughts:

(1) A bit more on the discontinuous vs. continuous models, and sensitive periods vs. not: I totally sympathize with the idea that a continuous sigmoidal function is the more parsimonious explanation for the available data, especially given the plausibility of external factors (i.e., non-biological factors like schooling) for the non-immersion learners. So, turning back to the idea of a critical/sensitive period, we still get a big dropoff in rate of learning, and if the slope is steep enough at the initial onset of the S-curve, it probably looks pretty stark. Is the big difference between that and a canonical sensitive period simply that the time before the dropoff isn’t all the same? That is, for a canonical sensitive period, all ages before the cutoff are the same. In contrast, for the continuous sigmoidal curve, all ages before the point of accelerated dropoff are mostly the same, but there may in fact be small differences the older you are. If that’s the takeaway, then great — we just have to be more nuanced in how we define what happens before the “cutoff” point. But the fact that a younger brain is better (broadly speaking) is true in either case.


(2) L1 vs. L2 sensitive periods:  It’s a good point that these may in fact be different (missing the L1 cutoff seems more catastrophic). This difference seems to call into question how much we can infer about a critical/sensitive period for L1 acquisition on the basis of L2 acquisition. Later results from this paper suggest qualitative similarities in early immersion (<10 years old), bilinguals, and monolinguals (L1) vs. later immersion, in terms of whether a continuous model with sigmoidal dropoff (early immersion) vs. a discontinuous model with constant rate followed by sigmoidal dropoff (later immersion) is the best fit. So maybe we can extrapolate from L2 to L1, provided we look at the right set of L2 learners (i.e., early immersion learners). And certainly we can learn useful things about L2 critical/sensitive periods.


(3) AIC score interpretation: I think I need more of a primer on this, as I was pretty confused on how to interpret these scores. I had thought that a negative score closer to 0 is better because the measure is based on log likelihood, and closer to 0 means a “smaller” negative, which is a higher probability.  Various googling suggests absolute lowest score is better,  but I don’t understand how you get a negative number in the first place if you’re subtracting the ln of the log likelihood. That is, you’re subtracting a negative number (because likelihoods are small probabilities often much less than 1), which is equivalent to adding a positive number. So, I would have expected these scores to be positive numbers.


Thursday, January 13, 2022

Some thoughts on Hu et al. 2021

It’s a nice change of pace for me to take a look at pragmatic modeling work more from the engineering/NLP side of the world (rather than the purely cognitive side), as I think this paper does. That said, I wonder if some of the specific techniques used here, such as the training of the initial context-free lexicon, might be useful for thinking about how humans represent of meaning (especially meaning that feeds into pragmatic reasoning). 


I admit, I also would have benefited from the authors having more space to explain their approach in different places (more on this below). For instance, the intuition of self-supervised vs. regular supervised learning is something I get, but the specific implementation of the self-supervised approach (in particular, why it counts as self-supervised) was a little hard for me to follow.


Specific thoughts:

(1) H&al2021 describe a two-step learning process, where the first step is learning a lexicon without “contextual supervision”. It sounds like this is “context-free” lexicon, like the L0 level level of RSA, which typically involves the semantic representation only. Though I do wonder how “context-free” the basic semantic representations actually are (e.g., they may incorporate the linguistic contexts words appear in), to be honest. But I suppose the main distinction is that no intentions or social information are involved.


The second step is to learn “pragmatic policies” by optimizing an appropriate objective function without “human supervision”. I initially took this to mean unsupervised learning, but then H&al2021 clarified (e.g., in section 3) that instead they meant that certain types of information provided by humans aren’t included during training, and this is useful from an engineering perspective because that kind of data can be costly to get. And so the learning gets the label “self-supervising”, from the standpoint of that withheld information.


 (2) Section 4.3, on the self-supervised learning (SSL) pragmatic agents.


For the AM model that the RSA implementations use, H&al2021 say that they train the base level agents with the full contextual supervision and then “enrich” it with subsequent AM steps. I think I need this unpacked more. I think I follow what it means to train agents with the full contextual supervision: in particular, include the contexts provided by the color triples. But I don’t understand what enriching the agents with AM steps afterwards means. How is that separate/different from the initial training process? Is the initial training not done via AM optimization? For the GD model, we see a similar process, with pragmatic enrichment done via GD steps, rather than AM steps. It seems like this is important to understand, as this distinction gets this approach classified as self-supervised rather than fully supervised. 


(3) For the GD approach, the listener model can train an utterance encoder and color context encoder. But why wouldn’t a listener be using decoders, since listeners can be intuitively thought of as decoding? I guess decoding is just the inverse of encoding, so maybe it’s translatable?


(4) I think I’m unclear on what “ground truth” is in Figure 2a, and why we’re interested in that if humans don’t match it either sometimes. I would have thought the ground truth would be what humans do for this kind of pragmatic language use.