Monday, December 4, 2023

Some thoughts on McCoy & Griffiths 2023

Before I read a single word of this paper, I already loved the idea of this: encoding useful symbolic knowledge into a distributed representation that’s been proven capable of Awesome Language Feats. This seems like exactly what we want in order to better understand how language acquisition is possible. I know the goal here is about making artificial neural networks (ANNs) better at language acquisition, but the way to do that is inspired by how children do the same thing. So it seems like there’s a good potential for accomplishing the goal I tend to be more interested in, which is using ANNs to better understand (tiny) human cognition.

Other targeted thoughts:

(1) In describing how the Bayesian prior is encoded into the ANN, M&G2023 say “hypotheses are sampled from that prior to create tasks that instantiate inductive bias in the data”.  When I first read this, I wanted to understand better what it means to create a task from a sampled hypothesis. Section 2 says “each ‘task’ is a language so that the inductive bias being distilled is a prior over the space of languages. So…that would be a language whose distribution over elements matches the sampled hypothesis? (That might make sense, assuming a hypothesis in the Bayesian model is a distribution over elements of the potential language.) 

After reading section 2, Step 2, this seems like what they’re doing. It’s just that the term “task” was new to me here, and doesn’t seem to describe what’s going on. Maybe this term comes from the ML literature on meta-learning.

(2) Meta-learning: to learn: M&G2023 use model-agnostic meta-learning (MAML), and they say MAML can be viewed as a way to perform hierarchical Bayesian modeling. Why? Because MAML involves learning about the equivalent of hyperparameters – the original model’s parameters – rather than only the model that actually learns directly from the data. It seems important to understand how the original model’s parameters are adjusted, on the basis of the temporary model’s learning of the sampled data. I don’t think I quite understand how this works.

Related: Pre-training vs. prior-training. M&G2023 describe these approaches as a head start (pre-training) vs. learning to learn (prior training). It feels like the details of how prior training works are now important – that is, transferring what was learned from temporary model M’ to original model M in the MAML approach. This transfer is clearly meant to be different from pre-training, which involves training M on a more-general task…which is somehow not “learning to learn”, even though the task is general. I may just need to read more in this literature to understand the difference, though.

(3) M&G2023 note that the prior-trained neural network can learn like a Bayesian model (e.g., pretty well from 10 examples), but it’s way faster because of the parallel processing architecture. This comment about the relative speed of Bayesian models vs. prior-trained neural networks that encode the equivalent of Bayesian inductive biases definitely makes me think about language evolution considerations. Basically, why do human languages have the shape they do? Because languages can be learned via inductive biases that can be encoded into parallel-processing, distributed-representation machines (i.e., human neural networks) that work fast.

(4)  It’s great to see strong performance from the prior-trained NN, but the fact the other NNs do pretty darned good too seems noticeable. That is, 8.5 million words may be enough even for NNs with weak inductive biases. M&G2023 note at the end of the section that a better demo would be a smaller corpus, among other considerations, and they in fact explore smaller input sizes (hurrah!).

(5) out-of-distribution generalization: The prior-trained NN always does a little better. Again, it’s great to see the improvement, but is it surprising that the standard NN without the inductive bias does pretty good too? Maybe this is because the standard NN had enough data? (Although M&G2023 say in the next subsection that this may have to do with the distilled inductive biases not being that helpful. So the issue is distilling better biases, i.e., ones defined over naturalistic data more…somehow?) I wonder what would happen if we focused on the versions that only had 1/32nd of the data, since that’s one case where the prior-trained NN definitely did better than the standard one.

(6) Future work: M&G2023 note that future work can distill different inductive biases into NNs and see which ones work better. I love the idea of this, but I think we should be clear about the assumptions we would be making here. Basically, if we’re going to test different theories of inductive biases, then we‘re committing to the NN representation as “good enough” to simulate computation in the human mind. This is fine, but we should be clear about it, especially since it can be hard to interpret what other biases might be active in any given ANN implementation (e.g., LSTMs vs. Transformers).

Wednesday, November 29, 2023

Some thoughts on Frank 2023a and 2023b

I’m definitely on board with the spirit of these papers. My position: I would love to understand more about how children do what they do when it comes to language acquisition. If that also helps large language models (LLMs) do what they do better, then that’s great too.

Some other specific thoughts, responding to certain ideas in “Bridging the gap”: 

(1)  I definitely understand that the interactive, social nature of children’s input matters. In particular, the social part in child language acquisition is usually about why certain input has more impact than others - the input in an interactive, social environment gets absorbed better by kids. But absorption doesn’t seem to be the problem for LLMs – they take in their data just fine. That said, it does seem like the interaction part helps Chat-GPT (I.e., the ability to query).

More generally, it could be that what a certain input quality (e.g., being social and interactive) does for human kids isn’t necessary for an LLM. But, we don’t know that until we understand why that input quality helps kids in the first place.

(2) I also understand that multimodal input gives concrete extensions to some concepts, and so helps “ground out” meaning in the real world for kids. I’m less sure how multimodal input would help current  AI systems — is it maybe helpful for bootstrapping the rest of the cognitive system (somehow?) that allows flexible reasoning?

(3) I think there’s a really good point made about needing the apples-to-apples comparison for evaluation. I remember earlier in the evaluation of speech segmentation models, the models were compared against perfect (adult-like) accuracy of segmentation, and few cognitively-plausible ones did all that well. In contrast, when these same models were tested on the segmentation tasks given to infants (which were meant to demonstrate infant segmentation ability), most models did just fine. Now, whether the models accomplished segmentation the way that the infants did is a different question, and one that would also apply to LLMs once we have apples-to-apples comparisons.

Tuesday, April 25, 2023

Some thoughts on Degen 2023

To me, this is a beautifully accessible review article for the probabilistic pragmatics approach, as implemented in RSA. (Figure 1 in particular made me happy – these helpful visuals really are worth it, though I know it’s hard to get them together just right.)  This review article definitely gets me wondering more about how to use RSA for language acquisition (especially when it discusses bounded cognition).

In particular, what’s the (potential) difference between a child’s approximation of Bayesian inference and an adult’s approximation? How much can be captured by this mental computation being pretty good but the units over which inference is operating being immature (e.g., utterance alternatives, meaning options, priors)? For instance, how worthwhile is it to try and capture child behavior on different pragmatic phenomena by assuming adult-like Bayesian inference but non-adult-like units that inference operates over? 

Scontras & Pearl 2021 did this a little for quantifier-scope interpretation, but those child data were from five-year-olds, who are known to be pretty adult-like for non-pragmatic things. What about younger kids? And of course, what about other pragmatic phenomena that we have child data for?

Tuesday, April 18, 2023

Some thoughts on Diercks et al. 2023

I really appreciated the leisurely pace and accessible tone of this writing, especially for someone who’s not super-familiar with the nuts and bolts of the Minimalist approach, but very interested in development. Here we can see one of the perks of not having a strict page limit. :)

Some other thoughts:

(1) One key idea of Developmental Minimalist Syntax (DMS) seems to be that the current bottom-up description of possible representations (which is what I take the iterated Merge cycles of the Minimalist approach to be) would actually have a cognitive correlate that we can observe and evaluate (i.e., stages of development). That is, this way of compactly describing acceptable/grammatical adult representations corresponds to an actual cognitive process (at the computational level of description, in Marr’s terms) whose signal can be seen in children’s developmental stages. So, this would support the validity (utility?) of describing adult representations this way.

(2) I didn’t quite follow the link between Minimalist Analytical Constructions (MACs) and Universal Cognition for Language. Is the idea that there are certain representations in the adult knowledge system, and we don’t care if their origin is language-specific? It sounds like that, from the text that follows. 

Later on, MACs are described as children’s “toolkit for grammaticalizing their language”. Would this mean that the adult representations are what children use to make sense of (“grammaticalize”) their language? That is, the representations children develop allow them to parse their input into useful information. In my standard way of thinking about these things, the developed/developing representations that children have allow them to perceive certain information in their input (which then is transformed into their “perceptual intake” of the input signal).

In ch 3, part 4, we get a fuller definition: “grammaticalizing” means arriving at and encoding generalizations for the language. So, I think that’s compatible with my idea above that “grammaticalizing” has to do with the developing adult-like representations, and children parse their input with whatever they’ve already developed along the way.

(3) Thinking about acquisition as addition, rather than replacement: Just to clarify, children can have immature representations in one of two ways: 

(1) a representation is immature because it’s still changing ([hug X] instead of [Predicate X]), or 

(2) a representation is immature because it’s fixed into the adult-like state, but it’s only part of the full adult-like structure (e.g., VP) instead of the adult-like full structure [CP [TP [vp [VP ]]]].  This second version is talked about later in ch.3 a little in “mixed status utterances”, which can have an adult-like part and an immature part.

(4) Predictions for VP before vP (section 4.3): So, I think a prediction of DMS is that we shouldn’t generally see agentive subjects combining productively with verbs (which would be vP) before we see verbs combining productively with their objects (which would be VP). (Ex: Not “I put” before “put the ball” or “put down”, as a specific item.) 

How would we then distinguish an item-specific combination that might seem to violate this from a language-general implementation involving that item that might seem to violate this? (That is, if we encounter “I put” before “put the ball”, how do we know if it’s an item-specific use or a productive language-general use?) Is it about where the child seems to be with respect to language-general use (e.g., productively using verbs with objects, but not subjects with verbs)? That is, we’d assume that an instance of “I put” would be item-specific and immature, but “put down” would be productive and general?

Friday, February 10, 2023

Some thoughts on Hahn et al. (2022)

I love the way Hahn et al. (2022) set up the two approaches they’re combining – it seems like the most natural thing in the world to combine them and reap the benefits of both. Hats off to the authors for some masterful narrative there.

In general, I’d love to think about how to apply the resource-rational lossy-context surprisal approach to models of acquisition. It seems like this approach to input representation could be applied to child input for any given existing model (say, of syntactic learning, but really for learning anything), so that we get a better sense of what (skewed) input children might actually be working from when they’re trying to infer properties of their native language. 

A first pass might be just to use this adult-like version to skew children’s input (maybe a neural model trained on child-directed speech to get appropriate retention probabilities, etc.). That said, I can also imagine that the retention rate might just generally be less for kids (and kids of different ages) compared to adults because of lower thresholds on the parts that go into calculating that retention rate (e.g., the delta parameter that modulates how much context goes into calculating next-word probabilities). Still,  the exciting thing for me is the idea that this is a way to formally implement “developing processing” (or even just “more realistic processing”) in a model that’s meant to capture developing representations.

Wednesday, October 19, 2022

Some thoughts on Hitczenko & Feldman 2022

I love seeing work that evaluates an idea against naturalistic data. It’s often the exciting next “proof of concept” once you’ve got an implemented theory that works on idealized data or controlled experimental data.

Some other thoughts:

(1) I completely sympathize with the idea that anything from the broader context might be relevant for discriminating contrastive dimensions. I think the question then becomes how infants decide which contextual factors to pay attention to, out of all the possible ones. Are certain ones more salient period, or because the infant brain has certain perceptual biases, etc? What’s the hypothesis space of possible contextual features, and how might an infant navigate through that hypothesis space?

(2) Thinking about noise: I wonder how much noise this kind of approach can tolerate. For instance (and this is a point the H&F2022 bring up in the discussion), if infants have a fuzzier notion of distributional similarity than Earthmover’s distance/KL divergence/whatever because of their developing learning abilities, can they still catch onto these distributional differences?

H&F2022 also implement some ideas for fuzzier (mis)perception of the input, which shows this approach can tolerate at least 20% noise in perception. So maybe someone could implement the fuzzier distributional similarity idea in a similar way.

Tuesday, October 4, 2022

Some thoughts on Cao et al. 2022

I really like seeing modeling work like this where a more complex, ideal computation (here, EIG) can be well-approximated by a simpler, more-heuristic computation (here, surprisal and KL divergence) when it comes to capturing developmental behavior. Of course, this paper is presenting a first-pass evaluation over adult behavior, but as the authors note, future work can extend their evaluation to infant looking behavior. I definitely would like to see how well this approach works for infant data, since I’d be surprised if there wasn’t some immaturity (i.e., resource constraints, other biases) at work for the computation itself in infants, compared with adult decision-making. And then the interesting question is how to capture that immaturity – for instance, do the approximations of the computation work even better than the idealized computation with EIG? Would even simpler heuristics that don’t approximate EIG as well but are also backward-looking, rather than forward-looking, be better?

Other specific thoughts:

(1) Noisy perception: It’s really nice to see this worked into a developmental model, since – especially for infants – imperfect representations of stimuli seems like a plausible situation. That is, the “perceptual intake” into the learning system depends on immature knowledge and abilities, and is therefore different from the input signal that’s out there in the world. (To be fair, the perceptual intake for adults is also different from the input signal out there in the world, and adults don’t have immature knowledge and abilities. So children basically have to learn to be adult-like in how they “skew” the input signal.)

(2) The RANCH model involves accumulating noisy samples and choosing what to do at each moment. This sounds like the diffusion model of decision-making from mathematical psych to me. I wonder if RANCH is an implementation of that (and if not, how they differ)?

(3) What the learner needs to know: A key idea here is that the motivation to sample the input at all is because the learner knows perception is noisy. To me, this is pretty reasonable knowledge to build into a modeled child. It reminds me of Perkins et al. 2022 where the learner knows misperception occurs, and so has to learn to filter out erroneous data. Importantly there, the modeled learner doesn’t have to know the specifics beyond that.

Perkins, L., Feldman, N. H., & Lidz, J. (2022). The Power of Ignoring: Filtering Input for Argument Structure Acquisition. Cognitive Science, 46(1), e13080.

Friday, February 11, 2022

Some thoughts on Wilcox et al. 2021

This paper made me really happy because it involved careful thought about what was being investigated, an accessible intuition about how each model works, what the selected models can and can’t tell us, how the models should be evaluated, sensible ways to interpret model results, and why we should care. Of course, I did have (a lot of) various things occur to me as I was reading (more on this below), but this is probably one of the few papers I’ve read recently using neural net models that I care about, as a developmental linguist who does cognitive modeling. Thanks, authors!

Specific thoughts:

(1) Poverty of the stimulus vs. the Argument from poverty of the stimulus (i.e, viable solutions to poverty of the stimulus): I think it’s useful to really separate these two ideas. Poverty of the stimulus is about whether the data are actually compatible with multiple generalizations. I think this seems to be true about learning constraints on filler-gap dependencies (though this assertion depends on the data considered relevant in the input signal, which is why it’s important to be clear about what the input is). But the argument from poverty of the stimulus is about viable solutions, i.e., the biases that are built in to navigate the possibilities and converge on the right generalization.

The abstract wording focuses on poverty of the stimulus itself for syntactic islands, while the general discussion in 6.2. is clearly focusing on the (potential) viable solutions uncovered via the models explored in the paper. That is, the focus isn’t about whether there’s poverty of the stimulus for learning about islands, but rather what built-in stuff it would take to solve it. And that’s where the linguistic nativist vs. non-linguistic nativist/empiricist discussion comes in. I think this distinction between poverty of the stimulus itself and the argument from poverty of the stimulus gets mushed together a bit sometimes, so it can be helpful to note it explicitly. Still, the authors are very careful in 6.2. to talk about what they’re interested in as the argument from poverty of the stimulus, and not poverty of the stimulus itself.

(2) Introduction, Mapping out a “lower bound for learnability”: I’m not quite sure I follow what this means: a lower bound in the sense of what’s learnable from this kind of setup, I guess? Which is why anything unlearnable might still require a language-specific constraint? 

Also, I’m not sure I quite follow the distinction between top-down vs bottom-up being made about constraints. Is it that top-down is explicitly defined and implemented, as opposed to bottom-up being an emerging thing from whatever was explicitly defined and implemented? But if so, isn’t that more of an implementational-level distinction, rather than a core aspect of the definition (=computational-level) of the constraint? That is, the bottom-up thing could be explicitly defined, if only we understood better how the explicitly defined things caused it to emerge?

(3) The “psycholinguistics paradigm” for model assessment: I really like this approach, precisely because it doesn’t commit you to an internal theory-specific representation. In general, this is a huge plus for evaluating models against observable behavior. Even if you use an internal representation (and someone doesn’t happen to like it), you can still say that whatever’s going on can yield human behavior so it must have something human-like about it. The same is true for distributed/connectionist language models where it’s hard to tell what the internal representations are, aside from being vectors of numbers.

(4) The expected superadditive pattern when both the filler and gap are present: Why should this be superadditive, instead of just additive? What extra thing is happening to make the presence of both yield a superadditive pattern? I have the same question once we get to island stimuli, too, where the factors are filler presence, gap presence, and island structure presence. 

(5) The domain-general property of the neural models: The neural models aren’t building any bias for language-specific representations in, but language-specific representations are in the hypothesis space. So, is it possible the best-fitting internal representations are language-specific? This would be similar to Bayesian approaches (e.g., Perfors et al 2011) that allow the hypothesis space to include domain-general options, but inference leads the learner to select language-specific options.

(6) The input: Just a quick note that the neural models here were trained on non-childlike input both in terms of content (e.g., newsire text, wikipedia) and quantity (though I do appreciate the legwork of estimating input quantity). This isn’t a really big deal for the proof-of-concept goal here, but starts to matter more for more targeted arguments about how children could learn various filler-gap knowledge so reliably from their experience. Of course, the authors are aware of this and explicitly discuss this right after they introduce the different models (thanks, authors!). 

One thing that could be done: cross-check the input quantity with known ages of acquisition (e.g., Complex NP islands in English by age four, De Villiers et al. 2008). Since the authors say input quantity doesn’t really affect their reported results anyway, then this should be both easy to do and not change any major findings.

The second thing is to train these models on child-directed speech samples and see if the results hold. The CHILDES database should have enough input samples from high-resource languages, and whatever limitations there might be in terms of sampling from multiple children at multiple ages from multiple backgrounds (and other variables), it seems like a step in the right direction that isn’t too hard to do (though I guess that does depend on how hard it is to train these models).

(7) Proof-of-concept argument with these neural models: The fact that these models do struggle with issues of length and word frequency in non-human-like ways does suggest that they might do other things (like learn about filler-gap dependencies) in non-human-like ways too. So we have to be careful about what kind of argument this proof-of-concept is — that is, it’s a computational-level “is it possible at all” argument, rather than a computational-level “is it possible for humans who have these known biases/limitations, etc” argument.

(8) N-grams always fail: Is this just because the 5-token window isn’t big enough, so there’s no hope of capturing dependencies that are longer? I expect so, but don’t remember the authors saying something explicitly like that.

(9) Figure 5: I want to better understand why inversion is an ok behavior (I’m looking at you, GRNN).  Does that mean that now a gap in matrix position with a licensing filler in the subject is more surprising than no gap in matrix position with no licensing filler in the subject? I guess that’s not too weird. Basically, GRNN doesn’t want gaps in places they shouldn’t be (which seems reminiscent of island restrictions, as islands are places where gaps shouldn’t be).

(10) One takeaway from the neural modeling results: Non-transformer models do better at generalizing.  Do we think this is just due to data overfitting (training input size, parameter number), or something else?

(11) Coordination islands: I know the text says all four neural models showed significant reduction in wh-effects, so I guess the reductions must be significant between the control conditions and the 1st conjunct gaps. But, there seems to be a qualitative difference in attenuation we see for a gap in the first conjunct vs. the second conjunct (and it’s true for all four neural models). I wonder why that should be. 

(12) Figure 10, checking my understanding: So, seeing no gap inside a control structure is less surprising sometimes than seeing no gap inside a left-branching structure…I think this may have to do with the weirdness of the control structures, if I’m following 14 correctly? In particular, the -gap control is “I know that you bought an expensive a car last week” and the -gap island is “I know how expensive you bought a car last week”. This may come back to being more precise about surprisal expectations for control vs. island structures. Usually, control structures are fine (grammatical), but here they’re not, and so that could interfere with the potential surprisal pattern we’re looking for.

(13) Subject islands: It was helpful to get a quick explanation about why the GRNN didn’t do as well as the other neural models here (basically, not having a robust wh-effect for the control structures). A quick explanation of this type would be helpful for other cases where we see some neural models (seem to) fail, like the first conjunct for Coordination islands, and then Left Branch and Sentential Subject islands.

(14) Table 14: (just a shout out) Thank you so much, authors, for providing this. Unbelievably helpful summary.

(15) One takeaway the authors point out: If learning is about maximizing input data probability, then these neural approaches are similar to previous approaches that do this. In particular, maximizing input data probability corresponds to the likelihood component of any Bayesian learning approach, which seems sensible. Then, the difference is just about the prior part, which corresponds to the inductive biases built in.

(16) General discussion: I’m not quite sure I follow why linguistic nativist biases would contrast with empiricist biases by a priori downweighting certain possibilities — maybe this is another way of saying that one type of language-specific bias skews/limits the hypothesis space a certain way only if it’s a language-based hypothesis space? In contrast, a domain-general bias skews/limits the hypothesis space no matter what kind of hypothesis space it is. The particular domain-general bias of maximizing input probability of course doesn’t occur a priori— the learner needs to see the input data. But other kinds of domain-general biases seem like they could skew the hypothesis space a priori (e.g., the simplicity preference from Perfors et al. 2006).

(17) Another takeaway from the general discussion is that the learner doesn’t obviously need built-in language-specific biases to learn these island constraints. But I would love to know what abstract representations get built up in the best-performing neural models from this set, like JRNN. These are likely linguistic, as they’re word forms passed through a convolutional neural network (and therefore compressed somehow), and it would be great to know if they look like syntactic categories we recognize or something else. 

So, I’m totally on board with being able to navigate to the right knowledge in this case without needing language-specific (in contrast with domain-general) help. I just would love to know more about the intermediate representations, and what it takes to plausibly construct them (especially for small humans).

Tuesday, January 25, 2022

Some thoughts on van der Slik et al. 2021

I really appreciate the thoughtfulness that went into the reanalysis of the original Harthorne et al. 2018 data on second language acquisition and a potential critical/sensitive period. What struck me (more on this below) was the subtlety of the distinction that van der Slik et al. 2021 were really looking at: I think it’s not really a “critical period” vs. not, but rather a sensitive period where some language ability is equal before a certain point vs. not. In particular, both the discontinuous (=sensitive period) and continuous (=no sensitive period) approaches assume a dropoff at some point, and that dropoff is steeper at some points than others (hence, the S-shaped curve). So the fact that there is in fact a dropoff isn’t really in dispute. Instead, the question is whether before that dropoff point, are abilities equal (and in fact, equal to native = sensitive period) or not? To me, this is certainly interesting, but the big picture remains that there’s a steeper dropoff after some point that’s predictable, and it’s useful to know when that point is.

Specific thoughts:

(1) A bit more on the discontinuous vs. continuous models, and sensitive periods vs. not: I totally sympathize with the idea that a continuous sigmoidal function is the more parsimonious explanation for the available data, especially given the plausibility of external factors (i.e., non-biological factors like schooling) for the non-immersion learners. So, turning back to the idea of a critical/sensitive period, we still get a big dropoff in rate of learning, and if the slope is steep enough at the initial onset of the S-curve, it probably looks pretty stark. Is the big difference between that and a canonical sensitive period simply that the time before the dropoff isn’t all the same? That is, for a canonical sensitive period, all ages before the cutoff are the same. In contrast, for the continuous sigmoidal curve, all ages before the point of accelerated dropoff are mostly the same, but there may in fact be small differences the older you are. If that’s the takeaway, then great — we just have to be more nuanced in how we define what happens before the “cutoff” point. But the fact that a younger brain is better (broadly speaking) is true in either case.

(2) L1 vs. L2 sensitive periods:  It’s a good point that these may in fact be different (missing the L1 cutoff seems more catastrophic). This difference seems to call into question how much we can infer about a critical/sensitive period for L1 acquisition on the basis of L2 acquisition. Later results from this paper suggest qualitative similarities in early immersion (<10 years old), bilinguals, and monolinguals (L1) vs. later immersion, in terms of whether a continuous model with sigmoidal dropoff (early immersion) vs. a discontinuous model with constant rate followed by sigmoidal dropoff (later immersion) is the best fit. So maybe we can extrapolate from L2 to L1, provided we look at the right set of L2 learners (i.e., early immersion learners). And certainly we can learn useful things about L2 critical/sensitive periods.

(3) AIC score interpretation: I think I need more of a primer on this, as I was pretty confused on how to interpret these scores. I had thought that a negative score closer to 0 is better because the measure is based on log likelihood, and closer to 0 means a “smaller” negative, which is a higher probability.  Various googling suggests absolute lowest score is better,  but I don’t understand how you get a negative number in the first place if you’re subtracting the ln of the log likelihood. That is, you’re subtracting a negative number (because likelihoods are small probabilities often much less than 1), which is equivalent to adding a positive number. So, I would have expected these scores to be positive numbers.

Thursday, January 13, 2022

Some thoughts on Hu et al. 2021

It’s a nice change of pace for me to take a look at pragmatic modeling work more from the engineering/NLP side of the world (rather than the purely cognitive side), as I think this paper does. That said, I wonder if some of the specific techniques used here, such as the training of the initial context-free lexicon, might be useful for thinking about how humans represent of meaning (especially meaning that feeds into pragmatic reasoning). 

I admit, I also would have benefited from the authors having more space to explain their approach in different places (more on this below). For instance, the intuition of self-supervised vs. regular supervised learning is something I get, but the specific implementation of the self-supervised approach (in particular, why it counts as self-supervised) was a little hard for me to follow.

Specific thoughts:

(1) H&al2021 describe a two-step learning process, where the first step is learning a lexicon without “contextual supervision”. It sounds like this is “context-free” lexicon, like the L0 level level of RSA, which typically involves the semantic representation only. Though I do wonder how “context-free” the basic semantic representations actually are (e.g., they may incorporate the linguistic contexts words appear in), to be honest. But I suppose the main distinction is that no intentions or social information are involved.

The second step is to learn “pragmatic policies” by optimizing an appropriate objective function without “human supervision”. I initially took this to mean unsupervised learning, but then H&al2021 clarified (e.g., in section 3) that instead they meant that certain types of information provided by humans aren’t included during training, and this is useful from an engineering perspective because that kind of data can be costly to get. And so the learning gets the label “self-supervising”, from the standpoint of that withheld information.

 (2) Section 4.3, on the self-supervised learning (SSL) pragmatic agents.

For the AM model that the RSA implementations use, H&al2021 say that they train the base level agents with the full contextual supervision and then “enrich” it with subsequent AM steps. I think I need this unpacked more. I think I follow what it means to train agents with the full contextual supervision: in particular, include the contexts provided by the color triples. But I don’t understand what enriching the agents with AM steps afterwards means. How is that separate/different from the initial training process? Is the initial training not done via AM optimization? For the GD model, we see a similar process, with pragmatic enrichment done via GD steps, rather than AM steps. It seems like this is important to understand, as this distinction gets this approach classified as self-supervised rather than fully supervised. 

(3) For the GD approach, the listener model can train an utterance encoder and color context encoder. But why wouldn’t a listener be using decoders, since listeners can be intuitively thought of as decoding? I guess decoding is just the inverse of encoding, so maybe it’s translatable?

(4) I think I’m unclear on what “ground truth” is in Figure 2a, and why we’re interested in that if humans don’t match it either sometimes. I would have thought the ground truth would be what humans do for this kind of pragmatic language use.

Tuesday, November 23, 2021

Some thoughts on Bohn et al. 2021

I think it’s really nice to see a developmental RSA model, along with explicit model comparisons. To me, this approach highlights how you can capture specific theories/hypotheses about what exactly is developing via these computational cognitive modeling “snapshots” that capture observable behavior at different ages. Also, we get to see the model-evaluation pipeline often used in RSA adult modeling now used with kids (i.e., the model makes testable predictions that are in fact tested on kids). I also appreciate how careful B&al2021 are with respect to how model parameters link to psychological processes in the discussion (they emphasize in the general discussion that their model necessarily made idealizations to be able to get anywhere).

Some other thoughts:

(1) It’s interesting to me that B&al2021 talk about children integrating all available information, in contrast to alternative models that ignore some information (and don’t do as well). I’m assuming “all” is relative, because a major part of language development is learning which part of the input signal is relevant. For instance, speaker voice pitch is presumably available information, but I don’t think B&al2021 would consider it relevant for the inference process they’re interested in. But I do get that they’re contrasting the winning model with one that ignores some available relevant information.

(2) I feel like the way that B&al2021 talk about informativity seems to differ at points. In one sense, they talk about an informative and cooperative speaker, which seems to link with the general RSA framework of speaker utility as maximizing correct listener inference. In another sense, they connect informativity to alpha specifically, which seems like a narrower sense of “informativity”, maybe tied to how much above 1 alpha is (and therefore how deterministic the probabilities are that the speaker uses).

(3) Methodology, no-word-knowledge variant: I was still a little fuzzy even after reading the methods section about how general vocabulary size is estimated and used in place of specific word familiarity, except that of course it’s the same value for all objects (rather than in fact differing by word familiarity).

Tuesday, November 9, 2021

Some thoughts on Perfors et al. 2010

I’m reminded how much I enjoy this style of modeling work. There’s a lot going on, but the intuitions and motivations for it made sense to me throughout, and I really appreciated how careful P&al2010 were in both interpreting their modeling results and connecting them to the existing developmental literature.

Some thoughts:

(1)  I generally am really a fan of building less in, but building it in more abstractly. This approach makes the problem of explaining where that built-in stuff comes from easier --  if you have to explain where fewer things came from, you have less explaining to do.

(2) I really appreciate how careful P&al2010 are with their conclusions about the value of having verb classes. It does seem like the model with classes (K-L3) captures the age-related effect of less overgeneralization much more strongly while the one with a single verb class (L3) doesn’t. But, P&al2010 still note that both technically capture the key effects. Qualitative developmental pattern as the official evaluation measure, check! (Something we see a lot in modeling work, because then you don’t have to explain every nuance of the observed behavior;  instead you can say the model can predict something that matters a lot for producing that observed behavior, even if it’s not the only thing that matters.)

(3) Study 3: It might seem strange to try to add more to the model in Study 2 that already seems to capture the known empirical developmental data with just syntactic distribution information. But, the thing we always have to remember is that learning any particular thing doesn’t occur in a vacuum -- if information is in the input that’s useful, and children don’t filter it out for some reason, then they probably do in fact use it and it’s helpful to see what impact this has on an explanatory model like this. Basically, does the additional information intensify the model-generated patterns or muck them up, especially if it’s noisy? This can tell us about whether kids could be using this additional information (or when they’re using it) or maybe should ignore it, for instance. This comes back at the end of the results presentation, when P&al2010 mention that having 13 features with only 6 being helpful ruins the model -- the model can’t ignore the other 7, tries to incorporate them, and gets mucked up.  Also, as P&al2010 demonstrate here, this approach could differentiate between different model types (i.e., representational theories here: with verb classes vs. without).

(4) Small implementation thing: In Study 3, when noise is added to the semantic feature correlations, so that the appropriate semantic feature only appears 60% of the time: Presumably this would be implemented across verb instances, rather than only 60% of the verbs in that class having the feature? Otherwise, if some verbs always had the feature and some didn’t, I would think the model would probably end up inferring different classes for each syntactic type instead of just one per syntactic type, e.g., a PD-only class with the P feature and a PD-only class with no feature.

Wednesday, October 27, 2021

Some thoughts on Tal et al. 2021

This seemed to me like a straightforward application of a measure of redundancy (measuring whatever level of representation you like) to quantify redundancy in child-directed speech over developmental time. As T&al2021 note, the idea of repetition and redundancy in child-directed speech isn’t new, but this way of measuring it is, and the results certainly accord with current wisdom that (i) repetition in speech is helpful for young children, and (ii) repetition gets less as children get older (and the speech directed at them gets more adult-like). The contributions therefore also seem pretty straightforward: a new, more holistic measure of repetition/redundancy at the lexical level, and the finding that multi-word utterances seem to be the thing that gets repeated less as children get older.

Some other thoughts:

(1) Corpus analysis: For the Providence corpus, with such large samples, I wonder why T&al2021 chose to make only two age bins (12-24 months, and 24-36 months). It seems like there would be enough data there to go finer-grained (like maybe every two months: 12-14, 14-16, etc), and especially zoom in on the gaps in the NewmanRatner corpus between 12 and 24 months.

(2) I had some confusion over the discussion of the NewmanRatner results, regarding the entropy decrease they found with the shuffled word order of Study 2. In particular, I think the explanation for the entropy decrease was that lexical diversity didn’t increase in this sample as children got older. But, I didn’t quite follow why this explained the entropy decrease. More specifically, if lexical diversity stays the same, the shuffled word order keeps the same frequencies of individual words over time, so no change in entropy at the lexical level. With shuffled word order, the multi-word sequences are destroyed, so that should increase entropy. How does no change + entropy increase lead to an overall entropy decrease? 

Relatedly, T&al2021 say  about Study 2 that “the opposite tendencies of lexical- and multi-word repetitiveness in this corpus seem to cancel each other out at 11 months”. This related to my confusion above. Basically, we have constant lexical diversity, so there’s no change to entropy over time coming from the lexical level. Decreasing multi-word repetitions leads to higher entropy over time. What are the opposite tendencies here? It seems like there’s only one tendency (increasing entropy from the loss of the multi-word repetitions).

Thursday, October 14, 2021

Some thoughts on Harmon et al. 2021

 I think it’s a testament to the model description that the simulations seemed almost unnecessary to me -- they turned out exactly as (I) expected, given what the model is described as trying to do, based on the frequency of novel types. I also really love seeing modeling work of this kind used to investigate developmental language disorders -- I feel like there’s just not as much of this kind of work out there, and the atypical development community really benefits from it. That said, I do think the paper suffers a bit from length limitations. I definitely had points of confusion about what conceptually was going on (more on this below).

(1) Production probability: The inference problem is described as trying to identify the “production probability”, but it took me awhile to figure out what this might be referring to. For instance, does “production probability” refer to the probability that this item will take some kind of morphology (i.e., be “productive”) vs. not in some moment? If an item has a production probability of say, .5, does that mean that the item is actually “fully” productive, but that productivity is only accessed 50% of the time (so it would be a deployment issue that we see 50% in the output)? Or does it mean that only 50% of the inflections that should be used with that item are actually used (e.g. -ed but not -ing)? (That seems more like a representation issue.) Or does “production probability” mean something else? 

I guess here, if H&al2021 are focusing on just one morpheme, it would be the deployment option, since that morpheme is either used or not. Later on, H&al2021 talk about this probability as “the probability for the inflection”, which does make me think it’s how often one inflection applies, which also aligns with the deployment option. Even later, when talking about the Pitman-Yor process, it seems like H&al2021 are talking about the probability assigned to the fragment that incorporates the inflection directly. So, this corresponds to how often that fragment gets deployed, I think.

(2) Competition, H&al2021 start a train of thought with “if competition is too difficult to resolve on the fly”: I don’t think I understand what “competition” means in this case. That is, what does it mean not to resolve the competition? I thought what was going on was that if the production probability is too low, the competition is lost (resolved) in favor of the non-inflected form. But this description makes it sound like the competition is a separate process (maybe among all the possible inflected forms?), and if that “doesn’t resolve”, then the inflected form loses to another option (which is compensation).

(3) In the description of the Procedural Deficit Hypothesis, DLD kids are said to “produce an unproductive rule”: I don’t think I follow what this means -- is it that these kids produce a form that should be unproductive, like “thank” for think-past tense? This doesn’t seem to align with “memorization using the declarative memory system”, unless these kids are hearing “thank” as think-past tense in their input (which seems unlikely). Maybe this was a typo for “produce an uninflected form”?

(4) The proposed account of H&al2021 is that children are trying to access appropriate semantics, and not just the appropriate form (i.e., they prioritize meaning); so, this is why bare forms win out.  This makes intuitive sense to me from a bottleneck standpoint. If you want to get your message across, you prioritize content over form. This is what little typically-developing kids do, too, during telegraphic speech.

(5) Potentially related work on productivity: I’m honestly surprised there’s no mention of Yang’s work on productivity here -- he has a whole book of work on it (Yang 2016), and his approach focuses on specifying how many types are necessary for a rule to be productive, which seems relevant here.


Yang, C. (2016). The price of linguistic productivity: How children learn to break the rules of language. MIT Press.

(6) During inference, the modeled learner is given parsed input and has to infer fragments: So the assumption is that the DLD child perceived the form and the inflection correctly in the input, but the issue is retrieving that form and inflection during production. I guess this is because DLD kids comprehend morphology just fine, but struggle with production?

(7) Results: “the results of t tests showed that in all models, the probability of producing wug was higher than wugged...due to the high frequency of the base form”: Was this true even for the TD (typically developing child) model? If so, isn’t that not what we want to see, because TD children pass the wug test? 

Also, were these the only two alternatives available, or were other inflectional options on the table too? 

Also, is it that the modeled child just picked the one with the highest probability? 

Are the only options available the chunked inflections (including the null of the bare form), or are fragments that just have STEM + INFLECTION (without specifying the inflection) also possible? If so, how can we tell that option from the STEM + null of the bare form in practice? Both would result in the bare form, I would think.

(8) In the discussion, processing difficulties are said to skew the intake to have fewer novel types, which is crucial for inferring productivity. So, this means that kids don’t infer a high enough probability for the productive fragment, as it were; I guess this doesn’t affect their comprehension, because they can still use the less efficient fragments to parse the input (but maybe not parse it as fast). So maybe this is a more specific hypothesis about the “processing difficulties” that cause them not to parse novel types in the input that well?

(9) Discussion, “past tense rule in the DLD models was not entirely unproductive”: Is this because the fragment probability wasn’t 0? Or, how low does it have to be to be considered unproductive? This brings me back to Yang’s work, where there’s a specific threshold. Below that threshold, it’s unproductive. And that threshold can actually be pretty high  (like, definitely above 50%).

(10) Discussion, the qualitative pattern match with TD kids is higher than with DLD kids: I get that qualitative pattern matching is important and useful when talking about child behavior, but 90-95% production vs. 30-60% production looks pretty different from Figure 3. I guess Figure 3’s in log space, and who knows what other linking components are involved. But still, I feel like it would have been rhetorically more effective to talk about higher vs lower usage than give the actual percentages here.

(11) Discussion, “possible that experience with fewer verb types in the past tense, especially with higher frequency, biases children with DLD to store a large number of inflected verbs as a single unit (stem plus inflection) compared to TD children, further undermining productivity": This description makes it sound like storing STEM + inflection directly isn’t productive. But, I thought that was the productive fragment we wanted. Or was this meant as a particular stem + inflection, like hug + ed?

Tuesday, February 23, 2021

Some thoughts on Tenenbaum et al. 2020

I think it’s a really interesting and intuitive idea to add semantic constraints to the task of morphology identification. That said, I do wonder how much of the morphology prefixes and suffixes might already come for free from the initial speech segmentation process. (I’m reminded of work in Bayesian segmentation strategies, where we definitely get some morphology like -ing sliced off for free with some implementations.) If those morphology pieces are already available, perhaps it becomes easier to implement semantically-constrained generalization over morphology transforms. Here, it seems like a lot of struggle is in the plausibility of the particular algorithm chosen for identifying suffix morphology. Perhaps that could all be sidestepped.

Relatedly, a major issue for me was understanding how the algorithm underlying the developmental model works (more on this below). I’m unclear on what seem to be important implementational details if we want to make claims about cognitive plausibility. But I love the goal of increasing developmental plausibility!

Other specific thoughts:

(1) The goal of identifying transforms: In some sense, this is the foundation of morphology learning systems (e.g., Yang 2002, 2005, 2016) that assume the child already recognizes a derived form as an instance of a root form (e.g., kissed-kiss, drank-drink, sung-sing, went-go). For those approaches, the child knows “kissed” is the past tense of “kiss” and “drank” is the past tense of “drink” (typically because the child has an awareness of the meaning similarity). Then, the child tries to figure out if the -ed transformation or the -in- → -an- transformation is productive morphology. Here, it’s about recognizing valid morphology transforms to begin with (is -in- → -an- really a thing that relates drink-drank and sing-sang?), so it’s a precursor step.

(2) On computational modeling as a goal: For me, it’s funny to state outright that a goal is to build a computational model of some process. Left implicit is why someone would want to do this. (Of course, it’s because a computational model allows us to make concrete the cognitive process we think is going on -- here, a learning theory for morphology -- and then evaluate the predictions that implemented theory makes. But experience has taught me that it’s always a good idea to say this kind of thing explicitly.)

(3) Training GloVe representations on child-directed speech: I love this. It could well be that the nature of children’s input structures the meaning space in a different way than adult linguistic input does, and this could matter for capturing non-adult-like behavior in children.

(4) Morphology algorithm stuff: In general, some of the model implementation details are unclear for me, and it seems important to understand what they are if we want to make claims that a certain algorithm is capturing the cognitive computations that humans are doing.

(a) Parameter P determines which sets (unmodeled, base, derived) the proposed base and derived elements can come from. So this means they don’t just come from the unmodeled set? I think I don’t understand what P is. Does this mean both the “base” and “derived” elements of a pair could come from, say, the “base” set? Later on, they discuss the actual P settings they consider, with respect to “static” vs “non-static”. I don’t quite know what’s going on there, though -- why do the additional three settings for the “Nonstatic” value intuitively connect to a “Nonstatic” rather than “Static” approach? It’s clearly something to do with allowing things to move in and out of the derived bin, in addition to in and out of the base bin...

(b) One step is to discard transforms that don’t meet a “threshold of overlap ratio”. What is this? Is this different from T? It seems like it, but what does it refer to?

(c) Another step is to rank remaining transforms according to the number of wordpairs they explain, with ties broken by token counts. So, token frequency does come back into play, even though the basic algorithm operates over types? I guess the frequencies come from the CHILDES data aggregates.

(d) If the top candidate transform explains >= W wordpairs, it’s kept. So, does this mean the algorithm is only evaluating the top transform each time? That is, it’s discarding the information from all the other potential transforms? That doesn’t seem very efficient...but maybe this has to do with explicit hypothesis testing, with the idea that the child can only entertain one hypothesis at a time…

(e) Each base/derived word pair explained by the new transform is moved to the Based/Derived bin. The exception is if the base form was in the derived bin before; in this case, it doesn’t move. So, if an approved transform seems to actually explain a derived1/derived2 pair, the derived1 element doesn’t go into the base bin? Is the transform still kept? I guess so?

(5) Performance is assessed via hits vs. false alarms, so I think this is an ROC curve. I like the signal detection theory approach, but then shouldn’t we be able to capture performance holistically for each combination by looking at the area under the curve?

Relatedly, transforms are counted as valid if they’re connected to at least three correct base/derived wordpairs, even if they’re also connected to any number of other spurious ones. So, a transform is “correct” if recall >=3, regardless of precision. Okay...this seems a bit arbitrary, though. Why focus on recall, rather than precision for correctness? This seems particularly salient given the discussion a bit further on in the paper that “reliability” (i.e., precision) would better model children’s learning. 

Note: I agree that high precision for early learning (<1 year) is more important than high recall. But I wonder what age this algorithm is meant to be applying to, and if that age would still be better modeled by high precision at the expense of high recall. 

Note 2 from the results later on: I do like seeing qualitative comparison to developmental data, discussing how a particular low-resource setting can capture 8 of the most common valid transforms children have.

(6) T&al2020 talk about a high-resource vs. a low-resource learner. But why not call the high-resource learner an idealized/computational-level learner? Unless Lignos & colleagues meant this to be a process/algorithmic-level learner? (It doesn’t seem like it, but then maybe they were less concerned about some of the cognitive plausibility aspects.)

(7) Fig 3 & 4, and comparisons: 

(a) Fig 3 & 4: I’d love to see the Lignos et al. version with no semantic information for all the parameter values manipulated here. That seems like an easy thing to do (just remove the semantic filtering, but still allow variation for the top number of suffixes N, wordpair threshold W, and permitted wordpairs P for the high-resource learners; for the low-resource learners, just vary W and P). Then, you could also easily compare the area under the curve for this baseline (no semantics) model vs. the semantics models for all the learners (not just the high-resource ones). And that then would make the conclusion that the learners who use semantics do better more robust. (Side note: I totally believe that semantics would help. But it would be great to see that explicitly in the analysis, and to understand exactly how much it helps the different types of learners, both high-resource and low-resource).

(b) Fig 4: I do appreciate the individual parameter exploration, but I’d also like to see a full low-resource learner combination [VC=Full, EC=CHILDES, N=3], too -- at least, if we want to claim that the more developmentally-plausible learners can still benefit from semantic info like this. This is talked about in the discussion some (i.e., VC=Full, EC=CHILDES, N=15 does as well as the original Lignos settings), but it’d be nice to see this plotted in a Figure-4-style plot for easy comparison.

(8) Which morphological transforms we’re after: In the discussion, T&al2020 note that they only focus on suffixes, and certainly the algorithm is only tuned to suffixes. It definitely seems like a more developmentally-plausible algorithm would be able to use meaning to connect more disparate derived forms to their base forms (e.g., drink-drank, think-thought). I’d love to see an algorithm that uses semantic similarity (and syntactic context) as the primary considerations, and then how close the base is to the derived form as a secondary consideration. This would allow the irregulars (like drink-drank, think-thought) to emerge as connected wordpairs. (T&al2020 do sketch some ideas in this direction in the next section, when they talk about model generalizability of morphology, and morphology clustering.)

(9) In the model extension part, T&al2020 say they want to get a “token level understanding of segmentation”. I’m not sure what this means -- is this the clustering together of different morphological transforms that apply to specific words? (I’d call this types, rather than tokens if so.)

(10) T&al2020’s proposed semantic constraint is that valid morphological transforms should connect pairs of base and derived forms that are offset in a consistent direction in semantic space. Hmmm...I guess the idea is that the semantic information encoded by a transform (e.g., past tense, plural, ongoing action) is consistent, so that should be detectable. That doesn’t seem crazy, certainly as a starting hypothesis. My concern in the practical implementation T&al2020 try is the GloVe semantic space, which may or may not actually have this property. The semantic space of embedding models is strange, and not usually very interpretable (currently) in the ways we might hope it to be. But I guess the brief practical demonstration T&al2020 do for their H3 morpheme transforms shows a proof of concept, even if it’s a mystery how a child would agglomeratively cluster things just so. That proof of concept does show it’s in fact possible to cluster just so over the GloVe-defined difference vectors.