Computational Models of Language (at UC Irvine)

Monday, May 12, 2025

Some thoughts on Poli et al. 2024

I love how this work uses state-of-the-art “AI” models (i.e., neural networks with distributed representations) to concretely explore ideas about what must be built in in order to explain infant phonetic category development. The authors are quite clear that this is a computational-level model that tries to capture “evolutionary” vs. “developmental” timelines, with the built-in knowledge getting built in during the evolutionary stage, as implemented by the pretraining. To me, this is a beautiful application of the neural network framework that yields easily interpretable results.

Other thoughts:

(1) Pretraining types: Why ambient in Experiment 1 and multilingual in Experiment 2, instead of language-specific sounds? I get that ambient is a reasonable baseline (that did shockingly well), but I’m curious as to why multilingual sounds would be a better initial-state-inducer than American English sounds/Japanese sounds, respectively. Maybe it’s about capturing the “evolutionary aspect” that the sound processing mechanisms are tuned to human speech in general?

Related: Future work might try to pretrain with the kind of sounds that can be heard in utero (I know certain parts of the acoustic signal make it through while others don’t). The language-specific versions of that seem like they’d be a great investigation of prenatal development.

(2) “Panel a) of Figure 5 suggests that pretraining in multilingual is sensibly similar to pretraining on ambient speech sounds in terms of initial speech sound discrimination”.

I don’t understand what’s sensible about this. I might expect that pretraining on speech sounds would yield even better initial performance than pretraining on ambient sounds. The authors themselves note this: “Contrary to our initial hypothesis, training on multilingual speech does not yield higher speech sound discrimination capabilities compared to pretraining on ambient sounds.” So maybe the “sensibly” is a typo?

(3) It seems like no version of this model actually shows true loss of discriminability. Instead, the best case (Experiment 3 in the Appendix) shows a slight reduction of discriminability, though still pretty darned good, at >.90 accuracy. So, I’m not sure what to make of the actual results, even though I really appreciate the way the authors set up the acquisition problem space. Maybe the issue is that this is effectively an ideal learner? So, maybe adding in cognitive constraints would cause the “loss trajectory” to actually become loss of discrimination (accuracy <0.50)?

Thursday, April 24, 2025

Some thoughts on Mita et al. 2025

The utility of incorporating human-like biases into LLMs trying to learn language is that language is a human construct, transmitted by humans to small humans over time, presumably adapted to pre-existing human constraints. So, that’s why “crippling” LLMs could be good – the target of learning is something that results from “crippled” transmission (by humans). I realize this isn’t a new thought, but it struck me particularly with this paper. And that comes back to how to make contributions to the NLP side (and make better LLMs).

I’m still less sure about the contribution to the cognitive science side. I do appreciate Mita et al.’s comment that working memory limitations are therefore a plausible helpful thing occurring in humans during the critical period, and so support a particular implementation of the Less-Is-More hypothesis. Basically, I think the line of reasoning is something like this: “If we do this thing (impose developing working memory) in non-human systems that learn language well, then those systems learn language better. So, maybe humans have this thing too, because humans also learn language well."

Some other thoughts:

(1) Interpreting model representations: I struggled.

I appreciate the attempt in Figure 3 to visualize the differences between the two model implementations, but I honestly don’t know what I’m looking at. Why is the clustering on the right for the DynamicLimit-Exp better than the clustering on the left for the baseline? (Other than the fact that we know the performance is better, so it clearly must be.) Maybe something about cluster dispersion in this 2D space? I think the measures in Table 4 for entropy and mean distance are supposed to help interpret this, but I still don’t know how to get from those to the explanations given:

NoLimit model: “...clusters...appear to contrast and overlap more, suggesting stagnation in representation learning…less distinguishable….loss of diversity in the learned representations”

vs.

DynamicLimit-Exp model: “...more structured and progressive evolution of embeddings…clusters remain well-separated…with clear distinctions…suggests that the model continuously refines its representations without excessive compression”.

I think for this last part about compression the entropy measure in Table 4 helps (“..higher entropy, indicating a balanced representation that avoids excessive compression.”) But I’m still at a loss for the rest of the interpretations.

(2) Justifying the memory implementation

While I totally appreciate the thoughtful discussion of why an exponential curve like the one used by DynamicLimit-Exp is better than a logarithmic or linear one, I think the authors might also have saved themselves some angst by hearkening to the adult working memory literature (like the recency effect: Anderson & Milson 1989), which also has an exponential component.

John R Anderson and Robert Milson. 1989. Human memory: An adaptive perspective. Psychological Review 96(4):703.

Tuesday, February 25, 2025

Some thoughts on Lan et al. 2024

I really feel for the goal of investigating what general-purpose learners can extract from an input signal, with the aim of identifying when poverty of the stimulus might be occurring. I also absolutely share reservations about what current large language models (LLMs) are doing, and how they and their results should be classified (i.e., are they linguistically-neutral? Are their results human-like enough? How do we decide?) With that said, I struggled with certain parts of the argument in this paper, in terms of how to interpret the LLM results. More below, along with some other thoughts.

(1) On using LLMs as proxies for good general-purpose learners, without caring if they’re cognitively plausible or have acquired human-like knowledge

(a) From the paper’s intro: LLMs “can be used as tools for assessing the information in a given corpus without assuming that these models are cognitively plausible in any way and without even asking whether these models have achieved an adequate knowledge of the pattern under consideration.”

I definitely appreciate the note that an LLM can provide useful information about acquisition even if it’s not cognitively plausible and even if it doesn’t achieve the target linguistic knowledge. It’s a pretty blunt admission, and so it’s worth calling out and exploring how this could be possible. (I think the answer is that we need to be very precise about what useful information we think we can gain from a non-cognitive modeled learner that achieves only an approximation of the target knowledge.

My thoughts, and we can see if they align with Lan et al’s thoughts in section 2, is that we can learn something about the signal available in principle in the input to a (powerful) learner with whatever biases that LLM has. I still need some help understanding what counts as a good-enough approximation of the target knowledge, though. Otherwise, I’m not sure what we conclude about whether the input signal available is or isn’t sufficient to infer the target knowledge.

(b) From 2.1: “If a given ANN can reach such an approximation from a sufficiently rich corpus, we can use it as a proxy for a good general-purpose learner, even if the ANN is not such a learner itself.”

This part I understand: if a less-powerful learner can succeed in a given acquisition scenario, then most likely a more-powerful learner will succeed too. (This logic isn’t perfect though – sometimes more-constrained learners (like children) do better than less-constrained learners (like adults). This is the whole fascination with the “Less is More” hypothesis of Newport (1990).

“If the model provides a reasonable approximation of wh-movement from a developmental-realistic corpus, this suggests that a good general-purpose learner will learn the correct pattern from that corpus and that the APS in this domain does not hold.”

I don’t follow this part exactly. I think this reasoning hinges on the model’s approximation being “close enough”, and its lesser learning power somehow corresponding to how far away its approximation is from the actual target pattern. That is, I think this reasoning assumes that the “distance” between the less-powerful learner’s approximation and the target knowledge is directly correlated with the “distance” between the less-powerful learner’s learning ability and the more-powerful learner’s learning ability. Is it? It’s not obviously so to me. I’d be much more comfortable with a modeled learner achieving an approximation that’s “close enough” to the target pattern such that we don’t need to make any special leaps of faith to talk about what a better learner could hypothetically learn.

“And if the model fails to reach such an approximation this suggests that a good general-purpose learner will not learn the correct pattern from the corpus.”

Wait, does it? This I definitely don’t follow. I don’t think we can say anything about other modeled learners with better/different learning capabilities. All we know is that this one modeled learner failed. And so we interpret results (and implications for acquisition) only with respect to the modeled learner that actually was implemented.

(c) From 2.3: “...inadequacy of ANNs as models of linguistic cognition but does not pose a problem for our use of these models as a tool for assessing the informativeness of the input data”

Doesn’t it, though? This comes back to the assumption that a more powerful learner will do better than a less-powerful one. One might argue that humans are less-powerful learners than these ANNs, but here we have an ANN ending up with “worse” learning (because the agreement attraction error discussed in this part seems to be a knowledge competence error rather than a performance error).

“...if the LLM does not systematically assign a much higher probability to the grammatical continuation, one potential explanation for this failure…is that the pattern of wh-movement is not sufficiently well represented in the input data to merit its approximation by the model….would suggest that a good linguistically-neutral learner will not acquire the pattern from the data.”

But does it imply that? I think all it shows is that this particular learner can’t. I don’t think we can reasonably say anything about better learners and their ability to extract information from the input signal.

(2) On GPT-3’s (relative) success

(a) From 4.2: - “We are not sure to what extent these numbers can be taken to indicate an approximation of the relevant patterns. If it is a success then it is hardly a striking one.”

If I’m interpreting Figure 5 correctly, then what we see is that some LLM (GPT-3) increases its success dramatically when different lexical items are used. So, it seems like lexical item choice matters for at least this model, and in a potentially favorable way. My question: Does lexical choice matter for humans when judging these items? To the extent that we see similar variation by lexical item, that’s when I would get interested and think the LLM is doing something similar to humans, and so we ought to start paying attention.

“...so even if it approximates the relevant patterns, this does not indicate that a general-purpose learner would acquire the relevant knowledge from a developmentally-realistic corpus of just a few years of linguistic experience.”

I don’t understand why we’re dinging GPT-3 as not a “general-purpose learner.” More generally, this comes back to why we think using LLMs like GPT-3 is informative for questions of the information available in the input. Either we’re allowing a learner who’s not child-like to assess the information available, or we’re not. I do agree with the issue of extracting information from a developmentally-realistic corpus, but then why are we allowing in LLMs that don’t learn from child language interactions? I guess I feel like this critique is more about the decision to use LLMs in the first place to assess information in the input signal, rather than a failing of any particular LLM.

(b) From 4.3: - “...suggests that current models are in principle capable of improving their approximation of the pattern of wh-movement, but also that this improvement requires much more information than what is present in a corpus that corresponds to anything a child might encounter.”

Right, so this is the criticism that I thought was fair, that we want to assess the information in a developmentally-realistic signal to investigate poverty of the stimulus claims. But then, why are we bothering to use LLMs that aren’t trained on that kind of input? If they succeed, we say, “Ah, but they were trained on unrealistic input! Not applicable.” If they fail, we say, “Didn’t work…even with a lot more data, so it *really* didn’t work.” So maybe the excitement is when the LLMs fail and they got a lot more input signal than is actually available? Then, we could say that they would presumably fail on developmentally-realistic input signal too. So, lo, poverty of the stimulus for this general-purpose learner.

(c) From 6: - “stimulus is simply too poor…by a linguistically-neutral learner….if that turns out to be the case, adult speakers’ knowledge of these aspects would mean that children are innately endowed in ways that are not linguistically neutral.”

Okay, if…but I think we’re pretty far from that. Also, wasn’t an earlier criticism of GPT-3 that it was fine-tuned on language tasks, so now it’s not a linguistically-neutral learner after all? So either it counts as one, or it doesn’t, right? We have to know which column to score its successes (and failures) in, and I can’t tell which one it’s supposed to be here.

(3) Implementation choices and how to interpret results

From 5: Lan et al. say they just wanted to see if LLMs show improvement, so they’re not worrying about multiple runs and hyper-parameter search.

Not that I know a lot about LLM-training, but this again strikes me as a problem of what happens if it doesn’t work/improve? I guess, luckily, the LLMs did improve. But if they hadn’t, how would we know it wasn’t a problem of the random seed or wrong hyper-parameters?

Thursday, January 23, 2025

Some thoughts on Leong & Linzen 2024

I really appreciate trying to leverage a sophisticated language modeling tool (neural networks = NNs) to help us understand child language acquisition. I love the attempt to see how different input affects acquisition (here of passivization in English). That said, I’m still struggling to be convinced of what seems to be a major claim of the paper: “neural network language models as theories of acquisition”. I have Feelings (TM) about this, which I talk about below. Short version: I’d like to believe this, but I just don’t yet. So, I don’t know what to do with these results, given that I care about child language acquisition.

Thoughts:

(1) NNs as theories of acquisition: The Feelings (TM).

What’s the theory exactly? I think it’s probably about the nature of the input, at best. That is, it’s asking what kind of information is in the input signal, and using a NN to extract that information. I feel like we can talk about NNs as measuring signal available in the input, assuming the powerful learning mechanism of the NN. And so sure, the signal either is or isn’t there. But that’s not a theory of acquisition. That’s more an assessment of the input signal (a poverty of the stimulus argument). And I’m all in favor of exploring what information is available vs. not in the input signal. I just feel like that’s not the same as a theory of acquisition, which should speak to how the child uses that information as part of the (acquisitional) intake.

I mean, I really like the idea of zeroing in on the types of input signal that have an effect on generalization behavior (i.e., frequency of active vs. passive but not actionality/affectedness). But what’s missing for me is an explanation of *why* those things matter or don’t matter. This is where a computational cognitive model has a leg up on NNs/LLMs, because the cognitive model version is implementing an interpretable theory of acquisition. Then, when the intake changes and the generated behavior changes, we can look inside to understand why those changes had the effect they did. That’s a more satisfying theory of acquisition to me.

From section 4, Experiment 1B: Comparing language model and human judgments.

For all the reasons outlined here about how neural networks aren’t human-like (neural networks overgeneralize in ways humans don’t, neural networks are less data-efficient than humans), I really hesitate to label a NN model a “theory of acquisition”. Again, I’m for it as a tool for measuring information in the input signal, but not as a theory of the acquisition process.

From 5, Experiment 2: Intervening on training data: “To the extent that the model is a reliable cognitive model of human language learning, our interventions…” – Exactly this. This is my issue. I’m struggling to be convinced that these NNs are reliable cognitive models of human language learning. And with that in question in my mind, I don’t know what to take from these results.

About 8.2 Using neural networks as models of human learners

I really appreciate the attempt of this section to justify how NNs can be used as theories of acquisition, but I still have the same concerns from above. At best, if given plausible input data, these models can assess information available in that input signal. I don’t think they tell us about how a child is using that signal, or offer a “theory” (i.e., explanation) for how acquisition works. I do agree that the ability to manipulate the input signal (or intake) is valuable and hard to do in behavioral experiments. But this is where computational *cognitive* models have a leg up: there we can adjust the input/intake to the modeled child however we want, and we’re implementing a theory that in fact models something about the child’s acquisition process.

From 8.2: “...working with neural networks allows for the ability to probe a model’s internal processes to understand which mechanisms are vital to the model’s learning process and form hypotheses about how humans may learn”.

I would love to see this here. What internal processes of the NNS here link to mechanisms of passivization acquisition?

From 8.2: “Without a clear understanding of the inductive biases of the particular neural network chosen for comparison, we cannot make a fair comparison between these models and our theories of human cognition.”

Yes! Exactly. So, given that we all agree on this point, what do we do with the results here if we’re interested in theories of child language acquisition?

Also, about the input set used (4.2 Training corpus).

If you’re talking about data children have access to, your average kid under tween age probably isn’t reading adult-directed reddit text. At the lexical level, there’s a massive difference in lexical composition in speech directed to young children (under five) at the very least, let alone under 10. There may be structural differences in active vs. passive frequency, based on age of child the speech is directed at (let alone any differences between reddit posting and actual child-directed speech or child-text materials). So, as an acquisition researcher, what do I do with the fact that a learning model can or can’t extract information from adult-directed text about the passive? Does this tell me about the signal available in the actual data children get access to? I’m just struggling to see how this implementation informs acquisition (let alone acquisition theory, which my previous feelings were about).

(2) The impact of lexical semantics

I wish other lexical semantic hypotheses had been explored here besides affectedness, because this is a bit of a straw man. Other verbs can surely passivize – like perception verbs (“see” - Lisa sees penguins. Penguins are seen by Lisa) and subject experiencers (“love” – Lisa loves penguins. Penguins are loved by Lisa). But there’s no action and the theme isn’t affected.

Nguyen & Pearl 2021 have a rundown of some of the nuances of lexical semantics, and how they seem to matter (short version: semantic clusters seem to correlate with the acquisition trajectory of passives).

Nguyen, E., & Pearl, L. (2021). The link between lexical semantic features and children’s comprehension of English verbal be-passives. Language Acquisition, 28(4), 433-450.

But anyway…I guess any lexical semantic hypothesis could be investigated this way, and maybe some future work in this area can look at more nuanced versions of the lexical semantic hypothesis.

From 3.3 Results: Doesn’t the fact that there was an effect of verb class (i.e., estimation, price, duration, and experiencer-theme passive drops were different from agent-patient passive drops) indicate that there’s an effect of lexical semantics? It makes me think the lexical semantic manipulation wasn’t quite the right thing somehow, if an effect of actional (agent-patient) vs. not didn’t show up.

From 7, Experiment 2B: Lexical semantics does not significantly affect our models’ acceptability judgments

So, I get that putting the unpassivizable verb in active sentences which the passivizable verb appeared in will nudge the semantics, but nudge it how? Did all the active sentences show affectedness? I guess this maybe addresses my earlier worry that targeting affectedness alone as the lexical semantic feature wasn’t really fair. Here, who knows what’s being targeted, so it might be affectedness, or it might be some other aspect of the actional passivizable verbs. But this comes back to my greater worry: When we find that this lexical nudge doesn’t impact the model’s passivization performance, what do we do with that result? Does lexical nudging just never work? Is it just this lexical nudge, whatever it actually did, that doesn’t work? Why does this lexical nudge work with some verbs but not others?

(3) About indirect evidence

From 6, Experiment 2A: Frequency significantly affects our models’ acceptability judgments

“...English-speaking hear the passive infrequently in child-directed speech…just four passive utterances that include a by-phrase”.

This is where I think a richer discussion of indirect evidence could be useful (a little of it comes back in the general discussion 8.4 with other passive types). There are other types of passives besides be-passives with by-phrases (e.g., Lisa was annoyed by the claim.) For instance, there are passives (or adjectival-passives) without by-phrases (e.g., Lisa was annoyed), and get-passives (e.g., Lisa got annoyed). I do agree that compared with active uses, the passive is rarer than other syntactic constructions – but there are other quantifiable sources of indirect evidence for the be-passive + by-phrase. The question of how much these are or aren’t impacting generalization behavior seems interesting and testable.

Monday, December 4, 2023

Some thoughts on McCoy & Griffiths 2023

Before I read a single word of this paper, I already loved the idea of this: encoding useful symbolic knowledge into a distributed representation that’s been proven capable of Awesome Language Feats. This seems like exactly what we want in order to better understand how language acquisition is possible. I know the goal here is about making artificial neural networks (ANNs) better at language acquisition, but the way to do that is inspired by how children do the same thing. So it seems like there’s a good potential for accomplishing the goal I tend to be more interested in, which is using ANNs to better understand (tiny) human cognition.

Other targeted thoughts:

(1) In describing how the Bayesian prior is encoded into the ANN, M&G2023 say “hypotheses are sampled from that prior to create tasks that instantiate inductive bias in the data”. When I first read this, I wanted to understand better what it means to create a task from a sampled hypothesis. Section 2 says “each ‘task’ is a language so that the inductive bias being distilled is a prior over the space of languages. So…that would be a language whose distribution over elements matches the sampled hypothesis? (That might make sense, assuming a hypothesis in the Bayesian model is a distribution over elements of the potential language.)

After reading section 2, Step 2, this seems like what they’re doing. It’s just that the term “task” was new to me here, and doesn’t seem to describe what’s going on. Maybe this term comes from the ML literature on meta-learning.

(2) Meta-learning: to learn: M&G2023 use model-agnostic meta-learning (MAML), and they say MAML can be viewed as a way to perform hierarchical Bayesian modeling. Why? Because MAML involves learning about the equivalent of hyperparameters – the original model’s parameters – rather than only the model that actually learns directly from the data. It seems important to understand how the original model’s parameters are adjusted, on the basis of the temporary model’s learning of the sampled data. I don’t think I quite understand how this works.

Related: Pre-training vs. prior-training. M&G2023 describe these approaches as a head start (pre-training) vs. learning to learn (prior training). It feels like the details of how prior training works are now important – that is, transferring what was learned from temporary model M’ to original model M in the MAML approach. This transfer is clearly meant to be different from pre-training, which involves training M on a more-general task…which is somehow not “learning to learn”, even though the task is general. I may just need to read more in this literature to understand the difference, though.

(3) M&G2023 note that the prior-trained neural network can learn like a Bayesian model (e.g., pretty well from 10 examples), but it’s way faster because of the parallel processing architecture. This comment about the relative speed of Bayesian models vs. prior-trained neural networks that encode the equivalent of Bayesian inductive biases definitely makes me think about language evolution considerations. Basically, why do human languages have the shape they do? Because languages can be learned via inductive biases that can be encoded into parallel-processing, distributed-representation machines (i.e., human neural networks) that work fast.

(4) It’s great to see strong performance from the prior-trained NN, but the fact the other NNs do pretty darned good too seems noticeable. That is, 8.5 million words may be enough even for NNs with weak inductive biases. M&G2023 note at the end of the section that a better demo would be a smaller corpus, among other considerations, and they in fact explore smaller input sizes (hurrah!).

(5) out-of-distribution generalization: The prior-trained NN always does a little better. Again, it’s great to see the improvement, but is it surprising that the standard NN without the inductive bias does pretty good too? Maybe this is because the standard NN had enough data? (Although M&G2023 say in the next subsection that this may have to do with the distilled inductive biases not being that helpful. So the issue is distilling better biases, i.e., ones defined over naturalistic data more…somehow?) I wonder what would happen if we focused on the versions that only had 1/32nd of the data, since that’s one case where the prior-trained NN definitely did better than the standard one.

(6) Future work: M&G2023 note that future work can distill different inductive biases into NNs and see which ones work better. I love the idea of this, but I think we should be clear about the assumptions we would be making here. Basically, if we’re going to test different theories of inductive biases, then we‘re committing to the NN representation as “good enough” to simulate computation in the human mind. This is fine, but we should be clear about it, especially since it can be hard to interpret what other biases might be active in any given ANN implementation (e.g., LSTMs vs. Transformers).

Wednesday, November 29, 2023

Some thoughts on Frank 2023a and 2023b

I’m definitely on board with the spirit of these papers. My position: I would love to understand more about how children do what they do when it comes to language acquisition. If that also helps large language models (LLMs) do what they do better, then that’s great too.

Some other specific thoughts, responding to certain ideas in “Bridging the gap”:

(1) I definitely understand that the interactive, social nature of children’s input matters. In particular, the social part in child language acquisition is usually about why certain input has more impact than others - the input in an interactive, social environment gets absorbed better by kids. But absorption doesn’t seem to be the problem for LLMs – they take in their data just fine. That said, it does seem like the interaction part helps Chat-GPT (I.e., the ability to query).

More generally, it could be that what a certain input quality (e.g., being social and interactive) does for human kids isn’t necessary for an LLM. But, we don’t know that until we understand why that input quality helps kids in the first place.

(2) I also understand that multimodal input gives concrete extensions to some concepts, and so helps “ground out” meaning in the real world for kids. I’m less sure how multimodal input would help current AI systems — is it maybe helpful for bootstrapping the rest of the cognitive system (somehow?) that allows flexible reasoning?

(3) I think there’s a really good point made about needing the apples-to-apples comparison for evaluation. I remember earlier in the evaluation of speech segmentation models, the models were compared against perfect (adult-like) accuracy of segmentation, and few cognitively-plausible ones did all that well. In contrast, when these same models were tested on the segmentation tasks given to infants (which were meant to demonstrate infant segmentation ability), most models did just fine. Now, whether the models accomplished segmentation the way that the infants did is a different question, and one that would also apply to LLMs once we have apples-to-apples comparisons.

Tuesday, April 25, 2023

Some thoughts on Degen 2023

To me, this is a beautifully accessible review article for the probabilistic pragmatics approach, as implemented in RSA. (Figure 1 in particular made me happy – these helpful visuals really are worth it, though I know it’s hard to get them together just right.) This review article definitely gets me wondering more about how to use RSA for language acquisition (especially when it discusses bounded cognition).

In particular, what’s the (potential) difference between a child’s approximation of Bayesian inference and an adult’s approximation? How much can be captured by this mental computation being pretty good but the units over which inference is operating being immature (e.g., utterance alternatives, meaning options, priors)? For instance, how worthwhile is it to try and capture child behavior on different pragmatic phenomena by assuming adult-like Bayesian inference but non-adult-like units that inference operates over?

Scontras & Pearl 2021 did this a little for quantifier-scope interpretation, but those child data were from five-year-olds, who are known to be pretty adult-like for non-pragmatic things. What about younger kids? And of course, what about other pragmatic phenomena that we have child data for?

Tuesday, April 18, 2023

Some thoughts on Diercks et al. 2023

I really appreciated the leisurely pace and accessible tone of this writing, especially for someone who’s not super-familiar with the nuts and bolts of the Minimalist approach, but very interested in development. Here we can see one of the perks of not having a strict page limit. :)

Some other thoughts:

(1) One key idea of Developmental Minimalist Syntax (DMS) seems to be that the current bottom-up description of possible representations (which is what I take the iterated Merge cycles of the Minimalist approach to be) would actually have a cognitive correlate that we can observe and evaluate (i.e., stages of development). That is, this way of compactly describing acceptable/grammatical adult representations corresponds to an actual cognitive process (at the computational level of description, in Marr’s terms) whose signal can be seen in children’s developmental stages. So, this would support the validity (utility?) of describing adult representations this way.

(2) I didn’t quite follow the link between Minimalist Analytical Constructions (MACs) and Universal Cognition for Language. Is the idea that there are certain representations in the adult knowledge system, and we don’t care if their origin is language-specific? It sounds like that, from the text that follows.

Later on, MACs are described as children’s “toolkit for grammaticalizing their language”. Would this mean that the adult representations are what children use to make sense of (“grammaticalize”) their language? That is, the representations children develop allow them to parse their input into useful information. In my standard way of thinking about these things, the developed/developing representations that children have allow them to perceive certain information in their input (which then is transformed into their “perceptual intake” of the input signal).

In ch 3, part 4, we get a fuller definition: “grammaticalizing” means arriving at and encoding generalizations for the language. So, I think that’s compatible with my idea above that “grammaticalizing” has to do with the developing adult-like representations, and children parse their input with whatever they’ve already developed along the way.

(3) Thinking about acquisition as addition, rather than replacement: Just to clarify, children can have immature representations in one of two ways:

(1) a representation is immature because it’s still changing ([hug X] instead of [Predicate X]), or

(2) a representation is immature because it’s fixed into the adult-like state, but it’s only part of the full adult-like structure (e.g., VP) instead of the adult-like full structure [CP [TP [vp [VP ]]]]. This second version is talked about later in ch.3 a little in “mixed status utterances”, which can have an adult-like part and an immature part.

(4) Predictions for VP before vP (section 4.3): So, I think a prediction of DMS is that we shouldn’t generally see agentive subjects combining productively with verbs (which would be vP) before we see verbs combining productively with their objects (which would be VP). (Ex: Not “I put” before “put the ball” or “put down”, as a specific item.)

How would we then distinguish an item-specific combination that might seem to violate this from a language-general implementation involving that item that might seem to violate this? (That is, if we encounter “I put” before “put the ball”, how do we know if it’s an item-specific use or a productive language-general use?) Is it about where the child seems to be with respect to language-general use (e.g., productively using verbs with objects, but not subjects with verbs)? That is, we’d assume that an instance of “I put” would be item-specific and immature, but “put down” would be productive and general?

Friday, February 10, 2023

Some thoughts on Hahn et al. (2022)

I love the way Hahn et al. (2022) set up the two approaches they’re combining – it seems like the most natural thing in the world to combine them and reap the benefits of both. Hats off to the authors for some masterful narrative there.

In general, I’d love to think about how to apply the resource-rational lossy-context surprisal approach to models of acquisition. It seems like this approach to input representation could be applied to child input for any given existing model (say, of syntactic learning, but really for learning anything), so that we get a better sense of what (skewed) input children might actually be working from when they’re trying to infer properties of their native language.

A first pass might be just to use this adult-like version to skew children’s input (maybe a neural model trained on child-directed speech to get appropriate retention probabilities, etc.). That said, I can also imagine that the retention rate might just generally be less for kids (and kids of different ages) compared to adults because of lower thresholds on the parts that go into calculating that retention rate (e.g., the delta parameter that modulates how much context goes into calculating next-word probabilities). Still, the exciting thing for me is the idea that this is a way to formally implement “developing processing” (or even just “more realistic processing”) in a model that’s meant to capture developing representations.

Wednesday, October 19, 2022

Some thoughts on Hitczenko & Feldman 2022

I love seeing work that evaluates an idea against naturalistic data. It’s often the exciting next “proof of concept” once you’ve got an implemented theory that works on idealized data or controlled experimental data.

Some other thoughts:

(1) I completely sympathize with the idea that anything from the broader context might be relevant for discriminating contrastive dimensions. I think the question then becomes how infants decide which contextual factors to pay attention to, out of all the possible ones. Are certain ones more salient period, or because the infant brain has certain perceptual biases, etc? What’s the hypothesis space of possible contextual features, and how might an infant navigate through that hypothesis space?

(2) Thinking about noise: I wonder how much noise this kind of approach can tolerate. For instance (and this is a point the H&F2022 bring up in the discussion), if infants have a fuzzier notion of distributional similarity than Earthmover’s distance/KL divergence/whatever because of their developing learning abilities, can they still catch onto these distributional differences?

H&F2022 also implement some ideas for fuzzier (mis)perception of the input, which shows this approach can tolerate at least 20% noise in perception. So maybe someone could implement the fuzzier distributional similarity idea in a similar way.

Tuesday, October 4, 2022

Some thoughts on Cao et al. 2022

I really like seeing modeling work like this where a more complex, ideal computation (here, EIG) can be well-approximated by a simpler, more-heuristic computation (here, surprisal and KL divergence) when it comes to capturing developmental behavior. Of course, this paper is presenting a first-pass evaluation over adult behavior, but as the authors note, future work can extend their evaluation to infant looking behavior. I definitely would like to see how well this approach works for infant data, since I’d be surprised if there wasn’t some immaturity (i.e., resource constraints, other biases) at work for the computation itself in infants, compared with adult decision-making. And then the interesting question is how to capture that immaturity – for instance, do the approximations of the computation work even better than the idealized computation with EIG? Would even simpler heuristics that don’t approximate EIG as well but are also backward-looking, rather than forward-looking, be better?

Other specific thoughts:

(1) Noisy perception: It’s really nice to see this worked into a developmental model, since – especially for infants – imperfect representations of stimuli seems like a plausible situation. That is, the “perceptual intake” into the learning system depends on immature knowledge and abilities, and is therefore different from the input signal that’s out there in the world. (To be fair, the perceptual intake for adults is also different from the input signal out there in the world, and adults don’t have immature knowledge and abilities. So children basically have to learn to be adult-like in how they “skew” the input signal.)

(2) The RANCH model involves accumulating noisy samples and choosing what to do at each moment. This sounds like the diffusion model of decision-making from mathematical psych to me. I wonder if RANCH is an implementation of that (and if not, how they differ)?

(3) What the learner needs to know: A key idea here is that the motivation to sample the input at all is because the learner knows perception is noisy. To me, this is pretty reasonable knowledge to build into a modeled child. It reminds me of Perkins et al. 2022 where the learner knows misperception occurs, and so has to learn to filter out erroneous data. Importantly there, the modeled learner doesn’t have to know the specifics beyond that.

Perkins, L., Feldman, N. H., & Lidz, J. (2022). The Power of Ignoring: Filtering Input for Argument Structure Acquisition. Cognitive Science, 46(1), e13080.

Friday, February 11, 2022

Some thoughts on Wilcox et al. 2021

This paper made me really happy because it involved careful thought about what was being investigated, an accessible intuition about how each model works, what the selected models can and can’t tell us, how the models should be evaluated, sensible ways to interpret model results, and why we should care. Of course, I did have (a lot of) various things occur to me as I was reading (more on this below), but this is probably one of the few papers I’ve read recently using neural net models that I care about, as a developmental linguist who does cognitive modeling. Thanks, authors!

Specific thoughts:

(1) Poverty of the stimulus vs. the Argument from poverty of the stimulus (i.e, viable solutions to poverty of the stimulus): I think it’s useful to really separate these two ideas. Poverty of the stimulus is about whether the data are actually compatible with multiple generalizations. I think this seems to be true about learning constraints on filler-gap dependencies (though this assertion depends on the data considered relevant in the input signal, which is why it’s important to be clear about what the input is). But the argument from poverty of the stimulus is about viable solutions, i.e., the biases that are built in to navigate the possibilities and converge on the right generalization.

The abstract wording focuses on poverty of the stimulus itself for syntactic islands, while the general discussion in 6.2. is clearly focusing on the (potential) viable solutions uncovered via the models explored in the paper. That is, the focus isn’t about whether there’s poverty of the stimulus for learning about islands, but rather what built-in stuff it would take to solve it. And that’s where the linguistic nativist vs. non-linguistic nativist/empiricist discussion comes in. I think this distinction between poverty of the stimulus itself and the argument from poverty of the stimulus gets mushed together a bit sometimes, so it can be helpful to note it explicitly. Still, the authors are very careful in 6.2. to talk about what they’re interested in as the argument from poverty of the stimulus, and not poverty of the stimulus itself.

(2) Introduction, Mapping out a “lower bound for learnability”: I’m not quite sure I follow what this means: a lower bound in the sense of what’s learnable from this kind of setup, I guess? Which is why anything unlearnable might still require a language-specific constraint?

Also, I’m not sure I quite follow the distinction between top-down vs bottom-up being made about constraints. Is it that top-down is explicitly defined and implemented, as opposed to bottom-up being an emerging thing from whatever was explicitly defined and implemented? But if so, isn’t that more of an implementational-level distinction, rather than a core aspect of the definition (=computational-level) of the constraint? That is, the bottom-up thing could be explicitly defined, if only we understood better how the explicitly defined things caused it to emerge?

(3) The “psycholinguistics paradigm” for model assessment: I really like this approach, precisely because it doesn’t commit you to an internal theory-specific representation. In general, this is a huge plus for evaluating models against observable behavior. Even if you use an internal representation (and someone doesn’t happen to like it), you can still say that whatever’s going on can yield human behavior so it must have something human-like about it. The same is true for distributed/connectionist language models where it’s hard to tell what the internal representations are, aside from being vectors of numbers.

(4) The expected superadditive pattern when both the filler and gap are present: Why should this be superadditive, instead of just additive? What extra thing is happening to make the presence of both yield a superadditive pattern? I have the same question once we get to island stimuli, too, where the factors are filler presence, gap presence, and island structure presence.

(5) The domain-general property of the neural models: The neural models aren’t building any bias for language-specific representations in, but language-specific representations are in the hypothesis space. So, is it possible the best-fitting internal representations are language-specific? This would be similar to Bayesian approaches (e.g., Perfors et al 2011) that allow the hypothesis space to include domain-general options, but inference leads the learner to select language-specific options.

(6) The input: Just a quick note that the neural models here were trained on non-childlike input both in terms of content (e.g., newsire text, wikipedia) and quantity (though I do appreciate the legwork of estimating input quantity). This isn’t a really big deal for the proof-of-concept goal here, but starts to matter more for more targeted arguments about how children could learn various filler-gap knowledge so reliably from their experience. Of course, the authors are aware of this and explicitly discuss this right after they introduce the different models (thanks, authors!).

One thing that could be done: cross-check the input quantity with known ages of acquisition (e.g., Complex NP islands in English by age four, De Villiers et al. 2008). Since the authors say input quantity doesn’t really affect their reported results anyway, then this should be both easy to do and not change any major findings.

The second thing is to train these models on child-directed speech samples and see if the results hold. The CHILDES database should have enough input samples from high-resource languages, and whatever limitations there might be in terms of sampling from multiple children at multiple ages from multiple backgrounds (and other variables), it seems like a step in the right direction that isn’t too hard to do (though I guess that does depend on how hard it is to train these models).

(7) Proof-of-concept argument with these neural models: The fact that these models do struggle with issues of length and word frequency in non-human-like ways does suggest that they might do other things (like learn about filler-gap dependencies) in non-human-like ways too. So we have to be careful about what kind of argument this proof-of-concept is — that is, it’s a computational-level “is it possible at all” argument, rather than a computational-level “is it possible for humans who have these known biases/limitations, etc” argument.

(8) N-grams always fail: Is this just because the 5-token window isn’t big enough, so there’s no hope of capturing dependencies that are longer? I expect so, but don’t remember the authors saying something explicitly like that.

(9) Figure 5: I want to better understand why inversion is an ok behavior (I’m looking at you, GRNN). Does that mean that now a gap in matrix position with a licensing filler in the subject is more surprising than no gap in matrix position with no licensing filler in the subject? I guess that’s not too weird. Basically, GRNN doesn’t want gaps in places they shouldn’t be (which seems reminiscent of island restrictions, as islands are places where gaps shouldn’t be).

(10) One takeaway from the neural modeling results: Non-transformer models do better at generalizing. Do we think this is just due to data overfitting (training input size, parameter number), or something else?

(11) Coordination islands: I know the text says all four neural models showed significant reduction in wh-effects, so I guess the reductions must be significant between the control conditions and the 1st conjunct gaps. But, there seems to be a qualitative difference in attenuation we see for a gap in the first conjunct vs. the second conjunct (and it’s true for all four neural models). I wonder why that should be.

(12) Figure 10, checking my understanding: So, seeing no gap inside a control structure is less surprising sometimes than seeing no gap inside a left-branching structure…I think this may have to do with the weirdness of the control structures, if I’m following 14 correctly? In particular, the -gap control is “I know that you bought an expensive a car last week” and the -gap island is “I know how expensive you bought a car last week”. This may come back to being more precise about surprisal expectations for control vs. island structures. Usually, control structures are fine (grammatical), but here they’re not, and so that could interfere with the potential surprisal pattern we’re looking for.

(13) Subject islands: It was helpful to get a quick explanation about why the GRNN didn’t do as well as the other neural models here (basically, not having a robust wh-effect for the control structures). A quick explanation of this type would be helpful for other cases where we see some neural models (seem to) fail, like the first conjunct for Coordination islands, and then Left Branch and Sentential Subject islands.

(14) Table 14: (just a shout out) Thank you so much, authors, for providing this. Unbelievably helpful summary.

(15) One takeaway the authors point out: If learning is about maximizing input data probability, then these neural approaches are similar to previous approaches that do this. In particular, maximizing input data probability corresponds to the likelihood component of any Bayesian learning approach, which seems sensible. Then, the difference is just about the prior part, which corresponds to the inductive biases built in.

(16) General discussion: I’m not quite sure I follow why linguistic nativist biases would contrast with empiricist biases by a priori downweighting certain possibilities — maybe this is another way of saying that one type of language-specific bias skews/limits the hypothesis space a certain way only if it’s a language-based hypothesis space? In contrast, a domain-general bias skews/limits the hypothesis space no matter what kind of hypothesis space it is. The particular domain-general bias of maximizing input probability of course doesn’t occur a priori— the learner needs to see the input data. But other kinds of domain-general biases seem like they could skew the hypothesis space a priori (e.g., the simplicity preference from Perfors et al. 2006).

(17) Another takeaway from the general discussion is that the learner doesn’t obviously need built-in language-specific biases to learn these island constraints. But I would love to know what abstract representations get built up in the best-performing neural models from this set, like JRNN. These are likely linguistic, as they’re word forms passed through a convolutional neural network (and therefore compressed somehow), and it would be great to know if they look like syntactic categories we recognize or something else.

So, I’m totally on board with being able to navigate to the right knowledge in this case without needing language-specific (in contrast with domain-general) help. I just would love to know more about the intermediate representations, and what it takes to plausibly construct them (especially for small humans).

Computational Models of Language (at UC Irvine)

Monday, May 12, 2025

Some thoughts on Poli et al. 2024

Thursday, April 24, 2025

Some thoughts on Mita et al. 2025

Tuesday, February 25, 2025

Some thoughts on Lan et al. 2024

Thursday, January 23, 2025

Some thoughts on Leong & Linzen 2024

Monday, December 4, 2023

Some thoughts on McCoy & Griffiths 2023

Wednesday, November 29, 2023

Some thoughts on Frank 2023a and 2023b

Tuesday, April 25, 2023

Some thoughts on Degen 2023

Tuesday, April 18, 2023

Some thoughts on Diercks et al. 2023

Friday, February 10, 2023

Some thoughts on Hahn et al. (2022)

Wednesday, October 19, 2022

Some thoughts on Hitczenko & Feldman 2022

Tuesday, October 4, 2022

Some thoughts on Cao et al. 2022

Friday, February 11, 2022

Some thoughts on Wilcox et al. 2021

People who think this blog is awesome

Members