Tuesday, February 25, 2025

Some thoughts on Lan et al. 2024

I really feel for the goal of investigating what general-purpose learners can extract from an input signal, with the aim of identifying when poverty of the stimulus might be occurring. I also absolutely share reservations about what current large language models (LLMs) are doing, and how they and their results should be classified (i.e., are they linguistically-neutral? Are their results human-like enough? How do we decide?) With that said, I struggled with certain parts of the argument in this paper, in terms of how to interpret the LLM results. More below, along with some other thoughts.

(1) On using LLMs as proxies for good general-purpose learners, without caring if they’re cognitively plausible or have acquired human-like knowledge

(a) From the paper’s intro: LLMs “can be used as tools for assessing the information in a given corpus without assuming that these models are cognitively plausible in any way and without even asking whether these models have achieved an adequate knowledge of the pattern under consideration.”

I definitely appreciate the note that an LLM can provide useful information about acquisition even if it’s not cognitively plausible and even if it doesn’t achieve the target linguistic knowledge. It’s a pretty blunt admission, and so it’s worth calling out and exploring how this could be possible. (I think the answer is that we need to be very precise about what useful information we think we can gain from a non-cognitive modeled learner that achieves only an approximation of the target knowledge. 

My thoughts, and we can see if they align with Lan et al’s thoughts in section 2, is that we can learn something about the signal available in principle in the input to a (powerful) learner with whatever biases that LLM has. I still need some help understanding what counts as a good-enough approximation of the target knowledge, though. Otherwise, I’m not sure what we conclude about whether the input signal available is or isn’t sufficient to infer the target knowledge.

(b) From 2.1:  “If a given ANN can reach such an approximation from a sufficiently rich corpus, we can use it as a proxy for a good general-purpose learner, even if the ANN is not such a learner itself.” 

This part I understand: if a less-powerful learner can succeed in a given acquisition scenario, then most likely a more-powerful learner will succeed too. (This logic isn’t perfect though – sometimes more-constrained learners (like children) do better than less-constrained learners (like adults). This is the whole fascination with the “Less is More” hypothesis of Newport (1990).

“If the model provides a reasonable approximation of wh-movement from a developmental-realistic corpus, this suggests that a good general-purpose learner will learn the correct pattern from that corpus and that the APS in this domain does not hold.” 

I don’t follow this part exactly. I think this reasoning hinges on the model’s approximation being “close enough”, and its lesser learning power somehow corresponding to how far away its approximation is from the actual target pattern. That is, I think this reasoning assumes that the “distance” between the less-powerful learner’s approximation and the target knowledge is directly correlated with the “distance” between the less-powerful learner’s learning ability and the more-powerful learner’s learning ability. Is it? It’s not obviously so to me. I’d be much more comfortable with a modeled learner achieving an approximation that’s “close enough” to the target pattern such that we don’t need to make any special leaps of faith to talk about what a better learner could hypothetically learn.

“And if the model fails to reach such an approximation this suggests that a good general-purpose learner will not learn the correct pattern from the corpus.” 

Wait, does it? This I definitely don’t follow. I don’t think we can say anything about other modeled learners with better/different learning capabilities. All we know is that this one modeled learner failed. And so we interpret results (and implications for acquisition) only with respect to the modeled learner that actually was implemented.

(c) From 2.3: “...inadequacy of ANNs as models of linguistic cognition but does not pose a problem for our use of these models as a tool for assessing the informativeness of the input data” 

Doesn’t it, though? This comes back to the assumption that a more powerful learner will do better than a less-powerful one. One might argue that humans are less-powerful learners than these ANNs, but here we have an ANN ending up with “worse” learning (because the agreement attraction error discussed in this part seems to be a knowledge competence error rather than a performance error).

“...if the LLM does not systematically assign a much higher probability to the grammatical continuation, one potential explanation for this failure…is that the pattern of wh-movement is not sufficiently well represented in the input data to merit its approximation by the model….would suggest that a good linguistically-neutral learner will not acquire the pattern from the data.” 

But does it imply that? I think all it shows is that this particular learner can’t. I don’t think we can reasonably say anything about better learners and their ability to extract information from the input signal.

(2) On GPT-3’s (relative) success

(a) From 4.2: - “We are not sure to what extent these numbers can be taken to indicate an approximation of the relevant patterns. If it is a success then it is hardly a striking one.” 

If I’m interpreting Figure 5 correctly, then what we see is that some LLM (GPT-3) increases its success dramatically when different lexical items are used. So, it seems like lexical item choice matters for at least this model, and in a potentially favorable way. My question: Does lexical choice matter for humans when judging these items? To the extent that we see similar variation by lexical item, that’s when I would get interested and think the LLM is doing something similar to humans, and so we ought to start paying attention.

“...so even if it approximates the relevant patterns, this does not indicate that a general-purpose learner would acquire the relevant knowledge from a developmentally-realistic corpus of just a few years of linguistic experience.” 

I don’t understand why we’re dinging GPT-3 as not a “general-purpose learner.” More generally,  this comes back to why we think using LLMs like GPT-3 is informative for questions of the information available in the input. Either we’re allowing a learner who’s not child-like to assess the information available, or we’re not. I do agree with the issue of extracting information from a developmentally-realistic corpus, but then why are we allowing in LLMs that don’t learn from child language interactions? I guess I feel like this critique is more about the decision to use LLMs in the first place to assess information in the input signal, rather than a failing of any particular LLM.

(b) From 4.3: - “...suggests that current models are in principle capable of improving their approximation of the pattern of wh-movement, but also that this improvement requires much more information than what is present in a corpus that corresponds to anything a child might encounter.” 

Right, so this is the criticism that I thought was fair, that we want to assess the information in a developmentally-realistic signal to investigate poverty of the stimulus claims. But then, why are we bothering to use LLMs that aren’t trained on that kind of input? If they succeed, we say, “Ah, but they were trained on unrealistic input! Not applicable.” If they fail, we say, “Didn’t work…even with a lot more data, so it *really* didn’t work.” So maybe the excitement is when the LLMs fail and they got a lot more input signal than is actually available? Then, we could say that they would presumably fail on developmentally-realistic input signal too. So, lo, poverty of the stimulus for this general-purpose learner.

(c) From 6: - “stimulus is simply too poor…by a linguistically-neutral learner….if that turns out to be the case, adult speakers’ knowledge of these aspects would mean that children are innately endowed in ways that are not linguistically neutral.” 

Okay, if…but I think we’re pretty far from that. Also, wasn’t an earlier criticism of GPT-3 that it was fine-tuned on language tasks, so now it’s not a linguistically-neutral learner after all? So either it counts as one, or it doesn’t, right? We have to know which column to score its successes (and failures) in, and I can’t tell which one it’s supposed to be here.

(3) Implementation choices and how to interpret results

From 5: Lan et al. say they just wanted to see if LLMs show improvement, so they’re not worrying about multiple runs and hyper-parameter search. 

Not that I know a lot about LLM-training, but this again strikes me as a problem of what happens if it doesn’t work/improve? I guess, luckily, the LLMs did improve. But if they hadn’t, how would we know it wasn’t a problem of the random seed or wrong hyper-parameters?

Thursday, January 23, 2025

Some thoughts on Leong & Linzen 2024

I really appreciate trying to leverage a sophisticated language modeling tool (neural networks = NNs) to help us understand child language acquisition. I love the attempt to see how different input affects acquisition (here of passivization in English). That said, I’m still struggling to be convinced of what seems to be a major claim of the paper: “neural network language models as theories of acquisition”. I have Feelings (TM) about this, which I talk about below. Short version: I’d like to believe this, but I just don’t yet. So, I don’t know what to do with these results, given that I care about child language acquisition.


(1) NNs as theories of acquisition: The Feelings (TM). 

What’s the theory exactly? I think it’s probably about the nature of the input, at best. That is, it’s asking what kind of information is in the input signal, and using a NN to extract that information. I feel like we can talk about NNs as measuring signal available in the input, assuming the powerful learning mechanism of the NN. And so sure, the signal either is or isn’t there. But that’s not a theory of acquisition. That’s more an assessment of the input signal (a poverty of the stimulus argument). And I’m all in favor of exploring what information is available vs. not in the input signal. I just feel like that’s not the same as a theory of acquisition, which should speak to how the child uses that information as part of the (acquisitional) intake.

I mean, I really like the idea of zeroing in on the types of input signal that have an effect on generalization behavior (i.e., frequency of active vs. passive but not actionality/affectedness). But what’s missing for me is an explanation of *why* those things matter or don’t matter. This is where a computational cognitive model has a leg up on NNs/LLMs, because the cognitive model version is implementing an interpretable theory of acquisition. Then, when the intake changes and the generated behavior changes, we can look inside to understand why those changes had the effect they did. That’s a more satisfying theory of acquisition to me.

From section 4, Experiment 1B: Comparing language model and human judgments.

For all the reasons outlined here about how neural networks aren’t human-like (neural networks overgeneralize in ways humans don’t, neural networks are less data-efficient than humans), I really hesitate to label a NN model a “theory of acquisition”. Again, I’m for it as a tool for measuring information in the input signal, but not as a theory of the acquisition process.

From 5, Experiment 2: Intervening on training data: “To the extent that the model is a reliable cognitive model of human language learning, our interventions…” – Exactly this. This is my issue. I’m struggling to be convinced that these NNs are reliable cognitive models of human language learning. And with that in question in my mind, I don’t know what to take from these results.

About 8.2 Using neural networks as models of human learners

I really appreciate the attempt of this section to justify how NNs can be used as theories of acquisition, but I still have the same concerns from above. At best, if given plausible input data, these models can assess information available in that input signal. I don’t think they tell us about how a child is using that signal, or offer a “theory” (i.e., explanation) for how acquisition works. I do agree that the ability to manipulate the input signal (or intake) is valuable and hard to do in behavioral experiments. But this is where computational *cognitive* models have a leg up: there we can adjust the input/intake to the modeled child however we want, and we’re implementing a theory that in fact models something about the child’s acquisition process.

From 8.2: “...working with neural networks allows for the ability to probe a model’s internal processes to understand which mechanisms are vital to the model’s learning process and form hypotheses about how humans may learn”. 

I would love to see this here. What internal processes of the NNS here link to mechanisms of passivization acquisition?

From 8.2: “Without a clear understanding of the inductive biases of the particular neural network chosen for comparison, we cannot make a fair comparison between these models and our theories of human cognition.” 

Yes! Exactly. So, given that we all agree on this point, what do we do with the results here if we’re interested in theories of child language acquisition?

Also, about the input set used (4.2 Training corpus).

If you’re talking about data children have access to, your average kid under tween age probably isn’t reading adult-directed reddit text. At the lexical level, there’s a massive difference in lexical composition in speech directed to young children (under five) at the very least, let alone under 10. There may be structural differences in active vs. passive frequency, based on age of child the speech is directed at (let alone any differences between reddit posting and actual child-directed speech or child-text materials). So, as an acquisition researcher, what do I do with the fact that a learning model can or can’t extract information from adult-directed text about the passive? Does this tell me about the signal available in the actual data children get access to? I’m just struggling to see how this implementation informs acquisition (let alone acquisition theory, which my previous feelings were about).

(2) The impact of lexical semantics

I wish other lexical semantic hypotheses had been explored here besides affectedness, because this is a bit of a straw man. Other verbs can surely passivize – like perception verbs (“see” - Lisa sees penguins. Penguins are seen by Lisa) and subject experiencers (“love” – Lisa loves penguins. Penguins are loved by Lisa). But there’s no action and the theme isn’t affected.

Nguyen & Pearl 2021 have a rundown of some of the nuances of lexical semantics, and how they seem to matter (short version: semantic clusters seem to correlate with the acquisition trajectory of passives).

Nguyen, E., & Pearl, L. (2021). The link between lexical semantic features and children’s comprehension of English verbal be-passives. Language Acquisition, 28(4), 433-450.

But anyway…I guess any lexical semantic hypothesis could be investigated this way, and maybe some future work in this area can look at more nuanced versions of the lexical semantic hypothesis.

From 3.3 Results: Doesn’t the fact that there was an effect of verb class (i.e., estimation, price, duration, and experiencer-theme passive drops were different from agent-patient passive drops) indicate that there’s an effect of lexical semantics? It makes me think the lexical semantic manipulation wasn’t quite the right thing somehow, if an effect of actional (agent-patient) vs. not didn’t show up.

From 7, Experiment 2B: Lexical semantics does not significantly affect our models’ acceptability judgments

So, I get that putting the unpassivizable verb in active sentences which the passivizable verb appeared in will nudge the semantics, but nudge it how? Did all the active sentences show affectedness? I guess this maybe addresses my earlier worry that targeting affectedness alone as the lexical semantic feature wasn’t really fair. Here, who knows what’s being targeted, so it might be affectedness, or it might be some other aspect of the actional passivizable verbs. But this comes back to my greater worry: When we find that this lexical nudge doesn’t impact the model’s passivization performance, what do we do with that result? Does lexical nudging just never work? Is it just this lexical nudge, whatever it actually did, that doesn’t work? Why does this lexical nudge work with some verbs but not others?

(3) About indirect evidence

From 6, Experiment 2A: Frequency significantly affects our models’ acceptability judgments

...English-speaking hear the passive infrequently in child-directed speech…just four passive utterances that include a by-phrase”. 

This is where I think a richer discussion of indirect evidence could be useful (a little of it comes back in the general discussion 8.4 with other passive types). There are other types of passives besides be-passives with by-phrases (e.g., Lisa was annoyed by the claim.) For instance, there are passives (or adjectival-passives) without by-phrases (e.g., Lisa was annoyed), and get-passives (e.g., Lisa got annoyed). I do agree that compared with active uses, the passive is rarer than other syntactic constructions – but there are other quantifiable sources of indirect evidence for the be-passive + by-phrase. The question of how much these are or aren’t impacting generalization behavior seems interesting and testable.