I really feel for the goal of investigating what general-purpose learners can extract from an input signal, with the aim of identifying when poverty of the stimulus might be occurring. I also absolutely share reservations about what current large language models (LLMs) are doing, and how they and their results should be classified (i.e., are they linguistically-neutral? Are their results human-like enough? How do we decide?) With that said, I struggled with certain parts of the argument in this paper, in terms of how to interpret the LLM results. More below, along with some other thoughts.
(1) On using LLMs as proxies for good general-purpose learners, without caring if they’re cognitively plausible or have acquired human-like knowledge
(a) From the paper’s intro: LLMs “can be used as tools for assessing the information in a given corpus without assuming that these models are cognitively plausible in any way and without even asking whether these models have achieved an adequate knowledge of the pattern under consideration.”
I definitely appreciate the note that an LLM can provide useful information about acquisition even if it’s not cognitively plausible and even if it doesn’t achieve the target linguistic knowledge. It’s a pretty blunt admission, and so it’s worth calling out and exploring how this could be possible. (I think the answer is that we need to be very precise about what useful information we think we can gain from a non-cognitive modeled learner that achieves only an approximation of the target knowledge.
My thoughts, and we can see if they align with Lan et al’s thoughts in section 2, is that we can learn something about the signal available in principle in the input to a (powerful) learner with whatever biases that LLM has. I still need some help understanding what counts as a good-enough approximation of the target knowledge, though. Otherwise, I’m not sure what we conclude about whether the input signal available is or isn’t sufficient to infer the target knowledge.
(b) From 2.1: “If a given ANN can reach such an approximation from a sufficiently rich corpus, we can use it as a proxy for a good general-purpose learner, even if the ANN is not such a learner itself.”
This part I understand: if a less-powerful learner can succeed in a given acquisition scenario, then most likely a more-powerful learner will succeed too. (This logic isn’t perfect though – sometimes more-constrained learners (like children) do better than less-constrained learners (like adults). This is the whole fascination with the “Less is More” hypothesis of Newport (1990).
“If the model provides a reasonable approximation of wh-movement from a developmental-realistic corpus, this suggests that a good general-purpose learner will learn the correct pattern from that corpus and that the APS in this domain does not hold.”
I don’t follow this part exactly. I think this reasoning hinges on the model’s approximation being “close enough”, and its lesser learning power somehow corresponding to how far away its approximation is from the actual target pattern. That is, I think this reasoning assumes that the “distance” between the less-powerful learner’s approximation and the target knowledge is directly correlated with the “distance” between the less-powerful learner’s learning ability and the more-powerful learner’s learning ability. Is it? It’s not obviously so to me. I’d be much more comfortable with a modeled learner achieving an approximation that’s “close enough” to the target pattern such that we don’t need to make any special leaps of faith to talk about what a better learner could hypothetically learn.
“And if the model fails to reach such an approximation this suggests that a good general-purpose learner will not learn the correct pattern from the corpus.”
Wait, does it? This I definitely don’t follow. I don’t think we can say anything about other modeled learners with better/different learning capabilities. All we know is that this one modeled learner failed. And so we interpret results (and implications for acquisition) only with respect to the modeled learner that actually was implemented.
(c) From 2.3: “...inadequacy of ANNs as models of linguistic cognition but does not pose a problem for our use of these models as a tool for assessing the informativeness of the input data”
Doesn’t it, though? This comes back to the assumption that a more powerful learner will do better than a less-powerful one. One might argue that humans are less-powerful learners than these ANNs, but here we have an ANN ending up with “worse” learning (because the agreement attraction error discussed in this part seems to be a knowledge competence error rather than a performance error).
“...if the LLM does not systematically assign a much higher probability to the grammatical continuation, one potential explanation for this failure…is that the pattern of wh-movement is not sufficiently well represented in the input data to merit its approximation by the model….would suggest that a good linguistically-neutral learner will not acquire the pattern from the data.”
But does it imply that? I think all it shows is that this particular learner can’t. I don’t think we can reasonably say anything about better learners and their ability to extract information from the input signal.
(2) On GPT-3’s (relative) success
(a) From 4.2: - “We are not sure to what extent these numbers can be taken to indicate an approximation of the relevant patterns. If it is a success then it is hardly a striking one.”
If I’m interpreting Figure 5 correctly, then what we see is that some LLM (GPT-3) increases its success dramatically when different lexical items are used. So, it seems like lexical item choice matters for at least this model, and in a potentially favorable way. My question: Does lexical choice matter for humans when judging these items? To the extent that we see similar variation by lexical item, that’s when I would get interested and think the LLM is doing something similar to humans, and so we ought to start paying attention.
“...so even if it approximates the relevant patterns, this does not indicate that a general-purpose learner would acquire the relevant knowledge from a developmentally-realistic corpus of just a few years of linguistic experience.”
I don’t understand why we’re dinging GPT-3 as not a “general-purpose learner.” More generally, this comes back to why we think using LLMs like GPT-3 is informative for questions of the information available in the input. Either we’re allowing a learner who’s not child-like to assess the information available, or we’re not. I do agree with the issue of extracting information from a developmentally-realistic corpus, but then why are we allowing in LLMs that don’t learn from child language interactions? I guess I feel like this critique is more about the decision to use LLMs in the first place to assess information in the input signal, rather than a failing of any particular LLM.
(b) From 4.3: - “...suggests that current models are in principle capable of improving their approximation of the pattern of wh-movement, but also that this improvement requires much more information than what is present in a corpus that corresponds to anything a child might encounter.”
Right, so this is the criticism that I thought was fair, that we want to assess the information in a developmentally-realistic signal to investigate poverty of the stimulus claims. But then, why are we bothering to use LLMs that aren’t trained on that kind of input? If they succeed, we say, “Ah, but they were trained on unrealistic input! Not applicable.” If they fail, we say, “Didn’t work…even with a lot more data, so it *really* didn’t work.” So maybe the excitement is when the LLMs fail and they got a lot more input signal than is actually available? Then, we could say that they would presumably fail on developmentally-realistic input signal too. So, lo, poverty of the stimulus for this general-purpose learner.
(c) From 6: - “stimulus is simply too poor…by a linguistically-neutral learner….if that turns out to be the case, adult speakers’ knowledge of these aspects would mean that children are innately endowed in ways that are not linguistically neutral.”
Okay, if…but I think we’re pretty far from that. Also, wasn’t an earlier criticism of GPT-3 that it was fine-tuned on language tasks, so now it’s not a linguistically-neutral learner after all? So either it counts as one, or it doesn’t, right? We have to know which column to score its successes (and failures) in, and I can’t tell which one it’s supposed to be here.
(3) Implementation choices and how to interpret results
From 5: Lan et al. say they just wanted to see if LLMs show improvement, so they’re not worrying about multiple runs and hyper-parameter search.
Not that I know a lot about LLM-training, but this again strikes me as a problem of what happens if it doesn’t work/improve? I guess, luckily, the LLMs did improve. But if they hadn’t, how would we know it wasn’t a problem of the random seed or wrong hyper-parameters?