The utility of incorporating human-like biases into LLMs trying to learn language is that language is a human construct, transmitted by humans to small humans over time, presumably adapted to pre-existing human constraints. So, that’s why “crippling” LLMs could be good – the target of learning is something that results from “crippled” transmission (by humans). I realize this isn’t a new thought, but it struck me particularly with this paper. And that comes back to how to make contributions to the NLP side (and make better LLMs).
I’m still less sure about the contribution to the cognitive science side. I do appreciate Mita et al.’s comment that working memory limitations are therefore a plausible helpful thing occurring in humans during the critical period, and so support a particular implementation of the Less-Is-More hypothesis. Basically, I think the line of reasoning is something like this: “If we do this thing (impose developing working memory) in non-human systems that learn language well, then those systems learn language better. So, maybe humans have this thing too, because humans also learn language well."
Some other thoughts:
(1) Interpreting model representations: I struggled.
I appreciate the attempt in Figure 3 to visualize the differences between the two model implementations, but I honestly don’t know what I’m looking at. Why is the clustering on the right for the DynamicLimit-Exp better than the clustering on the left for the baseline? (Other than the fact that we know the performance is better, so it clearly must be.) Maybe something about cluster dispersion in this 2D space? I think the measures in Table 4 for entropy and mean distance are supposed to help interpret this, but I still don’t know how to get from those to the explanations given:
NoLimit model: “...clusters...appear to contrast and overlap more, suggesting stagnation in representation learning…less distinguishable….loss of diversity in the learned representations”
vs.
DynamicLimit-Exp model: “...more structured and progressive evolution of embeddings…clusters remain well-separated…with clear distinctions…suggests that the model continuously refines its representations without excessive compression”.
I think for this last part about compression the entropy measure in Table 4 helps (“..higher entropy, indicating a balanced representation that avoids excessive compression.”) But I’m still at a loss for the rest of the interpretations.
(2) Justifying the memory implementation
While I totally appreciate the thoughtful discussion of why an exponential curve like the one used by DynamicLimit-Exp is better than a logarithmic or linear one, I think the authors might also have saved themselves some angst by hearkening to the adult working memory literature (like the recency effect: Anderson & Milson 1989), which also has an exponential component.
John R Anderson and Robert Milson. 1989. Human memory: An adaptive perspective. Psychological Review 96(4):703.
No comments:
Post a Comment