It’s nice to see this type of computational cognitive model: a proof of concept for an intuitive (though potentially vague) idea about how children regularize their input to yield more deterministic/categorical grammar knowledge than the input would seem to suggest on the surface. In particular, it’s intuitive to talk about children perceiving some of the input as signal and some as noise, but much more persuasive to see it work in a concrete implementation.
Specific thoughts:
(1) Intake vs. input filtering: Not sure I followed the distinction about filtering the child’s intake vs. filtering the child’s input. The basic pipeline is that external input signal gets encoded using the child’s current knowledge and processing abilities (perceptual intake) and then a subset of that is actually relevant for learning (acquisition intake). So, for filtering the (acquisition?) intake, this would mean children look at the subset of the input perceived as relevant and assume some of that is noise. For filtering the input, is the idea that children would assume some of the input itself is noise and so some of it is thrown out before it becomes perceptual intake? Or is it that the child assumes some of the perceptual intake is noise, and tosses that before it gets to the acquisition intake? And how would that differ for the end result of the acquisition intake?
Being a bit more concrete helps me think about this:
Filtering the input --
Let’s let the input be a set of 10 signal pieces and 2 noise pieces (10S, 2N).
Let’s say filtering occurs on this set, so the perceptual intake is now 10S.
Then maybe the acquisitional intake is a subset of those, so it’s 8S.
Filtering the intake --
Our input is again 10S, 2N.
(Accurate) perceptual intake takes in 10S, 2N.
Then acquisitional intake could be the subset 7S, 1N.
So okay, I think I get it -- filtering the input gets you a cleaner signal while filtering the intake gets you some subset (cleaner or not, but certainly more focused).
(2) Using English L1 and L2 data in place of ASL: Clever standin! I was wondering what they would do for an ASL corpus. But this highlights how to focus on the relevant aspects for modeling. Here, it’s more important to get the same kind of unpredictable variation in use than it is to get ASL data.
(3) Model explanations: I really appreciate the effort here to give the intuitions behind the model pieces. I wonder if it might have been more effective to have a plate diagram, and walk through the high-level explanation for each piece, and then the specifics with the model variables. As it was, I think I was able to follow what was going on in this high-level description because I’m familiar with this type of model already, but I don’t know if that would be true for people who aren’t as familiar. (For example, the bit about considering every partition is a high-level way of talking about Gibbs sampling, as they describe in section 4.2.)
(4) Model priors: If the prior over determiner class is 1/7, then it sounds like the model already knows there are 7 classes of determiner. Similar to a comment raised about the reading last time, why not infer the number of determiner classes, rather than knowing there are 7 already?
(5) Corpus preprocessing: Interesting step of “downsampling” the counts from the corpora by taking the log. This effectively squishes probability differences down, I think. I wonder why they did this, instead of just using the normalized frequencies? They say this was to compensate for the skewed distribution of frequent determiners like the...but I don’t think I understand why that’s a problem. What does it matter if you have a lot of the, as long as you have enough of the other determiners too? They have the minimum cutoff of 500 instances after all.
(6) Figure 1: It looks like the results from the non-native corpus with the noise filter recover the rates of sg, pl, and mass noun combination pretty well (compared against the gold standard). But the noise filter over the native corpus skews a bit towards allowing more noun types with more classes than the gold standard (e.g., more determiners allowing 3 noun types). Side note: I like this evaluation metric a little better than inferring fixed determiner classes, because individual determiner behavior (how many noun classes it allows) can be counted more directly. We don’t need to worry about whether we have the right determiner classes or not.
(7) Evaluation metrics: Related to the previous thought, maybe a more direct evaluation metric is to just compare allowed vs. disallowed noun vectors for each individual determiner? Then the class assignment becomes a means to that end, rather than being the evaluation metric itself. This may help deal with the issue of capturing the variability in the native input that shows up in simulation 2.
(8) L1 vs. L2 input results: The model learns there’s less noise in the native input case, and filters less; this leads to capturing more variability in the determiners. S&al2020 don’t seem happy about this, but is this so bad? If there’s true variability in native speaker grammars, then there’s variability.
In the discussion, S&al2020 say that the behavior they wanted was the same for both native and non-native input, since Simon learned the same as native ASL speakers. So that’s why they’re not okay with the native input results. But I’m trying to imagine how the noisy channel input model they designed could possibly give the same results when the input has different amounts of variability -- by nature, it would filter out less input when there seems to be more regularity in the input to begin with (i.e., the native input). I guess it was possible that just the right amount of the input would be filtered out in each case to lead to exactly the same classification results? And then that didn’t happen.