Monday, April 27, 2020

Some thoughts on Schneider et al. 2020

It’s nice to see this type of computational cognitive model: a proof of concept for an intuitive (though potentially vague) idea about how children regularize their input to yield more deterministic/categorical grammar knowledge than the input would seem to suggest on the surface. In particular, it’s intuitive to talk about children perceiving some of the input as signal and some as noise, but much more persuasive to see it work in a concrete implementation.

Specific thoughts:
(1) Intake vs. input filtering: Not sure I followed the distinction about filtering the child’s intake vs. filtering the child’s input. The basic pipeline is that external input signal gets encoded using the child’s current knowledge and processing abilities (perceptual intake) and then a subset of that is actually relevant for learning (acquisition intake). So, for filtering the (acquisition?) intake, this would mean children look at the subset of the input perceived as relevant and assume some of that is noise. For filtering the input, is the idea that children would assume some of the input itself is noise and so some of it is thrown out before it becomes perceptual intake? Or is it that the child assumes some of the perceptual intake is noise, and tosses that before it gets to the acquisition intake? And how would that differ for the end result of the acquisition intake? 

Being a bit more concrete helps me think about this:
Filtering the input --
Let’s let the input be a set of 10 signal pieces and 2 noise pieces (10S, 2N).
Let’s say filtering occurs on this set, so the perceptual intake is now 10S.
Then maybe the acquisitional intake is a subset of those, so it’s 8S.

Filtering the intake --
Our input is again 10S, 2N.
(Accurate) perceptual intake takes in 10S, 2N.
Then acquisitional intake could be the subset 7S, 1N.

So okay, I think I get it -- filtering the input gets you a cleaner signal while filtering the intake gets you some subset (cleaner or not, but certainly more focused).

(2) Using English L1 and L2 data in place of ASL: Clever standin! I was wondering what they would do for an ASL corpus. But this highlights how to focus on the relevant aspects for modeling. Here, it’s more important to get the same kind of unpredictable variation in use than it is to get ASL data. 

(3) Model explanations: I really appreciate the effort here to give the intuitions behind the model pieces. I wonder if it might have been more effective to have a plate diagram, and walk through the high-level explanation for each piece, and then the specifics with the model variables. As it was, I think I was able to follow what was going on in this high-level description because I’m familiar with this type of model already, but I don’t know if that would be true for people who aren’t as familiar. (For example, the bit about considering every partition is a high-level way of talking about Gibbs sampling, as they describe in section 4.2.)

(4) Model priors: If the prior over determiner class is 1/7, then it sounds like the model already knows there are 7 classes of determiner. Similar to a comment raised about the reading last time, why not infer the number of determiner classes, rather than knowing there are 7 already? 

(5) Corpus preprocessing: Interesting step of “downsampling” the counts from the corpora by taking the log. This effectively squishes probability differences down, I think. I wonder why they did this, instead of just using the normalized frequencies? They say this was to compensate for the skewed distribution of frequent determiners like the...but I don’t think I understand why that’s a problem. What does it matter if you have a lot of the, as long as you have enough of the other determiners too? They have the minimum cutoff of 500 instances after all.

(6) Figure 1: It looks like the results from the non-native corpus with the noise filter recover the rates of sg, pl, and mass noun combination pretty well (compared against the gold standard). But the noise filter over the native corpus skews a bit towards allowing more noun types with more classes than the gold standard (e.g., more determiners allowing 3 noun types). Side note: I like this evaluation metric a little better than inferring fixed determiner classes, because individual determiner behavior (how many noun classes it allows) can be counted more directly. We don’t need to worry about whether we have the right determiner classes or not.

(7) Evaluation metrics: Related to the previous thought, maybe a more direct evaluation metric is to just compare allowed vs. disallowed noun vectors for each individual determiner? Then the class assignment becomes a means to that end, rather than being the evaluation metric itself. This may help deal with the issue of capturing the variability in the native input that shows up in simulation 2.

(8) L1 vs. L2 input results:  The model learns there’s less noise in the native input case, and filters less; this leads to capturing more variability in the determiners. S&al2020 don’t seem happy about this, but is this so bad? If there’s true variability in native speaker grammars, then there’s variability. 

In the discussion, S&al2020 say that the behavior they wanted was the same for both native and non-native input, since Simon learned the same as native ASL speakers. So that’s why they’re not okay with the native input results. But I’m trying to imagine how the noisy channel input model they designed could possibly give the same results when the input has different amounts of variability -- by nature, it would filter out less input when there seems to be more regularity in the input to begin with (i.e., the native input). I guess it was possible that just the right amount of the input would be filtered out in each case to lead to exactly the same classification results? And then that didn’t happen.

Tuesday, April 14, 2020

Some thoughts on Perkins et al. 2020

General thoughts: I love this model as an example of incremental learning in action, where developing representations and developing processing abilities are taken seriously -- here, we can see how these developing components can yield pretty good learning of transitivity relations and an input filter, and then eventually canonical word order.  I also deeply appreciate the careful caveats P&al2020 give in the general discussion for how to interpret their modeling results. This is so important, because it’s so easy to misinterpret modeling results (especially if you weren’t the one doing the modeling -- and sometimes, even if you *are* the one doing the modeling!)

Other thoughts (I had a lot!):

(1) A key point seems to be that the input representation matters -- definitely speaking to the choir, here! What’s true of cognitive modeling seems true for (language) learning period: garbage in, garbage out. (Also, high quality stuff in = high quality stuff out.) Relatedly, I love the “quality over quantity” takeaway in the general discussion, when it comes to the data children use for learning. This seems exactly right to me, and is the heart of most “less is more” language learning proposals.

(2) A core aspect of the model is that the learner recognizes the possibility of misparsing some of the input. This doesn’t seem like unreasonable prior knowledge to have -- children are surely aware that they make mistakes in general, just by not being able to do/communicate the things they want. So, the “I-make-mistakes” overhypothesis could potentially transfer to this specific case of “I-make-mistakes-when-understanding-the-language-around-me”.

(3) It’s important to remember that this isn’t a model of simultaneously/jointly learning transitivity and word order (for the first part of the manuscript, I thought it was). Instead, it’s a joint learning model that will yield the rudimentary learning components (initial transitivity classes, some version of wh-dependencies that satisfy canonical word order) that a subsequent joint learning process could use. That is, it’s the precursor learning process that would allow children to derive useful learning components they’ll need in the future.  The things that are in fact jointly learned are rudimentary transitivity and how much of the input to trust (i.e., the basic word order filter).

(4) Finding that learning with a uniform prior works just as well:  This is really interesting to me because a uniform prior might explain how very young children can accomplish this inference. That is, they can get a pretty good result even with a uniform prior -- it’s wrong, but it doesn’t matter. Caveat: The model doesn’t differentiate transitive vs. intransitive if its prior is very biased towards the alternating class. But do we care, unless we think children would be highly biased a priori towards the alternating class?

Another simple (empirically-grounded) option is to seed the priors based on the current verbs the child knows, which is a (small) subset of the language’s transitive, intransitive, and alternating verbs. (P&al2020 mention this possibility as part of an incrementally-updating modeled learner.) As long as most of those in the subset aren’t alternating (and so cause that highly-skewed-towards-alternating prior), it looks like the English child will end up making good inferences about subsequent verbs.

(5) I feel for the authors in having the caveat about how ideal Bayesian inference is a proof of concept only. It’s true! But it’s a necessary first step (and highly recommended before trying more child-realistic inference processes -- which may in fact be “broken” forms of the idealized Bayesian computation that Gibbs sampling accomplishes here). Moreover, pretty much all our cognitive models are proofs of concept (i.e., existence proofs that something is possible). That is, we always have to idealize something to make any progress. So, the authors here do the responsible thing and remind us about where they’re idealizing so that we know how to interpret the results.

(6) The second error parameter (delta) about the rate of object drop -- I had some trouble interpreting it. I guess maybe it’s a version of “Did I miss $thing (which only affects that argument) or did I swap $thing with something else (which affects that argument and another argument)?” But then in the text explaining Figure 1, it seems like delta is the global rate of erroneously generating a direct object when it shouldn’t be there. Is this the same as “drop the direct object” vs. “confuse it for another argument”? It doesn’t quite seem like it. This is “I misparsed but accidentally made a direct object anyway when I shouldn’t have,” not necessarily “I confused the direct object with another argument”. Though maybe it could be “I just dropped the direct object completely”?

(7) As the authors note themselves, the model’s results look like a basic fuzzy thresholding decision (0 direct objects <= intransitive <= 15% <= alternating <= around 80% <= transitive <= 100%). Nothing wrong with this at all, but maybe the key is to have the child’s representation of the input take into account some of the nuances mentioned in the results discussion (like wait used with temporal adjuncts) that would cause these thresholds to be more accurate. Then, the trick to learning isn’t about fancy inference (though I do love me some Bayesian inference), but rather the input to that inference.

(8) My confusion about the “true” error parameter values (epsilon and delta): What do error parameters mean for the true corpus? That a non-canonical word order occurred? But weren’t all non-canonical instances removed in the curated input set?

(9) Figure 5:  If I’m interpreting the transitive graph correctly, it looks like super-high delta and epsilon values yield the best accuracy. In particular, if epsilon (i.e., how often to ignore the input) is near 1, we get high accuracy (near 1). What does that mean? The prior is really good for this class of verbs? This is the opposite of what we see with the alternating verbs, where low epsilon yields the best accuracy (so we shouldn’t ignore the input).

Relatedly though, it’s a good point that the three verb classes have different epsilon balances that yield high accuracy. And I appreciated the explanation that a high epsilon means lowering the threshold for membership into the class (e.g., transitive verbs).

(10) The no-filter baseline (with epsilon = 0): Note that this (dumb) strategy has better performance across all verbs (.70) simply because it gets all the alternating verbs right, and those comprise the bulk of the verbs. But this is definitely an instance of perfect recall (of alternating) at the cost of precision (transitive and intransitive).

(11) It’s a nice point that the model performs like children seem to in the presence of noisy input (where the noisy input doesn’t obviously have a predictable source of noise) --  i.e., children overregularize, and so does the model. And the way the model learns this is by having global parameters, so information from any individual verb informs those global parameters, which in turn affects the model’s decisions about other individual verbs. 

(12) I really like the idea of having different noise parameters depending on the sources of noise the learner thinks there are. This might require us to have a more articulated idea of the grammatical process that generates data, so that noise could come from different pieces of that process. Then, voila -- a noise parameter for each piece.

(13) It’s also a cool point about the importance of variation -- the variation provides anchor points (here: verbs the modeled child thinks are definitely transitive or intransitive). If there were no variation, the modeled child wouldn’t have these anchor points, and so would be hindered in deciding how much noise there might be. At a more general level, this idea about the importance of variation seems like an example where something “harder” about the learning problem (here: variation is present in the verbs) actually makes learning easier.

(14)  Main upshot: The modeled child can infer an appropriate filter (=”I mis-parse things sometimes” + “I add/delete a direct object sometimes”) at the same time as inferring classes of verbs with certain argument structure (transitive, intransitive, and alternating). Once these classes are established, then learners can use the classes to generalize properties of (new) verbs in those classes, such as transitive verbs having subjects and objects, which correspond to agents and patients in English. 

Relatedly, I’d really love to think more about this with respect to how children learn complex linking theories like UTAH and rUTAH, which involve a child knowing collections of links between verb arguments (like subject and object) and event participants (like agent and patient). That is, let’s assume the learning process described in this paper happens and children have some seed classes of transitive, intransitive, and alternating + the knowledge of the argument structure associated with each class (must have direct object [transitive], must not have direct object [intransitive], may have direct object [alternating]). I think children would still have to learn the links between arguments and event participants, right? That is, they’d still need to learn that the subject of a transitive verb is often an agent in the event. But they’d at least be able to recognize that certain verbs have these arguments, and so be able to handle input with movement, like wh-questions for transitive verbs.