Computational Models of Language (at UC Irvine): Some thoughts on Perkins et al. 2020

General thoughts: I love this model as an example of incremental learning in action, where developing representations and developing processing abilities are taken seriously -- here, we can see how these developing components can yield pretty good learning of transitivity relations and an input filter, and then eventually canonical word order. I also deeply appreciate the careful caveats P&al2020 give in the general discussion for how to interpret their modeling results. This is so important, because it’s so easy to misinterpret modeling results (especially if you weren’t the one doing the modeling -- and sometimes, even if you *are* the one doing the modeling!)

Other thoughts (I had a lot!):

(1) A key point seems to be that the input representation matters -- definitely speaking to the choir, here! What’s true of cognitive modeling seems true for (language) learning period: garbage in, garbage out. (Also, high quality stuff in = high quality stuff out.) Relatedly, I love the “quality over quantity” takeaway in the general discussion, when it comes to the data children use for learning. This seems exactly right to me, and is the heart of most “less is more” language learning proposals.

(2) A core aspect of the model is that the learner recognizes the possibility of misparsing some of the input. This doesn’t seem like unreasonable prior knowledge to have -- children are surely aware that they make mistakes in general, just by not being able to do/communicate the things they want. So, the “I-make-mistakes” overhypothesis could potentially transfer to this specific case of “I-make-mistakes-when-understanding-the-language-around-me”.

(3) It’s important to remember that this isn’t a model of simultaneously/jointly learning transitivity and word order (for the first part of the manuscript, I thought it was). Instead, it’s a joint learning model that will yield the rudimentary learning components (initial transitivity classes, some version of wh-dependencies that satisfy canonical word order) that a subsequent joint learning process could use. That is, it’s the precursor learning process that would allow children to derive useful learning components they’ll need in the future. The things that are in fact jointly learned are rudimentary transitivity and how much of the input to trust (i.e., the basic word order filter).

(4) Finding that learning with a uniform prior works just as well: This is really interesting to me because a uniform prior might explain how very young children can accomplish this inference. That is, they can get a pretty good result even with a uniform prior -- it’s wrong, but it doesn’t matter. Caveat: The model doesn’t differentiate transitive vs. intransitive if its prior is very biased towards the alternating class. But do we care, unless we think children would be highly biased a priori towards the alternating class?

Another simple (empirically-grounded) option is to seed the priors based on the current verbs the child knows, which is a (small) subset of the language’s transitive, intransitive, and alternating verbs. (P&al2020 mention this possibility as part of an incrementally-updating modeled learner.) As long as most of those in the subset aren’t alternating (and so cause that highly-skewed-towards-alternating prior), it looks like the English child will end up making good inferences about subsequent verbs.

(5) I feel for the authors in having the caveat about how ideal Bayesian inference is a proof of concept only. It’s true! But it’s a necessary first step (and highly recommended before trying more child-realistic inference processes -- which may in fact be “broken” forms of the idealized Bayesian computation that Gibbs sampling accomplishes here). Moreover, pretty much all our cognitive models are proofs of concept (i.e., existence proofs that something is possible). That is, we always have to idealize something to make any progress. So, the authors here do the responsible thing and remind us about where they’re idealizing so that we know how to interpret the results.

(6) The second error parameter (delta) about the rate of object drop -- I had some trouble interpreting it. I guess maybe it’s a version of “Did I miss $thing (which only affects that argument) or did I swap $thing with something else (which affects that argument and another argument)?” But then in the text explaining Figure 1, it seems like delta is the global rate of erroneously generating a direct object when it shouldn’t be there. Is this the same as “drop the direct object” vs. “confuse it for another argument”? It doesn’t quite seem like it. This is “I misparsed but accidentally made a direct object anyway when I shouldn’t have,” not necessarily “I confused the direct object with another argument”. Though maybe it could be “I just dropped the direct object completely”?

(7) As the authors note themselves, the model’s results look like a basic fuzzy thresholding decision (0 direct objects <= intransitive <= 15% <= alternating <= around 80% <= transitive <= 100%). Nothing wrong with this at all, but maybe the key is to have the child’s representation of the input take into account some of the nuances mentioned in the results discussion (like wait used with temporal adjuncts) that would cause these thresholds to be more accurate. Then, the trick to learning isn’t about fancy inference (though I do love me some Bayesian inference), but rather the input to that inference.

(8) My confusion about the “true” error parameter values (epsilon and delta): What do error parameters mean for the true corpus? That a non-canonical word order occurred? But weren’t all non-canonical instances removed in the curated input set?

(9) Figure 5: If I’m interpreting the transitive graph correctly, it looks like super-high delta and epsilon values yield the best accuracy. In particular, if epsilon (i.e., how often to ignore the input) is near 1, we get high accuracy (near 1). What does that mean? The prior is really good for this class of verbs? This is the opposite of what we see with the alternating verbs, where low epsilon yields the best accuracy (so we shouldn’t ignore the input).

Relatedly though, it’s a good point that the three verb classes have different epsilon balances that yield high accuracy. And I appreciated the explanation that a high epsilon means lowering the threshold for membership into the class (e.g., transitive verbs).

(10) The no-filter baseline (with epsilon = 0): Note that this (dumb) strategy has better performance across all verbs (.70) simply because it gets all the alternating verbs right, and those comprise the bulk of the verbs. But this is definitely an instance of perfect recall (of alternating) at the cost of precision (transitive and intransitive).

(11) It’s a nice point that the model performs like children seem to in the presence of noisy input (where the noisy input doesn’t obviously have a predictable source of noise) -- i.e., children overregularize, and so does the model. And the way the model learns this is by having global parameters, so information from any individual verb informs those global parameters, which in turn affects the model’s decisions about other individual verbs.

(12) I really like the idea of having different noise parameters depending on the sources of noise the learner thinks there are. This might require us to have a more articulated idea of the grammatical process that generates data, so that noise could come from different pieces of that process. Then, voila -- a noise parameter for each piece.

(13) It’s also a cool point about the importance of variation -- the variation provides anchor points (here: verbs the modeled child thinks are definitely transitive or intransitive). If there were no variation, the modeled child wouldn’t have these anchor points, and so would be hindered in deciding how much noise there might be. At a more general level, this idea about the importance of variation seems like an example where something “harder” about the learning problem (here: variation is present in the verbs) actually makes learning easier.

(14) Main upshot: The modeled child can infer an appropriate filter (=”I mis-parse things sometimes” + “I add/delete a direct object sometimes”) at the same time as inferring classes of verbs with certain argument structure (transitive, intransitive, and alternating). Once these classes are established, then learners can use the classes to generalize properties of (new) verbs in those classes, such as transitive verbs having subjects and objects, which correspond to agents and patients in English.

Relatedly, I’d really love to think more about this with respect to how children learn complex linking theories like UTAH and rUTAH, which involve a child knowing collections of links between verb arguments (like subject and object) and event participants (like agent and patient). That is, let’s assume the learning process described in this paper happens and children have some seed classes of transitive, intransitive, and alternating + the knowledge of the argument structure associated with each class (must have direct object [transitive], must not have direct object [intransitive], may have direct object [alternating]). I think children would still have to learn the links between arguments and event participants, right? That is, they’d still need to learn that the subject of a transitive verb is often an agent in the event. But they’d at least be able to recognize that certain verbs have these arguments, and so be able to handle input with movement, like wh-questions for transitive verbs.

Computational Models of Language (at UC Irvine)

Tuesday, April 14, 2020

Some thoughts on Perkins et al. 2020

No comments:

Post a Comment

People who think this blog is awesome

Members