Monday, December 4, 2017

Some thoughts on Perkins et al. 2017

I really enjoy seeing Bayesian models like this because it’s so clear exactly what’s built in and how. In this particular model, a couple of things struck me: 

(1) This learner needs to have prior (innate? definitely linguistic) knowledge that there are three classes of verbs with different properties. That actually goes a bit beyond just saying a verb has some probability of taking a direct object, which I think is pretty uncontroversial.

(2) The learner only has to know that its parsing is fallible, which causes errors — but notably the learner doesn’t need to know the error rate(s) beforehand. So, as P&al2017 note in their discussion, this means less specific knowledge about the filter has to be built in a priori.

Other thoughts:
(1) Thinking some about the initial stage of learning P&al2017 describe in section 2: So, this learner isn’t supposed to yet know that a wh-word can connect to the object of the verb. It’s true that knowing that specific knowledge is hard without already knowing which verbs are transitive (as P&al2017 point out). But does the learner know anything about wh-words looking for connections to things later in the utterance? For example, I’m thinking that maybe the learner encounters other wh-words that are clearly connected to the subject or object of a preposition: “Who ate a sandwich?” “Who did Amy throw a frisbee to?”. In those cases, it’s not a question of verb subcategorization - the wh-word is connecting to/standing in for something later on in the utterance. 

If the learner does know wh-words are searching for something to connect to later in the utterance, due to experience with non-object wh-words, then maybe a wh-word that connects to the object of a verb isn’t so mysterious (e.g., “What did John eat?”). That is, because the child knows wh-words connect to something else and there’s already a subject present, that leaves the object. Then, non-basic wh-questions actually can be parsed correctly and don’t have to be filtered out. They in fact are signals of a verb’s transitivity.

Maybe P&al2017’s idea is that this wh-awareness is a later stage of development. But I do wonder how early this more basic wh-words-indicate-a-connection knowledge is available.

(2) Thinking about the second part of the filter, involving delta (which is the chance of getting a spurious direct object due to a parsing error): I would have thought that this depended on which verb it was. Maybe it would help to think of a specific parsing error that would yield a spurious direct object. From section 5.1, we get this concrete example: “wait a minute”, with “a minute” parsed as a direct object. It does seem like it should depend on whether the verb is likely to have a direct object there to begin with, rather than a general direct object hallucination parsing error. I could imagine that spurious direct objects are more likely to occur for intransitive verbs, for instance.

I get that parsing error propensity (epsilon) doesn’t depend on verb, though.

(3) Thinking about the model’s target state: P&al2017 base this on adult classes from Levin (1993), but I wonder if it might be fairer to adjust that based on the actual child-directed speech usage (e.g., what’s in Table 2). For example, if “jump” was only ever used intransitively in this input sample, is it a fair target state to say it should be alternating? 

I guess this comes down to the general problem of defining the target state for models of early language learning. Here, what you’d ideally like is an output set of verb classes that corresponds to those of a very young child (say, a year old). That, of course, is hard to get. Alternatively, maybe what you want to have is some sort of downstream evaluation where you see if a model using that inferred knowledge representation can perform the way young children are attested to in some other task.

For example, one of the behaviors of this model, as noted in section 5.1, is that it assigns lots of alternating verbs to be either transitive or intransitive. It would be great to test this behaviorally with kids of the appropriate age to see if they also have these same mis-assignments.


(4) Related to the above about the overregularization tendencies: I love the idea that P&al2017 suggest in the discussion about this style of assumption (i.e.,“the parser will make errors but I don’t know how often”). They note that it could be useful for modeling cases of child overregularization. We certainly have a ton of data where children seem more deterministic than adults in the presence of noisy data. It’d be great to try to capture some of those known behavioral differences with a model like this.