Computational Models of Language (at UC Irvine): Some thoughts on Yang 2015

Just from a purely organizational standpoint, I really appreciate how explicitly the goals of this paper are laid out (basically, (i) here’s why the other strategy won’t work, and (ii) why this new one does). Also, because of the clarity of the presentation, I’ll be interested to read Goldberg & Boyd's response for next time. Additionally, I greatly enjoyed reading about the application of what I’ve been calling “indirect positive evidence” (Pearl & Mis in press) — that is, things that are present in the input that can be leveraged indirectly to tell you about something else you’re trying to currently learn about (here: leverage distributional cues for locative particles and PPs to learn about a-adjectives). I really do think this is the way to deal with a variety of acquisition problems (and as I’ve mentioned before, it’s the same intuition that underlies both linguistic parameters and Bayesian overhypotheses: Pearl & Lidz 2013). In my opinion, the more we see explicit examples of how indirect positive evidence can work for various language acquisition problems, the better.

Some more specific thoughts:

(1) I found it quite helpful to have the different cues to a-adjectives listed out, in particular that the phonological cue of beginning with the schwa isn’t 100%, while the morphological cue of being split into aspectual “a” (= something like presently occurring?) + root is nearly 100%. It reminds me of the Gagliardi et al. (2012) work on children’s differing sensitivity to available cues when categorizing nouns in Tsez. In particular, Gagliardi et al. found that the model had to be more sensitive to phonological cues than semantic cues in order to match children’s behavior. This possibly has to do with the ability to reliably observe phonological cues as compared to semantic cues. I suspect the fairly diagnostic morphological cue might also be more observable, since it involves recognition of a free morpheme within the a-adjective (e.g., wake in awake).

(2) Related point: the actual trajectory of children’s development with a-adjectives. This is something that seems really relevant for determining which learning strategies children are using (as Yang himself points out, when he notes that all the experiments from Boyd & Goldberg are with adults). Do children make errors and use infrequent non-a-adjectives only predicatively (i.e., they don’t think they can use them attributively)? And on the flip side, do they use some a-adjectives attributively? Knowing about the errors children make (or lack thereof) can help us decide if they’re really learning on a lexical item by lexical item basis, or instead recognizing certain classes of adjectives and therefore able to make generalizations from one class instance to another (or perhaps more likely, at what age they recognize the classes of adjectives).

Yang quite admirably does a corpus search of naturalistic child productions, which is consistent with children knowing not to use a-adjectives attributively, but it’s not quite the same as behavioral evidence where children definitively show they disallow (or strongly disprefer) the attributive usage.

(3) Indirect negative evidence: One of Yang’s concerns is that this kind of evidence “requires comparing the extensions of the competing hypotheses”. I get the general gist of this, but I think we run into the same problem with all the language hypothesis spaces we set up, where one language’s parameter is a subset of another’s. That is, classical approaches like the Subset Principle run into the exact same problem. This is something we always have to deal with, and I think it depends very much on the hypothesis spaces children entertain.

Moreover, on the flip side, how much of a problem is it really? For the concrete example we’re given about the language that includes “the asleep cat” vs. the language that doesn’t, the extensional difference is one utterance (or one category of utterances, if group them all together under a-adjectives). How computationally hard is this to calculate? Importantly, we really just need to know that the difference is one construction — the rest of the language’s extension doesn’t matter. So it seems like there should be a way to form a hypothesis space exactly like the one described above (P = “the asleep cat” is allowed vs. not-P = “the asleep cat” is not allowed)?

Also, related to the point about how Boyd & Goldberg’s strategy works — does it even matter what other constructions do appear with those adjectives (i.e., the cat is asleep)? Isn’t it enough that “the asleep cat” doesn’t? I guess the point is that you want to have appropriate abstract classes like the ones described in section 3.1, i.e., predicative usage = “the cat is asleep”, “the cat is nice”; attributive = *“the asleep cat”, “the nice cat”. This makes the P hypothesis more like “asleep can be used both predicatively and attributively” and the not-P class is “asleep can be used only predicatively”. But okay, let’s assume children have enough syntactic knowledge to manage this. Then we go back to the point about how hard it is in practice to deal with hypothesis space extensions. Especially once we add this kind of abstraction in, it doesn’t seem too hard at all, unless I’m missing something (which is always possible).

(4) I personally have a great love for the Tolerance Principle, and I enjoyed seeing its usage here. But, as always, it gets me thinking about the relationship between the Tolerance Principle and Bayesian inference, especially when we have nice hypothesis spaces laid out like we do here. So, here’s my thinking at the moment:

For the Tolerance Principle, we have a setup like this:

Hypotheses:

H1 = the generalization applies to all N items, even though e exceptions exist.

H2 = there is no generalization, and all N items do their own thing.

Data:

O = items the pattern/rule is observed to apply to

e = exceptional items the pattern/rule should apply to but doesn’t

N - O - e = unobserved items (if any). We can simplify this and just assume all items have been observed to either follow the pattern (and be in O) or not (and be in e), so N - O - e = 0.

Turning over to Bayesian thinking, let’s assume the priors for H1 and H2 are equal. So, all the work is really done in the likelihood, i.e, P(Hx | data) is proportional to P(Hx) [prior] * P(data | Hx) [likelihood].

Okay, so how do we calculate P(data | H1) vs. P(data | H2)? The data here is O pattern-following items and e exceptions, where N = O + e.

To calculate both likelihoods, we need to know the probability of generating those O pattern-following items and the probability of generating those e exceptions under both H1 and H2. I think this kind of question is where we get into the derivation of the Tolerance Principle, as described by Yang (2005). In particular, there’s an idea that if you have a rule (as in H1), it’s cheaper to store and access the right forms when there are enough items that follow the rule.

More specifically, it’s some kind of constant cost for those O items (rule application), though the constant cost involves some work because you actually have to do the computation of the rule/pattern over the item. For the e exceptions, there’s some cost of accessing the stored form individually, based on the frequency of the stored items. Importantly, if you have H1 with a rule + exceptions, every time you use the rule, you have to look through the exceptions first and then apply the rule. For H2 where everything is individually stored, you just wander down the list by frequency until you get to the individual item you care about.

The Tolerance Principle seems to be the result of doing this likelihood calculation, and giving a categorical decision. Instead of spelling out P(data | H1) and P(data | H2) explicitly, Yang (2005) worked out the decision point: if e <= N/ ln N, then P(data | H1) is higher (i.e., having the rule is worth it). So, if we wanted to generate the actual likelihood probabilities for H1 and H2, we’d want to plumb the depths of the Tolerance Principle derivation to determine these. And maybe that would be useful for tracking the trajectory of generalization over time, because it’s very possible these probabilities wouldn’t be close to 0 or 1 immediately. (Quick thoughts: P(data | H1) = something like (p_individualaccess)^e * p(p_followsrule)^O; P(data | H2) = something like (p_individualaccess)^N).

~~~

References:

Gagliardi, A., Feldman, N. H., & Lidz, J. 2012. When suboptimal behavior is optimal and why: Modeling the acquisition of noun classes in Tsez. In Proceedings of the 34th annual conference of the Cognitive Science Society (pp. 360-365).

Pearl, L., & Lidz, J. 2013. Parameters in Language Acquisition. The Cambridge Handbook of Biolinguistics, 129-159.

Pearl, L., & Mis, B. (in press - updated 2/2/15). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Language.

Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

Computational Models of Language (at UC Irvine)

Monday, February 15, 2016

Some thoughts on Yang 2015

No comments:

Post a Comment

People who think this blog is awesome

Members