Computational Models of Language (at UC Irvine): February 2016

Monday, February 29, 2016

Some thoughts on Goldberg & Boyd 2015

I definitely appreciated G&B2015’s clarification of how precisely statistical preemption and categorization are meant to work for learning about a-adjectives (or at least, one concrete implementation of it). In particular, statistical preemption is likened to blocking, which means the learner needs to have an explicit set of alternatives over which to form expectations. For A-adjectives, the relevant alternatives could be something like “the sleeping boy” vs. “the asleep boy”. If both are possible, then “the asleep boy” should appear sometimes (i.e., with some probability). When it doesn’t, this is because it’s blocked. Clearly, we could easily implement this with Bayesian inference (or as G&B2015 point out themselves, with simple error-driven learning), provided we have the right hypothesis space.

For example, H1 = only “the sleeping boy” is allowed, while H2 = “the sleeping boy” and “the asleep boy” are both allowed. H1 will win over H2 in a very short amount of time as long as children hear lots of non-a-adjective equivalents (like "sleeping") in this syntactic construction. The real trick is making sure these are the hypotheses under consideration. For example, there seems to be another reasonable way to think about the hypothesis space, based on the relative clause vs. attributive syntactic usage. H1 = “the boy who is asleep”; H2 = “the asleep boy” and “the boy who is asleep”. Here, we really need to instances of relative-clause usage to drive us towards H1.

It makes me think about the more general issue of determining the hypothesis space that statistical preemption (or Bayesian inference, etc.) is supposed to operate over. G&B2015 explicitly note this themselves in the beginning of section 5, and talk more about hypothesis space construction in 5.2. For the a-adjective learning story G&B2015 promote, I would think some sort of recognition of the semantic similarity of words and the syntactic environments is the basis of the hypothesis space generation.

Some other thoughts:

(1) Section 1: I thought it was an interesting point about “afraid” being sucked into the a-adjective class even though it lacks the morphological property (aspectual “a-“ prefix + free morpheme, the way we see with “asleep”, “ablaze”, “alone”, etc.). This is presumably because of the relevant distributional properties categorizing it with the other a-adjectives? (That is, it’s “close enough”, given the other properties it has.)

(2) Section 2: Just as a note about the description of the experimental tasks, I wonder why they didn’t use novel-a-adjectives that matched the morphological segmentation properties that the real a-adjectives and alternatives have, i..e, asleep and sleepy, so ablim and blimmy (instead of chammy).

(3) Section 3: G&B2015 note that Yang’s child-directed survey didn’t find a-adjectives being used in relative clauses (i.e., the relevant syntactic distribution cue). So, this is a problem if you think you need to see relative clause usage to learn something about a-adjectives. But, as mentioned above (and also in Yang 2015), I think that’s only one way to learn about them. There are other options, based on semantic equivalents (“sleeping”, “sleepy”, etc. vs. “asleep”) or similarity to other linguistic categories (e.g., the Yang 2015 approach with locative particles).

(4) Section 4: I really appreciate the explicit discussion of how the distributional similarity-based classification would need to work for the locative particles-strategy to pan out (i.e., Table 1). It’s the next logical step once we have Yang’s proposal about using locative particles in the first place.

(5) Section 4: I admit a bit of trepidation about the conclusion that the available distributional evidence for locative particles is insufficient to lump them together with a-adjectives. It’s the sort of thing where we have to remember that children are learning a system of knowledge, and so while the right-type adverb modification may not be a slam dunk for distinguishing a-adjectives from non-a-adjectives, I do wonder if the collection of syntactic distribution properties (e.g., probability of coordination with PPs, etc.) would cause children to lump a-adjectives together with locative particles and prepositional phrases and, importantly, not with non-a-adjectives. Or perhaps, more generally, the distributional information might cause children to just separate out a-adjectives, and note that they have some overlap with locative particles/PPs and also with regular non-a-adjectives.

Side note: This is the sort of thing ideal learner models are fantastic at telling us: is the information sufficient to draw conclusion x? In this case, the conclusion would be that non-a-adjectives go together, given the various syntactic distribution cues available. G&B2015 touch on this kind of model at the beginning of section 5.2, mentioning the Perfors et al. 2010 work.

(6) Section 5: I was delighted to see the Hao (2015) study, which gets us the developmental trajectory for a-adjective categorization (or at least, how a-adjectives project onto syntactic distribution). Ten years old is really old for most acquisition stuff. So, this accords with the evidence being pretty scanty (or at least, children taking awhile until they can recognize that the evidence is there, and then make use of it).

Monday, February 15, 2016

Some thoughts on Yang 2015

Just from a purely organizational standpoint, I really appreciate how explicitly the goals of this paper are laid out (basically, (i) here’s why the other strategy won’t work, and (ii) why this new one does). Also, because of the clarity of the presentation, I’ll be interested to read Goldberg & Boyd's response for next time. Additionally, I greatly enjoyed reading about the application of what I’ve been calling “indirect positive evidence” (Pearl & Mis in press) — that is, things that are present in the input that can be leveraged indirectly to tell you about something else you’re trying to currently learn about (here: leverage distributional cues for locative particles and PPs to learn about a-adjectives). I really do think this is the way to deal with a variety of acquisition problems (and as I’ve mentioned before, it’s the same intuition that underlies both linguistic parameters and Bayesian overhypotheses: Pearl & Lidz 2013). In my opinion, the more we see explicit examples of how indirect positive evidence can work for various language acquisition problems, the better.

Some more specific thoughts:

(1) I found it quite helpful to have the different cues to a-adjectives listed out, in particular that the phonological cue of beginning with the schwa isn’t 100%, while the morphological cue of being split into aspectual “a” (= something like presently occurring?) + root is nearly 100%. It reminds me of the Gagliardi et al. (2012) work on children’s differing sensitivity to available cues when categorizing nouns in Tsez. In particular, Gagliardi et al. found that the model had to be more sensitive to phonological cues than semantic cues in order to match children’s behavior. This possibly has to do with the ability to reliably observe phonological cues as compared to semantic cues. I suspect the fairly diagnostic morphological cue might also be more observable, since it involves recognition of a free morpheme within the a-adjective (e.g., wake in awake).

(2) Related point: the actual trajectory of children’s development with a-adjectives. This is something that seems really relevant for determining which learning strategies children are using (as Yang himself points out, when he notes that all the experiments from Boyd & Goldberg are with adults). Do children make errors and use infrequent non-a-adjectives only predicatively (i.e., they don’t think they can use them attributively)? And on the flip side, do they use some a-adjectives attributively? Knowing about the errors children make (or lack thereof) can help us decide if they’re really learning on a lexical item by lexical item basis, or instead recognizing certain classes of adjectives and therefore able to make generalizations from one class instance to another (or perhaps more likely, at what age they recognize the classes of adjectives).

Yang quite admirably does a corpus search of naturalistic child productions, which is consistent with children knowing not to use a-adjectives attributively, but it’s not quite the same as behavioral evidence where children definitively show they disallow (or strongly disprefer) the attributive usage.

(3) Indirect negative evidence: One of Yang’s concerns is that this kind of evidence “requires comparing the extensions of the competing hypotheses”. I get the general gist of this, but I think we run into the same problem with all the language hypothesis spaces we set up, where one language’s parameter is a subset of another’s. That is, classical approaches like the Subset Principle run into the exact same problem. This is something we always have to deal with, and I think it depends very much on the hypothesis spaces children entertain.

Moreover, on the flip side, how much of a problem is it really? For the concrete example we’re given about the language that includes “the asleep cat” vs. the language that doesn’t, the extensional difference is one utterance (or one category of utterances, if group them all together under a-adjectives). How computationally hard is this to calculate? Importantly, we really just need to know that the difference is one construction — the rest of the language’s extension doesn’t matter. So it seems like there should be a way to form a hypothesis space exactly like the one described above (P = “the asleep cat” is allowed vs. not-P = “the asleep cat” is not allowed)?

Also, related to the point about how Boyd & Goldberg’s strategy works — does it even matter what other constructions do appear with those adjectives (i.e., the cat is asleep)? Isn’t it enough that “the asleep cat” doesn’t? I guess the point is that you want to have appropriate abstract classes like the ones described in section 3.1, i.e., predicative usage = “the cat is asleep”, “the cat is nice”; attributive = *“the asleep cat”, “the nice cat”. This makes the P hypothesis more like “asleep can be used both predicatively and attributively” and the not-P class is “asleep can be used only predicatively”. But okay, let’s assume children have enough syntactic knowledge to manage this. Then we go back to the point about how hard it is in practice to deal with hypothesis space extensions. Especially once we add this kind of abstraction in, it doesn’t seem too hard at all, unless I’m missing something (which is always possible).

(4) I personally have a great love for the Tolerance Principle, and I enjoyed seeing its usage here. But, as always, it gets me thinking about the relationship between the Tolerance Principle and Bayesian inference, especially when we have nice hypothesis spaces laid out like we do here. So, here’s my thinking at the moment:

For the Tolerance Principle, we have a setup like this:

Hypotheses:

H1 = the generalization applies to all N items, even though e exceptions exist.

H2 = there is no generalization, and all N items do their own thing.

Data:

O = items the pattern/rule is observed to apply to

e = exceptional items the pattern/rule should apply to but doesn’t

N - O - e = unobserved items (if any). We can simplify this and just assume all items have been observed to either follow the pattern (and be in O) or not (and be in e), so N - O - e = 0.

Turning over to Bayesian thinking, let’s assume the priors for H1 and H2 are equal. So, all the work is really done in the likelihood, i.e, P(Hx | data) is proportional to P(Hx) [prior] * P(data | Hx) [likelihood].

Okay, so how do we calculate P(data | H1) vs. P(data | H2)? The data here is O pattern-following items and e exceptions, where N = O + e.

To calculate both likelihoods, we need to know the probability of generating those O pattern-following items and the probability of generating those e exceptions under both H1 and H2. I think this kind of question is where we get into the derivation of the Tolerance Principle, as described by Yang (2005). In particular, there’s an idea that if you have a rule (as in H1), it’s cheaper to store and access the right forms when there are enough items that follow the rule.

More specifically, it’s some kind of constant cost for those O items (rule application), though the constant cost involves some work because you actually have to do the computation of the rule/pattern over the item. For the e exceptions, there’s some cost of accessing the stored form individually, based on the frequency of the stored items. Importantly, if you have H1 with a rule + exceptions, every time you use the rule, you have to look through the exceptions first and then apply the rule. For H2 where everything is individually stored, you just wander down the list by frequency until you get to the individual item you care about.

The Tolerance Principle seems to be the result of doing this likelihood calculation, and giving a categorical decision. Instead of spelling out P(data | H1) and P(data | H2) explicitly, Yang (2005) worked out the decision point: if e <= N/ ln N, then P(data | H1) is higher (i.e., having the rule is worth it). So, if we wanted to generate the actual likelihood probabilities for H1 and H2, we’d want to plumb the depths of the Tolerance Principle derivation to determine these. And maybe that would be useful for tracking the trajectory of generalization over time, because it’s very possible these probabilities wouldn’t be close to 0 or 1 immediately. (Quick thoughts: P(data | H1) = something like (p_individualaccess)^e * p(p_followsrule)^O; P(data | H2) = something like (p_individualaccess)^N).

~~~

References:

Gagliardi, A., Feldman, N. H., & Lidz, J. 2012. When suboptimal behavior is optimal and why: Modeling the acquisition of noun classes in Tsez. In Proceedings of the 34th annual conference of the Cognitive Science Society (pp. 360-365).

Pearl, L., & Lidz, J. 2013. Parameters in Language Acquisition. The Cambridge Handbook of Biolinguistics, 129-159.

Pearl, L., & Mis, B. (in press - updated 2/2/15). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Language.

Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

Monday, February 1, 2016

Some thoughts on van Schijndel & Elsner 2014

I really like the idea of seeing how far you can get with understanding filler-gap interpretation, given very naive ideas about language structure (i.e., linear w.r.t. verb position, as vS&E2014 do). Even if it’s not this particular shallow representation (and instead maybe a syntactic skeleton like the kind Gutman et al. 2014 talked about), the idea of what a “good enough” representation can do for scaffolding other acquisition processes is something near and dear to my heart.

One niggling thing — given that vS&M2014 say that this model represents a learner between 15 and 25-30 months, it’s likely the syntactic knowledge is vastly more sophisticated at the end of the learning (i.e., ~25 months). So the assumptions of simplified syntactic input may not be as necessary (or appropriate) later on in development. More generally, this kind of extended modeling timeline makes me want more integration with the kind of acquisition framework of Lidz & Gagliardi (2015), which incorporates developing knowledge into the model’s input & inference.

One other thing I really appreciated in this paper was how much they strove to connect the modeling assumptions and evaluation with developmental trajectory data. We can argue about the implementation of the information those empirical data provide, sure, but at least vS&E2014 are trying to seriously incorporate the known facts so that we can get an informative model.

Other specific thoughts:

(1) At the end of section 3, vS&E2014 say the model “assumes that semantic roles have a one-to-one correspondence with nouns in a sentence”. So…is it surprising that “A and B gorped” is interpreted as “A gorped B” since it’s built into the model to begin with? That is, this misinterpretation is exactly what a one-to-one mapping would predict - A and B don’t get the same role (subject/agent) because only one of them can get the role. Unless I misunderstood what the one-to-one correspondence is doing.

(2) I wasn’t quite sure about this assumption mentioned in section 3: “To handle recursion, this work assumes children treat the final verb in each sentence as the main verb…”. So in the example in Table 1, “Susan said John gave (the) girl (a) book”, “gave” is the “main” verb because…why? Why not just break the sentence up by verbs anyway? (That is, “said” would get positions relative to it and “gave” would get positions relative to it, and they might overlap, but…okay?) Is this assumption maybe doing some other kind of work, like with respect to where gaps tend to be?

(3) If I’m understanding the evaluation in section 5 correctly, it seems that semantic roles commonly associated with subject and object (i.e., agent, patient, etc. depending on the specific verb) are automatically assigned by the model. I think this works for standard transitive and intransitive verbs really well, but I wonder about accusatives (fall, melt, freeze, etc.) where the subject is actually the “done-to” thing (i.e., Theme or Patient, so the event is actually affecting that thing). This is something that would be available if you had observable conceptual information (i.e., you could observe the event the utterance refers to and determine the role that participant plays in the event).

Practically speaking, it means the model assigning “theme/patient” to the subject position (preverbal) would be correct for unaccusatives. But I don’t think the current model does this - in fact, if it just uses “subject” and “object” to stand in for thematic/conceptual roles, the “correct” assignment would be the subject NP of unaccusatives as an “object” (Theme/Patient)….which would be counted as incorrect for this model. (Unless the BabySRL corpus that vS&E2014 used labels thematic roles and not just grammatical roles? It was a bit unclear.) I guess the broader issue is the complexity of different predicate types, and the fact that there isn’t a single mapping that works for all of them.

This came up again for me in section 6 when vS&E2014 compare their results to the competing BabySRL model and they note that when given a NV frame (like with intransitives or unaccusatives), BabySRL labels the lone NP as an “object” 30 or 40% of the time. If the verb is an unaccusative, this would actually be correct (again, assuming “object” maps to “patient” or “theme”).

(4) Section 6: “…these observations suggest that any linear classifier which relies on positioning features will have difficulties modeling filler-gap acquisition” — including the model here? It seemed like the one vS&E2014 used captured the filler-gap interpretations effects they were after, and yet relied on positioning features (relative to the main verb).

References:

Gutman, Ariel, Isabelle Dautriche, Benoit Crabbe, & Anne Christophe 2015. Bootstrapping the Syntactic Bootstrapper: Probabilistic Labeling of Prosodic Phrases, Language Acquisition, 22(3), 285-309.

Lidz, J., & Gagliardi, A. (2015). How Nature Meets Nurture: Universal Grammar and Statistical Learning. Annu. Rev. Linguist., 1(1), 333-353.

Computational Models of Language (at UC Irvine)