One of the things I really liked about this paper was that it implements a computational model that makes predictions, and then test those predictions experimentally. It's becoming more of a trend to do both within a single paper, but often it's too involved to describe both parts, and so they end up in separate papers. Fortunately, here we see something concise enough to fit both in, and that's a lovely thing.
I also really liked that R&al investigate the logical problem of language acquisition (LPLA) by targeting one specific instance of that problem that's been held up (or used to be held up as recently as ten years ago) as an easily understood example of the LPLA. I'm definitely sympathetic to R&al's conclusions, but I don't think I believe the implication that this debunks the LPLA. I do believe it's away to solve it for this particular instantiation, but the LPLA is about induction problems in general -- not just this one, not just subset problems, but all kinds of induction problems. And I do think that induction problems abound in language acquisition.
It was interesting to me how R&al talked about positive and negative evidence -- it almost seemed like they conflated two dimensions that are distinct: positive (something present) vs. negative (something absent), and direct (about that data point) vs. indirect (about related data points). For example, they equate positive evidence with "the reinforcement of successful predictions", but to me, that could be a successful prediction about what's supposed to be there (direct positive evidence) or a successful prediction about what's not supposed to be there (indirect negative evidence). Similarly, prediction error is equated with negative evidence, but a prediction error could be about predicting something should be there but it actually isn't (indirect negative evidence) or about predicting something shouldn't be there but it actually is (direct positive evidence -- and in particular, counterexamples). However, I do agree with their point that indirect negative evidence is a reasonable thing for children to be using, because of children's prediction ability.
Another curious thing for me was that the particular learning story R&al implement forces them to commit to what children's semantic hypothesis space is for a word (since it hinges on selecting the appropriate semantic hypothesis for the word as well as the appropriate morphological form, and using that to make predictions). This seemed problematic, because the semantic hypothesis space is potentially vast, particularly if we're talking about what semantic features are associated with a word. And maybe the point is their story should work no matter what the semantic hypothesis space is, but that wasn't obviously true to me.
As an alternative, it seemed to me that the same general approach could be taken without having to make that semantic hypothesis space commitment. In particular, suppose the child is merely tracking the morphological forms, and recognizes the +s regular pattern from other plural forms. This causes them to apply this rule to "mouse" too. Children's behavior indicates there's a point where they use both "mice" and "mouses", so this is a morphological hypothesis that allows both forms (H_both). The correct hypothesis only allows "mice" (H_mice), so it's a subset-superset relationship of the hypotheses (H_mice is a subset of H_both). Using Bayesian inference (and the accompanying Size Principle) should produce the same results we see computationally (the learner converges on the H_mice hypothesis over time). It seems like it should also be capable of matching the experimental results: early on, examples of the regular rule indirectly boost the H_both hypothesis more, but later on when children have seen enough suspicious coincidences of "mice" input only, the indirect boost to H_both matters less because H_mice is much more probable.
So then, I think the only reason to add on this semantic hypothesis space the way R&al's approach does is if you believe the learning story is necessarily semantic, and therefore must depend on the semantic features.
Some more specific thoughts:
(1) The U-shaped curve of development: R&al talk about the U-shaped curve of development in a way that seemed to odd to me. In particular, in section 6 (p.767), they call the fact that "children who have been observed to produce mice in one context may still frequently produce overregularized forms such as mouses in another" a U-shaped trajectory. But this seems to me to just be one piece of the trajectory (the valley of the U, rather than the overall trajectory).
(2) The semantic cues issue comes back in an odd way in section 6.7, where R&al say that the "error rate of unreliable cues" will "help young speakers discriminate the appropriate semantic cues to irregulars" (p.776). What semantic cues would these be? (Aren't the semantics of "mouses" and "mice" the same? The difference is morphological, rather than semantic.)
(3) R&al promote the idea that a useful thing computational approaches to learning do is ''discover structure in the data" rather than trying to "second-guess the structure of those data in advance" (section 7.4, p.782). That seems like a fine idea, but I don't think it's actually what they did in this particular computational model. In particular, didn't they predefine the hypothesis space of semantic cues? So yes, structure was discovered, but it was discovered in a hypothesis space that had already been constrained (and this is the main point of modern linguistic nativists, I think -- you need a well-defined hypothesis space to get the right generalizations out).
Discussion board for the reading group based out of UCI.
Friday, May 30, 2014
Monday, May 19, 2014
Next time on 6/2/14 @ 3:00pm in SBSG 2221 = Ramscar et al. 2013
Thanks to everyone who was able to join us for our delightful discussion of Kol et al. 2014! We had some really thoughtful commentary on model evaluation. Next time on Jun 2 @ 3:00pm in SBSG 2221, we'll be looking at an article that discusses how children recover from errors during learning, and how this relates to induction problems in language acquisition.
Ramscar, M., Dye, M., & McCauley, S. 2013. Error and expectation in language learning: The curious absence of mouses in adult speech. Language, 89(4), 760-793.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/RamscarEtAl2013_RecoveryFromOverreg.pdf
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/RamscarEtAl2013_RecoveryFromOverreg.pdf
See you then!
Friday, May 16, 2014
Some thoughts on Kol et al. 2014
I completely love that this paper is highlighting the strength of computational models for precisely evaluating theories about language learning strategies (which is an issue near and dear to my heart). As K&al2014 so clearly note, a computational model forces you to implement all the necessary pieces of your theory and can show you where parts are underspecified. And then, when K&al2014 demonstrate the issues with the TBM, they can identify what parts seem to be causing the problem and where the theory needs to include additional information/constraints.
On a related note, I love that K&al2014 are worrying about how to evaluate model output — again, an issue I’ve been thinking about a lot lately. They end up doing something like a bigger picture version of recall and precision — we don’t just want the model to generate all the true utterances (high recall). We want it to also not generate the bad utterances (high precision). And they demonstrate quite clearly that the TBM’s generative power is great…so great that it generates the bad utterances, too (and so has low precision from this perspective). Which is not so good after all.
But what was even more interesting to me was their mention of measures like perplexity to test the “quality of the grammars” learned, with the idea that good quality grammars make the real data less perplexing. Though they didn’t do it here, I wonder if there’s a reasonable way to do that for the learning strategy they talk about here — it’s not a grammar exactly, but it’s definitely a collection of units and operations that can be used to generate an output. So, as long as you have a generative model for how to produce a sequence of words, it seems like you could use a perplexity measure to compare this particular collection of units and operations against something like a context-free grammar (or even just various versions of the TBM learning strategy).
Some more targeted thoughts:
(1) K&al2014 make a point in the introduction that simulations that “specifically implement definitions provided by cognitive models of language acquisition are rare”. I found this a very odd thing to say — isn’t every model an implementation of some theory of a language strategy? Maybe the point is more that we have a lot of cognitive theories that don’t yet have computational simulations.
(2) There’s a certain level of arbitrariness that K&al2014 note for things like how many matching utterances have to occur for frames to be established (e.g., if it occurs twice, it’s established). Similarly, the preference for choosing consecutive matches over non-consecutive matches is more important than choosing more frequent matches. It’s not clear there are principled reasons for this ordering (at least, not from the description here — and in fact, I don’t think the consecutive preference isn’t implemented in the model K&al2014 put together later on). So, in some sense, these are sort of free parameters in the cognitive theory.
(3) Something that struck me about having high recall on the child-produced utterances with the TBM model — K&al2014 find that the TBM approach can account for a large majority of the utterances (in the high 80s and sometimes 90s). But what about the rest of them (i.e., those 10 or 20% that aren’t so easily reconstructable)? Is it just a sampling issue (and so having denser data would show that you could construct these utterances too)? Or is it more what the linguistic camp tends to assume, where there are knowledge pieces that aren’t a direct/transparent translation of the input? In general, this reminds me of what different theoretical perspectives focus their efforts on — the usage-based camp (and often the NLP camp for computational linguistics) is interested in what accounts for most of everything out there (which can maybe be thought of as the “easy” stuff), while the UG-based camp is interested in accounting for the “hard” stuff (even though that may be a much smaller part of the data).
(3) Something that struck me about having high recall on the child-produced utterances with the TBM model — K&al2014 find that the TBM approach can account for a large majority of the utterances (in the high 80s and sometimes 90s). But what about the rest of them (i.e., those 10 or 20% that aren’t so easily reconstructable)? Is it just a sampling issue (and so having denser data would show that you could construct these utterances too)? Or is it more what the linguistic camp tends to assume, where there are knowledge pieces that aren’t a direct/transparent translation of the input? In general, this reminds me of what different theoretical perspectives focus their efforts on — the usage-based camp (and often the NLP camp for computational linguistics) is interested in what accounts for most of everything out there (which can maybe be thought of as the “easy” stuff), while the UG-based camp is interested in accounting for the “hard” stuff (even though that may be a much smaller part of the data).
Monday, May 5, 2014
Next time on 5/19/14 @ 3:00pm in SBSG 2221 = Kol et al. 2014
Thanks to everyone who was able to join us for our thorough discussion of Orita et al. 2013! We had some really excellent ideas for how to extend the model to connect with children's interpretations of utterances. Next time on May 19 @ 3:00pm in SBSG 2221, we'll be looking at an article that discusses how to evaluate formal models of acquisition, focusing on a particular model of early language acquisition as a case study:
Kol, S., Nir, B., & Wintner, S. 2014. Computational evaluation of the Traceback Method. Journal of Child Language, 41(1), 176-199.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/KolEtAl2014_CompEvalTraceback.pdf
See you then!
Kol, S., Nir, B., & Wintner, S. 2014. Computational evaluation of the Traceback Method. Journal of Child Language, 41(1), 176-199.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/KolEtAl2014_CompEvalTraceback.pdf
See you then!
Friday, May 2, 2014
Some thoughts on Orita et al. 2013
There are several aspects of this paper that I really enjoyed. First, I definitely appreciate the clean and clear description of the circularity in this learning task, where you can learn about the syntax if you know the referents…and you can learn about the referents if you know the syntax (chicken and egg, check).
I also love how hard the authors strive to ground their computational model in empirical data. Now granted, the human simulation paradigm may have its own issues (more on this below), but it’s a great way to try to get at least some approximation of the contextual knowledge children might have access to.
I also really liked the demonstration of the utility of discourse/non-linguistic context information vs. strong syntactic prior knowledge — and how having the super-strong syntax knowledge isn’t enough. This is something that’s a really important point, I think: It’s all well and good to posit detailed, innate, linguistic knowledge as a necessary component for solving an acquisition problem, but it’s important to make sure that this component actually does solve the learning problem (and be aware of what else it might need in order to do so). This paper provides an excellent demonstration of why we need to check this…because in this case, that super-strong syntactic knowledge didn’t actually work on its own. (Side note: The authors are very aware that their model still relies on some less-strong syntactic knowledge, like the relevance of syntactic locality and c-command, but the super-strong syntactic knowledge was on top of that less-strong knowledge.)
More specific thoughts:
(1) The human simulation paradigm (HSP):
In some sense, this task strikes me as similar to ideal learner computational models — we want to see what information is useful in the available input. For the HSP, we do this by seeing what a learner with adult-level cognitive resources can extract. For ideal learners, we do this by seeing what inferences a learner with unlimited computational resources can make, based on the information available.
On the other hand, there’s definitely a sense in which the HSP is not really an ideal learner parallel. First, adult-level processing resources is not the same as unlimited processing resources (it’s just better than child-level processing resources). Second, the issue with adults is that they have a bunch of knowledge to build on about how to extract information from both linguistic and non-linguistic context…and that puts constraints on how they process the available information that children might not have. In effect, the adults may have biases that cause them to perceive the information differently, and this may actually be sub-optimal when compared to children (we don’t really know for sure…but it’s definitely different than children).
Something which is specific to this particular HSP task is that the stated goal is to “determine whether conversational context provides sufficient information for adults” to guess the intended referent. But where does the knowledge about how to use the conversational context to interpret the blanked out NP (as either reflexive, non-reflexive, or lexical) come from? Presumably from adults’ prior experience with how these NPs are typically used. This isn’t something we think children would have access to, though, right? So this is a very specific case of that second issue above, where it’s not clear that the information adults extract is a fair representation of the information children extract, due to prior knowledge that adults have about the language.
Now to be fair, the authors are very aware of this (they have a nice discussion about it in the Experiment 1 discussion section), so again, this is about trying to get some kind of empirical estimate to base their computational model’s priors on. And maybe in the future we can come up with a better way to get this information. For example, it occurs to me that the non-linguistic context (i.e., environment, visual scene info) might be usable. If the caretaker has just bumped her knee, saying “Oops, I hurt myself” is more likely than “Oops, I hurt you”. It may be that the conversational context approximated this to some extent for adults, but I wonder if this kind of thing could be extracted from the video samples we have on CHILDES. What you’d want to do is do a variant of the HSP where you show the video clip with the NP beeped out, so the non-linguistic context is available, along with the discourse information in the preceding and subsequent utterances.
(2) Figure 2: Though I’m fairly familiar with Bayesian models by now, I admit that I loved having text next to each level reminding me what each variable corresponded to. Yay, authors.
(3) General discussion point at the end about unambiguous data: This is a really excellent point, since we don’t like to have to rely on the presence of unambiguous data too much in real life (because typically when we go look for it in realistic input, it’s only very rarely there). Something I’d be interested in is how often unambiguous data for this pronoun categorization issue does actually occur. If it’s never (or almost never, relatively speaking), then this becomes a very nice selling point for this learning model.
In some sense, this task strikes me as similar to ideal learner computational models — we want to see what information is useful in the available input. For the HSP, we do this by seeing what a learner with adult-level cognitive resources can extract. For ideal learners, we do this by seeing what inferences a learner with unlimited computational resources can make, based on the information available.
(3) General discussion point at the end about unambiguous data: This is a really excellent point, since we don’t like to have to rely on the presence of unambiguous data too much in real life (because typically when we go look for it in realistic input, it’s only very rarely there). Something I’d be interested in is how often unambiguous data for this pronoun categorization issue does actually occur. If it’s never (or almost never, relatively speaking), then this becomes a very nice selling point for this learning model.
Subscribe to:
Posts (Atom)