Wednesday, December 3, 2014

See you in the winter!

Thanks so much to everyone who was able to join us for our invigorating discussion today about Richie et al. 2014, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Monday, December 1, 2014

Some thoughts on Richie et al. 2014

One thing I really like about this article is that it provides a nice example of how to make a concrete, empirically grounded computational model to test an intuitive theory about a particular phenomenon. In this case, it’s the impact of multiple speakers on lexicon emergence (in particular, the many-to-one vs. many-to-many dynamic). While I do have one or two questions about the model implementation, it was generally pretty straightforward to understand and nicely intuitive in its own right — and so for me, this is an excellent demonstration of how to use modeling in an informative way. On a related note, while the authors certainly packed both experimental and modeling pieces into the paper, it actually didn’t feel all that rushed (but perhaps this is because I’m already fairly familiar with the model). 

Some more targeted thoughts:

p.185, Introduction: “The disconnect between experimental and computational approaches is a general concern for research on collective and cooperative behavior” — I think this is really the biggest concern I always have for models of general language evolution. At least for the sign lexicon emergence, we actually have examples of it happening “in the wild”, so we can ground the models that way (in the input representation, transmission, output evaluation, etc.). But this becomes far harder (if not impossible) to do for the evolution of language in the human species. What are reasonable initial conditions?  What is the end result supposed to look like anyway? Ack. And that doesn’t even begin to get into the problem of idealization in evolutionary modeling (what to idealize, is it reasonable to do that, and so on). So for my empirical heart, one of the main selling points of this modeling study is the availability of the empirical data, and the attempt to work it into the model in a meaningful way.

p.185, Introduction: “A probabilistic model of language learning…situated in a social setting appears to capture the observed trends of conventionalization” — One of the things I’m wondering is how much the particulars of the probabilistic learning model matter. Could you be a rational Bayesian type for example, rather than a reinforcement learner, and get the same results? In some sense, I hope this would be true since the basic intuition that many-to-many is better for convergence seems sensible. But would the irrelevance of population dynamics persist, or not? Since that’s one of the less intuitive results (perhaps due to the small population size, perhaps due to something else), I wonder.

p.186, 2.1.4 Coding: “…we coded every gesture individually for its Conceptual Component” — On a purely practical note, I wonder how these were delineated. Is it some recognizable unit in time (so the horns of the cow would occur before the milking action of the cow, and that’s how you decide they’re two meaning pieces)? Is it something spatial?  Something else, like a gestalt of different features? I guess I’ve been thinking about the simultaneous articulation aspects of signed languages like ASL, and this struck me as something that could be determined by human perceptual bias (which could be interesting in its own right).

p.190, 3.2: For the adjustment of p using the Linear-Reward-Penalty, is the idea that each Conceptual Component’s (CC’s) probability is adjusted, one at a time? I’m trying to map this to what I recall of previous uses of Yang’s learning model (e.g., Yang 2004), where the vector would be of different grammar parameters, and the learner actually can’t tell which parameter is responsible for successful analysis of the observed data point. In that case, all parameter probabilities are either rewarded or punished, based on the success (or failure) of input analysis. Here, since the use (or not) of a given CC is observed, you don’t have to worry about that. Instead, each one can be rewarded or punished based on its observed use (or not). So, in some sense, this is simpler than previous uses of the learning model, precisely because this part of learning is observed.

p.191, 3.5: “…we run the simulations over 2 million instances of communications” — So for the homesigners with the many-to-one setup, this can easily be interpreted as 2 million homesigner-nonhomesigner interactions. For deaf population simulation, is this just 2 million deaf-deaf communication instances, among any of the 10 pairs in a population of 5? Or does each of the 10 pairs get 2 million interactions? The former seems fairer for comparison with the homesigner population, but the latter is a possible sensible instantiation. If it’s the latter, then the overall frequency of interactions within the population might be what’s driving faster convergence. 

Yang, C. D. (2004). Universal Grammar, statistics or both? Trends in cognitive sciences, 8(10), 451-456. 

Wednesday, November 19, 2014

Next time on 12/3/14 @ 4:00pm in SBSG 2221 = Richie et al. 2014

Thanks to everyone who was able to join us for our illuminating discussion of Qing & Franke 2014!  For our next CoLa reading group meeting on Wednesday December 3 at a special time of 4:00pm in SBSG 2221, we'll be looking at an article that investigates the emergence of homesign lexicons, paying particular attention to the impact of other speakers.

Richie, R., Yang, C., & Coppola, M. 2014. Modeling the Emergence of Lexicons in Homesign. Topics in Cognitive Science, 6, 183-195.

See you then!

Monday, November 17, 2014

Some thoughts on Qing & Franke 2014

One of the things I really liked about this article was the attention to formalizing the pragmatic facts, and the attention to explaining the intuitions behind the model. (That being said, I probably could have benefited from more of a primer on degree semantics, since I had some trouble following the exact instantiation of that in the model.) Still, Q&F2014’s point is to demonstrate the utility of certain assumptions for learning about degree adjectives and then to rigorously evaluate them using standard Bayesian methods, and I think they succeeded on that computational-level goal. In general, I suspect the length constraint was a big issue for this paper — so much was packed in that of course many things had to be glossed over. I do wish Q&F had spent a bit more time on the discussion and conclusions, however — I was left wondering exactly what to make of these results as someone who cares about how acquisition works. For instance, what does an individual need to learn (c, theta?) vs. what’s already built in (lambda?)?

Some more targeted thoughts:

(1) p.2, “…to measure how efficient a standard theta for ‘tall’ is for describing basketball players, we calculate on average how likely the speaker will manage to convey the height of a random basketball player by adopting that standard.” — This text sounds like the goal is to convey exact height, rather than relative height (importantly, is the player in question “tall” relative to the standard theta?). But it seems like relative height would make more sense. (That is, “tall” doesn’t tell you the player's 7’1” vs. 7’2”, but rather that he’s tall compared to other basketball players, captured by that theta.)

(2) p.2, c: I admit, I struggled to understand how to interpret c specifically. I get the general point about how c captures a tradeoff between communicative efficiency and absolute general applicability (side note: which means…? It always applies, I think?). But what does it mean to have communicative efficiency dominate absolute general applicability (with c close to 0) -- that the adjective doesn’t always apply?  I guess this is something of a noise factor, more or less. And then there’s another noise factor with the degree of rationality in an individual, lambda.

(3) p.3, Parameters Learning section: c_A is set to range between -1 and 0. Given the interpretations of c we just got on p.2, does this mean Q&F are assuming that the adjectives they investigate (big, dark, tall, full) are generally inapplicable (and so have a higher theta baseline to apply), since c can only be negative if it’s non-zero? It doesn’t seem unreasonable, but if so, this is an assumption they build into the learner. Why not allow it to range from -1 to 1, and allow the learner to assume positive c values are a possibility?

(4) p.6, Conclusion: “Combining the idea of pragmatic reasoning as social cognition…” — Since they’re just looking at individual production in their model (and individual judgments in their experiment), where is the social cognition component? Is it in how the baseline theta is assessed? Something else?

(5) p.6, Conclusion: “…we advanced the hypothesis that the use of gradable adjectives is driven by optimality of descriptive language use.” — What does this mean exactly? How does it contrast with optimal contextual categorization and referential language use? This is definitely a spot where I wish they had had more space to explain, since this seems to get at the issue of how we interpret the results here.

Wednesday, October 29, 2014

Next time on 11/19/14 @ 10:30am in SBSG 2221 = Qing & Franke 2014

Thanks to everyone who was able to join us for our incisive and informative discussion of Barak et al. 2014!  For our next CoLa reading group meeting on Wednesday November 19 at 10:30am in SBSG 2221, we'll be looking at an article that investigates the acquisition of gradable adjectives like "tall", using a Bayesian approach that incorporates pragmatic reasoning.

Qing, C. & Franke, M. 2014. Meaning and Use of Gradable Adjectives: Formal Modeling Meets Empirical Data. Proceedings of the Cognitive Science Society.

See you then!

Monday, October 27, 2014

Some thoughts on Barak et al. 2014

One of things I really liked about this paper was the additional "verb class" layer, which is of course what allows similarities between verbs to be identified, based on their syntactic structure distributions. This seems like an obvious thing, but I don't think I've seen too many incremental models that actually have hierarchy in them (in contrast to ideal learner models operating in batch mode, which often have hierarchical levels in them). So that was great to see. Relatedly, the use of syntactic distributions from other verbs too (not just mental state verbs and communication/perception verbs) feels very much like indirect positive evidence (Pearl & Mis 2014 terminology), where something present in the input is informative, even if it's not specifically about the thing you're trying to learn. And that's also nice to see more explicit examples of. Here, this indirect positive evidence provides a nice means to generalize from communication/perception verbs to mental state verbs.

I also liked the attention spent on the perceptual coding problem (I'm using Lidz & Gagliardi 2014 terminology now) as it relates to mental state verbs, since it definitely seem true that mental state concepts/semantic primitives are going to be harder to extract from the non-linguistic environment, as compared to communication events or perception events.

More specific comments:

(1) Overview of the Model, "The model also includes a component that simulates the difficulty of children attending to the mental content...also simulates this developing attention to mental content as an increasing ability to correctly interpret a scene paired with an SC utterance as having mental semantic properties." -- Did I miss where it was explained how this was instantiated? This seems like exactly the right thing to do, since semantic feature extraction should be noisy early on and get better over time. But how did this get implemented? (Maybe it was in the Barak et al. 2012 reference?)

(2) Learning Constructions of Verb Usages, "...prior probability of cluster P(k) is estimated as the proportion of frames that are in k out of all observed input frames, thus assigning a higher prior to larger clusters representing more frequent constructions." -- This reminds me of adaptor grammars, where both type frequency and token frequency have roles to play (except, if I understand this implementation correctly, it's only token frequency that matters for the constructions, and it's only at the verb class level that type frequency matters, where type = verb).

(3) Learning Verb Classes, "...creation of a new class for a given verb distribution if the distribution is not sufficiently similar to any of those represented by the existing verb classes.", and the new class is a uniform distribution over all constructions. This seems like a sensible way to get at the same thing generative models do by having some small amount of probability assigned to creating a new class. I wonder if there are other ways to implement it, though. Maybe something more probabilistic where, after calculating the probabilities of it being in each existing verb class and the new uniform distribution one, the verb is assigned to a class based on that probability distribution. (Basically, something that doesn't use the argmax, but instead samples.)

(4) Generation of Input Corpora, "...frequencies are extracted from a manual annotation of a sample of 100 child-directed utterances per verb" -- I understand manual annotation is a pain, but it does seem like this isn't all that many per verb. Though I suppose if there are only 4 frames they're looking at, it's not all that bad.  That being said, the range of syntactic frames is surely much more than that, so if they were looking at the full range, it seems like they'd want to have more than 100 samples per verb.

(5) Set-up of Simulations: "...we train our model on a randomly generated input corpus of 10,000 input frames" -- I'd be curious about how this amount of input maps onto the amount of input children normally get to learn these mental state verbs. It actually isn't all that much input. But maybe it doesn't matter for the model, which settles down pretty quickly to its final classifications?

(6) Estimating Event Type Likelihoods: "...each verb entry in our lexicon is represented as a collection of features, including a set of event primitives...think is {state, cogitate, belief, communicate}" -- I'm very curious as to how these are derived, as some of them seem very odd for a child's representation of the semantic content available. (Perhaps automatically derived from existing electronic resources for adult English? And if so, is there a more realistic way to instantiate this representation?)

(7) Experimental Results: "...even for Desire verbs, there is still an initial stage where they are produced mostly in non-mental meaning." -- I wish B&al had had space for an example of this, because I had an imagination fail about what that would be. I want used in a non-mental meaning? What is that for want?

Lidz, J. & Gagliardi, A. 2014 to appear. How Nature Meets Nurture: Universal Grammar and Statistical LearningAnnual Review of Linguistics.

Pearl & Mis 2014. The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Manuscript, UCI. [lingbuzz:]

Wednesday, October 15, 2014

Next time on 10/29/14 @ 10:30am in SBSG 2221 = Barak et al. 2014

Thanks to everyone who was able to join us for our invigorating discussion of Lidz & Gagliardi 2014!  For our next meeting on Wednesday October 29 at 10:30am in SBSG 2221, we'll be looking at an article that investigates the acquisition of a particular subset of lexical items, known as mental state verbs (like "want", "wish", "think", "know"). This computational modeling study focuses on different syntactic information that children could be leveraging.

Barak, L., Fazly, A., & Stevenson, S. 2014. Gradual Acquisition of Mental State Meaning: A Computational Investigation. Proceedings of the Cognitive Science Society.

See you then!

Monday, October 13, 2014

Some thoughts on Lidz & Gagliardi 2014

My Bayesian-inclined brain really had a fun time trying to translate everything in this acquisition model into Bayesian terms, and I think it actually lends itself quite well to this -- model specification, model variables, inference, likelihood, etc. I'm almost wondering if it's worth doing this explicitly in another paper for this model (maybe for a different target audience, like a general cognitive sciences crowd). I think it'd make it easier to understand the nuances L&G highlight, since these nuances track so well with different aspects of Bayesian modeling. (More on this below.)

That being said, it took me a bit to wrap my head around the distinction between perceptual and acquisitional intake, probably because of that mapping I kept trying to do to the Bayesian terminology. I think in the end I sorted out exactly what each meant, but this is worth talking about more since they do (clearly) mean different things.  What I ended up with: perceptual intake is what can be reliably extracted from the input, while acquisitional intake is the subset relevant for the model variables (and of course the model/hypothesis space that defines those variables need to already be specified).

Related to this: It definitely seems like prior knowledge is involved to implement both intake types, but the nature of that prior knowledge is up for grabs. For example, if a learner is biased to weight cues differently for the acquisitional intake, does that come from prior experience about the reliability of these cues for forming generalizations, or is it specified in something like Universal Grammar, irrespective of how useful these cues have been previously? Either seems possible. To differentiate them, I guess you'd want to do what L&G are doing here, where you try to find situations where the information use doesn't map to the information reliability, since that's something that wouldn't be expected from derived prior knowledge. (Of course, then you have to have a very good idea about what exactly the child's prior experience was like, so that you could tell what they perceived the information reliability to be.)

One other general comment: I loved how careful L&G were to highlight when empirical evidence doesn't distinguish between theoretical viewpoints. So helpful. It really underscores why these theoretical viewpoints have persisted in the face of all the empirical data we now have available.

More specific comments:

(1) The mapping to Bayesian terms that I was able to make:
-- Universal Grammar = hypothesis space/model specification
(a) Abstract: "Universal Grammar provides representations that support deductions that fall outside of experience...these representations define the evidence the learners use..." -- Which makes sense, because if the model is specified, the relevant data are also specified (anything that impacts the model variables is relevant).
(b) p.6, "The UG component identifies the class of representations that shape the nature of human grammatical systems".

-- Perceptual Intake = parts of the input that could impact model variables
p.10, "contain[s]...information relevant to making inferences"

-- Acquisitional Intake = parts of the input that do impact model variables

-- Inference engine = likelihood?
(a) p.10, "...makes predictions about what the learner should expect to find in the environment"...presumably, given a particular hypothesis. So, this is basically a set of likelihoods (P(D | H)) for all the Hs in the hypothesis space (defined by UG, for example).
(b) p.21, "...the inference engine, which selects specified features of that representation (the acquisitional intake) to derive conclusions about grammatical representations". This makes it sound like the inference engine is the one selecting the model variables, which doesn't sound like likelihood at all. Unless inference is over the model variables, which are already defined for each H.

-- Updated Grammar, deductive consequences = posterior over hypotheses
p.30, "...inferential, using distributional evidence to license conclusions about the abstract representations underlying language"
Even though L&G distinguish between inferential and deductive aspects, I feel like they're still talking about the hypothesis space. The inferential part is selecting the hypothesis (using the posterior) and the deductive consequences part is all the model variables that are connected to that hypothesis.

(2) The difference about inference: p.4, "On the input-driven view, abstract linguistic representations are arrived at by a process of generalization across specific cases...", and this is interpreted as "not inference" (in contrast to the knowledge-driven tradition). But a process of "generalization across specific cases" certainly sounds a lot like inference, because something has to determine exactly how that generalization is constrained (even if it's non-linguistic constraints like economy or something). So I'm not sure it's fair to say the input-driven approach doesn't use inference, per se. Instead, it sounds like the distinction L&G want is about how that inference is constrained (input-driven: non-linguistic constraints;  knowledge-driven: linguistic hypothesis space).

(3) Similarly, I also feel it's not quite fair to divide the world into "nothing like the input" (knowledge-driven) vs. "just like the input, only compressed" (input-driven) (p.5). Instead, it seems like this is more of a continuum, and some representations can be "not obviously" like the input, and yet still be derived from it. The key is knowing exactly what the derivation process is -- for example, for the knowledge-driven approach, the representations could be viewed as similar to the input at an abstract level, even if the surface representation looks very different.

(4) p.6, "...the statistical sensitivities of the learner are sometimes distinct from ideal-observer measures of informativity...reveal the role learners play in selecting relevant input to drive learning."  So if the learner has additional constraints (say, on how the perceptual intake is implemented), could these be incorporated into the learner assumptions that would make up an ideal learner model? That is, if we're not talking about constraints that are based on cognitive resources but are instead talking about learner biases, couldn't we build an ideal-observer model that has those biases? (Or maybe the point is that perceptual intake only comes from constraints on cognitive resources?)

(5) p.8, " must come from a projection beyond their experience". I think we have to be really careful about claiming this -- maybe "direct experience" is better, since even things you derive are based on some kind of experience, unless you assume everything about them is innate. But the basic point is that some previously-learned or innately-known stuff may matter for how the current direct experience is utilized.

(6) p.9, (referring to distribution of pronouns & interpretations), "...we are aware of no proposals outside the knowledge-driven tradition". Hello, modeling call! (Whether for the knowledge-driven theory, or other theories.)

(7) p.9, "...most work in generative linguistics has been the specification of these representations". I think some of the ire this approach has inspired from the non-generative community could be mitigated by considering which of these representations could be derived (and importantly, from what). It seems like not as many generative researchers (particularly ones who don't work a lot on acquisition) think about the origin of these representations. But since some of them can get quite complex, it rubs some people the wrong way to call them all innate. But really, it doesn't have to be that way -- some might be innate, true, but some of these specifications might be built up from other simpler innate components and/or derived from prior experience.

(8) p.15, "...predicted that the age of acquisition of a grammar with tense is a function of the degree to which the input unambiguously supports that kind of grammar..." And this highlights the importance of what counts as unambiguous data (which is basically data where likelihood p(D | H) is 0 for all but the correct H). And this clearly depends on the model variables involved in all the different Hs (which should be the same??).

(9) p.25, "...preference for using phonological information over semantic information likely reflects perceptual intake in the initial stages of noun class learning". So this could easily be a derived bias, but I would think we would still call it "knowledge-driven" -- it's just that it's derived knowledge, rather than innate knowledge that caused it.

(10) sections 6, Kannada empirical facts -- So interesting! Every time I see this, I always have a quiet moment of goggling. It seems like such an interesting challenge to figure out what these facts could be learned from. Something about binding? Something about goal-prominence? I feel like the top of p.35 has a parameter-style proposal linking possession constructions and these ditransitive facts, which would then be model variables. The Viau & Lidz 2011 proposal that cares about what kind of NPs are in different places also seems like another model variable. Of course, these are very specific pieces of knowledge about model variables...but still, will this actually work (like, can we implement a model that uses these variables and run it)? And if it does, can the more specific model variables be derived from other aspects of the input, or do you really have to know about those specific model variables?

(11) Future Issues, p.47: Yes.  All of these. Because modeling. (Especially 5, but really, all of them.)

Friday, October 3, 2014

Next time on 10/15/14 @ 10:30am in SBSG 2221 = Lidz & Gagliardi 2014

Hi everyone,

It looks like a good collective time to meet will be Wednesdays at 10:30am for this quarter, so that's what we'll plan on.  Our first meeting will be on October 15, and our complete schedule is available on the webpage at 

On October 15, we'll be looking at a review article that discusses a particular learning model drawing on language-specific and domain-general knowledge to explain the process of acquisition. For modelers, it's especially useful to consider the specific implementations proposed, as these are theoretically and empirically motivated learning strategies that can we can investigate via computational modeling.

Lidz, J. & Gagliardi, A. 2014 to appear. How Nature Meets Nurture: Universal Grammar and Statistical Learning. Annual Review of Linguistics.

See you on October 15!

Monday, June 2, 2014

Thanks and see you in the fall!

Thanks to everyone who was able to join us for our delightful discussion of Ramscar et al. 2013, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be on hiatus this summer, and we'll resume again in the fall quarter.  As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Friday, May 30, 2014

Some thoughts about Ramscar et al. 2013

One of the things I really liked about this paper was that it implements a computational model that makes predictions, and then test those predictions experimentally. It's becoming more of a trend to do both within a single paper, but often it's too involved to describe both parts, and so they end up in separate papers. Fortunately, here we see something concise enough to fit both in, and that's a lovely thing.

I also really liked that R&al investigate the logical problem of language acquisition (LPLA) by targeting one specific instance of that problem that's been held up (or used to be held up as recently as ten years ago) as an easily understood example of the LPLA. I'm definitely sympathetic to R&al's conclusions, but I don't think I believe the implication that this debunks the LPLA. I do believe it's away to solve it for this particular instantiation, but the LPLA is about induction problems in general -- not just this one, not just subset problems, but all kinds of induction problems. And I do think that induction problems abound in language acquisition.

It was interesting to me how R&al talked about positive and negative evidence -- it almost seemed like they conflated two dimensions that are distinct: positive (something present) vs. negative (something absent), and direct (about that data point) vs. indirect (about related data points). For example, they equate positive evidence with "the reinforcement of successful predictions", but to me, that could be a successful prediction about what's supposed to be there (direct positive evidence) or a successful prediction about what's not supposed to be there (indirect negative evidence). Similarly, prediction error is equated with negative evidence, but a prediction error could be about predicting something should be there but it actually isn't (indirect negative evidence) or about predicting something shouldn't be there but it actually is (direct positive evidence -- and in particular, counterexamples).  However, I do agree with their point that indirect negative evidence is a reasonable thing for children to be using, because of children's prediction ability.

Another curious thing for me was that the particular learning story R&al implement forces them to commit to what children's semantic hypothesis space is for a word (since it hinges on selecting the appropriate semantic hypothesis for the word as well as the appropriate morphological form, and using that to make predictions). This seemed problematic, because the semantic hypothesis space is potentially vast, particularly if we're talking about what semantic features are associated with a word. And maybe the point is their story should work no matter what the semantic hypothesis space is, but that wasn't obviously true to me.

As an alternative, it seemed to me that the same general approach could be taken without having to make that semantic hypothesis space commitment. In particular, suppose the child is merely tracking the morphological forms, and recognizes the +s regular pattern from other plural forms. This causes them to apply this rule to "mouse" too. Children's behavior indicates there's a point where they use both "mice" and "mouses", so this is a morphological hypothesis that allows both forms (H_both). The correct hypothesis only allows "mice" (H_mice), so it's a subset-superset relationship of the hypotheses (H_mice is a subset of H_both). Using Bayesian inference (and the accompanying Size Principle) should produce the same results we see computationally (the learner converges on the H_mice hypothesis over time). It seems like it should also be capable of matching the experimental results: early on, examples of the regular rule indirectly boost the H_both hypothesis more, but later on when children have seen enough suspicious coincidences of "mice" input only, the indirect boost to H_both matters less because H_mice is much more probable.

So then, I think the only reason to add on this semantic hypothesis space the way R&al's approach does is if you believe the learning story is necessarily semantic, and therefore must depend on the semantic features.

Some more specific thoughts:

(1) The U-shaped curve of development: R&al talk about the U-shaped curve of development in a way that seemed to odd to me. In particular, in section 6 (p.767), they call the fact that "children who have been observed to produce mice in one context may still frequently produce overregularized forms such as mouses in another" a U-shaped trajectory. But this seems to me to just be one piece of the trajectory (the valley of the U, rather than the overall trajectory).

(2) The semantic cues issue comes back in an odd way in section 6.7, where R&al say that the "error rate of unreliable cues" will "help young speakers discriminate the appropriate semantic cues to irregulars" (p.776). What semantic cues would these be? (Aren't the semantics of "mouses" and "mice" the same? The difference is morphological, rather than semantic.)

(3) R&al promote the idea that a useful thing computational approaches to learning do is ''discover structure in the data" rather than trying to "second-guess the structure of those data in advance" (section 7.4, p.782). That seems like a fine idea, but I don't think it's actually what they did in this particular computational model. In particular, didn't they predefine the hypothesis space of semantic cues? So yes, structure was discovered, but it was discovered in a hypothesis space that had already been constrained (and this is the main point of modern linguistic nativists, I think -- you need a well-defined hypothesis space to get the right generalizations out).

Monday, May 19, 2014

Next time on 6/2/14 @ 3:00pm in SBSG 2221 = Ramscar et al. 2013

Thanks to everyone who was able to join us for our delightful discussion of Kol et al. 2014! We had some really thoughtful commentary on model evaluation. Next time on Jun 2 @ 3:00pm in SBSG 2221, we'll be looking at an article that discusses how children recover from errors during learning, and how this relates to induction problems in language acquisition.

Ramscar, M., Dye, M., & McCauley, S. 2013. Error and expectation in language learning: The curious absence of mouses in adult speech. Language, 89(4), 760-793.

See you then!

Friday, May 16, 2014

Some thoughts on Kol et al. 2014

I completely love that this paper is highlighting the strength of computational models for precisely evaluating theories about language learning strategies (which is an issue near and dear to my heart). As K&al2014 so clearly note, a computational model forces you to implement all the necessary pieces of your theory and can show you where parts are underspecified. And then, when K&al2014 demonstrate the issues with the TBM, they can identify what parts seem to be causing the problem and where the theory needs to include additional information/constraints.

On a related note, I love that K&al2014 are worrying about how to evaluate model output — again, an issue I’ve been thinking about a lot lately.  They end up doing something like a bigger picture version of recall and precision — we don’t just want the model to generate all the true utterances (high recall). We want it to also not generate the bad utterances (high precision). And they demonstrate quite clearly that the TBM’s generative power is great…so great that it generates the bad utterances, too (and so has low precision from this perspective). Which is not so good after all.

But what was even more interesting to me was their mention of measures like perplexity to test the “quality of the grammars” learned, with the idea that good quality grammars make the real data less perplexing. Though they didn’t do it here, I wonder if there’s a reasonable way to do that for the learning strategy they talk about here — it’s not a grammar exactly, but it’s definitely a collection of units and operations that can be used to generate an output. So, as long as you have a generative model for how to produce a sequence of words, it seems like you could use a perplexity measure to compare this particular collection of units and operations against something like a context-free grammar (or even just various versions of the TBM learning strategy).

Some more targeted thoughts:

(1) K&al2014 make a point in the introduction that simulations that “specifically implement definitions provided by cognitive models of language acquisition are rare”.  I found this a very odd thing to say — isn’t every model an implementation of some theory of a language strategy? Maybe the point is more that we have a lot of cognitive theories that don’t yet have computational simulations.

(2) There’s a certain level of arbitrariness that K&al2014 note for things like how many matching utterances have to occur for frames to be established (e.g., if it occurs twice, it’s established).  Similarly, the preference for choosing consecutive matches over non-consecutive matches is more important than choosing more frequent matches. It’s not clear there are principled reasons for this ordering (at least, not from the description here — and in fact, I don’t think the consecutive preference isn’t implemented in the model K&al2014 put together later on). So, in some sense, these are sort of free parameters in the cognitive theory.

(3) Something that struck me about having high recall on the child-produced utterances with the TBM model — K&al2014 find that the TBM approach can account for a large majority of the utterances (in the high 80s and sometimes 90s). But what about the rest of them (i.e., those 10 or 20% that aren’t so easily reconstructable)? Is it just a sampling issue (and so having denser data would show that you could construct these utterances too)? Or is it more what the linguistic camp tends to assume, where there are knowledge pieces that aren’t a direct/transparent translation of the input? In general, this reminds me of what different theoretical perspectives focus their efforts on — the usage-based camp (and often the NLP camp for computational linguistics) is interested in what accounts for most of everything out there (which can maybe be thought of as the “easy” stuff), while the UG-based camp is interested in accounting for the “hard” stuff (even though that may be a much smaller part of the data).

Monday, May 5, 2014

Next time on 5/19/14 @ 3:00pm in SBSG 2221 = Kol et al. 2014

Thanks to everyone who was able to join us for our thorough discussion of Orita et al. 2013! We had some really excellent ideas for how to extend the model to connect with children's interpretations of utterances. Next time on May 19 @ 3:00pm in SBSG 2221,  we'll be looking at an article that discusses how to evaluate formal models of acquisition, focusing on a particular model of early language acquisition as a case study:

Kol, S., Nir, B., & Wintner, S. 2014. Computational evaluation of the Traceback Method. Journal of Child Language, 41(1), 176-199.

See you then!

Friday, May 2, 2014

Some thoughts on Orita et al. 2013

There are several aspects of this paper that I really enjoyed. First, I definitely appreciate the clean and clear description of the circularity in this learning task, where you can learn about the syntax if you know the referents…and you can learn about the referents if you know the syntax (chicken and egg, check). 

I also love how hard the authors strive to ground their computational model in empirical data. Now granted, the human simulation paradigm may have its own issues (more on this below), but it’s a great way to try to get at least some approximation of the contextual knowledge children might have access to. 

I also really liked the demonstration of the utility of discourse/non-linguistic context information vs. strong syntactic prior knowledge — and how having the super-strong syntax knowledge isn’t enough. This is something that’s a really important point, I think: It’s all well and good to posit detailed, innate, linguistic knowledge as a necessary component for solving an acquisition problem, but it’s important to make sure that this component actually does solve the learning problem (and be aware of what else it might need in order to do so). This paper provides an excellent demonstration of why we need to check this…because in this case, that super-strong syntactic knowledge didn’t actually work on its own. (Side note: The authors are very aware that their model still relies on some less-strong syntactic knowledge, like the relevance of syntactic locality and c-command, but the super-strong syntactic knowledge was on top of that less-strong knowledge.)

More specific thoughts:

(1) The human simulation paradigm (HSP): 
In some sense, this task strikes me as similar to ideal learner computational models — we want to see what information is useful in the available input. For the HSP, we do this by seeing what a learner with adult-level cognitive resources can extract. For ideal learners, we do this by seeing what inferences a learner with unlimited computational resources can make, based on the information available. 

On the other hand, there’s definitely a sense in which the HSP is not really an ideal learner parallel. First, adult-level processing resources is not the same as unlimited processing resources (it’s just better than child-level processing resources). Second, the issue with adults is that they have a bunch of knowledge to build on about how to extract information from both linguistic and non-linguistic context…and that puts constraints on how they process the available information that children might not have. In effect, the adults may have biases that cause them to perceive the information differently, and this may actually be sub-optimal when compared to children (we don’t really know for sure…but it’s definitely different than children).

Something which is specific to this particular HSP task is that the stated goal is to “determine whether conversational context provides sufficient information for adults” to guess the intended referent.  But where does the knowledge about how to use the conversational context to interpret the blanked out NP (as either reflexive, non-reflexive, or lexical) come from? Presumably from adults’ prior experience with how these NPs are typically used. This isn’t something we think children would have access to, though, right? So this is a very specific case of that second issue above, where it’s not clear that the information adults extract is a fair representation of the information children extract, due to prior knowledge that adults have about the language.

Now to be fair, the authors are very aware of this (they have a nice discussion about it in the Experiment 1 discussion section), so again, this is about trying to get some kind of empirical estimate to base their computational model’s priors on. And maybe in the future we can come up with a better way to get this information.  For example, it occurs to me that the non-linguistic context (i.e., environment, visual scene info) might be usable. If the caretaker has just bumped her knee, saying “Oops, I hurt myself” is more likely than “Oops, I hurt you”. It may be that the conversational context approximated this to some extent for adults, but I wonder if this kind of thing could be extracted from the video samples we have on CHILDES. What you’d want to do is do a variant of the HSP where you show the video clip with the NP beeped out, so the non-linguistic context is available, along with the discourse information in the preceding and subsequent utterances.

(2) Figure 2: Though I’m fairly familiar with Bayesian models by now, I admit that I loved having text next to each level reminding me what each variable corresponded to. Yay, authors.

(3) General discussion point at the end about unambiguous data: This is a really excellent point, since we don’t like to have to rely on the presence of unambiguous data too much in real life (because typically when we go look for it in realistic input, it’s only very rarely there). Something I’d be interested in is how often unambiguous data for this pronoun categorization issue does actually occur. If it’s never (or almost never, relatively speaking), then this becomes a very nice selling point for this learning model.

Monday, April 14, 2014

Next time on 5/5/14 @ 2:15pm in SBSG 2221 = Orita et al. 2013

Thanks to everyone who was able to join us for our invigorating discussion of the Han et al. 2013 manuscript! Next time on May 5 @ 2:15pm in SBSG 2221,  we'll be looking at an article that presents a Bayesian learning model for pronoun acquisition, with a special focus on the role of discourse information:

Orita, N., McKeown, R., Feldman, N. H., Lidz, J., & Boyd-Graber, J. 2013. Discovering Pronoun Categories using Discourse Information. Proceedings of the Cognitive Science Society.

See you on May 5!

Thursday, April 10, 2014

Some thoughts on the Han et al. 2013 Manuscript

One of the things I greatly enjoyed about this paper is that it really takes a tricky learning issue seriously: What happens if you don't get any indicative data about a certain hypothesis space (in this case, defined as a set of possible grammars related to verb-raising)? Do humans just remain permanently ambivalent (which is a rational thing to do, and what I think any Bayesian model would do), or do they pick one (somehow)? The super-tricky thing in testing this, of course, is how you find something that humans have no input about and actually ascertain what grammar they picked. If there's no input (i.e., a relevant example in the language) that discerns between the grammar options, how do you tell?

And that's how we find ourselves in some fairly complex syntax and semantics involving quantifiers and negation in Korean, and their relationship to verb-raising. I did find myself somewhat surprised by the (apparent) simplicity of the test sentences (e.g., the equivalent of "The man did not wash every car in front of his house"). Because the sentences are so simple, I'm surprised they wouldn't occur at all in the input with the appropriate disambiguating contexts (i.e., the subset of these sentences that occur in a neg>every-compatible context, like the man washing 2 out of 3 of the cars in the above example). Maybe this is more about their relative sparseness, with the idea that while they may appear, they're so infrequent that they're just not noticeable by a human learner during the lifespan.  But that starts to become a tricky argument when you get to adults -- you might think that adults encountering examples like these over time would eventually learn from them. (You might even argue that this happened between session one and session two for the adults that were tested one month apart: they learned (or solidified learning) from the examples in the first session and held onto that newly gained knowledge for the second session.)

One reason this matters is that there's a big difference between no data and sparse data for a Bayesian model. Nothing can be learned from no data, but something (even if it's only a very slight bias) can be learned from sparse data, assuming the learner pays attention to those data when they occur.  For this reason, I'd be interested in some kind of corpus analysis of realistic Korean input of how often these type of negation + object quantifier sentences occur (with the appropriate disambiguating context ideally, but at the very least, how often the sentence structure itself occurs).  If they really don't occur at all, this is interesting, since we have the idea described in the paper that humans are picking a hypothesis even in the absence of input (and then we have to investigate why). If these sentences do occur, but only very sparsely, this is still interesting, but speaks more about the frequency thresholds at which learning occurs.

Thursday, April 3, 2014

Next time on 4/14/14 @ 2:15pm in SBSG 2221 = Han et al. 2013 Manuscript

Hi everyone,

It looks like a good collective time to meet will be Mondays at 2:15pm for this quarter, so that's what we'll plan on.  Our first meeting will be on April 14, and our complete schedule is available on the webpage at 

On April 14, we'll be looking at an article that investigates how learners generalize in the absence of input that distinguishes between two hypotheses. This is an experimental paper that makes explicit links to the learning process and may provide fruitful data for acquisition modeling work.

Han, C., Lidz, J., & Musolino, J. 2013. Grammar Selection in the Absence of Evidence: Korean Scope and Verb-Raising Revisited. Manuscript, University of Maryland College Park and Rutgers University. Please do not cite without permission from the authors.

See you on April 14!

Friday, March 14, 2014

Thanks and see you in the spring!

Thanks so much to everyone who was able to join us for our thoughtful, spirited discussion today, and to everyone who's joined us throughout the winter quarter! The CoLa Reading Group will resume again in the spring quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly exciting!

Wednesday, March 12, 2014

Some thoughts on Goodman & Stuhlmüller 2013

One thing I really liked about this paper was the use of a formal model of a theory (in this case, a theory of pragmatic implicature cancellation) to predict specific usages that may not have originally been considered by previous researchers working on the theory. In this case, it's a certain combination of theory of mind and implicature cancellation which leads to specific predictions when the listener knows the speaker has partial vs. complete knowledge.

One thing I was curious about was G&S's use of the term of "partial implicature". They were specifically using it when, for example, the speaker had knowledge of two of the three apples and said "one apple is red". The listeners inferred that either one or two of the three apples apples were red, but not all three. If I'm understanding G&S correctly, the partial implicature is that the listeners didn't infer all three apples were red, but did allow for more than one apple to be red.  To me, this is the listener thinking, "Ah, the speaker used one instead of two, which means that one of the apples the speaker saw definitely isn't red. But the third one the speaker didn't see might be red, so either one or two apples total are red."  If this is true, then it seems to me like the "partial implicature" is really a regular implicature (i.e., one = one or two) over the restricted domain of two apples (the one the speaker saw that was definitely red and the one the speaker didn't see yet). If this is true, I'm not sure this result says something particularly different than the rest of the results (i.e., partial implicature, under this interpretation, doesn't seem different from regular implicature). More specifically, there are cases where the implicature is cancelled, and cases where it isn't, and this happens to be one where it isn't. The magic comes in on the domain restriction the listener imposes, rather than being anything about the implicature itself.

Some other thoughts:

(1) Intro, p.174: I was glad G&S were explicit about who argues for a purely modularized form of implicature, because when I first read that option, my thought was, "Really?  That sounds a bit like a straw man." To me, pragmatics is definitely the aspect of language knowledge that seems most amenable to incorporating non-linguistic knowledge because it's about how we use language to communicate. From what I took of G&S's summary of the strongly modular theories, I would assume that those theories haven't yet looked at incorporating this issue of incomplete knowledge on the part of the speaker?

(2) Experiment 1, p.178: I was trying to work this through for the "one" case, with "some" being used. If Laura looks at one of three letters, and then says "Some of the letters have checks inside", my intuition about what this means becomes a little wonky. For me, "some" means more than one, but of course, Laura can't know about more than one of the letters. So how do I interpret what she's saying here? My first inclination is to think she's got some kind of magic knowledge about the other letters and so knows that at least one of the other two has a check, and I wonder if some of the respondents indicated this when G&S checked for how much knowledge the listeners believed Laura to have.  For the ones left over who believed Laura only knew about one, we can see in Figure 2 that they basically wibbled between interpreting some as two or three.

(3) Figure 2 results for exact number words, p.182: When we look at the "one" row (bottom of (B) and (D)), it seems like the model prefers "one" to mean two when only one object is known (1st panel of B).  However, people prefer "one" to mean three (1st panel of D).  Notably, if we move up one row to the "two" row (middle of (B) and (D)), it seems like here the model happily prefers "two" to mean three (1st panel of B), as do people (1st panel of D). I'm not sure how to interpret this exactly -- it certainly seems like these are different preferences qualitatively at the very least, as one matches human preferences (interpreting "two") and the other doesn't (interpreting "one"). I'm also not sure why the model mechanics would yield different predictions in these two cases. Is it something about the number word meaning "one more" somehow? If so, that would explain why the model prefers "one" to mean two, and prefers "two" to mean three. However, it's not immediately obvious to me why this would fall out of the model mechanics.

(4) Conclusion, p.183: As a follow up from our discussion last time, it seems like there's some evidence from the pragmatic inference modeling literature that there's a boundary of two on the recursion depth for theory of mind: "...we assume only one such level of reasoning...The quantitative fits we have shown suggest that limited recursion and optimization are psychologically realistic assumptions." This provides another bit of evidence that the center-embedding limit for structural recursion probably isn't specific to syntax. (Though then it becomes interesting why other kinds of recursion in syntax don't seem to have this same limit, e.g., right-branching: "This is the dog who chased the cat who ate the rat who stole the cheese." We clearly can process these three embeddings without a problem.)

Friday, February 28, 2014

Next time on 3/14/14 @ 3:20pm in SBSG 2221 = Goodman & Stuhlmüller 2013

Thanks to everyone who was able to join us for our exciting discussion of Levinson 2013 & the Legate et al. 2013 reply manuscript! We had some particularly interesting thoughts about potential recursion in non-syntactic domains, such as theory of mind, which relates to our article for Friday March 14 at 3:20pm in SBSG 2221: Goodman & Stuhlmüller 2013. In particular, G&S2013 discuss a rational model of how speakers interpret utterances, where listeners model the speaker's model of selecting utterances:

Goodman, N. & Stuhlmueller, A. 2013. Knowledge and Implicature: Modeling Language Understanding as Social Cognition.Topics in Cognitive Science, 5, 173-184.

See you then!

Wednesday, February 26, 2014

Some thoughts on Levinson 2013 + Legate, Pesetsky, & Yang 2013 reply (manuscript)

One of the take-away points I had from Levinson 2013 [L13] was the idea that center-embedding is not a structural option specific to syntax, since there are examples of this same structural option in dialogue. I had the impression then that L13 wanted to use this to mean this particular type of recursion is not language-specific, as dialogue is using language to communicate information and it's the information communicated (via the speech acts) that's center-embedded. (At least, that's how I'm interpreting "speech acts" as "actions in linguistic clothing".) I'm not quite sure I believe that, since I would classify speech acts as a type of linguistic knowledge (specifically, how to translate intention into the specific linguistic form required to convey that intention). But suppose we classify this kind of knowledge as not really linguistic, per se -- then wouldn't the interesting question be about how unique this type of structural option is to human communication systems, since that relates to questions about the Faculty of Language (broad or narrow)? And presumably, this then links back to whether non-human animals can learn these syntactic structures (doesn't seem to be true as far as we know) or these type of embedded interactions (also doesn't seem to be true, I think?)?

As a general caveat, I should note that while I followed the simpler examples of center embedding in dialogue, I was much less clear about the more complex examples that involved multiple center-embeddings and cross-serial dependencies (for example, deciding that something was an embedded question rather than a serial follow-up, like in example (14), the middle of (16), some of the embeddings in (17)). This may be due to my very light background in pragmatics and dialogue analysis, however. Still, it seemed that Legate, Pesetsky, & Yang 2013 [LPY13] had similar reservations about some of these dialogue dependencies.

LPY13 also had very strong reactions to both the syntactic and computational claims in L13, in addition to these issues about how to assign structure to discourse. I was quite sympathetic to (and convinced by) LPY13's syntactic and computational objections as a whole, from the cross-linguistic frequency of embedding to the non-centrality of center embedding for recursion to the not-debate about whether natural languages were regular. They also brought out a very interesting point about the restrictions on center embedding in speech acts (example (13)), which seemed to match some of the restrictions observed in syntax.  If it's true that these restrictions are there, and we see them in both linguistic (syntax) and potentially non-linguistic (speech act) areas, then maybe this is nice evidence for a domain-general restriction on processing this kind of structure. (And so maybe we should be looking for it seriously elsewhere too.)

More specific comments:

L13: There's a comment in section 4 about whether it's more complex to treat English as a large system of simple rules or a small system of complex rules. Isn't this exactly the kind of thing that rational inference gets at (e.g., Perfors, Tenenbaum, & Regier 2011 find that the context-free grammar works better than a regular or linear grammar on child-directed English speech -- as LPY13 note)? With respect to recursion, L13 cites the Perfors et al. 2010 study, which LPY13 correctly note doesn't have to do with regular languages vs. non-regular languages. Instead, that study finds that a mixture of recursive and non-recursive context-free rules (surprisingly) is the best, rather than having all recursive or all non-recursive rules, despite this seeming to duplicate a number of rules.

L13: Section 6, using the transformation from pidgin to creole as evidence for syntactic embedding coming from other capacities like joint action abilities: It's true that one of the hallmarks of pidgins vs. creoles is the added syntactic complexity, which (broadly speaking) seems to come from children learning the pidgin, adding syntactic structure to it that's regular and predictable, and ending up with something that has the same syntactic complexity as any other language. I'm not sure I understand why this tells us anything about where the syntactic complexity is coming from, other than something internal to the children (since they obviously aren't getting it from the pidgin in any direct way). Is it that these children are talking to each other, and it's the dialogue that provides a model for the embedded structures, for example?

LPY13: I'm not quite sure I agree with the objection LPY13 raise about whether dialogue embeddings represent structures (p.9). I agree that there don't seem to be very many restrictions, certainly when compared to syntactic structure. But just because there are multiple licit options doesn't mean there isn't a structure corresponding to each of them. It just may be exactly this: there are multiple possible structures that we allow in dialogue. So maybe this really more of an issue about how we tell structure is present (as opposed to linear "beads on a string", for example).

Friday, February 14, 2014

Next time on 2/28/14 @ 3:20pm in SBSG 2221 = Levinson 2013 + Legate et al. 2013 Manuscript

Thanks to everyone who was able to join us for our delightfully thoughtful discussion of the Omaki & Lidz 2013 manuscript!   Next time on Friday February 28 at 3pm in SBSG 2221, we'll be looking at an article that argues that recursion is a central part of cognition, even if it's curiously restricted in the realm of syntax (Levinson 2013).

Levinson, S. 2013. Recursion in pragmatics. Language, 89(1), 149-162.

In addition, we'll read a reply to that article that critically examines the assumptions underlying that argument (Legate et al. 2013 Manuscript).

Legate, J., Pesetsky, D., & Yang, C. 2013. Recursive Misrepresentations: a Reply to Levinson (2013). Revised version to appear in LanguagePlease do not cite without permission from Julie Legate.

Wednesday, February 12, 2014

Some thoughts on the Omaki & Lidz 2013 Manuscript

There are many things that made me happy about this manuscript as a modeler, not the least of which is the callout to modelers about what ought to be included in their models of language acquisition (hurrah for experimentally-motivated guidance!). For example, there's good reason to believe that a "noise parameter" that simply distorts the input in some way can be replaced by a more targeted perceptual intake noise parameter that distorts the input in particular ways. Also, I love how explicit O&L are about the observed vs. latent variables in their view of the acquisition process -- it makes me want to draw plate diagrams. And of course, I'm a huge fan of the distinction between input and intake.

Another thing that struck me was the effects incrementality could have. For example, it could cause prioritization of working memory constraints over repair costs, especially when repair is costly, because the data's coming at you now and you have to do something about it. This is discussed in light of the parser and syntax, but I'm wondering how it translates to other types of linguistic knowledge (and perhaps more basic things like word segmentation, lexical acquisition, and grammatical categorization). If this is about working memory constraints, we might expect it to apply whenever the child's "processor" (however that's instantiated for each of these tasks) gets overloaded. So, at the beginning of word segmentation, it's all about making your first guess and sticking to it (perhaps leading to snowball effects of mis-segmentation, as you use your mis-segmentations to segment other words). But maybe later, when you have more of a handle of word segmentation, it's easier to revise bad guesses (which is one way to recover from mis-segmentations, aside from continuing experience).

This relates to the cost of revision in areas besides syntax. In some sense, you might expect that cost is very much tied to how hard it is to construct the representation in the first place. For syntax (and the related sentential semantics), that can continue to be hard for a really long time, because these structures are so complex. And as you get better at it, it gets faster, so revision gets less costly. But looking at word segmentation, is constructing the "representation" ever that hard? (I'm trying to think what the "representation" would be, other than the identification of the lexical item, which seems pretty basic assuming you've abstracted to the phonemic level.) If not, then maybe word segmentation revision might be less costly, and so the swing from being revision-averse to revision-friendly might happen sooner for this task than in other tasks.

Some more targeted thoughts:

(i) One thing about the lovely schematic in Figure 1: I can definitely get behind the perceptual intake feeding the language acquisition device (LAD) and (eventually) feeding the action encoding, but I'm wondering why it's squished together with "linguistic representations". I would have imagined that perceptual intake directly feeds the LAD, and the LAD feeds the linguistic representation (which then feeds the action encoding). Is the idea that there's a transparent mapping between perceptual intake and linguistic representations, so separating them is unnecessary? And if so, where's the place for acquisitional intake (talked about in footnote 1 on p.7), which seems like it might come between perceptual intake and LAD?

(ii) I found it a bit funny that footnote 2 refers to the learning problem as "inference-under-uncertainty" rather than the more familiar "poverty of the stimulus" (PoS). Maybe PoS has too many other associations with it, and O&L just wanted to sidestep any misconceptions arising from the term? (In which case, probably a shrewd move.)

(iii) In trying to understand the relationship between vocabulary size and knowledge of pronoun interpretation (principle C), O&L note that children who had faster lexical access were not faster at computing principle C, so it's not simply that children who could access meaning faster were then able to do the overall computation faster. This means that the hypothesis that "more vocabulary" equals "better at dealing with word meaning", which equals "better at doing computations that require word meaning as input" can't be what's going on. So do we have any idea what the link between vocabulary size and principle C computation actually is? Is vocabulary size the result of some kind of knowledge or ability that would happen after initial lexical access, and so would be useful for computing principle C too? One thought that occurred to me was that someone who's good at extracting sentential level meaning (i.e., because their computations over words happen faster) might find it easier to learn new words in the first place. This then could lead to a larger vocabulary size. So, this underlying ability to compute meaning over utterances (including using principle C) could cause a larger vocabulary, rather than knowing lots of words causing faster computation.

(iv) I totally love the U-shaped development of filler-gap knowledge in the Gagliardi et al. (submitted) study. It's nice to see an example of this qualitative behavior in a realm besides morphology. The explanation seems similar, too -- a more sophisticated view of the input causes errors, which take some time to recover from. But the initial simplified view leads to surface behavior that seems right, even if the underlying representation isn't at that point. Hence, U-shaped performance curve. Same for the White et al. 2011 study -- U-shaped learning in syntactic bootstrapping for the win.

(v) I really liked the note on p.45 in the conclusion about how the input vs. intake distinction could really matter for L2 acquisition. It's nice to see some explicit ideas about what the skew that occurs is and why it might be occurring. (Basically, this feels like a more explicit form of the "less is more" hypothesis, where adult processing is distorting the input in predictable ways.)

Friday, January 24, 2014

Next time on 2/14/14 @ 3pm in SBSG 2221 = Omaki & Lidz 2013 Manuscript

Thanks to everyone who was able to join us for our thorough and thoughtful discussion of the Meylan et al. 2014 manuscript! Next time on Friday February 14 at 3pm in SBSG 2221, we'll be looking at an article manuscript that argues for the need to consider the development of children's processing abilities at the same time as we consider their acquisition of knowledge. This is particularly relevant to computational modelers who must explicitly model what the child's input looks like and how that input is used, for example.

Omaki, A. & Lidz, J. 2013. Linking parser development to acquisition of linguistic knowledge. Manuscript, Johns Hopkins University and University of Maryland, College Park. Please do not cite without permission from Akira Omaki. 

Wednesday, January 22, 2014

Some thoughts on Meylan et al. 2014 Manuscript

One of the things I really enjoyed about this paper was the framing they give to explain why we should care about the emergence of grammatical categories, with respect to the existing debate between (some of) the nativists and (some of) the constructivists.  Of course I'm always a fan of clever uses of Bayesian inference to problems in language acquisition, but I sometimes really miss the level of background story that we get here. (So hurrah!)

That being said, I was somewhat surprised to see the conclusion M&al2014 drew with respect to their results that (some kind of) a nativist view wasn't supported. To me, the fact that we see very early rapid development of this grammatical category knowledge is an unexpected thing from the "gradual emergence based on data" story (i.e., constructivist perspective). So, what's causing the rapid development? I know it's not the focus of M&al2014's work here, but positing some kind of additional learning guidance seems necessary to explain these results. And until we have a story for how that guidance would be learned, the "it's innate" answer is a pretty good placeholder.  So, for me, that places the results in the nativist side, though maybe not the strict "grammatical categories are innate" version. Maybe I'm being unfair to the constructivist side, though -- would they have an explanation for the rapid, early development?

Another very cool thing was the application of this approach to the Speechome data set. It's been around for awhile, but we don't have a lot of studies that use it and it's such an amazing resource. One of the things I wondered, though, was whether the evaluation metric M&al2014 propose can only work if you have this density of data. It seems like that might be true, given the issues with confidence intervals on the CHILDES datasets. If so, this is different from Yang's metric [Yang 2013] which can be used on much smaller datasets. (My understanding is that as long as you have enough data to form a Zipfian distribution, you have enough for Yang's metric to be applied.)

One thing I didn't quite follow was the argument about why only a developmental analysis is possible, rather than both a developmental and a comparative analysis. I completely understand that adults may have different values for their generalized determiner preferences, but we assume that they realize determiners are a grammatical class. So, given this, whatever range of values adults have is the target state for acquisition, right? And this should allow a comparative analysis between wherever the child is and wherever the adult is. (Unless I'm missing something about this.)

Some more targeted thoughts:

As a completely nit-picky thing that probably doesn't matter, it took me a second to get used to calling grammatical categories syntactic abstractions. I get that they're the basis for (many) syntactic generalizations, but I wouldn't have thought of them as syntactic, per se.  (Clearly, this is just a terminology issue, and other researchers that M&al2014 cite definitely have called it syntactic knowledge, too.)

M&al2014 state in the previous work section that Yang's metric is "not well-suited to discovering if a child could be less than fully productive at a given stage of development". I'm not sure I understand why this is so - if the observed overlap in the child's output is less than the expected overlap from a fully productive system, isn't that exactly the indicator of a less than fully productive system?

In the generative model M&al2014 use, they have a latent variable that represents the unrecorded caregiver input (DA), which is assumed to be drawn from the same distribution as the observed caregiver input (dA). I don't follow what this variable contributes, especially if it follows the same distribution as the observed data.

The table just below figure 4:  I'm not sure I followed this. What would rich morphology be for English data, for example? And are the values for "Current" the v value inferred for the child? Are the Yang 2013 values calculated based on his expected overlap metric?

I wonder if the reason there were developmental changes found in the Speechome corpus is more about having enough data in the appropriate age range (i.e., < 2 years old). The other corpora had a much wider range of ages, and it could very well be that the ones that included younger-than-2-year-old data had older-age data included in the earliest developmental window investigated.

There's a claim made in the discussion that "no previous analysis has taken into account the input that individual children hear in judging whether their subsequent determiner usage has changed its productivity". I think what M&al2014 intend is something related to the explicit modeling of how much of the productions are imitated chunks, and if so, that seems completely fine (though one could argue that the Yang 2010 manuscript goes into quite some detail modeling this option). However, the way the current sentence reads, it seems a bit odd to say no previous analysis has cared about the input -- certainly Yang's metric can be used to assess productivity in child-directed speech utterances, which are the children's input. This is how a comparative analysis would presumably be made using Yang's metric.

Similarly, there's a claim near the end that the Bayesian analysis "makes inferences regarding developmental change of continuity in a single child possible". While it's true that this can be done with the Bayesian analysis, there seems to be an implicit claim that the other metrics can't do this. But I'm pretty sure it can also be done with the other metrics out there (e.g., Yang's). You basically apply the metric to data at multiple time points, and track the change, just as M&al2014 did here with the Bayesian metric.


Yang, C. 2013. Onotogeny and philogeny of language. 2013. Proceedings of the National Academy of Science, 110 (16). doi:10.1073/pnas.1216803110.

Thursday, January 9, 2014

Next time on 1/24/14 @ 3pm in SBSG 2221 = Meylan et al. 2014 Manuscript

It looks like the best collective time to meet will be Fridays at 3pm for this quarter, so that's what we'll plan on.  Our first meeting will be in a few weeks on January 24.  Our complete schedule is available on the webpage at 

On Jan 24, we'll be looking at an article that examines a formal metric to gauge productivity for grammatical categories, based on hierarchical Bayesian modeling.

UPDATE for Jan 24: Michael Frank was kind enough to provide us with an updated version of the 2013 paper (2013 version linked below), which they're intending to submit for a journal publication. It's already received some outside feedback, and they'd be delighted to hear any thoughts we had on it.  Michael preferred the manuscript not be posted publicly however, so I've sent it around as an attachment to the mailing list.

Meylan, S., Frank, M. C., & Levy, R. 2013. Modeling the development of determiner productivity in children's early speech. Proceedings of the 35th Annual Meeting of the Cognitive Science Society.

Categorical productivity is typically used to determine when abstract knowledge that a category actually exists is acquired (think "VERB exists, not just see and kiss and want! Woweee! Who knew?"), which is a fundamental building block for more complex linguistic knowledge.

I think the metric proposed in this session's article is particularly useful to compare and contrast against the metric that's been proposed recently by Yang (which is based on straight probability calculations), so I encourage you to have a look at that one as well:

Yang, C. 2013. Onotogeny and philogeny of language. 2013. Proceedings of the National Academy of Science, 110 (16). doi:10.1073/pnas.1216803110.

See you on Jan 24!