Computational Models of Language (at UC Irvine)

Friday, January 23, 2015

Next time on 2/13/15 @ 11:30am in SBSG 2221 = Yurovsky & Frank 2014 Ms

Thanks to everyone who was able to join us for our educational discussion of Johnson 2013! For our next CoLa reading group meeting on Friday February 13 at 11:30am in SBSG 2221, we'll be looking at a manuscript that explores a model of word learning, integrating non-linguistic aspects such as memory and attention.

Yurovsky, D. & Frank, M. 2014. An Integrative Account of Constraints on Cross-Situational Learning. Manuscript, Stanford University.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/YurovskyFrank2014Manu_IntegrativeCrossSitWordLearning.pdf

See you then!

Wednesday, January 21, 2015

Some thoughts on Johnson 2013

Something I really liked about this paper was Johnson’s sensitivity to the problems that occur during actual acquisition even as he gave an intuitive overview about different approximation algorithms used in machine learning. He also made a point to connect with linguistic theory related to acquisition (e.g., Universal Grammar, uniqueness constraint, etc.) This makes it much easier for acquisition people who aren’t necessarily modelers to understand why they should care about these approaches, especially when the particular structures Johnson uses for his demonstrations (PCFGs) are known to be not quite right (which Johnson himself helpfully points out right at the beginning).

Some more targeted thoughts:

(1) Johnson makes a point at the very beginning about the utility of joint inference of syntactic structure and grammatical categories (which he calls lexical categories), and how better performance is obtained that way (as opposed to solving one problem after another). This seems to be another example of this joint-inference-is-better thing, which is getting a fair amount of play in the acquisition modeling literature. Bigger point: Information from one problem can help usefully constrain another. Smaller quibble: I think grammatical categories may be learned earlier than syntactic structure, so we may want something like an informed prior when it comes to the grammatical categories if we still want syntactic structure and grammatical categorization to be solved simultaneously.

(2) This comment in section 3: “…suggesting the attractive possibility that at least some aspects of language acquisition may be an almost cost-free by-product of parsing. That is, the child’s efforts at language comprehension may supply the information they need for language acquisition.” This reminds me very strongly of Fodor’s (1998) “Parsing to Learn” approach, which talks about exactly this idea. (A number of follow up papers with William Sakas also tackle this issue.) Fodor’s learner was using parsing to help figure out Universal Grammar parameter settings, but the idea is exactly the same — because parsing is already happening, the learner can leverage the information from that process to learn about the structure of her language.

**Fodor, J. D. 1998. Parsing to learn. Journal of Psycholinguistic Research, 27(3), 339-374.

(3) Related to the smaller quibble above in (1): Johnson notes later on in section 3 that “it’s hard to see how any ‘staged’ learner (which attempted to learn lexical entries before learning syntax, or vice versa) could succeed on this data”. The important unspoken part is “using just this strategy”, I’m assuming — because certainly it’s possible to learn grammatical categories using other strategies just fine. In fact, most of the grammatical categorization models I’m aware of do just this (though some do incorporate aspects of syntactic structure in the grammatical category inference).

(4) This point in section 5 seems spot on to me: “…language learning may require additional information beyond that contained in a set of strings of surface forms.” Johnson jumps straight to non-linguistic information, but I’m imagining that semantics would still be counted as linguistic, and that seems super-important for a number of syntactic structure things (e.g., animacy for learning about the appropriate meanings for tough-constructions: The apple was easy to eat. vs. The girl was eager to eat (the apple).)

(5) The production model by Johnson & Riezler (2002) discussed later on in that section was interesting, where the input is the intended logical form (hierarchical semantic structure…which presumably maps to syntactic structure?) and the output is the observed string. Presumably this is how you could design a generative learning model, where the goal is to infer the syntactic structure that corresponds to the observed string, with the idea that the syntactic structure was used to generate the string.

(6) This in the conclusion: “…in principle it should be possible for Bayesian priors to express the kinds of rich linguistic knowledge that linguists posit for Universal Grammar. It would be extremely interesting to investigate just what a statistical estimator using linguistically plausible parameters might be able to learn.” — Exactly this! I’ve long (vaguely) pondered how to connect the sorts of parameters in, say, a parametric representation of metrical phonology to the kinds of precise mathematical priors Bayesian models use. Somehow, somehow it seems possible…and then perhaps the two uses of “parameter” could be reconciled more precisely.

Friday, January 9, 2015

Next time on 1/23/15 @ 11:30am in SBSG 2221 = Johnson 2013

Hi everyone,

It looks like a good collective time to meet will be Fridays at 11:30am for this quarter, so that's what we'll plan on. Our first meeting will be on January 23, and our complete schedule is available on the webpage at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

On January 23, we'll be discussing a book chapter that looks closely at the idea that language acquisition is a statistical inference problem, and examines how to translate current machine learning statistical inference approaches to implementations that would work for acquisition.

Johnson, M. 2013. Language acquisition as statistical inference. In Stephen R. Anderson, Jacques Moeschler, and Fabienne Reboul, (eds.), The Language-Cognition Interface, Libraire Droz, Geneva, 109-134.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Johnson2013_LangAcqStatInf.pdf

See you on January 23!

Wednesday, December 3, 2014

See you in the winter!

Thanks so much to everyone who was able to join us for our invigorating discussion today about Richie et al. 2014, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Monday, December 1, 2014

Some thoughts on Richie et al. 2014

One thing I really like about this article is that it provides a nice example of how to make a concrete, empirically grounded computational model to test an intuitive theory about a particular phenomenon. In this case, it’s the impact of multiple speakers on lexicon emergence (in particular, the many-to-one vs. many-to-many dynamic). While I do have one or two questions about the model implementation, it was generally pretty straightforward to understand and nicely intuitive in its own right — and so for me, this is an excellent demonstration of how to use modeling in an informative way. On a related note, while the authors certainly packed both experimental and modeling pieces into the paper, it actually didn’t feel all that rushed (but perhaps this is because I’m already fairly familiar with the model).

Some more targeted thoughts:

p.185, Introduction: “The disconnect between experimental and computational approaches is a general concern for research on collective and cooperative behavior” — I think this is really the biggest concern I always have for models of general language evolution. At least for the sign lexicon emergence, we actually have examples of it happening “in the wild”, so we can ground the models that way (in the input representation, transmission, output evaluation, etc.). But this becomes far harder (if not impossible) to do for the evolution of language in the human species. What are reasonable initial conditions? What is the end result supposed to look like anyway? Ack. And that doesn’t even begin to get into the problem of idealization in evolutionary modeling (what to idealize, is it reasonable to do that, and so on). So for my empirical heart, one of the main selling points of this modeling study is the availability of the empirical data, and the attempt to work it into the model in a meaningful way.

p.185, Introduction: “A probabilistic model of language learning…situated in a social setting appears to capture the observed trends of conventionalization” — One of the things I’m wondering is how much the particulars of the probabilistic learning model matter. Could you be a rational Bayesian type for example, rather than a reinforcement learner, and get the same results? In some sense, I hope this would be true since the basic intuition that many-to-many is better for convergence seems sensible. But would the irrelevance of population dynamics persist, or not? Since that’s one of the less intuitive results (perhaps due to the small population size, perhaps due to something else), I wonder.

p.186, 2.1.4 Coding: “…we coded every gesture individually for its Conceptual Component” — On a purely practical note, I wonder how these were delineated. Is it some recognizable unit in time (so the horns of the cow would occur before the milking action of the cow, and that’s how you decide they’re two meaning pieces)? Is it something spatial? Something else, like a gestalt of different features? I guess I’ve been thinking about the simultaneous articulation aspects of signed languages like ASL, and this struck me as something that could be determined by human perceptual bias (which could be interesting in its own right).

p.190, 3.2: For the adjustment of p using the Linear-Reward-Penalty, is the idea that each Conceptual Component’s (CC’s) probability is adjusted, one at a time? I’m trying to map this to what I recall of previous uses of Yang’s learning model (e.g., Yang 2004), where the vector would be of different grammar parameters, and the learner actually can’t tell which parameter is responsible for successful analysis of the observed data point. In that case, all parameter probabilities are either rewarded or punished, based on the success (or failure) of input analysis. Here, since the use (or not) of a given CC is observed, you don’t have to worry about that. Instead, each one can be rewarded or punished based on its observed use (or not). So, in some sense, this is simpler than previous uses of the learning model, precisely because this part of learning is observed.

p.191, 3.5: “…we run the simulations over 2 million instances of communications” — So for the homesigners with the many-to-one setup, this can easily be interpreted as 2 million homesigner-nonhomesigner interactions. For deaf population simulation, is this just 2 million deaf-deaf communication instances, among any of the 10 pairs in a population of 5? Or does each of the 10 pairs get 2 million interactions? The former seems fairer for comparison with the homesigner population, but the latter is a possible sensible instantiation. If it’s the latter, then the overall frequency of interactions within the population might be what’s driving faster convergence.

Yang, C. D. (2004). Universal Grammar, statistics or both? Trends in cognitive sciences, 8(10), 451-456.

Wednesday, November 19, 2014

Next time on 12/3/14 @ 4:00pm in SBSG 2221 = Richie et al. 2014

Thanks to everyone who was able to join us for our illuminating discussion of Qing & Franke 2014! For our next CoLa reading group meeting on Wednesday December 3 at a special time of 4:00pm in SBSG 2221, we'll be looking at an article that investigates the emergence of homesign lexicons, paying particular attention to the impact of other speakers.

Richie, R., Yang, C., & Coppola, M. 2014. Modeling the Emergence of Lexicons in Homesign. Topics in Cognitive Science, 6, 183-195.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/RichieEtAl2014_ModelingHomeSignLexicons.pdf

See you then!

Monday, November 17, 2014

Some thoughts on Qing & Franke 2014

One of the things I really liked about this article was the attention to formalizing the pragmatic facts, and the attention to explaining the intuitions behind the model. (That being said, I probably could have benefited from more of a primer on degree semantics, since I had some trouble following the exact instantiation of that in the model.) Still, Q&F2014’s point is to demonstrate the utility of certain assumptions for learning about degree adjectives and then to rigorously evaluate them using standard Bayesian methods, and I think they succeeded on that computational-level goal. In general, I suspect the length constraint was a big issue for this paper — so much was packed in that of course many things had to be glossed over. I do wish Q&F had spent a bit more time on the discussion and conclusions, however — I was left wondering exactly what to make of these results as someone who cares about how acquisition works. For instance, what does an individual need to learn (c, theta?) vs. what’s already built in (lambda?)?

Some more targeted thoughts:

(1) p.2, “…to measure how efficient a standard theta for ‘tall’ is for describing basketball players, we calculate on average how likely the speaker will manage to convey the height of a random basketball player by adopting that standard.” — This text sounds like the goal is to convey exact height, rather than relative height (importantly, is the player in question “tall” relative to the standard theta?). But it seems like relative height would make more sense. (That is, “tall” doesn’t tell you the player's 7’1” vs. 7’2”, but rather that he’s tall compared to other basketball players, captured by that theta.)

(2) p.2, c: I admit, I struggled to understand how to interpret c specifically. I get the general point about how c captures a tradeoff between communicative efficiency and absolute general applicability (side note: which means…? It always applies, I think?). But what does it mean to have communicative efficiency dominate absolute general applicability (with c close to 0) -- that the adjective doesn’t always apply? I guess this is something of a noise factor, more or less. And then there’s another noise factor with the degree of rationality in an individual, lambda.

(3) p.3, Parameters Learning section: c_A is set to range between -1 and 0. Given the interpretations of c we just got on p.2, does this mean Q&F are assuming that the adjectives they investigate (big, dark, tall, full) are generally inapplicable (and so have a higher theta baseline to apply), since c can only be negative if it’s non-zero? It doesn’t seem unreasonable, but if so, this is an assumption they build into the learner. Why not allow it to range from -1 to 1, and allow the learner to assume positive c values are a possibility?

(4) p.6, Conclusion: “Combining the idea of pragmatic reasoning as social cognition…” — Since they’re just looking at individual production in their model (and individual judgments in their experiment), where is the social cognition component? Is it in how the baseline theta is assessed? Something else?

(5) p.6, Conclusion: “…we advanced the hypothesis that the use of gradable adjectives is driven by optimality of descriptive language use.” — What does this mean exactly? How does it contrast with optimal contextual categorization and referential language use? This is definitely a spot where I wish they had had more space to explain, since this seems to get at the issue of how we interpret the results here.

Wednesday, October 29, 2014

Next time on 11/19/14 @ 10:30am in SBSG 2221 = Qing & Franke 2014

Thanks to everyone who was able to join us for our incisive and informative discussion of Barak et al. 2014! For our next CoLa reading group meeting on Wednesday November 19 at 10:30am in SBSG 2221, we'll be looking at an article that investigates the acquisition of gradable adjectives like "tall", using a Bayesian approach that incorporates pragmatic reasoning.

Qing, C. & Franke, M. 2014. Meaning and Use of Gradable Adjectives: Formal Modeling Meets Empirical Data. Proceedings of the Cognitive Science Society.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/QingFranke2014_GradableAdjectives.pdf

See you then!

Monday, October 27, 2014

Some thoughts on Barak et al. 2014

One of things I really liked about this paper was the additional "verb class" layer, which is of course what allows similarities between verbs to be identified, based on their syntactic structure distributions. This seems like an obvious thing, but I don't think I've seen too many incremental models that actually have hierarchy in them (in contrast to ideal learner models operating in batch mode, which often have hierarchical levels in them). So that was great to see. Relatedly, the use of syntactic distributions from other verbs too (not just mental state verbs and communication/perception verbs) feels very much like indirect positive evidence (Pearl & Mis 2014 terminology), where something present in the input is informative, even if it's not specifically about the thing you're trying to learn. And that's also nice to see more explicit examples of. Here, this indirect positive evidence provides a nice means to generalize from communication/perception verbs to mental state verbs.

I also liked the attention spent on the perceptual coding problem (I'm using Lidz & Gagliardi 2014 terminology now) as it relates to mental state verbs, since it definitely seem true that mental state concepts/semantic primitives are going to be harder to extract from the non-linguistic environment, as compared to communication events or perception events.

More specific comments:

(1) Overview of the Model, "The model also includes a component that simulates the difficulty of children attending to the mental content...also simulates this developing attention to mental content as an increasing ability to correctly interpret a scene paired with an SC utterance as having mental semantic properties." -- Did I miss where it was explained how this was instantiated? This seems like exactly the right thing to do, since semantic feature extraction should be noisy early on and get better over time. But how did this get implemented? (Maybe it was in the Barak et al. 2012 reference?)

(2) Learning Constructions of Verb Usages, "...prior probability of cluster P(k) is estimated as the proportion of frames that are in k out of all observed input frames, thus assigning a higher prior to larger clusters representing more frequent constructions." -- This reminds me of adaptor grammars, where both type frequency and token frequency have roles to play (except, if I understand this implementation correctly, it's only token frequency that matters for the constructions, and it's only at the verb class level that type frequency matters, where type = verb).

(3) Learning Verb Classes, "...creation of a new class for a given verb distribution if the distribution is not sufficiently similar to any of those represented by the existing verb classes.", and the new class is a uniform distribution over all constructions. This seems like a sensible way to get at the same thing generative models do by having some small amount of probability assigned to creating a new class. I wonder if there are other ways to implement it, though. Maybe something more probabilistic where, after calculating the probabilities of it being in each existing verb class and the new uniform distribution one, the verb is assigned to a class based on that probability distribution. (Basically, something that doesn't use the argmax, but instead samples.)

(4) Generation of Input Corpora, "...frequencies are extracted from a manual annotation of a sample of 100 child-directed utterances per verb" -- I understand manual annotation is a pain, but it does seem like this isn't all that many per verb. Though I suppose if there are only 4 frames they're looking at, it's not all that bad. That being said, the range of syntactic frames is surely much more than that, so if they were looking at the full range, it seems like they'd want to have more than 100 samples per verb.

(5) Set-up of Simulations: "...we train our model on a randomly generated input corpus of 10,000 input frames" -- I'd be curious about how this amount of input maps onto the amount of input children normally get to learn these mental state verbs. It actually isn't all that much input. But maybe it doesn't matter for the model, which settles down pretty quickly to its final classifications?

(6) Estimating Event Type Likelihoods: "...each verb entry in our lexicon is represented as a collection of features, including a set of event primitives...think is {state, cogitate, belief, communicate}" -- I'm very curious as to how these are derived, as some of them seem very odd for a child's representation of the semantic content available. (Perhaps automatically derived from existing electronic resources for adult English? And if so, is there a more realistic way to instantiate this representation?)

(7) Experimental Results: "...even for Desire verbs, there is still an initial stage where they are produced mostly in non-mental meaning." -- I wish B&al had had space for an example of this, because I had an imagination fail about what that would be. I want used in a non-mental meaning? What is that for want?

References:
Lidz, J. & Gagliardi, A. 2014 to appear. How Nature Meets Nurture: Universal Grammar and Statistical Learning. Annual Review of Linguistics.

Pearl & Mis 2014. The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Manuscript, UCI. [lingbuzz: http://ling.auf.net/lingbuzz/001922]

Wednesday, October 15, 2014

Next time on 10/29/14 @ 10:30am in SBSG 2221 = Barak et al. 2014

Thanks to everyone who was able to join us for our invigorating discussion of Lidz & Gagliardi 2014! For our next meeting on Wednesday October 29 at 10:30am in SBSG 2221, we'll be looking at an article that investigates the acquisition of a particular subset of lexical items, known as mental state verbs (like "want", "wish", "think", "know"). This computational modeling study focuses on different syntactic information that children could be leveraging.

Barak, L., Fazly, A., & Stevenson, S. 2014. Gradual Acquisition of Mental State Meaning: A Computational Investigation. Proceedings of the Cognitive Science Society.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/BarakEtAl2014_LearningMentalStateVerbs.pdf

See you then!

Monday, October 13, 2014

Some thoughts on Lidz & Gagliardi 2014

My Bayesian-inclined brain really had a fun time trying to translate everything in this acquisition model into Bayesian terms, and I think it actually lends itself quite well to this -- model specification, model variables, inference, likelihood, etc. I'm almost wondering if it's worth doing this explicitly in another paper for this model (maybe for a different target audience, like a general cognitive sciences crowd). I think it'd make it easier to understand the nuances L&G highlight, since these nuances track so well with different aspects of Bayesian modeling. (More on this below.)

That being said, it took me a bit to wrap my head around the distinction between perceptual and acquisitional intake, probably because of that mapping I kept trying to do to the Bayesian terminology. I think in the end I sorted out exactly what each meant, but this is worth talking about more since they do (clearly) mean different things. What I ended up with: perceptual intake is what can be reliably extracted from the input, while acquisitional intake is the subset relevant for the model variables (and of course the model/hypothesis space that defines those variables need to already be specified).

Related to this: It definitely seems like prior knowledge is involved to implement both intake types, but the nature of that prior knowledge is up for grabs. For example, if a learner is biased to weight cues differently for the acquisitional intake, does that come from prior experience about the reliability of these cues for forming generalizations, or is it specified in something like Universal Grammar, irrespective of how useful these cues have been previously? Either seems possible. To differentiate them, I guess you'd want to do what L&G are doing here, where you try to find situations where the information use doesn't map to the information reliability, since that's something that wouldn't be expected from derived prior knowledge. (Of course, then you have to have a very good idea about what exactly the child's prior experience was like, so that you could tell what they perceived the information reliability to be.)

One other general comment: I loved how careful L&G were to highlight when empirical evidence doesn't distinguish between theoretical viewpoints. So helpful. It really underscores why these theoretical viewpoints have persisted in the face of all the empirical data we now have available.

More specific comments:

(1) The mapping to Bayesian terms that I was able to make:
-- Universal Grammar = hypothesis space/model specification
Motivation:
(a) Abstract: "Universal Grammar provides representations that support deductions that fall outside of experience...these representations define the evidence the learners use..." -- Which makes sense, because if the model is specified, the relevant data are also specified (anything that impacts the model variables is relevant).
(b) p.6, "The UG component identifies the class of representations that shape the nature of human grammatical systems".

-- Perceptual Intake = parts of the input that could impact model variables
Motivation:
p.10, "contain[s]...information relevant to making inferences"

-- Acquisitional Intake = parts of the input that do impact model variables

-- Inference engine = likelihood?
Motivation:
(a) p.10, "...makes predictions about what the learner should expect to find in the environment"...presumably, given a particular hypothesis. So, this is basically a set of likelihoods (P(D | H)) for all the Hs in the hypothesis space (defined by UG, for example).
...except
(b) p.21, "...the inference engine, which selects specified features of that representation (the acquisitional intake) to derive conclusions about grammatical representations". This makes it sound like the inference engine is the one selecting the model variables, which doesn't sound like likelihood at all. Unless inference is over the model variables, which are already defined for each H.

-- Updated Grammar, deductive consequences = posterior over hypotheses
Motivation:
p.30, "...inferential, using distributional evidence to license conclusions about the abstract representations underlying language"
Even though L&G distinguish between inferential and deductive aspects, I feel like they're still talking about the hypothesis space. The inferential part is selecting the hypothesis (using the posterior) and the deductive consequences part is all the model variables that are connected to that hypothesis.

(2) The difference about inference: p.4, "On the input-driven view, abstract linguistic representations are arrived at by a process of generalization across specific cases...", and this is interpreted as "not inference" (in contrast to the knowledge-driven tradition). But a process of "generalization across specific cases" certainly sounds a lot like inference, because something has to determine exactly how that generalization is constrained (even if it's non-linguistic constraints like economy or something). So I'm not sure it's fair to say the input-driven approach doesn't use inference, per se. Instead, it sounds like the distinction L&G want is about how that inference is constrained (input-driven: non-linguistic constraints; knowledge-driven: linguistic hypothesis space).

(3) Similarly, I also feel it's not quite fair to divide the world into "nothing like the input" (knowledge-driven) vs. "just like the input, only compressed" (input-driven) (p.5). Instead, it seems like this is more of a continuum, and some representations can be "not obviously" like the input, and yet still be derived from it. The key is knowing exactly what the derivation process is -- for example, for the knowledge-driven approach, the representations could be viewed as similar to the input at an abstract level, even if the surface representation looks very different.

(4) p.6, "...the statistical sensitivities of the learner are sometimes distinct from ideal-observer measures of informativity...reveal the role learners play in selecting relevant input to drive learning." So if the learner has additional constraints (say, on how the perceptual intake is implemented), could these be incorporated into the learner assumptions that would make up an ideal learner model? That is, if we're not talking about constraints that are based on cognitive resources but are instead talking about learner biases, couldn't we build an ideal-observer model that has those biases? (Or maybe the point is that perceptual intake only comes from constraints on cognitive resources?)

(5) p.8, "...it must come from a projection beyond their experience". I think we have to be really careful about claiming this -- maybe "direct experience" is better, since even things you derive are based on some kind of experience, unless you assume everything about them is innate. But the basic point is that some previously-learned or innately-known stuff may matter for how the current direct experience is utilized.

(6) p.9, (referring to distribution of pronouns & interpretations), "...we are aware of no proposals outside the knowledge-driven tradition". Hello, modeling call! (Whether for the knowledge-driven theory, or other theories.)

(7) p.9, "...most work in generative linguistics has been the specification of these representations". I think some of the ire this approach has inspired from the non-generative community could be mitigated by considering which of these representations could be derived (and importantly, from what). It seems like not as many generative researchers (particularly ones who don't work a lot on acquisition) think about the origin of these representations. But since some of them can get quite complex, it rubs some people the wrong way to call them all innate. But really, it doesn't have to be that way -- some might be innate, true, but some of these specifications might be built up from other simpler innate components and/or derived from prior experience.

(8) p.15, "...predicted that the age of acquisition of a grammar with tense is a function of the degree to which the input unambiguously supports that kind of grammar..." And this highlights the importance of what counts as unambiguous data (which is basically data where likelihood p(D | H) is 0 for all but the correct H). And this clearly depends on the model variables involved in all the different Hs (which should be the same??).

(9) p.25, "...preference for using phonological information over semantic information likely reflects perceptual intake in the initial stages of noun class learning". So this could easily be a derived bias, but I would think we would still call it "knowledge-driven" -- it's just that it's derived knowledge, rather than innate knowledge that caused it.

(10) sections 6, Kannada empirical facts -- So interesting! Every time I see this, I always have a quiet moment of goggling. It seems like such an interesting challenge to figure out what these facts could be learned from. Something about binding? Something about goal-prominence? I feel like the top of p.35 has a parameter-style proposal linking possession constructions and these ditransitive facts, which would then be model variables. The Viau & Lidz 2011 proposal that cares about what kind of NPs are in different places also seems like another model variable. Of course, these are very specific pieces of knowledge about model variables...but still, will this actually work (like, can we implement a model that uses these variables and run it)? And if it does, can the more specific model variables be derived from other aspects of the input, or do you really have to know about those specific model variables?

(11) Future Issues, p.47: Yes. All of these. Because modeling. (Especially 5, but really, all of them.)

Friday, October 3, 2014

Next time on 10/15/14 @ 10:30am in SBSG 2221 = Lidz & Gagliardi 2014

Hi everyone,

It looks like a good collective time to meet will be Wednesdays at 10:30am for this quarter, so that's what we'll plan on. Our first meeting will be on October 15, and our complete schedule is available on the webpage at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

On October 15, we'll be looking at a review article that discusses a particular learning model drawing on language-specific and domain-general knowledge to explain the process of acquisition. For modelers, it's especially useful to consider the specific implementations proposed, as these are theoretically and empirically motivated learning strategies that can we can investigate via computational modeling.

Lidz, J. & Gagliardi, A. 2014 to appear. How Nature Meets Nurture: Universal Grammar and Statistical Learning. Annual Review of Linguistics.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/LidzGagliardi2014ToAppear_UGStats.pdf

See you on October 15!

Monday, June 2, 2014

Thanks and see you in the fall!

Thanks to everyone who was able to join us for our delightful discussion of Ramscar et al. 2013, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be on hiatus this summer, and we'll resume again in the fall quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Friday, May 30, 2014

Some thoughts about Ramscar et al. 2013

One of the things I really liked about this paper was that it implements a computational model that makes predictions, and then test those predictions experimentally. It's becoming more of a trend to do both within a single paper, but often it's too involved to describe both parts, and so they end up in separate papers. Fortunately, here we see something concise enough to fit both in, and that's a lovely thing.

I also really liked that R&al investigate the logical problem of language acquisition (LPLA) by targeting one specific instance of that problem that's been held up (or used to be held up as recently as ten years ago) as an easily understood example of the LPLA. I'm definitely sympathetic to R&al's conclusions, but I don't think I believe the implication that this debunks the LPLA. I do believe it's away to solve it for this particular instantiation, but the LPLA is about induction problems in general -- not just this one, not just subset problems, but all kinds of induction problems. And I do think that induction problems abound in language acquisition.

It was interesting to me how R&al talked about positive and negative evidence -- it almost seemed like they conflated two dimensions that are distinct: positive (something present) vs. negative (something absent), and direct (about that data point) vs. indirect (about related data points). For example, they equate positive evidence with "the reinforcement of successful predictions", but to me, that could be a successful prediction about what's supposed to be there (direct positive evidence) or a successful prediction about what's not supposed to be there (indirect negative evidence). Similarly, prediction error is equated with negative evidence, but a prediction error could be about predicting something should be there but it actually isn't (indirect negative evidence) or about predicting something shouldn't be there but it actually is (direct positive evidence -- and in particular, counterexamples). However, I do agree with their point that indirect negative evidence is a reasonable thing for children to be using, because of children's prediction ability.

Another curious thing for me was that the particular learning story R&al implement forces them to commit to what children's semantic hypothesis space is for a word (since it hinges on selecting the appropriate semantic hypothesis for the word as well as the appropriate morphological form, and using that to make predictions). This seemed problematic, because the semantic hypothesis space is potentially vast, particularly if we're talking about what semantic features are associated with a word. And maybe the point is their story should work no matter what the semantic hypothesis space is, but that wasn't obviously true to me.

As an alternative, it seemed to me that the same general approach could be taken without having to make that semantic hypothesis space commitment. In particular, suppose the child is merely tracking the morphological forms, and recognizes the +s regular pattern from other plural forms. This causes them to apply this rule to "mouse" too. Children's behavior indicates there's a point where they use both "mice" and "mouses", so this is a morphological hypothesis that allows both forms (H_both). The correct hypothesis only allows "mice" (H_mice), so it's a subset-superset relationship of the hypotheses (H_mice is a subset of H_both). Using Bayesian inference (and the accompanying Size Principle) should produce the same results we see computationally (the learner converges on the H_mice hypothesis over time). It seems like it should also be capable of matching the experimental results: early on, examples of the regular rule indirectly boost the H_both hypothesis more, but later on when children have seen enough suspicious coincidences of "mice" input only, the indirect boost to H_both matters less because H_mice is much more probable.

So then, I think the only reason to add on this semantic hypothesis space the way R&al's approach does is if you believe the learning story is necessarily semantic, and therefore must depend on the semantic features.

Some more specific thoughts:

(1) The U-shaped curve of development: R&al talk about the U-shaped curve of development in a way that seemed to odd to me. In particular, in section 6 (p.767), they call the fact that "children who have been observed to produce mice in one context may still frequently produce overregularized forms such as mouses in another" a U-shaped trajectory. But this seems to me to just be one piece of the trajectory (the valley of the U, rather than the overall trajectory).

(2) The semantic cues issue comes back in an odd way in section 6.7, where R&al say that the "error rate of unreliable cues" will "help young speakers discriminate the appropriate semantic cues to irregulars" (p.776). What semantic cues would these be? (Aren't the semantics of "mouses" and "mice" the same? The difference is morphological, rather than semantic.)

(3) R&al promote the idea that a useful thing computational approaches to learning do is ''discover structure in the data" rather than trying to "second-guess the structure of those data in advance" (section 7.4, p.782). That seems like a fine idea, but I don't think it's actually what they did in this particular computational model. In particular, didn't they predefine the hypothesis space of semantic cues? So yes, structure was discovered, but it was discovered in a hypothesis space that had already been constrained (and this is the main point of modern linguistic nativists, I think -- you need a well-defined hypothesis space to get the right generalizations out).

Monday, May 19, 2014

Next time on 6/2/14 @ 3:00pm in SBSG 2221 = Ramscar et al. 2013

Thanks to everyone who was able to join us for our delightful discussion of Kol et al. 2014! We had some really thoughtful commentary on model evaluation. Next time on Jun 2 @ 3:00pm in SBSG 2221, we'll be looking at an article that discusses how children recover from errors during learning, and how this relates to induction problems in language acquisition.

Ramscar, M., Dye, M., & McCauley, S. 2013. Error and expectation in language learning: The curious absence of mouses in adult speech. Language, 89(4), 760-793.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/RamscarEtAl2013_RecoveryFromOverreg.pdf

See you then!

Friday, May 16, 2014

Some thoughts on Kol et al. 2014

I completely love that this paper is highlighting the strength of computational models for precisely evaluating theories about language learning strategies (which is an issue near and dear to my heart). As K&al2014 so clearly note, a computational model forces you to implement all the necessary pieces of your theory and can show you where parts are underspecified. And then, when K&al2014 demonstrate the issues with the TBM, they can identify what parts seem to be causing the problem and where the theory needs to include additional information/constraints.

On a related note, I love that K&al2014 are worrying about how to evaluate model output — again, an issue I’ve been thinking about a lot lately. They end up doing something like a bigger picture version of recall and precision — we don’t just want the model to generate all the true utterances (high recall). We want it to also not generate the bad utterances (high precision). And they demonstrate quite clearly that the TBM’s generative power is great…so great that it generates the bad utterances, too (and so has low precision from this perspective). Which is not so good after all.

But what was even more interesting to me was their mention of measures like perplexity to test the “quality of the grammars” learned, with the idea that good quality grammars make the real data less perplexing. Though they didn’t do it here, I wonder if there’s a reasonable way to do that for the learning strategy they talk about here — it’s not a grammar exactly, but it’s definitely a collection of units and operations that can be used to generate an output. So, as long as you have a generative model for how to produce a sequence of words, it seems like you could use a perplexity measure to compare this particular collection of units and operations against something like a context-free grammar (or even just various versions of the TBM learning strategy).

Some more targeted thoughts:

(1) K&al2014 make a point in the introduction that simulations that “specifically implement definitions provided by cognitive models of language acquisition are rare”. I found this a very odd thing to say — isn’t every model an implementation of some theory of a language strategy? Maybe the point is more that we have a lot of cognitive theories that don’t yet have computational simulations.

(2) There’s a certain level of arbitrariness that K&al2014 note for things like how many matching utterances have to occur for frames to be established (e.g., if it occurs twice, it’s established). Similarly, the preference for choosing consecutive matches over non-consecutive matches is more important than choosing more frequent matches. It’s not clear there are principled reasons for this ordering (at least, not from the description here — and in fact, I don’t think the consecutive preference isn’t implemented in the model K&al2014 put together later on). So, in some sense, these are sort of free parameters in the cognitive theory.

(3) Something that struck me about having high recall on the child-produced utterances with the TBM model — K&al2014 find that the TBM approach can account for a large majority of the utterances (in the high 80s and sometimes 90s). But what about the rest of them (i.e., those 10 or 20% that aren’t so easily reconstructable)? Is it just a sampling issue (and so having denser data would show that you could construct these utterances too)? Or is it more what the linguistic camp tends to assume, where there are knowledge pieces that aren’t a direct/transparent translation of the input? In general, this reminds me of what different theoretical perspectives focus their efforts on — the usage-based camp (and often the NLP camp for computational linguistics) is interested in what accounts for most of everything out there (which can maybe be thought of as the “easy” stuff), while the UG-based camp is interested in accounting for the “hard” stuff (even though that may be a much smaller part of the data).

Monday, May 5, 2014

Next time on 5/19/14 @ 3:00pm in SBSG 2221 = Kol et al. 2014

Thanks to everyone who was able to join us for our thorough discussion of Orita et al. 2013! We had some really excellent ideas for how to extend the model to connect with children's interpretations of utterances. Next time on May 19 @ 3:00pm in SBSG 2221, we'll be looking at an article that discusses how to evaluate formal models of acquisition, focusing on a particular model of early language acquisition as a case study:

Kol, S., Nir, B., & Wintner, S. 2014. Computational evaluation of the Traceback Method. Journal of Child Language, 41(1), 176-199.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/KolEtAl2014_CompEvalTraceback.pdf

See you then!

Friday, May 2, 2014

Some thoughts on Orita et al. 2013

There are several aspects of this paper that I really enjoyed. First, I definitely appreciate the clean and clear description of the circularity in this learning task, where you can learn about the syntax if you know the referents…and you can learn about the referents if you know the syntax (chicken and egg, check).

I also love how hard the authors strive to ground their computational model in empirical data. Now granted, the human simulation paradigm may have its own issues (more on this below), but it’s a great way to try to get at least some approximation of the contextual knowledge children might have access to.

I also really liked the demonstration of the utility of discourse/non-linguistic context information vs. strong syntactic prior knowledge — and how having the super-strong syntax knowledge isn’t enough. This is something that’s a really important point, I think: It’s all well and good to posit detailed, innate, linguistic knowledge as a necessary component for solving an acquisition problem, but it’s important to make sure that this component actually does solve the learning problem (and be aware of what else it might need in order to do so). This paper provides an excellent demonstration of why we need to check this…because in this case, that super-strong syntactic knowledge didn’t actually work on its own. (Side note: The authors are very aware that their model still relies on some less-strong syntactic knowledge, like the relevance of syntactic locality and c-command, but the super-strong syntactic knowledge was on top of that less-strong knowledge.)

More specific thoughts:

(1) The human simulation paradigm (HSP):
In some sense, this task strikes me as similar to ideal learner computational models — we want to see what information is useful in the available input. For the HSP, we do this by seeing what a learner with adult-level cognitive resources can extract. For ideal learners, we do this by seeing what inferences a learner with unlimited computational resources can make, based on the information available.

On the other hand, there’s definitely a sense in which the HSP is not really an ideal learner parallel. First, adult-level processing resources is not the same as unlimited processing resources (it’s just better than child-level processing resources). Second, the issue with adults is that they have a bunch of knowledge to build on about how to extract information from both linguistic and non-linguistic context…and that puts constraints on how they process the available information that children might not have. In effect, the adults may have biases that cause them to perceive the information differently, and this may actually be sub-optimal when compared to children (we don’t really know for sure…but it’s definitely different than children).

Something which is specific to this particular HSP task is that the stated goal is to “determine whether conversational context provides sufficient information for adults” to guess the intended referent. But where does the knowledge about how to use the conversational context to interpret the blanked out NP (as either reflexive, non-reflexive, or lexical) come from? Presumably from adults’ prior experience with how these NPs are typically used. This isn’t something we think children would have access to, though, right? So this is a very specific case of that second issue above, where it’s not clear that the information adults extract is a fair representation of the information children extract, due to prior knowledge that adults have about the language.

Now to be fair, the authors are very aware of this (they have a nice discussion about it in the Experiment 1 discussion section), so again, this is about trying to get some kind of empirical estimate to base their computational model’s priors on. And maybe in the future we can come up with a better way to get this information. For example, it occurs to me that the non-linguistic context (i.e., environment, visual scene info) might be usable. If the caretaker has just bumped her knee, saying “Oops, I hurt myself” is more likely than “Oops, I hurt you”. It may be that the conversational context approximated this to some extent for adults, but I wonder if this kind of thing could be extracted from the video samples we have on CHILDES. What you’d want to do is do a variant of the HSP where you show the video clip with the NP beeped out, so the non-linguistic context is available, along with the discourse information in the preceding and subsequent utterances.

(2) Figure 2: Though I’m fairly familiar with Bayesian models by now, I admit that I loved having text next to each level reminding me what each variable corresponded to. Yay, authors.

(3) General discussion point at the end about unambiguous data: This is a really excellent point, since we don’t like to have to rely on the presence of unambiguous data too much in real life (because typically when we go look for it in realistic input, it’s only very rarely there). Something I’d be interested in is how often unambiguous data for this pronoun categorization issue does actually occur. If it’s never (or almost never, relatively speaking), then this becomes a very nice selling point for this learning model.

Monday, April 14, 2014

Next time on 5/5/14 @ 2:15pm in SBSG 2221 = Orita et al. 2013

Thanks to everyone who was able to join us for our invigorating discussion of the Han et al. 2013 manuscript! Next time on May 5 @ 2:15pm in SBSG 2221, we'll be looking at an article that presents a Bayesian learning model for pronoun acquisition, with a special focus on the role of discourse information:

Orita, N., McKeown, R., Feldman, N. H., Lidz, J., & Boyd-Graber, J. 2013. Discovering Pronoun Categories using Discourse Information. Proceedings of the Cognitive Science Society.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/OritaEtAl2013_Pronouns.pdf

See you on May 5!

Thursday, April 10, 2014

Some thoughts on the Han et al. 2013 Manuscript

One of the things I greatly enjoyed about this paper is that it really takes a tricky learning issue seriously: What happens if you don't get any indicative data about a certain hypothesis space (in this case, defined as a set of possible grammars related to verb-raising)? Do humans just remain permanently ambivalent (which is a rational thing to do, and what I think any Bayesian model would do), or do they pick one (somehow)? The super-tricky thing in testing this, of course, is how you find something that humans have no input about and actually ascertain what grammar they picked. If there's no input (i.e., a relevant example in the language) that discerns between the grammar options, how do you tell?

And that's how we find ourselves in some fairly complex syntax and semantics involving quantifiers and negation in Korean, and their relationship to verb-raising. I did find myself somewhat surprised by the (apparent) simplicity of the test sentences (e.g., the equivalent of "The man did not wash every car in front of his house"). Because the sentences are so simple, I'm surprised they wouldn't occur at all in the input with the appropriate disambiguating contexts (i.e., the subset of these sentences that occur in a neg>every-compatible context, like the man washing 2 out of 3 of the cars in the above example). Maybe this is more about their relative sparseness, with the idea that while they may appear, they're so infrequent that they're just not noticeable by a human learner during the lifespan. But that starts to become a tricky argument when you get to adults -- you might think that adults encountering examples like these over time would eventually learn from them. (You might even argue that this happened between session one and session two for the adults that were tested one month apart: they learned (or solidified learning) from the examples in the first session and held onto that newly gained knowledge for the second session.)

One reason this matters is that there's a big difference between no data and sparse data for a Bayesian model. Nothing can be learned from no data, but something (even if it's only a very slight bias) can be learned from sparse data, assuming the learner pays attention to those data when they occur. For this reason, I'd be interested in some kind of corpus analysis of realistic Korean input of how often these type of negation + object quantifier sentences occur (with the appropriate disambiguating context ideally, but at the very least, how often the sentence structure itself occurs). If they really don't occur at all, this is interesting, since we have the idea described in the paper that humans are picking a hypothesis even in the absence of input (and then we have to investigate why). If these sentences do occur, but only very sparsely, this is still interesting, but speaks more about the frequency thresholds at which learning occurs.