Computational Models of Language (at UC Irvine)

Monday, March 11, 2013

See you in the spring!

Thanks so much to everyone who was able to join us for our feisty and fantastic discussion today, and to everyone who's joined us throughout the winter quarter! The CoLa Reading Group will resume again in the spring quarter, where it will coincide timing-wise with the seminar course "Computational Models of Language Learning", taught by me and Mark Steyvers.

See you then!

Friday, March 8, 2013

Some thoughts on Frank et al. (2012)

I definitely appreciate that the authors are trying to explore provocative ideas - specifically, it seems like they want to claim that hierarchical structure isn't required for language use (defined as production, comprehension, and acquisition). However, from what I can tell, the evidence they present is more about how they can lessen the amount of hierarchical structure required for any given aspect of language use, rather than eliminate it altogether (e.g., section 3b: "...evidence for the primacy of sequential processing...", not "sequential processing is the only type of processing going on"; section 4c: "...while a syntactic structure is only assigned at a later stage...", not "while a syntactic structure is never assigned"; section 5a: "...reanalyses that deemphasize hierarchical structure...", not "reanalyses that eliminate hierarchical structure"). This then seems to play into the continuing debate going on about exactly what kind of structure is required for language (for example, generativist representations vs. constructionist representations). And this debate isn't particularly new, as far as I'm aware.

The basic issue that kept occurring to me as I read this was that hierarchy, in its most basic conception, is the idea that you have units that are made out of other units (which can be made out of other units, etc.). Constructing these hierarchical units (or constituents, if you prefer) is one idea of how you derive meaning from a sequence of word-units, for example. As far as I can tell, I don't think the authors would argue against this view of hierarchical structure. (And if they did, it's unclear to me what alternative they would propose to replace it.)

Also, the authors don't appear to be unhappy with the idea that hierarchy is part of the complete knowledge representation that's built for language (and so would therefore be the target of acquisition). Their claim seems to be more about how we don't need to use all that hierarchical knowledge all the time when we're producing or comprehending language (they try to claim this for acquisition as well, but that seems more tenuous if we think the point of acquisition is to acquire the target knowledge representation). If we focus just on production and comprehension, I think they still need to be more explicit about how to get from a sequence of word units to the complex knowledge representation an entire sentence corresponds to (they say something like this in section 5c: "...if subjects are motivated to read for comprehension, if sentence meaning depends on the precise (hierarchical) sentence structure.."). They present a kind of idea about this with the parallel streams in Figure 1, but I think this doesn't really take care of the underlying problem of constructing compositional (and non-compositional) meaning (more on this below).

Some more targeted thoughts:

The issue of translating between linear pieces and the entire meaning of a sentence appears right at the beginning, with example 2 in particular. While it's true that the pieces can be chunked this way, if you don't have some kind of additional relationship between "sentences" and "can be analysed" (for example, IP if we think of these pieces as NP and VP), how do you know how to put them together to get the larger meaning corresponding to "sentences can be analysed"? And if you do have that relationship somewhere, isn't that equivalent to having hierarchical structure, since these two pieces are subsumed under a larger unit (called IP above)?

In section 2, they mention the idea that "the mechanisms employed for language learning and use are likely to be fundamentally sequential in nature, rather than hierarchical". I have no problem with talking about the mechanism this way - in fact, that makes perfect sense (incremental processing, etc.). But isn't the mechanism distinct from the knowledge representations being manipulated? And that's the part whose structure people generally argue about?

In section 3c, where they talk about some of the models that can learn different aspects of syntax by just using sequential information, do they believe that the target knowledge for these structures doesn't involve hierarchy at all? If they believe the target knowledge does in fact involve hierarchy, then this falls back onto the mechanism vs. knowledge distinction I mentioned above. If they instead think there's no hierarchy even in the target knowledge, then I think they run into the basic problem of how you map words to sentential meaning without hierarchy (or dependency relations, etc.). I think they're aiming towards the former idea where hierarchy is present in the target knowledge (section 4b on combining constructions: "...seems intuitive to regard a combination of constructions as a part-whole relation...").

This then brings me to the parallel sequential streams idea presented in Figure 1. The fact that the pieces combine into a whole seems to be exactly what hierarchy accomplishes (part-whole relationships, etc). Beyond this, it seems one thing to slot pieces together in a parallel stream, and another to create a mental model from this (i.e., get the interpretation of the whole meaning once the pieces are composed together in particular ways).

~~~
Also, here's an excellent blog post by someone very knowledgeable who read this article last year, and had some very specific (and similar) things to say about it:
Norbert Hornstein's Faculty of Language: Three psychologists walk into a bar...

Monday, February 25, 2013

Next time on 3/11 @ 2:15 pm in SBSG 2221 = Frank et al. (2012)

Thanks to everyone who joined our meeting this week, where we had a very helpful discussion about some of the ideas in Thiessen & Pavlik (2012 forthcoming)! Next time on Monday March 11 @ 2:15pm in SBSG 2221, we'll be looking at an article that investigates the necessity of hierarchical structure in language comprehension, production, and acquisition:

Frank, S., Bod, R., & Christiansen, M. 2012. How hierarchical is language use? Proc. R. Soc. B, published online 12 September 2012. doi: 10.1098/rspb.2012.1741.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/FrankEtAl2012_HierarchicalLangUse.pdf

See you then!

-Lisa

Friday, February 22, 2013

Some thoughts on Thiessen & Pavlik 2013

I found this paper a very enjoyable read, and I like very much that it's looking at the building blocks of distributional learning. This seems like the next step forward - we want to know not just that statistical/distributional learning works, but also what the underlying cognitive pieces are that make it work. It's a nice demonstration of a particular story of how cognitive pieces could fit together and make distributional learning work for a few different language acquisition tasks, and it definitely aims to be an algorithmic-level ("mechanistic") account of this process. One of the things that was really good is how clear the authors are that this is only one story - that is, it's an existence proof that this account could work. It doesn't preclude other accounts, but it does shore up support for this account by showing that it does, indeed, work.

The model they propose seems like it can be very prone to initial snowballing, where small initial errors persist and cause larger errors later on. (This may or may not be a bad thing, if we're concerned with actual human learning.) For example, in the first simulation, they mention how sometimes their bimodal input resulted in a unimodal representation, due to exactly this kind of thing. Also, it did seem like there were a fair number of free parameters involved - of course, the nice thing is that some of those parameters have explanatory power, since we can manipulate them to get different qualitative learning effects. Aiming for qualitative patterns rather than exact behavioral matches seems exactly right, though - there are other factors contributing to the observed output behavior, and the authors (quite reasonably) only modeling some of them.

Something else notable about this model is that it's geared only towards tasks that involve abstraction. Now, of course, many acquisition tasks are about some kind of abstraction, but some aren't (like word segmentation) - so it's worth remembering that even if this is how (some kinds of) distributional learning are implemented, we still need some explanation for how other non-abstraction tasks are accomplished. I also like how much they grounded the underlying cognitive pieces of their model in existing models of (long-term) memory - this does my empirical heart good.

Some more targeted thoughts:

I like how they pointed out on p.3 that statistical learning doesn't just have to be about transitional probabilities. Sometimes, these really get equated, and it's a little unfair to the enterprise of statistical learning to talk about it as if it's just dealing with conditional relations. (Of course, much of the experimental work looking at children's inference capabilities involve testing conditional relationships, and many computational models assume conditional relationship tracking abilities.)

The discussion of making inferences from exemplars on p.4 seemed a little simplified to me. For example, while I can imagine that it's often the case that exemplars occurring more frequently will be weighted more than exemplars occurring rarely, it's not obvious to me that this is always the case. Instead, it seems like it would depend on the learner's hypothesis space. For instance, in a subset-superset hypothesis space, one counter-example seems like it could be very heavily weighted, even if it occurs rarely. As another example, the authors talk about how exemplar similarity depends "at least in part upon the variability of the exemplars in the input set". I could imagine that this is true, but I think it also depends on the learner's biases about the hypothesis space - in effect, learner-subjective variability rather than objective variability.

The authors mention on p.7 that they selected the specific linguistic tasks they did because language is a domain where domain-specific mechanisms have often been argued to be at work. I wonder if domain-specific mechanisms have been proposed for the specific linguistic tasks they chose, though - it seems like the type of mechanism proposed depends very much on the task. So, if the authors want to argue that their results show domain-specific mechanisms aren't needed, they do need to address the specific problems where domain-specific mechanisms have been proposed. It wasn't clear to me that this was done, which makes that argument a little weaker to me.

I thought the ability to explain why variable contexts facilitate the learning of phonetic distinctions (basically, due to having a holistic representation of input exemplars) was really excellent. In effect, the "irrelevant" part of the representation helps keep the "relevant" part distinct. This really argues for not just context-sensitive storage of data from the input, but holistic storage. And this also ties into the idea that minimal pairs are probably helpful to linguists, but not to children.

The basic components of the distributional statistical learning process seem quite reasonable: similarity-based activation of prior memories, strength-based learning of features, abstraction of irrelevant features, and memory decay. The second and third components do implicitly assume that the learner has a reasonable set of features to begin with, though. This is a non-trivial assumption, especially when you start thinking about the hypothesis space of possible features. For example, this shows up in simulation 1, where only certain phonetic features are picked out as even in the hypothesis space to begin with.

The effectiveness of the learner really comes from being able to compare across exemplars, which means particular modeling assumptions - such as assuming the learner is memoryless or that the learner is limited to one exemplar at a time - become not so harmless.

I thought it was slightly unfair on p.38 to differentiate the current model from prior models by saying prior models "have been focused on acquiring relatively domain-specific kinds of knowledge...meaning they are not easily applied to other domains". It seemed to me that the current model can only be applied to different domains because the domain-specific knowledge has been built in as part of the feature descriptions. So maybe the point was simply that prior models didn't try to separate out the more-general components from the task-specific components.

I really appreciated the discussion on p.40 about what different parameter values for different learning tasks might actually imply for children. I don't think modelers are always so careful in evaluating what the model parameters & parameter values mean.

A nice aspect of this model discussed in Appendix A is how it can basically recover from spurious examples in the input. Because actual exemplars are kept around (in addition to increasingly more abstract interpretations), one spurious example and its created interpretation can be overrun by lots of non-spurious (i.e., good) exemplars.

Monday, January 28, 2013

Next time on 2/25/13 @ 2:15pm in SBSG 2221 = Thiessen & Pavlik (forthcoming)

Thanks to everyone who joined out meeting this week, where we had a very enlightening discussion about some of the ideas in Stabler (2009b)! Next time on Monday February 25 @ 2:15pm in SBSG 2221, we'll be looking at an article that investigates a single computational learning framework (and general distributional learning strategy) for multiple language learning tasks:

Thiessen, E., & Pavlik, P. 2012 forthcoming. iMinerva: A Mathematical Model of Distributional Statistical Learning. Cognitive Science.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/ThiessenPavlik2012_iMinerva.pdf

See you then!

Friday, January 25, 2013

Some thoughts on Stabler (2009b)

One of the things I really appreciated about this article was the clear intention to connect the kind of computational models & problems learnability researchers typically worry about with the kind of realistic language acquisition and language use problems that linguistic & psychology researchers typically worry about. A nice example of this was the connection to syntactic bootstrapping, which showed up in some of the later sections. I also found myself thinking a few times about the connection between some of these ideas and the issue of language evolution (more on this below), though I suspect this often comes up whenever language universals are discussed.

More targeted thoughts:

The connection with language evolution: I first thought about this in the introduction, where Stabler talks about the "special restrictions on the range of structural options" and the idea that some of the language universals "may guarantee that the whole class of languages with such properties is 'learnable' in a relevant sense." The basic thought was that if the universals didn't help language be learned, they probably wouldn't have survived through the generations of language speakers. This could be because those universals take advantage of already existing cognitive biases humans have for learning, for example.

In section 1, Stabler mentions that it would be useful to care about the universals that apply before more complex abstract notions like "subject" are available. I can see the value of this, but I think most ideas about Universal Grammar (UG) that I'm aware of involve exactly these kind of abstract concepts/symbols. And this makes a little more sense once we remember that UG is meant to be (innate) language-specific learning biases, which would therefore involve symbols that only exist when we're talking about language. So maybe Stabler's point is more that language universals that apply to less abstract (and more perceptible) symbols are not necessarily based on UG biases. They just happen to be used for language learning (and again, contributed to how languages evolved to take the shape that they do).

I'm very sympathetic to the view Stabler mentions at the end of section 1 which is concerned with how to connect computational description results to human languages, given the idealized/simplified languages for which those results are shown.

I like Stabler's point in section 2 about the utility of learnability results, specifically when talking about how a learner realizes that finite data does not mean that the language itself is finite. This connects very well to what I know about the human brain's tendency towards generalization (especially young human brains).

Later on in section 2, I think Stabler does a nice job of explaining why we should care about results that deal with properties in languages like reversibility (e.g., if it's known that the language has that property, the hypothesis space of possible languages is constrained - coupled with a bias for compact representations, this can really winnow the hypothesis space). My take away from that was that these kind of results can tell us about what kind of knowledge is necessary to converge on one answer/representation, which is good. (The downside, of course, is that we can only use this new information if human languages actually have the properties that were explored.) However, it seems like languages might have some of these properties, if we look in the domain of phonotactics. And that makes this feel much more relevant to researchers interested in human language learning.

In section 3, where Stabler is discussing PAC learning, there's some mention of the time taken to converge on a language (i.e., whether the learner is "efficient"). One formal measure of this that's mentioned is polynomial time. I'm wondering how this connects to notions of a reasonable learning period for human language acquisition. (Maybe it doesn't, but it's a first pass attempt to distinguish "wow, totally beyond human capability" from "not".)

I really liked the exploration of the link between syntax and semantics in section 4. One takeaway point for me was evidence in the formal learnability domain for the utility of multiple sources of information (multiple cues). I wonder if there's any analog for solving multiple problems (i.e., learning multiple aspects of language) simultaneously (e.g., identifying individual words and grammatical categories at the same time, etc.). The potential existence of universal links between syntax and semantics again got me thinking about language evolution, too. Basically, if certain links are known, learning both syntax and semantics is much easier, so maybe these links take advantage of existing cognitive biases. That would then be why languages evolved to capitalize on these links, and how languages with these links got transmitted through the generations.

I also liked the discussion of syntactic bootstrapping in section 4, and the sort of "top-down" approach of inferring semantics, instead of always using the compositional bottom-up approach where you know the pieces before you understand the thing they make up. This seems right, given what we know about children's chunking and initial language productions.

Monday, January 14, 2013

Next time on 1/28/13 @ 2:15pm in SBSG 2221 = Stabler 2009b

Thanks to everyone who joined out meeting this week, where we had a very interesting discussion about some of the ideas in Stabler (2009)! Next time on Monday January 28 @ 2:15pm in SBSG 2221, we'll be looking at another article by Stabler. This time, it's one that reviews computational approaches to understanding language universals:

Stabler, E. 2009b. Computational models of language universals: Expressiveness, learnability and consequences. Revised version appears in M. H. Christiansen, C. Collins, and S. Edelman, eds., Language Universals, Oxford: Oxford University Press, 200-223. Note: Because this is a non-final version, please do not cite without permission from Ed Stabler.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Stabler2009_CompModLangUni.pdf

See you then!
-Lisa

Friday, January 11, 2013

Some thoughts on Stabler (2009)

One of the things I most enjoyed about this paper was the way Stabler gives the intuitions behind the different approaches - in many cases, these are some of the most lucid descriptions I've seen about these different mathematical techniques. I also really appreciated the discussion about model selection - it certainly seems true to me that model selection is what many theoretical linguists are thinking about when they discuss different knowledge representations. Of course, this isn't to say that parameter setting once you know the model isn't worthy of investigation (I worry a lot about it myself!). But I also think it's easier to use existing mathematical techniques to investigate parameter setting (and model selection, when the models are known), as compared to model generation.

Some more targeted thoughts below:

I really liked the initial discussion of "abstraction from irrelevant factors", which is getting at the idealizations that we (as language science researchers) make. I don't think anyone would argue that it's necessary to do that to get anything done, but the fights break out when we start talking about the specifics of what's irrelevant. A simple example would be frequency - I think some linguists would assume that frequency's not part of the linguistic knowledge that's relevant for talking about linguistic competence, while others would say that frequency is inherently part of that knowledge since linguistic knowledge includes how often various units are used.

I thought Stabler made very good points about the contributions from both the nativist and the empiricist perspectives (basically, constrained hypothesis spaces for the model types but also impressive rational learning abilities) - and he did it in multiple places, highlighting that both sides have very reasonable claims.

The example in the HMM section with the discovery of implicit syllable structure reminded me very much of UG parameter setting. In particular, while it's true that the learner in this example has to discover the particulars of the unobserved syllable structure, there's still knowledge already (by the nature of the hidden units in the HMM) that there is hidden structure to be discovered (and perhaps even more specific, hidden syllabic structure). I guess the real question is how much has to be specified in the hidden structure for the learner to succeed at discovering the correct syllable structure - is it enough to know that there's a level above consonants & vowel? Or do the hidden units need to specify that this hidden structure is about syllables, and then it's just a question of figuring out exactly what about syllables is true for this language?

I was struck by Stabler's comment about whether it's methodologically appropriate for linguists to seek grammar formalisms that guarantee that human learners can, from any point on the hypothesis space, always reach the global optimum by using some sort of gradient descent. This reminds me very much of the tension between the complexity of language and the sophistication of language learning. First, if language isn't that complex, then the hypothesis space de facto probably can be traversed by some good domain-general learning algorithms. If, however, language is complex, the hypothesis space may not be so cleanly structured. But, if children have innate learning biases that guide them through this "bumpy" hypothesis space, effectively restructuring the hypothesis space to become smooth, then this works out. So it wouldn't be so much that the hypothesis space must be smoothly structured on its own, but rather that it can be perceived as being smoothly structured, given the right learning biases. (This is the basic linguistic nativist tenet about UG, I think - UG are the biases that allow swift traversal of the "bumpy" hypothesis space.)

I also got to thinking about the idea mentioned in the section on perceptrons about how there are many facts about language that don't seem to naturally be Boolean (and so wouldn't lend themselves well to being learned by a perceptron). In a way, anything can be made into a Boolean - this is the basis of binary decomposition in categorization problems. (If you have 10 categories, you first ask if it's category 1 or not, then category 2 or not, etc.) What you do need is a lot of knowledge about the space of possibilities so you know what yes or no questions to ask - and this reminds me of (binary) parameter setting, as it's usually discussed by linguists. The child has a lot of knowledge about the hypothesis space of language, and is making decisions about each parameter (effectively solving a categorizing problem for each parameter - is it value a or value b?, etc.). So I guess the upshot of my thought stream was that perceptrons could be used to learn language, but at the level of implementing the actual parameter setting.

It was very useful to be reminded that the representation of the problem and the initial values for neural networks are crucial for learning success. This of course implies that the correct structure and values for whatever language learning problem must be known a priori (which is effectively a nativist claim, and if these values are specific to language learning, then a linguistic nativist claim). So, the fight between those who use neural networks to explain language learning behavior and those who hold the classic ideas about what's in UG isn't about whether there are some innate biases, or even if those biases are language-specific - it may just be about whether the biases are about the learning mechanism (values in neural networks, for example) or about the knowledge representation (traditional UG biases, but also potentially about network structure for neural nets).

Alas, the one part where I failed to get the intuition that Stabler offered was in the section on support vector machines. This is probably due to my own inadequate knowledge of SVMs, but given how marvelous the other sections were with their intuitions, I really found myself struggling with this one.

Stabler notes in the section on model selection that model fit cannot be the only criterion for modeling success, since larger models tend to fit the data (and perhaps overfit the data) better than simpler models. MDL seems like one good attempt to deal with this, since it has a simple encoding length metric which it uses to compare models - encoding not just the data, based on the model, but also the model itself. So, while a larger model may have a more compact data encoding, its larger size counts against it. In this way, you get some of that nice balance between model complexity and data fit.

Tuesday, January 8, 2013

Winter meeting time set & Jan 14 = Stabler 2009 @ 2:15pm in SBSG 2221

Based on the responses, it seems like Mondays at 2:15pm will work best for everyone's schedules this quarter. Our complete schedule (with specific dates) can now be seen at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

So, let's get kicking! For our first meeting on Monday January 14 @ 2:15pm in SBSG 2221, we'll be looking at an article that surveys several mathematical approaches to language learning, as well as the assumptions inherent in these various approaches.

Stabler, E. 2009. Mathematics of language learning. Revised version appears in Histoire, Epistemologie, Langage, 31, 1, 127-145. Note: Since this a non-final version, please do not cite without permission from Ed Stabler.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Stabler2009_MathLangLearn.pdf

See you then!

Friday, January 4, 2013

Winter quarter planning

I hope everyone's had a good winter break - and now it's time to gear up for the winter quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on mathematical language learning, statistical learning, and hierarchy in language:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here). Thanks and see you soon!

Wednesday, November 28, 2012

See you in the winter!

Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Sunday, November 25, 2012

Some thoughts on Frank (2012)

I thought this was a really nice big picture piece about computational modeling work in language acquisition, and it tries (admirably!) to consolidate insights in different domains about the kind of learning assumptions/strategies that are useful. This is such an incredibly good thing to do, I think - one of the questions I get a lot is whether there's one general purpose style of computational model that's the right way to do things, and I'm usually left shrugging and saying, "Depends on what you're trying to do." And to some extent of course, this is right - but there's also something to be said about what the different useful models have in common.

Another note: Despite the empirical coverage, I did feel there was something of a disconnect between the phenomena generative linguists get excited about (w.r.t poverty of the stimulus, for example - syntactic islands, case theory, etc.) and the phenomena modeled in the studies discussed here. There's nothing wrong with this, since everyone's goal is to understand language acquisition, and that means acquisition of a lot of different kinds of knowledge. But I did wonder how the insights discussed here could be applied to more sophisticated knowledge acquisition problems in language. Frank notes already that it's unclear what insights successful models of more sophisticated knowledge have in common.

Some more targeted thoughts:

Frank focuses on two metrics of model success: sufficiency (basically, acquisition success) and fidelity (fitting patterns of human behavior). I've seen other proposed metrics, such as formal sufficiency, developmental compatibility, and explanatory power (discussed, for example, in Pearl 2010, which is based on prior work by Yang). I feel like formal sufficiency maps pretty well to sufficiency (and actually may cover fidelity too). Developmental compatibility, though, is more about psychological plausibility, and explanatory power is about the ability of the model to give informative (explanatory) answers about what causes the acquisition process modeled. I think all of the studies discussed hold up on the explanatory power metric, so that's fine. It's unclear how well they hold up for developmental compatibility - it may not matter if they're computational-level analyses, for example. But I feel like that's something that should be mentioned as a more prominent thing to think about when judging a computational model. (But maybe that's my algorithmic bias showing through.)

Related point: Frank clearly is aware of the tension between computational-level and algorithmic-level approaches, and spends some time discussing things like incremental vs. batch learning. I admit, I was surprised to see this though: "Fully incremental learning prevents backtracking or re-evaluation of hypotheses in light of earlier data". If I'm understanding this correctly, the idea is that you can't use earlier data at all in a fully incremental model. I think this conflates incremental with memoryless - for example, you can have an incremental learner that has some memory of prior data (usually in some kind of compressed format, perhaps tallying statistics of some kind, etc.). For me, all incremental means is that the learner processes data as it comes in - it doesn't preclude the ability to remember prior data with some (or even a lot of) detail.

Related point: Human memory constraints. In the word segmentation section, Frank mentions that experimental results suggest that "learners may not store the results of segmentation veridically, falsely interpolating memories that they have heard novel items that share all of their individual transitions within a set of observed items". At first, I thought this was about humans not storing the actual segmentations in memory (and I thought, well, of course not - they're storing the recovered word forms). But the second bit made me think this was actually even more abstract than that - it seems to suggest that artificial language participants were extracting probabilistic rules about word forms, rather than the word forms themselves. Maybe this is because the word forms were disconnected from meaning in the experiments described, so the most compact representation was of the rules for making word forms, rather than the word forms themselves?

I loved the Goldsmith (2010) quote: "...if you dig deep enough into any task in acquisition, it will become clear that in order to model that task effectively, a model of every other task is necessary". This is probably generally true, no matter what you're studying, actually - you always have to simplify and pretend things are disconnected when you start out in order to make any progress. But then, once you know a little something, you can relax the idealizations. And Frank notes the synergies in acquisition tasks, which seems like exactly the right way to think about it (at least, now that we think we know something about the individual acquisition tasks involved). It seems like a good chunk of the exciting work going on in acquisition modeling is investigating solving multiple tasks simultaneously, leveraging information from the different tasks to make solving all of them easier. However, once you start trying to do this, you then need to have a precise model of how that leveraging/integration process works.

Another great quote (this time from George Box): "all models are wrong, but some are useful". So true - and related to the point above. I think a really nice contribution Frank makes is in thinking about ways in which models can be useful - whether they provide a general framework or are formal demonstrations of simple principles, for example.

I think this quote might ruffle a few linguist feathers: "...lexicalized (contain information that is linked to individual word forms), the majority of language acquisition could be characterized as 'word learning'. Inferring the meaning of individual lexical items...". While technically this could be true (given really complex ideas about word "meaning"), the complexity of the syntactic acquisition task gets a little lost here, especially given what many people think about as "word meaning". In particular, the rules for putting words together isn't necessarily connected directly to lexical semantics (though of course, individual word meaning plays a part).

I think the Frank et al. work on intention inference when learning a lexicon demonstrates a nice sequence of research w.r.t. the utility of computational models. Basically, child behavior was best explained by a principle of mutual exclusivity. So, for awhile, that was a placeholder, i.e., something like "Use mutual exclusivity to make your decision". Then, Frank et al. came along and hypothesized where mutual exclusivity could come from, and showed how it could arise from more basic learning biases (e.g., "use probabilistic learning this way"). That is, mutual exclusivity itself didn't have to be a basic unit. This reminds me of the Subset Principle in generative linguistics, which falls out nicely from the Size Principle of Bayesian inference.

It's an interesting idea that humans do best at learning when there are multiple (informationally redundant) cues available, as opposed to just one really informative cue. I'm not sure if the Mintz frequent frame is a really good example of this, though - it seems like a frame vs. a bigram is really just the same kind of statistical cue. Though maybe the point is more that the framing words provide more redundancy, rather than being different kinds of cues.

It's also a really interesting idea to measure success by having the output of a model be an intermediate representation used in some other task that has an uncontroversial gold standard. Frank talks about it in the context of syntactic categories, but I could easily imagine the same thing applying to word segmentation. It's definitely a recurring problem that we don't want perfect segmentation for models of infant word segmentation - but then, what do we want? So maybe we can use the output of word segmentation as the input to word- (or morpheme-) meaning mapping.

It took me a little to understand what "expressive" meant in this context. I think it relates to the informational content of some representation - so if a representation is expressive, it can cover a lot of data while being very compact (e.g., rule-based systems, instead of mappings between individual lexical items). A quote near the end gets at this more directly: "...it becomes possible to generate new sentences and to encode sentences more efficiently. At all levels of organization, language is non-random: it is characterized by a high degree of redundancy and hence there is a lot of room for compression." I think this is basically an information-theoretic motivation for having a grammar (which is great!). In a similar vein, it seems like this would be an argument in favor of Universal Grammar-style parameters, because they would be a very good compression of complex regularities and relationships in the data.

~~~

References

Pearl, L. 2010.Using computational modeling in language acquisition research. In E. Blom & S. Unsworth (eds). Experimental Methods in Language Acquisition Research, John Benjamins.

Wednesday, November 14, 2012

Next time on 11/28/12 @ 2pm in SBSG 2221 = Frank (2012)

Thanks to everyone who participated in our vigorous and thoughtful discussion of Hsu et al. (2011)! For our next meeting on Wednesday November 28th @ 2pm in SBSG 2221, we'll be looking at a paper that investigates the role of computational models in the study of early language acquisition and how to evaluate them.

Frank, M. 2012. Computational models of early language acquisition. Manuscript, Stanford University.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Frank2012Manu_CompModelsLangAcq.pdf

See you then!

Monday, November 12, 2012

Some thoughts on Hsu et al. 2011

So this seems to be more of an overview paper showcasing how to apply a probabilistic learning framework at the computational level to problems in language acquisition, whether we're concerned with theoretical learnability results or predicting observable behavior. As a followup to Hsu & Chater (2010), which we discussed a few years back, this re-emphasized some of the nice intuitions in the MDL framework (such as "more compact representations are better"). I think a strength of this framework is its ability to identify linguistic knowledge pieces that are hard to learn from the available data, since this is exactly the sort of thing poverty of the stimulus (PoS) is all about. (Of course, the results rest on the particular assumptions made about the input, forms of the rules, etc., but that's true of all computational analyses, I think.) On a related note, I did notice that nearly all the phenomena examined by Hsu et al. were based on lexical item classification (verb argument subcategorization) or contraction (what generativist might call "traces" in some cases). This is fine (especially the "wanna" case, which I have seen actually used in PoS arguments), but I was surprised that we're not really getting into the kind of complex sentential semantics or syntax that I usually see talked about in generativist circles (e.g., syntactic islands, case theory - see Crain & Pietroski (2002) for some examples on the semantic side). Also, even though Hsu et al's own analysis shows that wanna & that-traces are "practically" unlearnable (i.e., even with probabilistic learning, these look like PoS problems), it seems like they close this paper by sort of downplaying this: "probabilistic language learning is theoretically and computationally possible").

Some more targeted thoughts below:

I think my biggest issue with the computational learnability analyses (and proofs) is that I find it very hard to connect them to the psychological problem of language acquisition that I'm used to thinking about. (In fact, Kent Johnson in UCI's LPS department has a really nice 2004 paper talking about how this connection probably shouldn't have been made with the (in)famous Gold (1967) learnability results.) I do understand that this type of argument is meant to combat the claim about the "logical problem of language acquisition", with the specific interpretation that the "logical problem" comes from computational learnability results (and the Gold paper in particular). However, I've also seen "logical problem of language acquisition" apply to the simple fact that there are induction problems in language acquisition, i.e., the data are compatible with multiple hypotheses, and "logically" any of them could be right, but only one actually is, so "logical problem". This second interpretation still seems right to me, and I don't feel particularly swayed to change this view after reading the learnability results here (though maybe that's (again) because I have trouble connecting these results to the psychological problem).

Related to the point above - in section 2, where we see a brief description of the learnability proof, the process is described as an algorithm that "generates a sequence of guesses concerning the generative probabilistic model of the language". Are these guesses probabilities over utterances, probabilities over the generative grammars that produce the utterances, something else? It seems like we might want them to be probabilities over the generative grammars, but then don't we need some definition of the hypothesis space of possible generative grammars?

I had a little trouble understanding the distinction that Hsu et al. were making between discriminative and generative models in the introduction. Basically, it seemed to me that "discriminative" behavior could be the output of a generative model, so we could view a discriminative model as a special case of a generative model. So is the idea that we really want to emphasize that humans are identifying the underlying probability distribution, instead of just making binary classifications based on their grammars? That is, that there is no such thing as "grammatical" and "ungrammatical", but instead these are epiphenomena of thresholding a probabilistic system?

In section 3, at the very end, Hsu et al. mention that the ideal statistical learner provides an "upper bound" on learnability. I found this somewhat odd - I always thought of ideal learners as providing a lower bound in some sense, since they're not constrained by cognitive resource limitations, and are basically looking at the question of whether the data contain enough information to solve the problem in question.

The practical example in 3.2 with the "going to" contraction threw me for a bit, since I couldn't figure out how to interpret this: "Under the new grammar, going to contraction never occurs when to is a preposition and thus 0 bits are required to encode contraction." Clearly, the intent is that "no contraction" is cheaper to encode than the process of contraction, but why was that? Especially since the new grammar that has the "don't contract when to is a preposition" seems to require an extra rule. Looking back to Hsu & Chater (2010), it seems to be that rules with probability 1 (like going to --> going to when to=prep) require 0 bits to encode. So in effect, the new grammar that has a special exception when to is a preposition gets a data encoding boost, even though the actual grammar model is longer (since it has this exception explicitly encoded). So, "exceptions" that always apply (in a context-dependent way) are cheaper than general rules when the observable data appear in that context.

I liked the idea that learnability should correlate with grammaticality judgments, with the idea that more "learnable" rules (i.e., ones with more data in the input) are encountered more and so their probabilities are stronger in whichever direction. In looking at the computational results though, I have to admit I was surprised that "going to" ranked 12th in learnability (Fig 2), maybe putting it on the order of 50 years to learn. That rule seems very easy, and I assume the grammaticality judgments are very strong for it. (My intuitions are at least.)

A small methodological quibble, section 4.1: "...because many constructions do not occur often enough for statistical significance [in child-directed speech]...we use...the full Corpus of Contemporary American English." Isn't this the point for PoS arguments, though? There are differences between child-directed and adult-directed input (especially between child-directed speech and adult-directed written text), especially at this lexical item level that Hsu et al. are looking at (and also even at very abstract levels like wh-dependencies: Pearl & Sprouse (forthcoming)). So if we don't find these often enough in child-directed speech, and the thing we're concerned with is child acquisition of language, doesn't this also suggest there's a potential PoS problem?

I liked that Hsu et al. connect their work to entrenchment theory, and basically provide a formal (computational-level) instantiation of how/why entrenchment occurs.

~~~
References

Crain, C. & P. Pietroski. 2002. Why language acquisition is a snap. The Linguistic Review, 19, 163-183.

Gold, E. 1967. Language Identification in the Limit. Information and Control, 10, 447-474.

Hsu, A. & N. Chater. 2010. The Logical Problem of Language Acquisition: A Probabilistic Perspective. Cognitive Science, 34, 972-1016.

Johnson, K. 2004. Gold's Theorem and Cognitive Science. Philosophy of Science, 71, 571-592.

Pearl, L. & J. Sprouse. Forthcoming 2012. Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition.

Wednesday, October 24, 2012

Next time on 11/14 @ 2pm in SBSG 2221 = Hsu et al. 2011

Hi everyone,

Thanks to everyone who participated in our thoughtful discussion of Gagliardi et al. (2012)! For our next meeting on Wednesday November 14th @ 2pm in SBSG 2221, we'll be looking at an article that investigates a way to quantify natural language learnability and discusses the impact this has on the debate about the nature of the necessary learning biases for language:

Hsu, A., Chater, N., & Vitanyi, P. 2011. The probabilistic analysis of language acquisition: Theoretical, computational, and experimental analysis. Cognition, 120, 380-390.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/HsuEtAl2011_ProbLang.pdf

See you then!
-Lisa

Monday, October 22, 2012

Some thoughts on Gagliardi et al. (2012)

I thought this was a really lovely Cog Sci paper showcasing how to combine experimental & computational methodologies (and still make it all fit in 6 pages). The authors really tried to give the intuitions behind the modeling aspects, which makes this more accessible to a wider audience. The study does come off as a foundational one, given the many extensions that could be done (involving effects in younger word learners, cross-linguistics applications, etc.), but I think that's a perfectly reasonable approach (again, given the page limitations). I also thought the empirical grounding was really lovely for the computational modeling part, especially as relating to the concept priors. Granted, there are still some idealizations being made (more discussion of this below), but it's nice to see this being taken seriously.

Some more targeted thoughts:

--> One issue concerns the age of the children tested experimentally (4 years old) (and as Gagliardi et al. mention, a future study should look at younger word learners). The reason is that 4-year-olds are fairly good word learners (and have a vocabulary of some size), and presumably have the link between concept and grammatical category (and maybe morphology and grammatical category for the adjectives) firmly established. So it maybe isn't so surprising that grammatical category information is helpful to them. What would be really nice is to know when that link is established, and the interaction between concept formation and recognition/mapping to grammatical categories. I could certainly imagine a bootstrapping process, for instance, and it would be useful to understand that more.

--> The generative model assumes a particular sequence, namely (1) choose the syntactic category, (2) choose the concept, and (3) choose instances of that concept. This seems reasonable for the teaching scenario in the experimental setup, but what might we expect in a more realistic word-learning environment? Would a generative model still have syntactic category first (probably not), or instead have a balance between syntactic environment and concept? Or maybe it would be concept first? And more importantly, how much would this matter? It would presumably change the probabilities that the learner needs to estimate at each point in the generative process.

--> I'd be very interested to see the exact way the Mechanical Turk survey was conducted for classifying things as examples of kinds, properties, or both (and which words were used). Obviously, due to space limitations, this wasn't included here. But I can imagine that many words might easily be described as both kind & concept, if you think carefully enough (or maybe too carefully) about it. Take "cookie", for example (a fairly common child word, I think): It's got both kind (ex: food) and property aspects (ex: sweet) that are fairly salient. So it really matters what examples you give the participants and how you explain the classification you're looking for. And even then, we're getting adult judgments, where child judgments might be more malleable (so maybe we want to try this exercise with children too, if we can).

--> Also, on a related note, the authors make a (reasonable) idealization that the distribution of noun and adjective dimensions in the 30-month-old CDIs are representative of the "larger and more varied set of words" that the child experimental participants know. However, I do wonder about the impact of that assumption, since we are talking about priors (which drive the model to use grammatical category information in a helpful way). It's not too hard to imagine children whose vocabularies skew away from this sample (especially if they're older). Going in the other direction though, if we want to try to extend this to younger word learners, then the CDIs start to become a very good estimate of the nouns and adjectives these children know, so that's very good.

Wednesday, October 10, 2012

Next time on Oct 24 @ 2pm in SBSG 2221 = Gagliardi et al. 2012

Thanks to everyone who participated in our thoughtful discussion of Feldman et al. (2012 Ms)! For our next meeting on Wednesday October 24 @ 2pm in SBSG 2221, we'll be looking at an article that seeks to model learning of word meaning for specific grammatical categories:

Gagliardi, A., E. Bennett, J. Lidz, & N. Feldman. 2012. Children's Inferences in Generalizing Novel Nouns and Adjectives. In N. Miyake, D. Peebles, & R. Cooper (Eds), Proceedings of the 34th Annual Meeting of the Cognitive Science Society, 354-359.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/GagliardiEtAl2012_NounAdjectiveClassification.pdf

See you then!

Monday, October 8, 2012

Some thoughts on Feldman et al. (2012 Ms)

So I'm definitely a huge fan of work that combines different levels of information when solving acquisition problems, and this is that kind of study. In particular, as Feldman et al. note themselves, they're making explicit an idea that came from Swingley (2009): Maybe identifying phonetic categories from the acoustic signal is easier if you keep word context in mind. Another way of putting this is that infants realize that sounds are part of larger units, and so as they try to solve the problem of identifying their native sounds, they're also trying to solve the problem of what these larger units are. This seems intuitively right to me (I had lots of notes in the margins saying "right!" and "yes!!"), though of course we need to grant that infants realize these larger units exist.

One thing I was surprised about, since I had read an earlier version of this study (Feldman et al. 2009): The learners here actually aren't solving word segmentation at the same time they're learning phonetic categories. For some reason, I had assumed they were - maybe because the idea of identifying the lexicon items in a stream of speech seems similar to word segmentation. But that's not what's going on here. Feldman et al. emphasize that the words are presented with boundaries already in place, so this is a little easier than real life. (It's as if the infants are presented with a list of words, or just isolated words.) Given the nature of the Bayesian model (and especially since one of the co-authors is Sharon Goldwater, who's done work on Bayesian segmentation), I wonder how difficult it would be to actually do word segmentation at the same time. It seems fairly similar to me, with the lexicon model already in place (geometric word length, Dirichlet process for lexicon item frequency in the corpus, etc.)

Anyway, on to some more targeted thoughts:

--> I thought the summary of categorization & the links between categorization in language acquisition and categorization in other areas of cognition was really well presented. Similarly, the summary of the previous phonetic category learning models was great - enough detail to know what happened, and how it compares to what Feldman et al. are doing.

--> Regarding the child-directed speech data used, I thought it was really great to see this kind of empirical grounding. I did wonder a bit about which corpora the CHILDES parental frequency count draws from though - since we're looking at processes that happen between 6 and 12 months, we might want to focus on data directed at children of that age. There are plenty of corpora in the American English section of CHILDES with at least some data in this range, so I don't think it would be too hard. The same conversion with the CMU pronouncing dictionary could then be used on those data. (Of course, getting the actual acoustic signal would be best, but I don't know how many CHILDES corpora have this information attached to them. But if we had that, then we could get all the contextual/coarticulatory effects.) On a related note, I wonder how hard it would be to stack a coarticulatory model on top of the existing model, once you had that data. Basically, this would involve hypothesizing different rules, perhaps based on motor constraints (rather than the more abstract rules that we see in phonology, such as those that Dillon et al. (forthcoming) look into in their learning model). Also related, could a phonotactic model of some kind be stacked on top of this? (Blanchard et al. 2011 combine word segmentation & phonotactics.) A word could be made up of bigrams of phonetic categories, rather than just the unigrams in there now.

--> I liked that they used both the number of categories recovered and the pairwise performance measures to gauge model performance. While it seems obvious that we want to learn the categories that match the adult categories, some previous models only checked that the right number of categories were recovered.

--> The larger point about the failure of distributional learning on its own reminds me a bit of Gambell & Yang (2006), who essentially were saying that distributional learning works much better in conjunction with additional information (stress information in their case, since they were looking at word segmentation). Feldman et al.'s point is that this additional information can be on a different level of representation, and depending on what you believe about stress w.r.t. word segmentation, Gambell & Yang would be saying the same thing.

--> The discussion of minimal pairs is very interesting (and this was one of the cool ideas from the original Feldman et al. 2009 paper) - minimal pairs can actually harm phonetic category acquisition in the absence of referents. In particular, it's more parsimonious to just have one lexicon item whose vowel varies, and this in turn creates broader vowel categories than we want. So, to succeed, the learner needs to have a fairly weak bias to have a small lexicon - this then leads to splitting minimal pairs into multiple lexicon items, which is actually the correct thing to do. However, we then have to wonder how realistic it is to have such a weak bias for a small lexicon. (Given memory & processing constraints in infants, it might seem more realistic to have a strong bias for a small lexicon.) On a related note, Feldman et al note later on that information about word referents actually seem to hinder infant ability to distinguish a minimal pair (citing Stager & Werker 1997). Traditionally, this was explained as something like "word learning is extra hard processing-wise, so infants fail to make the phonetic category distinctions that would separate minimal pairs." But the basic point is that word referent information isn't so helpful. But maybe it's enough for infants to know that words are functionally different, even if the exact word-meaning mapping isn't established? This might be less cognitively taxing for infants, and allow them to use that information to separate minimal pairs. Or instead, maybe we should be looking for evidence that infants are terrible at learning minimal pairs when they're first building their lexicons. Feldman et al. reference some evidence that non-minimal pairs are actually really helpful for category learning (more specifically, minimal pairs embedded in non-minimal pairs.)

--> I thought the discussion of hierarchical models in general near the end was really nice, and was struck by the statement that "knowledge of sounds is nothing more than a type of general knowledge about words". From a communicative perspective, this seems right - words are the meaningful things, not individual sounds. Moreover, if we translate this statement back over to syntax since Perfors et al. (2011) used hierarchical models to learn about hierarchical grammars, we get something like "knowledge of hierarchical grammar is nothing more than a type of general knowledge about individual parse tree structures", and that also seems right. Going back to sounds and words, it's just a little odd at first blush to think of sounds as being the higher level of knowledge and words being the lower level of knowledge. But I think Feldman et al. argue for it effectively.

--> I thought this was an excellent statement describing the computational/rational approach: "...identifying which problem [children] are solving can give us clues to the types of strategies that are likely to be used."

~~~
References

Blanchard, D., J. Heinz, & R. Golinkoff. 2010. Modeling the contribution of phonotactic cues to the problem of word segmentation. Journal of Child Language, 27, 487-511.

Dillon, B., E. Dunbar, & W. Idsardi. forthcoming. A single stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science.

Feldman, N., T. Griffiths, & J. Morgan. 2009. Learning phonetic categories by learning a lexicon. Proceedings of the 31st Annual Conference on Cognitive Science.

Gambell, T. & C. Yang. 2006. Word Segmentation: Quick but not dirty. Manuscript, Yale University.

Perfors, A., J. Tenenbaum, & T. Regier. 2011. The learnability of abstract syntactic principles. Cognition, 118, 306-338.

Stager, C. & J. Werker. 1997. Infants listen for more phonetic detail in speech perception than word-learning taste. Nature, 388, 381-382.

Swingley, D. 2009. Contributions of infant word learning to language development. Philosophical Transactions of the Royal Society B, 364, 3617-3632.

Friday, September 28, 2012

Fall meeting times set & Oct 10 = Feldman et al. 2012

Based on the responses, it seems like Wednesdays at 2pm will work best for everyone's schedules. Our complete schedule (with specific dates) can now be seen at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

So, let's get kicking! For our first meeting on Wednesday October 10 @ 2pm in SBSG 2221, we'll be looking at an article that seeks to model learning of phonetic categories and word forms simultaneously, using hierarchical Bayesian inference:

Feldman, N., Griffiths, T., Goldwater, S., & Morgan, J. 2012. A role for the developing lexicon in phonetic category acquisition. Manuscript, University of Maryland at College Park, University of California at Berkeley, University of Edinburgh, and Brown University. Note: Because this is a manuscript, please do no cite without permission from Naomi Feldman.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/FeldmanEtAl2012Manu_PhonCatLearning.pdf

See you then!

Sunday, September 23, 2012

Fall quarter planning

I hope everyone's had a good summer break - and now it's time to gear up for the fall quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on the acquisition of sounds & words, and general learning & learnability:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here ).

See you soon!