Computational Models of Language (at UC Irvine)

Friday, January 25, 2013

Some thoughts on Stabler (2009b)

One of the things I really appreciated about this article was the clear intention to connect the kind of computational models & problems learnability researchers typically worry about with the kind of realistic language acquisition and language use problems that linguistic & psychology researchers typically worry about. A nice example of this was the connection to syntactic bootstrapping, which showed up in some of the later sections. I also found myself thinking a few times about the connection between some of these ideas and the issue of language evolution (more on this below), though I suspect this often comes up whenever language universals are discussed.

More targeted thoughts:

The connection with language evolution: I first thought about this in the introduction, where Stabler talks about the "special restrictions on the range of structural options" and the idea that some of the language universals "may guarantee that the whole class of languages with such properties is 'learnable' in a relevant sense." The basic thought was that if the universals didn't help language be learned, they probably wouldn't have survived through the generations of language speakers. This could be because those universals take advantage of already existing cognitive biases humans have for learning, for example.

In section 1, Stabler mentions that it would be useful to care about the universals that apply before more complex abstract notions like "subject" are available. I can see the value of this, but I think most ideas about Universal Grammar (UG) that I'm aware of involve exactly these kind of abstract concepts/symbols. And this makes a little more sense once we remember that UG is meant to be (innate) language-specific learning biases, which would therefore involve symbols that only exist when we're talking about language. So maybe Stabler's point is more that language universals that apply to less abstract (and more perceptible) symbols are not necessarily based on UG biases. They just happen to be used for language learning (and again, contributed to how languages evolved to take the shape that they do).

I'm very sympathetic to the view Stabler mentions at the end of section 1 which is concerned with how to connect computational description results to human languages, given the idealized/simplified languages for which those results are shown.

I like Stabler's point in section 2 about the utility of learnability results, specifically when talking about how a learner realizes that finite data does not mean that the language itself is finite. This connects very well to what I know about the human brain's tendency towards generalization (especially young human brains).

Later on in section 2, I think Stabler does a nice job of explaining why we should care about results that deal with properties in languages like reversibility (e.g., if it's known that the language has that property, the hypothesis space of possible languages is constrained - coupled with a bias for compact representations, this can really winnow the hypothesis space). My take away from that was that these kind of results can tell us about what kind of knowledge is necessary to converge on one answer/representation, which is good. (The downside, of course, is that we can only use this new information if human languages actually have the properties that were explored.) However, it seems like languages might have some of these properties, if we look in the domain of phonotactics. And that makes this feel much more relevant to researchers interested in human language learning.

In section 3, where Stabler is discussing PAC learning, there's some mention of the time taken to converge on a language (i.e., whether the learner is "efficient"). One formal measure of this that's mentioned is polynomial time. I'm wondering how this connects to notions of a reasonable learning period for human language acquisition. (Maybe it doesn't, but it's a first pass attempt to distinguish "wow, totally beyond human capability" from "not".)

I really liked the exploration of the link between syntax and semantics in section 4. One takeaway point for me was evidence in the formal learnability domain for the utility of multiple sources of information (multiple cues). I wonder if there's any analog for solving multiple problems (i.e., learning multiple aspects of language) simultaneously (e.g., identifying individual words and grammatical categories at the same time, etc.). The potential existence of universal links between syntax and semantics again got me thinking about language evolution, too. Basically, if certain links are known, learning both syntax and semantics is much easier, so maybe these links take advantage of existing cognitive biases. That would then be why languages evolved to capitalize on these links, and how languages with these links got transmitted through the generations.

I also liked the discussion of syntactic bootstrapping in section 4, and the sort of "top-down" approach of inferring semantics, instead of always using the compositional bottom-up approach where you know the pieces before you understand the thing they make up. This seems right, given what we know about children's chunking and initial language productions.

Monday, January 14, 2013

Next time on 1/28/13 @ 2:15pm in SBSG 2221 = Stabler 2009b

Thanks to everyone who joined out meeting this week, where we had a very interesting discussion about some of the ideas in Stabler (2009)! Next time on Monday January 28 @ 2:15pm in SBSG 2221, we'll be looking at another article by Stabler. This time, it's one that reviews computational approaches to understanding language universals:

Stabler, E. 2009b. Computational models of language universals: Expressiveness, learnability and consequences. Revised version appears in M. H. Christiansen, C. Collins, and S. Edelman, eds., Language Universals, Oxford: Oxford University Press, 200-223. Note: Because this is a non-final version, please do not cite without permission from Ed Stabler.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Stabler2009_CompModLangUni.pdf

See you then!
-Lisa

Friday, January 11, 2013

Some thoughts on Stabler (2009)

One of the things I most enjoyed about this paper was the way Stabler gives the intuitions behind the different approaches - in many cases, these are some of the most lucid descriptions I've seen about these different mathematical techniques. I also really appreciated the discussion about model selection - it certainly seems true to me that model selection is what many theoretical linguists are thinking about when they discuss different knowledge representations. Of course, this isn't to say that parameter setting once you know the model isn't worthy of investigation (I worry a lot about it myself!). But I also think it's easier to use existing mathematical techniques to investigate parameter setting (and model selection, when the models are known), as compared to model generation.

Some more targeted thoughts below:

I really liked the initial discussion of "abstraction from irrelevant factors", which is getting at the idealizations that we (as language science researchers) make. I don't think anyone would argue that it's necessary to do that to get anything done, but the fights break out when we start talking about the specifics of what's irrelevant. A simple example would be frequency - I think some linguists would assume that frequency's not part of the linguistic knowledge that's relevant for talking about linguistic competence, while others would say that frequency is inherently part of that knowledge since linguistic knowledge includes how often various units are used.

I thought Stabler made very good points about the contributions from both the nativist and the empiricist perspectives (basically, constrained hypothesis spaces for the model types but also impressive rational learning abilities) - and he did it in multiple places, highlighting that both sides have very reasonable claims.

The example in the HMM section with the discovery of implicit syllable structure reminded me very much of UG parameter setting. In particular, while it's true that the learner in this example has to discover the particulars of the unobserved syllable structure, there's still knowledge already (by the nature of the hidden units in the HMM) that there is hidden structure to be discovered (and perhaps even more specific, hidden syllabic structure). I guess the real question is how much has to be specified in the hidden structure for the learner to succeed at discovering the correct syllable structure - is it enough to know that there's a level above consonants & vowel? Or do the hidden units need to specify that this hidden structure is about syllables, and then it's just a question of figuring out exactly what about syllables is true for this language?

I was struck by Stabler's comment about whether it's methodologically appropriate for linguists to seek grammar formalisms that guarantee that human learners can, from any point on the hypothesis space, always reach the global optimum by using some sort of gradient descent. This reminds me very much of the tension between the complexity of language and the sophistication of language learning. First, if language isn't that complex, then the hypothesis space de facto probably can be traversed by some good domain-general learning algorithms. If, however, language is complex, the hypothesis space may not be so cleanly structured. But, if children have innate learning biases that guide them through this "bumpy" hypothesis space, effectively restructuring the hypothesis space to become smooth, then this works out. So it wouldn't be so much that the hypothesis space must be smoothly structured on its own, but rather that it can be perceived as being smoothly structured, given the right learning biases. (This is the basic linguistic nativist tenet about UG, I think - UG are the biases that allow swift traversal of the "bumpy" hypothesis space.)

I also got to thinking about the idea mentioned in the section on perceptrons about how there are many facts about language that don't seem to naturally be Boolean (and so wouldn't lend themselves well to being learned by a perceptron). In a way, anything can be made into a Boolean - this is the basis of binary decomposition in categorization problems. (If you have 10 categories, you first ask if it's category 1 or not, then category 2 or not, etc.) What you do need is a lot of knowledge about the space of possibilities so you know what yes or no questions to ask - and this reminds me of (binary) parameter setting, as it's usually discussed by linguists. The child has a lot of knowledge about the hypothesis space of language, and is making decisions about each parameter (effectively solving a categorizing problem for each parameter - is it value a or value b?, etc.). So I guess the upshot of my thought stream was that perceptrons could be used to learn language, but at the level of implementing the actual parameter setting.

It was very useful to be reminded that the representation of the problem and the initial values for neural networks are crucial for learning success. This of course implies that the correct structure and values for whatever language learning problem must be known a priori (which is effectively a nativist claim, and if these values are specific to language learning, then a linguistic nativist claim). So, the fight between those who use neural networks to explain language learning behavior and those who hold the classic ideas about what's in UG isn't about whether there are some innate biases, or even if those biases are language-specific - it may just be about whether the biases are about the learning mechanism (values in neural networks, for example) or about the knowledge representation (traditional UG biases, but also potentially about network structure for neural nets).

Alas, the one part where I failed to get the intuition that Stabler offered was in the section on support vector machines. This is probably due to my own inadequate knowledge of SVMs, but given how marvelous the other sections were with their intuitions, I really found myself struggling with this one.

Stabler notes in the section on model selection that model fit cannot be the only criterion for modeling success, since larger models tend to fit the data (and perhaps overfit the data) better than simpler models. MDL seems like one good attempt to deal with this, since it has a simple encoding length metric which it uses to compare models - encoding not just the data, based on the model, but also the model itself. So, while a larger model may have a more compact data encoding, its larger size counts against it. In this way, you get some of that nice balance between model complexity and data fit.

Tuesday, January 8, 2013

Winter meeting time set & Jan 14 = Stabler 2009 @ 2:15pm in SBSG 2221

Based on the responses, it seems like Mondays at 2:15pm will work best for everyone's schedules this quarter. Our complete schedule (with specific dates) can now be seen at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

So, let's get kicking! For our first meeting on Monday January 14 @ 2:15pm in SBSG 2221, we'll be looking at an article that surveys several mathematical approaches to language learning, as well as the assumptions inherent in these various approaches.

Stabler, E. 2009. Mathematics of language learning. Revised version appears in Histoire, Epistemologie, Langage, 31, 1, 127-145. Note: Since this a non-final version, please do not cite without permission from Ed Stabler.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Stabler2009_MathLangLearn.pdf

See you then!

Friday, January 4, 2013

Winter quarter planning

I hope everyone's had a good winter break - and now it's time to gear up for the winter quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on mathematical language learning, statistical learning, and hierarchy in language:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here). Thanks and see you soon!

Wednesday, November 28, 2012

See you in the winter!

Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Sunday, November 25, 2012

Some thoughts on Frank (2012)

I thought this was a really nice big picture piece about computational modeling work in language acquisition, and it tries (admirably!) to consolidate insights in different domains about the kind of learning assumptions/strategies that are useful. This is such an incredibly good thing to do, I think - one of the questions I get a lot is whether there's one general purpose style of computational model that's the right way to do things, and I'm usually left shrugging and saying, "Depends on what you're trying to do." And to some extent of course, this is right - but there's also something to be said about what the different useful models have in common.

Another note: Despite the empirical coverage, I did feel there was something of a disconnect between the phenomena generative linguists get excited about (w.r.t poverty of the stimulus, for example - syntactic islands, case theory, etc.) and the phenomena modeled in the studies discussed here. There's nothing wrong with this, since everyone's goal is to understand language acquisition, and that means acquisition of a lot of different kinds of knowledge. But I did wonder how the insights discussed here could be applied to more sophisticated knowledge acquisition problems in language. Frank notes already that it's unclear what insights successful models of more sophisticated knowledge have in common.

Some more targeted thoughts:

Frank focuses on two metrics of model success: sufficiency (basically, acquisition success) and fidelity (fitting patterns of human behavior). I've seen other proposed metrics, such as formal sufficiency, developmental compatibility, and explanatory power (discussed, for example, in Pearl 2010, which is based on prior work by Yang). I feel like formal sufficiency maps pretty well to sufficiency (and actually may cover fidelity too). Developmental compatibility, though, is more about psychological plausibility, and explanatory power is about the ability of the model to give informative (explanatory) answers about what causes the acquisition process modeled. I think all of the studies discussed hold up on the explanatory power metric, so that's fine. It's unclear how well they hold up for developmental compatibility - it may not matter if they're computational-level analyses, for example. But I feel like that's something that should be mentioned as a more prominent thing to think about when judging a computational model. (But maybe that's my algorithmic bias showing through.)

Related point: Frank clearly is aware of the tension between computational-level and algorithmic-level approaches, and spends some time discussing things like incremental vs. batch learning. I admit, I was surprised to see this though: "Fully incremental learning prevents backtracking or re-evaluation of hypotheses in light of earlier data". If I'm understanding this correctly, the idea is that you can't use earlier data at all in a fully incremental model. I think this conflates incremental with memoryless - for example, you can have an incremental learner that has some memory of prior data (usually in some kind of compressed format, perhaps tallying statistics of some kind, etc.). For me, all incremental means is that the learner processes data as it comes in - it doesn't preclude the ability to remember prior data with some (or even a lot of) detail.

Related point: Human memory constraints. In the word segmentation section, Frank mentions that experimental results suggest that "learners may not store the results of segmentation veridically, falsely interpolating memories that they have heard novel items that share all of their individual transitions within a set of observed items". At first, I thought this was about humans not storing the actual segmentations in memory (and I thought, well, of course not - they're storing the recovered word forms). But the second bit made me think this was actually even more abstract than that - it seems to suggest that artificial language participants were extracting probabilistic rules about word forms, rather than the word forms themselves. Maybe this is because the word forms were disconnected from meaning in the experiments described, so the most compact representation was of the rules for making word forms, rather than the word forms themselves?

I loved the Goldsmith (2010) quote: "...if you dig deep enough into any task in acquisition, it will become clear that in order to model that task effectively, a model of every other task is necessary". This is probably generally true, no matter what you're studying, actually - you always have to simplify and pretend things are disconnected when you start out in order to make any progress. But then, once you know a little something, you can relax the idealizations. And Frank notes the synergies in acquisition tasks, which seems like exactly the right way to think about it (at least, now that we think we know something about the individual acquisition tasks involved). It seems like a good chunk of the exciting work going on in acquisition modeling is investigating solving multiple tasks simultaneously, leveraging information from the different tasks to make solving all of them easier. However, once you start trying to do this, you then need to have a precise model of how that leveraging/integration process works.

Another great quote (this time from George Box): "all models are wrong, but some are useful". So true - and related to the point above. I think a really nice contribution Frank makes is in thinking about ways in which models can be useful - whether they provide a general framework or are formal demonstrations of simple principles, for example.

I think this quote might ruffle a few linguist feathers: "...lexicalized (contain information that is linked to individual word forms), the majority of language acquisition could be characterized as 'word learning'. Inferring the meaning of individual lexical items...". While technically this could be true (given really complex ideas about word "meaning"), the complexity of the syntactic acquisition task gets a little lost here, especially given what many people think about as "word meaning". In particular, the rules for putting words together isn't necessarily connected directly to lexical semantics (though of course, individual word meaning plays a part).

I think the Frank et al. work on intention inference when learning a lexicon demonstrates a nice sequence of research w.r.t. the utility of computational models. Basically, child behavior was best explained by a principle of mutual exclusivity. So, for awhile, that was a placeholder, i.e., something like "Use mutual exclusivity to make your decision". Then, Frank et al. came along and hypothesized where mutual exclusivity could come from, and showed how it could arise from more basic learning biases (e.g., "use probabilistic learning this way"). That is, mutual exclusivity itself didn't have to be a basic unit. This reminds me of the Subset Principle in generative linguistics, which falls out nicely from the Size Principle of Bayesian inference.

It's an interesting idea that humans do best at learning when there are multiple (informationally redundant) cues available, as opposed to just one really informative cue. I'm not sure if the Mintz frequent frame is a really good example of this, though - it seems like a frame vs. a bigram is really just the same kind of statistical cue. Though maybe the point is more that the framing words provide more redundancy, rather than being different kinds of cues.

It's also a really interesting idea to measure success by having the output of a model be an intermediate representation used in some other task that has an uncontroversial gold standard. Frank talks about it in the context of syntactic categories, but I could easily imagine the same thing applying to word segmentation. It's definitely a recurring problem that we don't want perfect segmentation for models of infant word segmentation - but then, what do we want? So maybe we can use the output of word segmentation as the input to word- (or morpheme-) meaning mapping.

It took me a little to understand what "expressive" meant in this context. I think it relates to the informational content of some representation - so if a representation is expressive, it can cover a lot of data while being very compact (e.g., rule-based systems, instead of mappings between individual lexical items). A quote near the end gets at this more directly: "...it becomes possible to generate new sentences and to encode sentences more efficiently. At all levels of organization, language is non-random: it is characterized by a high degree of redundancy and hence there is a lot of room for compression." I think this is basically an information-theoretic motivation for having a grammar (which is great!). In a similar vein, it seems like this would be an argument in favor of Universal Grammar-style parameters, because they would be a very good compression of complex regularities and relationships in the data.

~~~

References

Pearl, L. 2010.Using computational modeling in language acquisition research. In E. Blom & S. Unsworth (eds). Experimental Methods in Language Acquisition Research, John Benjamins.

Wednesday, November 14, 2012

Next time on 11/28/12 @ 2pm in SBSG 2221 = Frank (2012)

Thanks to everyone who participated in our vigorous and thoughtful discussion of Hsu et al. (2011)! For our next meeting on Wednesday November 28th @ 2pm in SBSG 2221, we'll be looking at a paper that investigates the role of computational models in the study of early language acquisition and how to evaluate them.

Frank, M. 2012. Computational models of early language acquisition. Manuscript, Stanford University.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Frank2012Manu_CompModelsLangAcq.pdf

See you then!

Monday, November 12, 2012

Some thoughts on Hsu et al. 2011

So this seems to be more of an overview paper showcasing how to apply a probabilistic learning framework at the computational level to problems in language acquisition, whether we're concerned with theoretical learnability results or predicting observable behavior. As a followup to Hsu & Chater (2010), which we discussed a few years back, this re-emphasized some of the nice intuitions in the MDL framework (such as "more compact representations are better"). I think a strength of this framework is its ability to identify linguistic knowledge pieces that are hard to learn from the available data, since this is exactly the sort of thing poverty of the stimulus (PoS) is all about. (Of course, the results rest on the particular assumptions made about the input, forms of the rules, etc., but that's true of all computational analyses, I think.) On a related note, I did notice that nearly all the phenomena examined by Hsu et al. were based on lexical item classification (verb argument subcategorization) or contraction (what generativist might call "traces" in some cases). This is fine (especially the "wanna" case, which I have seen actually used in PoS arguments), but I was surprised that we're not really getting into the kind of complex sentential semantics or syntax that I usually see talked about in generativist circles (e.g., syntactic islands, case theory - see Crain & Pietroski (2002) for some examples on the semantic side). Also, even though Hsu et al's own analysis shows that wanna & that-traces are "practically" unlearnable (i.e., even with probabilistic learning, these look like PoS problems), it seems like they close this paper by sort of downplaying this: "probabilistic language learning is theoretically and computationally possible").

Some more targeted thoughts below:

I think my biggest issue with the computational learnability analyses (and proofs) is that I find it very hard to connect them to the psychological problem of language acquisition that I'm used to thinking about. (In fact, Kent Johnson in UCI's LPS department has a really nice 2004 paper talking about how this connection probably shouldn't have been made with the (in)famous Gold (1967) learnability results.) I do understand that this type of argument is meant to combat the claim about the "logical problem of language acquisition", with the specific interpretation that the "logical problem" comes from computational learnability results (and the Gold paper in particular). However, I've also seen "logical problem of language acquisition" apply to the simple fact that there are induction problems in language acquisition, i.e., the data are compatible with multiple hypotheses, and "logically" any of them could be right, but only one actually is, so "logical problem". This second interpretation still seems right to me, and I don't feel particularly swayed to change this view after reading the learnability results here (though maybe that's (again) because I have trouble connecting these results to the psychological problem).

Related to the point above - in section 2, where we see a brief description of the learnability proof, the process is described as an algorithm that "generates a sequence of guesses concerning the generative probabilistic model of the language". Are these guesses probabilities over utterances, probabilities over the generative grammars that produce the utterances, something else? It seems like we might want them to be probabilities over the generative grammars, but then don't we need some definition of the hypothesis space of possible generative grammars?

I had a little trouble understanding the distinction that Hsu et al. were making between discriminative and generative models in the introduction. Basically, it seemed to me that "discriminative" behavior could be the output of a generative model, so we could view a discriminative model as a special case of a generative model. So is the idea that we really want to emphasize that humans are identifying the underlying probability distribution, instead of just making binary classifications based on their grammars? That is, that there is no such thing as "grammatical" and "ungrammatical", but instead these are epiphenomena of thresholding a probabilistic system?

In section 3, at the very end, Hsu et al. mention that the ideal statistical learner provides an "upper bound" on learnability. I found this somewhat odd - I always thought of ideal learners as providing a lower bound in some sense, since they're not constrained by cognitive resource limitations, and are basically looking at the question of whether the data contain enough information to solve the problem in question.

The practical example in 3.2 with the "going to" contraction threw me for a bit, since I couldn't figure out how to interpret this: "Under the new grammar, going to contraction never occurs when to is a preposition and thus 0 bits are required to encode contraction." Clearly, the intent is that "no contraction" is cheaper to encode than the process of contraction, but why was that? Especially since the new grammar that has the "don't contract when to is a preposition" seems to require an extra rule. Looking back to Hsu & Chater (2010), it seems to be that rules with probability 1 (like going to --> going to when to=prep) require 0 bits to encode. So in effect, the new grammar that has a special exception when to is a preposition gets a data encoding boost, even though the actual grammar model is longer (since it has this exception explicitly encoded). So, "exceptions" that always apply (in a context-dependent way) are cheaper than general rules when the observable data appear in that context.

I liked the idea that learnability should correlate with grammaticality judgments, with the idea that more "learnable" rules (i.e., ones with more data in the input) are encountered more and so their probabilities are stronger in whichever direction. In looking at the computational results though, I have to admit I was surprised that "going to" ranked 12th in learnability (Fig 2), maybe putting it on the order of 50 years to learn. That rule seems very easy, and I assume the grammaticality judgments are very strong for it. (My intuitions are at least.)

A small methodological quibble, section 4.1: "...because many constructions do not occur often enough for statistical significance [in child-directed speech]...we use...the full Corpus of Contemporary American English." Isn't this the point for PoS arguments, though? There are differences between child-directed and adult-directed input (especially between child-directed speech and adult-directed written text), especially at this lexical item level that Hsu et al. are looking at (and also even at very abstract levels like wh-dependencies: Pearl & Sprouse (forthcoming)). So if we don't find these often enough in child-directed speech, and the thing we're concerned with is child acquisition of language, doesn't this also suggest there's a potential PoS problem?

I liked that Hsu et al. connect their work to entrenchment theory, and basically provide a formal (computational-level) instantiation of how/why entrenchment occurs.

~~~
References

Crain, C. & P. Pietroski. 2002. Why language acquisition is a snap. The Linguistic Review, 19, 163-183.

Gold, E. 1967. Language Identification in the Limit. Information and Control, 10, 447-474.

Hsu, A. & N. Chater. 2010. The Logical Problem of Language Acquisition: A Probabilistic Perspective. Cognitive Science, 34, 972-1016.

Johnson, K. 2004. Gold's Theorem and Cognitive Science. Philosophy of Science, 71, 571-592.

Pearl, L. & J. Sprouse. Forthcoming 2012. Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition.

Wednesday, October 24, 2012

Next time on 11/14 @ 2pm in SBSG 2221 = Hsu et al. 2011

Hi everyone,

Thanks to everyone who participated in our thoughtful discussion of Gagliardi et al. (2012)! For our next meeting on Wednesday November 14th @ 2pm in SBSG 2221, we'll be looking at an article that investigates a way to quantify natural language learnability and discusses the impact this has on the debate about the nature of the necessary learning biases for language:

Hsu, A., Chater, N., & Vitanyi, P. 2011. The probabilistic analysis of language acquisition: Theoretical, computational, and experimental analysis. Cognition, 120, 380-390.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/HsuEtAl2011_ProbLang.pdf

See you then!
-Lisa

Monday, October 22, 2012

Some thoughts on Gagliardi et al. (2012)

I thought this was a really lovely Cog Sci paper showcasing how to combine experimental & computational methodologies (and still make it all fit in 6 pages). The authors really tried to give the intuitions behind the modeling aspects, which makes this more accessible to a wider audience. The study does come off as a foundational one, given the many extensions that could be done (involving effects in younger word learners, cross-linguistics applications, etc.), but I think that's a perfectly reasonable approach (again, given the page limitations). I also thought the empirical grounding was really lovely for the computational modeling part, especially as relating to the concept priors. Granted, there are still some idealizations being made (more discussion of this below), but it's nice to see this being taken seriously.

Some more targeted thoughts:

--> One issue concerns the age of the children tested experimentally (4 years old) (and as Gagliardi et al. mention, a future study should look at younger word learners). The reason is that 4-year-olds are fairly good word learners (and have a vocabulary of some size), and presumably have the link between concept and grammatical category (and maybe morphology and grammatical category for the adjectives) firmly established. So it maybe isn't so surprising that grammatical category information is helpful to them. What would be really nice is to know when that link is established, and the interaction between concept formation and recognition/mapping to grammatical categories. I could certainly imagine a bootstrapping process, for instance, and it would be useful to understand that more.

--> The generative model assumes a particular sequence, namely (1) choose the syntactic category, (2) choose the concept, and (3) choose instances of that concept. This seems reasonable for the teaching scenario in the experimental setup, but what might we expect in a more realistic word-learning environment? Would a generative model still have syntactic category first (probably not), or instead have a balance between syntactic environment and concept? Or maybe it would be concept first? And more importantly, how much would this matter? It would presumably change the probabilities that the learner needs to estimate at each point in the generative process.

--> I'd be very interested to see the exact way the Mechanical Turk survey was conducted for classifying things as examples of kinds, properties, or both (and which words were used). Obviously, due to space limitations, this wasn't included here. But I can imagine that many words might easily be described as both kind & concept, if you think carefully enough (or maybe too carefully) about it. Take "cookie", for example (a fairly common child word, I think): It's got both kind (ex: food) and property aspects (ex: sweet) that are fairly salient. So it really matters what examples you give the participants and how you explain the classification you're looking for. And even then, we're getting adult judgments, where child judgments might be more malleable (so maybe we want to try this exercise with children too, if we can).

--> Also, on a related note, the authors make a (reasonable) idealization that the distribution of noun and adjective dimensions in the 30-month-old CDIs are representative of the "larger and more varied set of words" that the child experimental participants know. However, I do wonder about the impact of that assumption, since we are talking about priors (which drive the model to use grammatical category information in a helpful way). It's not too hard to imagine children whose vocabularies skew away from this sample (especially if they're older). Going in the other direction though, if we want to try to extend this to younger word learners, then the CDIs start to become a very good estimate of the nouns and adjectives these children know, so that's very good.

Wednesday, October 10, 2012

Next time on Oct 24 @ 2pm in SBSG 2221 = Gagliardi et al. 2012

Thanks to everyone who participated in our thoughtful discussion of Feldman et al. (2012 Ms)! For our next meeting on Wednesday October 24 @ 2pm in SBSG 2221, we'll be looking at an article that seeks to model learning of word meaning for specific grammatical categories:

Gagliardi, A., E. Bennett, J. Lidz, & N. Feldman. 2012. Children's Inferences in Generalizing Novel Nouns and Adjectives. In N. Miyake, D. Peebles, & R. Cooper (Eds), Proceedings of the 34th Annual Meeting of the Cognitive Science Society, 354-359.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/GagliardiEtAl2012_NounAdjectiveClassification.pdf

See you then!

Monday, October 8, 2012

Some thoughts on Feldman et al. (2012 Ms)

So I'm definitely a huge fan of work that combines different levels of information when solving acquisition problems, and this is that kind of study. In particular, as Feldman et al. note themselves, they're making explicit an idea that came from Swingley (2009): Maybe identifying phonetic categories from the acoustic signal is easier if you keep word context in mind. Another way of putting this is that infants realize that sounds are part of larger units, and so as they try to solve the problem of identifying their native sounds, they're also trying to solve the problem of what these larger units are. This seems intuitively right to me (I had lots of notes in the margins saying "right!" and "yes!!"), though of course we need to grant that infants realize these larger units exist.

One thing I was surprised about, since I had read an earlier version of this study (Feldman et al. 2009): The learners here actually aren't solving word segmentation at the same time they're learning phonetic categories. For some reason, I had assumed they were - maybe because the idea of identifying the lexicon items in a stream of speech seems similar to word segmentation. But that's not what's going on here. Feldman et al. emphasize that the words are presented with boundaries already in place, so this is a little easier than real life. (It's as if the infants are presented with a list of words, or just isolated words.) Given the nature of the Bayesian model (and especially since one of the co-authors is Sharon Goldwater, who's done work on Bayesian segmentation), I wonder how difficult it would be to actually do word segmentation at the same time. It seems fairly similar to me, with the lexicon model already in place (geometric word length, Dirichlet process for lexicon item frequency in the corpus, etc.)

Anyway, on to some more targeted thoughts:

--> I thought the summary of categorization & the links between categorization in language acquisition and categorization in other areas of cognition was really well presented. Similarly, the summary of the previous phonetic category learning models was great - enough detail to know what happened, and how it compares to what Feldman et al. are doing.

--> Regarding the child-directed speech data used, I thought it was really great to see this kind of empirical grounding. I did wonder a bit about which corpora the CHILDES parental frequency count draws from though - since we're looking at processes that happen between 6 and 12 months, we might want to focus on data directed at children of that age. There are plenty of corpora in the American English section of CHILDES with at least some data in this range, so I don't think it would be too hard. The same conversion with the CMU pronouncing dictionary could then be used on those data. (Of course, getting the actual acoustic signal would be best, but I don't know how many CHILDES corpora have this information attached to them. But if we had that, then we could get all the contextual/coarticulatory effects.) On a related note, I wonder how hard it would be to stack a coarticulatory model on top of the existing model, once you had that data. Basically, this would involve hypothesizing different rules, perhaps based on motor constraints (rather than the more abstract rules that we see in phonology, such as those that Dillon et al. (forthcoming) look into in their learning model). Also related, could a phonotactic model of some kind be stacked on top of this? (Blanchard et al. 2011 combine word segmentation & phonotactics.) A word could be made up of bigrams of phonetic categories, rather than just the unigrams in there now.

--> I liked that they used both the number of categories recovered and the pairwise performance measures to gauge model performance. While it seems obvious that we want to learn the categories that match the adult categories, some previous models only checked that the right number of categories were recovered.

--> The larger point about the failure of distributional learning on its own reminds me a bit of Gambell & Yang (2006), who essentially were saying that distributional learning works much better in conjunction with additional information (stress information in their case, since they were looking at word segmentation). Feldman et al.'s point is that this additional information can be on a different level of representation, and depending on what you believe about stress w.r.t. word segmentation, Gambell & Yang would be saying the same thing.

--> The discussion of minimal pairs is very interesting (and this was one of the cool ideas from the original Feldman et al. 2009 paper) - minimal pairs can actually harm phonetic category acquisition in the absence of referents. In particular, it's more parsimonious to just have one lexicon item whose vowel varies, and this in turn creates broader vowel categories than we want. So, to succeed, the learner needs to have a fairly weak bias to have a small lexicon - this then leads to splitting minimal pairs into multiple lexicon items, which is actually the correct thing to do. However, we then have to wonder how realistic it is to have such a weak bias for a small lexicon. (Given memory & processing constraints in infants, it might seem more realistic to have a strong bias for a small lexicon.) On a related note, Feldman et al note later on that information about word referents actually seem to hinder infant ability to distinguish a minimal pair (citing Stager & Werker 1997). Traditionally, this was explained as something like "word learning is extra hard processing-wise, so infants fail to make the phonetic category distinctions that would separate minimal pairs." But the basic point is that word referent information isn't so helpful. But maybe it's enough for infants to know that words are functionally different, even if the exact word-meaning mapping isn't established? This might be less cognitively taxing for infants, and allow them to use that information to separate minimal pairs. Or instead, maybe we should be looking for evidence that infants are terrible at learning minimal pairs when they're first building their lexicons. Feldman et al. reference some evidence that non-minimal pairs are actually really helpful for category learning (more specifically, minimal pairs embedded in non-minimal pairs.)

--> I thought the discussion of hierarchical models in general near the end was really nice, and was struck by the statement that "knowledge of sounds is nothing more than a type of general knowledge about words". From a communicative perspective, this seems right - words are the meaningful things, not individual sounds. Moreover, if we translate this statement back over to syntax since Perfors et al. (2011) used hierarchical models to learn about hierarchical grammars, we get something like "knowledge of hierarchical grammar is nothing more than a type of general knowledge about individual parse tree structures", and that also seems right. Going back to sounds and words, it's just a little odd at first blush to think of sounds as being the higher level of knowledge and words being the lower level of knowledge. But I think Feldman et al. argue for it effectively.

--> I thought this was an excellent statement describing the computational/rational approach: "...identifying which problem [children] are solving can give us clues to the types of strategies that are likely to be used."

~~~
References

Blanchard, D., J. Heinz, & R. Golinkoff. 2010. Modeling the contribution of phonotactic cues to the problem of word segmentation. Journal of Child Language, 27, 487-511.

Dillon, B., E. Dunbar, & W. Idsardi. forthcoming. A single stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science.

Feldman, N., T. Griffiths, & J. Morgan. 2009. Learning phonetic categories by learning a lexicon. Proceedings of the 31st Annual Conference on Cognitive Science.

Gambell, T. & C. Yang. 2006. Word Segmentation: Quick but not dirty. Manuscript, Yale University.

Perfors, A., J. Tenenbaum, & T. Regier. 2011. The learnability of abstract syntactic principles. Cognition, 118, 306-338.

Stager, C. & J. Werker. 1997. Infants listen for more phonetic detail in speech perception than word-learning taste. Nature, 388, 381-382.

Swingley, D. 2009. Contributions of infant word learning to language development. Philosophical Transactions of the Royal Society B, 364, 3617-3632.

Friday, September 28, 2012

Fall meeting times set & Oct 10 = Feldman et al. 2012

Based on the responses, it seems like Wednesdays at 2pm will work best for everyone's schedules. Our complete schedule (with specific dates) can now be seen at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

So, let's get kicking! For our first meeting on Wednesday October 10 @ 2pm in SBSG 2221, we'll be looking at an article that seeks to model learning of phonetic categories and word forms simultaneously, using hierarchical Bayesian inference:

Feldman, N., Griffiths, T., Goldwater, S., & Morgan, J. 2012. A role for the developing lexicon in phonetic category acquisition. Manuscript, University of Maryland at College Park, University of California at Berkeley, University of Edinburgh, and Brown University. Note: Because this is a manuscript, please do no cite without permission from Naomi Feldman.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/FeldmanEtAl2012Manu_PhonCatLearning.pdf

See you then!

Sunday, September 23, 2012

Fall quarter planning

I hope everyone's had a good summer break - and now it's time to gear up for the fall quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on the acquisition of sounds & words, and general learning & learnability:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here ).

See you soon!

Wednesday, May 30, 2012

Have a good summer, and see you in the fall!

Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be taking a hiatus this summer, and we'll resume again in the fall quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Monday, May 28, 2012

Some thoughts on Sondregger & Niyogi (2010)

I think this paper is a really nice example of how to use real data for language change modeling, and why you would want to. I like this methodology in particular, where properties of the individual learner are explored and measured by their effects on the population dynamics. Interestingly, I think this is different than some of the other work I'm familiar with relating language acquisition and language change, since I'm not sure it restricts the learning period to the period of language acquisition, per se. In particular, the knowledge being modeled - stress patterns of lexical items, possibly based on influence from the rest of the lexicon - is something that seems like it can change after native language acquisition is over. That is, the learners here don't have to be children (which is something that Pearl & Weinberg (2007) assumed for the knowledge they looked at, and something with work by Lightfoot (1999, 2010) generally assumes). Based on some of the learning assumptions involved in this paper (e.g., probability matching when given noisy input, using the lexicon to determine the most likely stress pattern), I would say that the modeled learners probably aren't children. And that's totally fine. The only caveat is that then the explanatory power of learning to explain the observed changes becomes a little less, simply because other factors may be involved (language contact, synchronic change within the adults of a population, etc.), and these other factors aren't modeled here. So, when you get the population reproducing the observed behaviors, it's true that this learning behavior on its own could be the explanatory story - but it's also possible that a different learning behavior coupled with these other factors might be the true explanatory story. I think this is inherently a problem in explanatory models of language change, though - what you provide is an existence proof of a particular theory of how change happens. So then it's up to people who don't like your particular theory to provide an alternative. ;)

More targeted thoughts:

- I was definitely intrigued by the constrained variation observed in the stress patterns of English nouns and verbs together. Ross' generalization seems to describe it well enough (primary stress for nouns is further to the left than primary stress for verbs), but that doesn't explain where this preference comes from - it certainly seems quite arbitrary. Presumably, it could be an accident of history that a bunch of the "original" nouns happened to have that pattern while the verbs didn't, and that got passed along through the generations of speakers. The authors mention something later on about how nouns appear in trochaic-biasing contexts, while verbs appear in iambic-biasing contexts (based on work by Kelly and colleagues). This again seems like the result of some process, rather than the cause of it. Maybe it has something to do with the order of verbs and their arguments? I could imagine that there's some kind of preference for binary feet where stress occurs every other syllable, and then the stress context for nouns vs. verbs comes from that (somehow)...

- The authors mention that falling frequency (rather than low frequency) seems to be the trigger for change to {1,2}. This means that something could be highly frequent, but because its frequency lessens some (maybe lessens rapidly?), change is triggered. That seems odd to me. Instead, it seems more likely that both falling frequency and low frequency might be caused by the same underlying something, and that's the something that triggers change. (Caveat: I haven't read the work the authors mentioned, so maybe it's laid out more clearly there.) However, they restate it again at the end of this paper, relating to the last model they look at.

- The last model the authors explore (coupling by priors + mistransmission) is the one that does best at matching the desired behaviors, such as changing to {1,2} more often. I interpreted this model as something like the following: If enough examples are heard, the mistransmission bias encourages mis-hearing in the right direction, given the priors that come from the lexicon on overall stress patterns. However, the mistransmission also means that it goes towards that {1,2} pattern more slowly, so only higher frequencies can make it happen the way we want it to (and this is how it differs from the fourth model that just has coupling by priors).

~~~
References
~~~

Lightfoot, D. (1999). The development of language: Acquisition, change, and evolution. Oxford, Eng-
land: Blackwell.

Lightfoot, D. (2010). Language acquisition and language change. Wiley Interdisciplinary Reviews: Cognitive Science, 1, 677-684. doi: 10.1002/wcs.39.

Pearl, L. & Weinberg, A. (2007) Input Filtering in Syntactic Acquisition: Answers from Language Change Modeling, Language Learning and Development, 3(1), 43-72.

Wednesday, May 16, 2012

Next time on May 30: Sondregger & Niyogi (2010)

Thanks to everyone who was able to join our rousing discussion today of Crain & Thornton's (2012) article on syntax acquisition! Next time on May 30 at 10:30am in SBSG 2221, we'll be looking at an article that examines the interplay of language acquisition and language change, looking at the role of mistransmission in a dynamical system:

Sondregger, M. & Niyogi, P. (2010). Combining data and mathematical models of language change. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 1019-1029.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/SondreggerNiyogi2010_DataModelsLangChange.pdf

See you then!

Monday, May 14, 2012

Some thoughts on Crain & Thornton (2012)

Once again, I'm a fan of these kind of review articles because they often distill some of the arguments and assumptions that a particular perspective makes. It's quite clear that the authors come from a linguistic nativist perspective, and offer a set of phenomena that they think make the case for linguistic nativism very clearly. This is good for us as modelers because we can look at what the learning problems are that cause people to take the linguistic nativist perspective.

I admit that I do find some of their claims a little strong, given the evidence. This might be due to the fact that it is a review article, so they're just summarizing, rather than providing a detailed argument. However, I did find it a little ironic that they seem to make a particular assumption about what productivity is, and this kind of assumption is precisely what Yang (2010 Ms, 2011) took the usage-based folk to task for (more on this below). I also think the authors are a little overzealous in characterizing the weaknesses of the usage-based approach sometimes - in particular, they don't seem like they want to have statistical learning be part of the acquisition story at all. While I'm perfectly happy to say that statistical learning can't be the whole story (after all, we need a hypothesis space for it to operate over), I don't want to deny its usefulness.

More specific thoughts:

- I was surprised to find a conflation of nature (innate) vs. nurture (derived) with domain-specific vs. domain-general in the opening paragraph. To me, these are very different dimensions - for example, we could have an innate, domain-general learning process (say, statistical learning) and derived, domain-specific knowledge (say, phonemes).

- I thought this characterization of the usage-based approach was a little unfair: "...child language is expected to match that of adults, more or less". And then later on, "...children only (re)produce linguistic expressions they have experienced in the input..." Maybe on an extreme version, this is true. But I'm pretty sure the usage-based approach is meant to account for error patterns, too. And that doesn't "match" adult usage, per se, unless we're talking about a more abstract level of matching. This again comes up when they say the child "would not be expected to produce utterances that do not reflect the target language", later on in the section about child language vs. adult language.

- I thought the discussion of core vs. periphery was very good. I think this really is one way the two approaches (linguistic nativist vs. usage-based) significantly differ. For the usage-based folk, this is not a useful distinction - they expect everything to be accounted for the same way. For the linguistic nativist folk, this isn't necessarily true: Core phenomena may be learned in a different way than periphery phenomena.

- I was less impressed by the training study that showed 7-year-olds can't learn structure-independent rules. At that point in acquisition, it wouldn't surprise me at all if their hypothesis space was highly (insurmountably) biased towards structure-dependent rules, even if they had initially allowed structure-independent rules. However, the point I think the authors are trying to make here is that statistical learning needs a hypothesis space to operate over, and doesn't necessarily have anything to do with defining that hypothesis space. (And that, I can agree with.)

- This is the third time this quarter we've seen the structure-dependence of rules problem invoked. However, it's interesting to me that the fact there is still a learning problem seems to be glossed over. That is, let's suppose we know we're only supposed to use structure-dependent rules. It's still a question of which rule we should pick, given the input data, isn't it? This is an interesting learning problem, I think.

- The discussion about how children must avoid making overly-broad generalizations (given ambiguous data) seems a bit old-fashioned to me. Bayesian inference is one really easy way to learn the subset hypothesis, given ambiguous data, for example. But I think this shows how techniques like Bayesian inference haven't really managed to penetrate the discussions of language acquisition in linguistic nativist circles.

- For the Principle C data, the authors make an assertion that 3-year-olds knowing the usage of names vs. pronouns indicates knowledge that they couldn't have learned. But this is an empirical question, I think - what other (and how many other) hypotheses might they have? What are the relevant data to learn from (utterances with names and pronouns in them?), and how often do these data appear in child-directed speech?

- The conjunction and disjunction stuff is definitely very cool - I get the sense that these kind of data don't appear that often in children's data, so it again becomes a very interesting question about what kinds of generalizations are reasonable to make, given ambiguous data. Additionally, it's hard to observe interpretations the way we can observe the forms of utterances - in particular, it's unclear if the child gets the same interpretation the adult intends. This in general makes semantic acquisition stuff like this a very interesting problem.

- For the passives, I wonder if children's passive knowledge varies by verb semantics. I could imagine a situation where passives with physical verbs come first (easily observable), then internal state (like heard), and then mental (like thought). This ties into how observable the data are for each verb type.

- For explaining long-distance wh questions with wh-medial constructions (What do you think what does Cookie Monster like?), I think the authors are a touch hasty on dismissing a juxtaposition account simply because kids don't repeat the full NP (e.g., Which smurf) in the wh-medial position. It seems like this could be explained by a bit of pragmatic knowledge about pronoun vs. name usage, where kids don't like to say the full name after they've already said it earlier in the utterance (we know this from imitation tasks with young kids around 3 years old, I believe).

- The productivity assumption I mentioned in the intro to this post relates to this wh-medial question issue. The third argument against juxtaposition is that we should expect to see certain kinds of utterances regularly (like (41)), but we don't observe them that often. However, before assuming this means that children do not productively use these forms, we probably need to have an objective measure of how often we would expect them to use these forms (probably based on a Zipfian distribution, etc.).

- I love how elegant the continuity hypothesis is. I'm less convinced by the wh-medial questions as evidence, but it's potentially a support for it. However, I find the positive polarity stuff (and in particular, the different behavior in English vs. Japanese children, as compared to adults) to be more convincing support for it (the kids have an initial bias that they probably didn't pick up from the adults). The only issue (for me) with the PPI parameter is that it seems very narrow. Usually, we try to make parameters for things that connect to a lot of different linguistic phenomena. Maybe this parameter might connect to other logical operators, and not just AND and OR? On a related note, if it's just tied to AND and OR, what does the parameter really accomplish? That is, does it reduce the hypothesis space in a useful way? How many other hypotheses could there be otherwise for interpreting AND and OR?

- Related to the PPI stuff: I was less clear on their story about how children pick that initial bias: "...favor parameter values that generate scope relations that make sentences true in the narrowest range of circumstances...". This is very abstract indeed - kids are measuring an interpretation by how many hypothetical situations it would be true for. This really depends on their ability to imagine those other situations and actively be comparing them against a current interpretation...

~~~
References:
Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

Yang, C. (2011). A Statistical Test for Grammar. Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, 30-38.

Wednesday, May 2, 2012

Next time on May 16: Crain & Thornton (2012)

Thanks to everyone who was able to join us for our thoughtful discussion of Bouchard (2012)! Next time on May 16, we'll be reading a survey article on syntactic acquisition that compares two opposing current approaches, and attempts to adjudicate between them. It's possible that the learning problems discussed can be good targets for computational modeling studies as well.

Crain, S. & Thornton, R. (2012). Syntax acquisition. WIREs Cogn Sci, doi: 10.1002/wcs.1158.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/CrainThornton2012_SyntaxAcquisition.pdf

See you then!
-Lisa