Computational Models of Language (at UC Irvine): 2012

Wednesday, November 28, 2012

See you in the winter!

Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Sunday, November 25, 2012

Some thoughts on Frank (2012)

I thought this was a really nice big picture piece about computational modeling work in language acquisition, and it tries (admirably!) to consolidate insights in different domains about the kind of learning assumptions/strategies that are useful. This is such an incredibly good thing to do, I think - one of the questions I get a lot is whether there's one general purpose style of computational model that's the right way to do things, and I'm usually left shrugging and saying, "Depends on what you're trying to do." And to some extent of course, this is right - but there's also something to be said about what the different useful models have in common.

Another note: Despite the empirical coverage, I did feel there was something of a disconnect between the phenomena generative linguists get excited about (w.r.t poverty of the stimulus, for example - syntactic islands, case theory, etc.) and the phenomena modeled in the studies discussed here. There's nothing wrong with this, since everyone's goal is to understand language acquisition, and that means acquisition of a lot of different kinds of knowledge. But I did wonder how the insights discussed here could be applied to more sophisticated knowledge acquisition problems in language. Frank notes already that it's unclear what insights successful models of more sophisticated knowledge have in common.

Some more targeted thoughts:

Frank focuses on two metrics of model success: sufficiency (basically, acquisition success) and fidelity (fitting patterns of human behavior). I've seen other proposed metrics, such as formal sufficiency, developmental compatibility, and explanatory power (discussed, for example, in Pearl 2010, which is based on prior work by Yang). I feel like formal sufficiency maps pretty well to sufficiency (and actually may cover fidelity too). Developmental compatibility, though, is more about psychological plausibility, and explanatory power is about the ability of the model to give informative (explanatory) answers about what causes the acquisition process modeled. I think all of the studies discussed hold up on the explanatory power metric, so that's fine. It's unclear how well they hold up for developmental compatibility - it may not matter if they're computational-level analyses, for example. But I feel like that's something that should be mentioned as a more prominent thing to think about when judging a computational model. (But maybe that's my algorithmic bias showing through.)

Related point: Frank clearly is aware of the tension between computational-level and algorithmic-level approaches, and spends some time discussing things like incremental vs. batch learning. I admit, I was surprised to see this though: "Fully incremental learning prevents backtracking or re-evaluation of hypotheses in light of earlier data". If I'm understanding this correctly, the idea is that you can't use earlier data at all in a fully incremental model. I think this conflates incremental with memoryless - for example, you can have an incremental learner that has some memory of prior data (usually in some kind of compressed format, perhaps tallying statistics of some kind, etc.). For me, all incremental means is that the learner processes data as it comes in - it doesn't preclude the ability to remember prior data with some (or even a lot of) detail.

Related point: Human memory constraints. In the word segmentation section, Frank mentions that experimental results suggest that "learners may not store the results of segmentation veridically, falsely interpolating memories that they have heard novel items that share all of their individual transitions within a set of observed items". At first, I thought this was about humans not storing the actual segmentations in memory (and I thought, well, of course not - they're storing the recovered word forms). But the second bit made me think this was actually even more abstract than that - it seems to suggest that artificial language participants were extracting probabilistic rules about word forms, rather than the word forms themselves. Maybe this is because the word forms were disconnected from meaning in the experiments described, so the most compact representation was of the rules for making word forms, rather than the word forms themselves?

I loved the Goldsmith (2010) quote: "...if you dig deep enough into any task in acquisition, it will become clear that in order to model that task effectively, a model of every other task is necessary". This is probably generally true, no matter what you're studying, actually - you always have to simplify and pretend things are disconnected when you start out in order to make any progress. But then, once you know a little something, you can relax the idealizations. And Frank notes the synergies in acquisition tasks, which seems like exactly the right way to think about it (at least, now that we think we know something about the individual acquisition tasks involved). It seems like a good chunk of the exciting work going on in acquisition modeling is investigating solving multiple tasks simultaneously, leveraging information from the different tasks to make solving all of them easier. However, once you start trying to do this, you then need to have a precise model of how that leveraging/integration process works.

Another great quote (this time from George Box): "all models are wrong, but some are useful". So true - and related to the point above. I think a really nice contribution Frank makes is in thinking about ways in which models can be useful - whether they provide a general framework or are formal demonstrations of simple principles, for example.

I think this quote might ruffle a few linguist feathers: "...lexicalized (contain information that is linked to individual word forms), the majority of language acquisition could be characterized as 'word learning'. Inferring the meaning of individual lexical items...". While technically this could be true (given really complex ideas about word "meaning"), the complexity of the syntactic acquisition task gets a little lost here, especially given what many people think about as "word meaning". In particular, the rules for putting words together isn't necessarily connected directly to lexical semantics (though of course, individual word meaning plays a part).

I think the Frank et al. work on intention inference when learning a lexicon demonstrates a nice sequence of research w.r.t. the utility of computational models. Basically, child behavior was best explained by a principle of mutual exclusivity. So, for awhile, that was a placeholder, i.e., something like "Use mutual exclusivity to make your decision". Then, Frank et al. came along and hypothesized where mutual exclusivity could come from, and showed how it could arise from more basic learning biases (e.g., "use probabilistic learning this way"). That is, mutual exclusivity itself didn't have to be a basic unit. This reminds me of the Subset Principle in generative linguistics, which falls out nicely from the Size Principle of Bayesian inference.

It's an interesting idea that humans do best at learning when there are multiple (informationally redundant) cues available, as opposed to just one really informative cue. I'm not sure if the Mintz frequent frame is a really good example of this, though - it seems like a frame vs. a bigram is really just the same kind of statistical cue. Though maybe the point is more that the framing words provide more redundancy, rather than being different kinds of cues.

It's also a really interesting idea to measure success by having the output of a model be an intermediate representation used in some other task that has an uncontroversial gold standard. Frank talks about it in the context of syntactic categories, but I could easily imagine the same thing applying to word segmentation. It's definitely a recurring problem that we don't want perfect segmentation for models of infant word segmentation - but then, what do we want? So maybe we can use the output of word segmentation as the input to word- (or morpheme-) meaning mapping.

It took me a little to understand what "expressive" meant in this context. I think it relates to the informational content of some representation - so if a representation is expressive, it can cover a lot of data while being very compact (e.g., rule-based systems, instead of mappings between individual lexical items). A quote near the end gets at this more directly: "...it becomes possible to generate new sentences and to encode sentences more efficiently. At all levels of organization, language is non-random: it is characterized by a high degree of redundancy and hence there is a lot of room for compression." I think this is basically an information-theoretic motivation for having a grammar (which is great!). In a similar vein, it seems like this would be an argument in favor of Universal Grammar-style parameters, because they would be a very good compression of complex regularities and relationships in the data.

~~~

References

Pearl, L. 2010.Using computational modeling in language acquisition research. In E. Blom & S. Unsworth (eds). Experimental Methods in Language Acquisition Research, John Benjamins.

Wednesday, November 14, 2012

Next time on 11/28/12 @ 2pm in SBSG 2221 = Frank (2012)

Thanks to everyone who participated in our vigorous and thoughtful discussion of Hsu et al. (2011)! For our next meeting on Wednesday November 28th @ 2pm in SBSG 2221, we'll be looking at a paper that investigates the role of computational models in the study of early language acquisition and how to evaluate them.

Frank, M. 2012. Computational models of early language acquisition. Manuscript, Stanford University.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Frank2012Manu_CompModelsLangAcq.pdf

See you then!

Monday, November 12, 2012

Some thoughts on Hsu et al. 2011

So this seems to be more of an overview paper showcasing how to apply a probabilistic learning framework at the computational level to problems in language acquisition, whether we're concerned with theoretical learnability results or predicting observable behavior. As a followup to Hsu & Chater (2010), which we discussed a few years back, this re-emphasized some of the nice intuitions in the MDL framework (such as "more compact representations are better"). I think a strength of this framework is its ability to identify linguistic knowledge pieces that are hard to learn from the available data, since this is exactly the sort of thing poverty of the stimulus (PoS) is all about. (Of course, the results rest on the particular assumptions made about the input, forms of the rules, etc., but that's true of all computational analyses, I think.) On a related note, I did notice that nearly all the phenomena examined by Hsu et al. were based on lexical item classification (verb argument subcategorization) or contraction (what generativist might call "traces" in some cases). This is fine (especially the "wanna" case, which I have seen actually used in PoS arguments), but I was surprised that we're not really getting into the kind of complex sentential semantics or syntax that I usually see talked about in generativist circles (e.g., syntactic islands, case theory - see Crain & Pietroski (2002) for some examples on the semantic side). Also, even though Hsu et al's own analysis shows that wanna & that-traces are "practically" unlearnable (i.e., even with probabilistic learning, these look like PoS problems), it seems like they close this paper by sort of downplaying this: "probabilistic language learning is theoretically and computationally possible").

Some more targeted thoughts below:

I think my biggest issue with the computational learnability analyses (and proofs) is that I find it very hard to connect them to the psychological problem of language acquisition that I'm used to thinking about. (In fact, Kent Johnson in UCI's LPS department has a really nice 2004 paper talking about how this connection probably shouldn't have been made with the (in)famous Gold (1967) learnability results.) I do understand that this type of argument is meant to combat the claim about the "logical problem of language acquisition", with the specific interpretation that the "logical problem" comes from computational learnability results (and the Gold paper in particular). However, I've also seen "logical problem of language acquisition" apply to the simple fact that there are induction problems in language acquisition, i.e., the data are compatible with multiple hypotheses, and "logically" any of them could be right, but only one actually is, so "logical problem". This second interpretation still seems right to me, and I don't feel particularly swayed to change this view after reading the learnability results here (though maybe that's (again) because I have trouble connecting these results to the psychological problem).

Related to the point above - in section 2, where we see a brief description of the learnability proof, the process is described as an algorithm that "generates a sequence of guesses concerning the generative probabilistic model of the language". Are these guesses probabilities over utterances, probabilities over the generative grammars that produce the utterances, something else? It seems like we might want them to be probabilities over the generative grammars, but then don't we need some definition of the hypothesis space of possible generative grammars?

I had a little trouble understanding the distinction that Hsu et al. were making between discriminative and generative models in the introduction. Basically, it seemed to me that "discriminative" behavior could be the output of a generative model, so we could view a discriminative model as a special case of a generative model. So is the idea that we really want to emphasize that humans are identifying the underlying probability distribution, instead of just making binary classifications based on their grammars? That is, that there is no such thing as "grammatical" and "ungrammatical", but instead these are epiphenomena of thresholding a probabilistic system?

In section 3, at the very end, Hsu et al. mention that the ideal statistical learner provides an "upper bound" on learnability. I found this somewhat odd - I always thought of ideal learners as providing a lower bound in some sense, since they're not constrained by cognitive resource limitations, and are basically looking at the question of whether the data contain enough information to solve the problem in question.

The practical example in 3.2 with the "going to" contraction threw me for a bit, since I couldn't figure out how to interpret this: "Under the new grammar, going to contraction never occurs when to is a preposition and thus 0 bits are required to encode contraction." Clearly, the intent is that "no contraction" is cheaper to encode than the process of contraction, but why was that? Especially since the new grammar that has the "don't contract when to is a preposition" seems to require an extra rule. Looking back to Hsu & Chater (2010), it seems to be that rules with probability 1 (like going to --> going to when to=prep) require 0 bits to encode. So in effect, the new grammar that has a special exception when to is a preposition gets a data encoding boost, even though the actual grammar model is longer (since it has this exception explicitly encoded). So, "exceptions" that always apply (in a context-dependent way) are cheaper than general rules when the observable data appear in that context.

I liked the idea that learnability should correlate with grammaticality judgments, with the idea that more "learnable" rules (i.e., ones with more data in the input) are encountered more and so their probabilities are stronger in whichever direction. In looking at the computational results though, I have to admit I was surprised that "going to" ranked 12th in learnability (Fig 2), maybe putting it on the order of 50 years to learn. That rule seems very easy, and I assume the grammaticality judgments are very strong for it. (My intuitions are at least.)

A small methodological quibble, section 4.1: "...because many constructions do not occur often enough for statistical significance [in child-directed speech]...we use...the full Corpus of Contemporary American English." Isn't this the point for PoS arguments, though? There are differences between child-directed and adult-directed input (especially between child-directed speech and adult-directed written text), especially at this lexical item level that Hsu et al. are looking at (and also even at very abstract levels like wh-dependencies: Pearl & Sprouse (forthcoming)). So if we don't find these often enough in child-directed speech, and the thing we're concerned with is child acquisition of language, doesn't this also suggest there's a potential PoS problem?

I liked that Hsu et al. connect their work to entrenchment theory, and basically provide a formal (computational-level) instantiation of how/why entrenchment occurs.

~~~
References

Crain, C. & P. Pietroski. 2002. Why language acquisition is a snap. The Linguistic Review, 19, 163-183.

Gold, E. 1967. Language Identification in the Limit. Information and Control, 10, 447-474.

Hsu, A. & N. Chater. 2010. The Logical Problem of Language Acquisition: A Probabilistic Perspective. Cognitive Science, 34, 972-1016.

Johnson, K. 2004. Gold's Theorem and Cognitive Science. Philosophy of Science, 71, 571-592.

Pearl, L. & J. Sprouse. Forthcoming 2012. Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition.

Wednesday, October 24, 2012

Next time on 11/14 @ 2pm in SBSG 2221 = Hsu et al. 2011

Hi everyone,

Thanks to everyone who participated in our thoughtful discussion of Gagliardi et al. (2012)! For our next meeting on Wednesday November 14th @ 2pm in SBSG 2221, we'll be looking at an article that investigates a way to quantify natural language learnability and discusses the impact this has on the debate about the nature of the necessary learning biases for language:

Hsu, A., Chater, N., & Vitanyi, P. 2011. The probabilistic analysis of language acquisition: Theoretical, computational, and experimental analysis. Cognition, 120, 380-390.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/HsuEtAl2011_ProbLang.pdf

See you then!
-Lisa

Monday, October 22, 2012

Some thoughts on Gagliardi et al. (2012)

I thought this was a really lovely Cog Sci paper showcasing how to combine experimental & computational methodologies (and still make it all fit in 6 pages). The authors really tried to give the intuitions behind the modeling aspects, which makes this more accessible to a wider audience. The study does come off as a foundational one, given the many extensions that could be done (involving effects in younger word learners, cross-linguistics applications, etc.), but I think that's a perfectly reasonable approach (again, given the page limitations). I also thought the empirical grounding was really lovely for the computational modeling part, especially as relating to the concept priors. Granted, there are still some idealizations being made (more discussion of this below), but it's nice to see this being taken seriously.

Some more targeted thoughts:

--> One issue concerns the age of the children tested experimentally (4 years old) (and as Gagliardi et al. mention, a future study should look at younger word learners). The reason is that 4-year-olds are fairly good word learners (and have a vocabulary of some size), and presumably have the link between concept and grammatical category (and maybe morphology and grammatical category for the adjectives) firmly established. So it maybe isn't so surprising that grammatical category information is helpful to them. What would be really nice is to know when that link is established, and the interaction between concept formation and recognition/mapping to grammatical categories. I could certainly imagine a bootstrapping process, for instance, and it would be useful to understand that more.

--> The generative model assumes a particular sequence, namely (1) choose the syntactic category, (2) choose the concept, and (3) choose instances of that concept. This seems reasonable for the teaching scenario in the experimental setup, but what might we expect in a more realistic word-learning environment? Would a generative model still have syntactic category first (probably not), or instead have a balance between syntactic environment and concept? Or maybe it would be concept first? And more importantly, how much would this matter? It would presumably change the probabilities that the learner needs to estimate at each point in the generative process.

--> I'd be very interested to see the exact way the Mechanical Turk survey was conducted for classifying things as examples of kinds, properties, or both (and which words were used). Obviously, due to space limitations, this wasn't included here. But I can imagine that many words might easily be described as both kind & concept, if you think carefully enough (or maybe too carefully) about it. Take "cookie", for example (a fairly common child word, I think): It's got both kind (ex: food) and property aspects (ex: sweet) that are fairly salient. So it really matters what examples you give the participants and how you explain the classification you're looking for. And even then, we're getting adult judgments, where child judgments might be more malleable (so maybe we want to try this exercise with children too, if we can).

--> Also, on a related note, the authors make a (reasonable) idealization that the distribution of noun and adjective dimensions in the 30-month-old CDIs are representative of the "larger and more varied set of words" that the child experimental participants know. However, I do wonder about the impact of that assumption, since we are talking about priors (which drive the model to use grammatical category information in a helpful way). It's not too hard to imagine children whose vocabularies skew away from this sample (especially if they're older). Going in the other direction though, if we want to try to extend this to younger word learners, then the CDIs start to become a very good estimate of the nouns and adjectives these children know, so that's very good.

Wednesday, October 10, 2012

Next time on Oct 24 @ 2pm in SBSG 2221 = Gagliardi et al. 2012

Thanks to everyone who participated in our thoughtful discussion of Feldman et al. (2012 Ms)! For our next meeting on Wednesday October 24 @ 2pm in SBSG 2221, we'll be looking at an article that seeks to model learning of word meaning for specific grammatical categories:

Gagliardi, A., E. Bennett, J. Lidz, & N. Feldman. 2012. Children's Inferences in Generalizing Novel Nouns and Adjectives. In N. Miyake, D. Peebles, & R. Cooper (Eds), Proceedings of the 34th Annual Meeting of the Cognitive Science Society, 354-359.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/GagliardiEtAl2012_NounAdjectiveClassification.pdf

See you then!

Monday, October 8, 2012

Some thoughts on Feldman et al. (2012 Ms)

So I'm definitely a huge fan of work that combines different levels of information when solving acquisition problems, and this is that kind of study. In particular, as Feldman et al. note themselves, they're making explicit an idea that came from Swingley (2009): Maybe identifying phonetic categories from the acoustic signal is easier if you keep word context in mind. Another way of putting this is that infants realize that sounds are part of larger units, and so as they try to solve the problem of identifying their native sounds, they're also trying to solve the problem of what these larger units are. This seems intuitively right to me (I had lots of notes in the margins saying "right!" and "yes!!"), though of course we need to grant that infants realize these larger units exist.

One thing I was surprised about, since I had read an earlier version of this study (Feldman et al. 2009): The learners here actually aren't solving word segmentation at the same time they're learning phonetic categories. For some reason, I had assumed they were - maybe because the idea of identifying the lexicon items in a stream of speech seems similar to word segmentation. But that's not what's going on here. Feldman et al. emphasize that the words are presented with boundaries already in place, so this is a little easier than real life. (It's as if the infants are presented with a list of words, or just isolated words.) Given the nature of the Bayesian model (and especially since one of the co-authors is Sharon Goldwater, who's done work on Bayesian segmentation), I wonder how difficult it would be to actually do word segmentation at the same time. It seems fairly similar to me, with the lexicon model already in place (geometric word length, Dirichlet process for lexicon item frequency in the corpus, etc.)

Anyway, on to some more targeted thoughts:

--> I thought the summary of categorization & the links between categorization in language acquisition and categorization in other areas of cognition was really well presented. Similarly, the summary of the previous phonetic category learning models was great - enough detail to know what happened, and how it compares to what Feldman et al. are doing.

--> Regarding the child-directed speech data used, I thought it was really great to see this kind of empirical grounding. I did wonder a bit about which corpora the CHILDES parental frequency count draws from though - since we're looking at processes that happen between 6 and 12 months, we might want to focus on data directed at children of that age. There are plenty of corpora in the American English section of CHILDES with at least some data in this range, so I don't think it would be too hard. The same conversion with the CMU pronouncing dictionary could then be used on those data. (Of course, getting the actual acoustic signal would be best, but I don't know how many CHILDES corpora have this information attached to them. But if we had that, then we could get all the contextual/coarticulatory effects.) On a related note, I wonder how hard it would be to stack a coarticulatory model on top of the existing model, once you had that data. Basically, this would involve hypothesizing different rules, perhaps based on motor constraints (rather than the more abstract rules that we see in phonology, such as those that Dillon et al. (forthcoming) look into in their learning model). Also related, could a phonotactic model of some kind be stacked on top of this? (Blanchard et al. 2011 combine word segmentation & phonotactics.) A word could be made up of bigrams of phonetic categories, rather than just the unigrams in there now.

--> I liked that they used both the number of categories recovered and the pairwise performance measures to gauge model performance. While it seems obvious that we want to learn the categories that match the adult categories, some previous models only checked that the right number of categories were recovered.

--> The larger point about the failure of distributional learning on its own reminds me a bit of Gambell & Yang (2006), who essentially were saying that distributional learning works much better in conjunction with additional information (stress information in their case, since they were looking at word segmentation). Feldman et al.'s point is that this additional information can be on a different level of representation, and depending on what you believe about stress w.r.t. word segmentation, Gambell & Yang would be saying the same thing.

--> The discussion of minimal pairs is very interesting (and this was one of the cool ideas from the original Feldman et al. 2009 paper) - minimal pairs can actually harm phonetic category acquisition in the absence of referents. In particular, it's more parsimonious to just have one lexicon item whose vowel varies, and this in turn creates broader vowel categories than we want. So, to succeed, the learner needs to have a fairly weak bias to have a small lexicon - this then leads to splitting minimal pairs into multiple lexicon items, which is actually the correct thing to do. However, we then have to wonder how realistic it is to have such a weak bias for a small lexicon. (Given memory & processing constraints in infants, it might seem more realistic to have a strong bias for a small lexicon.) On a related note, Feldman et al note later on that information about word referents actually seem to hinder infant ability to distinguish a minimal pair (citing Stager & Werker 1997). Traditionally, this was explained as something like "word learning is extra hard processing-wise, so infants fail to make the phonetic category distinctions that would separate minimal pairs." But the basic point is that word referent information isn't so helpful. But maybe it's enough for infants to know that words are functionally different, even if the exact word-meaning mapping isn't established? This might be less cognitively taxing for infants, and allow them to use that information to separate minimal pairs. Or instead, maybe we should be looking for evidence that infants are terrible at learning minimal pairs when they're first building their lexicons. Feldman et al. reference some evidence that non-minimal pairs are actually really helpful for category learning (more specifically, minimal pairs embedded in non-minimal pairs.)

--> I thought the discussion of hierarchical models in general near the end was really nice, and was struck by the statement that "knowledge of sounds is nothing more than a type of general knowledge about words". From a communicative perspective, this seems right - words are the meaningful things, not individual sounds. Moreover, if we translate this statement back over to syntax since Perfors et al. (2011) used hierarchical models to learn about hierarchical grammars, we get something like "knowledge of hierarchical grammar is nothing more than a type of general knowledge about individual parse tree structures", and that also seems right. Going back to sounds and words, it's just a little odd at first blush to think of sounds as being the higher level of knowledge and words being the lower level of knowledge. But I think Feldman et al. argue for it effectively.

--> I thought this was an excellent statement describing the computational/rational approach: "...identifying which problem [children] are solving can give us clues to the types of strategies that are likely to be used."

~~~
References

Blanchard, D., J. Heinz, & R. Golinkoff. 2010. Modeling the contribution of phonotactic cues to the problem of word segmentation. Journal of Child Language, 27, 487-511.

Dillon, B., E. Dunbar, & W. Idsardi. forthcoming. A single stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science.

Feldman, N., T. Griffiths, & J. Morgan. 2009. Learning phonetic categories by learning a lexicon. Proceedings of the 31st Annual Conference on Cognitive Science.

Gambell, T. & C. Yang. 2006. Word Segmentation: Quick but not dirty. Manuscript, Yale University.

Perfors, A., J. Tenenbaum, & T. Regier. 2011. The learnability of abstract syntactic principles. Cognition, 118, 306-338.

Stager, C. & J. Werker. 1997. Infants listen for more phonetic detail in speech perception than word-learning taste. Nature, 388, 381-382.

Swingley, D. 2009. Contributions of infant word learning to language development. Philosophical Transactions of the Royal Society B, 364, 3617-3632.

Friday, September 28, 2012

Fall meeting times set & Oct 10 = Feldman et al. 2012

Based on the responses, it seems like Wednesdays at 2pm will work best for everyone's schedules. Our complete schedule (with specific dates) can now be seen at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

So, let's get kicking! For our first meeting on Wednesday October 10 @ 2pm in SBSG 2221, we'll be looking at an article that seeks to model learning of phonetic categories and word forms simultaneously, using hierarchical Bayesian inference:

Feldman, N., Griffiths, T., Goldwater, S., & Morgan, J. 2012. A role for the developing lexicon in phonetic category acquisition. Manuscript, University of Maryland at College Park, University of California at Berkeley, University of Edinburgh, and Brown University. Note: Because this is a manuscript, please do no cite without permission from Naomi Feldman.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/FeldmanEtAl2012Manu_PhonCatLearning.pdf

See you then!

Sunday, September 23, 2012

Fall quarter planning

I hope everyone's had a good summer break - and now it's time to gear up for the fall quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on the acquisition of sounds & words, and general learning & learnability:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here ).

See you soon!

Wednesday, May 30, 2012

Have a good summer, and see you in the fall!

Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be taking a hiatus this summer, and we'll resume again in the fall quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Monday, May 28, 2012

Some thoughts on Sondregger & Niyogi (2010)

I think this paper is a really nice example of how to use real data for language change modeling, and why you would want to. I like this methodology in particular, where properties of the individual learner are explored and measured by their effects on the population dynamics. Interestingly, I think this is different than some of the other work I'm familiar with relating language acquisition and language change, since I'm not sure it restricts the learning period to the period of language acquisition, per se. In particular, the knowledge being modeled - stress patterns of lexical items, possibly based on influence from the rest of the lexicon - is something that seems like it can change after native language acquisition is over. That is, the learners here don't have to be children (which is something that Pearl & Weinberg (2007) assumed for the knowledge they looked at, and something with work by Lightfoot (1999, 2010) generally assumes). Based on some of the learning assumptions involved in this paper (e.g., probability matching when given noisy input, using the lexicon to determine the most likely stress pattern), I would say that the modeled learners probably aren't children. And that's totally fine. The only caveat is that then the explanatory power of learning to explain the observed changes becomes a little less, simply because other factors may be involved (language contact, synchronic change within the adults of a population, etc.), and these other factors aren't modeled here. So, when you get the population reproducing the observed behaviors, it's true that this learning behavior on its own could be the explanatory story - but it's also possible that a different learning behavior coupled with these other factors might be the true explanatory story. I think this is inherently a problem in explanatory models of language change, though - what you provide is an existence proof of a particular theory of how change happens. So then it's up to people who don't like your particular theory to provide an alternative. ;)

More targeted thoughts:

- I was definitely intrigued by the constrained variation observed in the stress patterns of English nouns and verbs together. Ross' generalization seems to describe it well enough (primary stress for nouns is further to the left than primary stress for verbs), but that doesn't explain where this preference comes from - it certainly seems quite arbitrary. Presumably, it could be an accident of history that a bunch of the "original" nouns happened to have that pattern while the verbs didn't, and that got passed along through the generations of speakers. The authors mention something later on about how nouns appear in trochaic-biasing contexts, while verbs appear in iambic-biasing contexts (based on work by Kelly and colleagues). This again seems like the result of some process, rather than the cause of it. Maybe it has something to do with the order of verbs and their arguments? I could imagine that there's some kind of preference for binary feet where stress occurs every other syllable, and then the stress context for nouns vs. verbs comes from that (somehow)...

- The authors mention that falling frequency (rather than low frequency) seems to be the trigger for change to {1,2}. This means that something could be highly frequent, but because its frequency lessens some (maybe lessens rapidly?), change is triggered. That seems odd to me. Instead, it seems more likely that both falling frequency and low frequency might be caused by the same underlying something, and that's the something that triggers change. (Caveat: I haven't read the work the authors mentioned, so maybe it's laid out more clearly there.) However, they restate it again at the end of this paper, relating to the last model they look at.

- The last model the authors explore (coupling by priors + mistransmission) is the one that does best at matching the desired behaviors, such as changing to {1,2} more often. I interpreted this model as something like the following: If enough examples are heard, the mistransmission bias encourages mis-hearing in the right direction, given the priors that come from the lexicon on overall stress patterns. However, the mistransmission also means that it goes towards that {1,2} pattern more slowly, so only higher frequencies can make it happen the way we want it to (and this is how it differs from the fourth model that just has coupling by priors).

~~~
References
~~~

Lightfoot, D. (1999). The development of language: Acquisition, change, and evolution. Oxford, Eng-
land: Blackwell.

Lightfoot, D. (2010). Language acquisition and language change. Wiley Interdisciplinary Reviews: Cognitive Science, 1, 677-684. doi: 10.1002/wcs.39.

Pearl, L. & Weinberg, A. (2007) Input Filtering in Syntactic Acquisition: Answers from Language Change Modeling, Language Learning and Development, 3(1), 43-72.

Wednesday, May 16, 2012

Next time on May 30: Sondregger & Niyogi (2010)

Thanks to everyone who was able to join our rousing discussion today of Crain & Thornton's (2012) article on syntax acquisition! Next time on May 30 at 10:30am in SBSG 2221, we'll be looking at an article that examines the interplay of language acquisition and language change, looking at the role of mistransmission in a dynamical system:

Sondregger, M. & Niyogi, P. (2010). Combining data and mathematical models of language change. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 1019-1029.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/SondreggerNiyogi2010_DataModelsLangChange.pdf

See you then!

Monday, May 14, 2012

Some thoughts on Crain & Thornton (2012)

Once again, I'm a fan of these kind of review articles because they often distill some of the arguments and assumptions that a particular perspective makes. It's quite clear that the authors come from a linguistic nativist perspective, and offer a set of phenomena that they think make the case for linguistic nativism very clearly. This is good for us as modelers because we can look at what the learning problems are that cause people to take the linguistic nativist perspective.

I admit that I do find some of their claims a little strong, given the evidence. This might be due to the fact that it is a review article, so they're just summarizing, rather than providing a detailed argument. However, I did find it a little ironic that they seem to make a particular assumption about what productivity is, and this kind of assumption is precisely what Yang (2010 Ms, 2011) took the usage-based folk to task for (more on this below). I also think the authors are a little overzealous in characterizing the weaknesses of the usage-based approach sometimes - in particular, they don't seem like they want to have statistical learning be part of the acquisition story at all. While I'm perfectly happy to say that statistical learning can't be the whole story (after all, we need a hypothesis space for it to operate over), I don't want to deny its usefulness.

More specific thoughts:

- I was surprised to find a conflation of nature (innate) vs. nurture (derived) with domain-specific vs. domain-general in the opening paragraph. To me, these are very different dimensions - for example, we could have an innate, domain-general learning process (say, statistical learning) and derived, domain-specific knowledge (say, phonemes).

- I thought this characterization of the usage-based approach was a little unfair: "...child language is expected to match that of adults, more or less". And then later on, "...children only (re)produce linguistic expressions they have experienced in the input..." Maybe on an extreme version, this is true. But I'm pretty sure the usage-based approach is meant to account for error patterns, too. And that doesn't "match" adult usage, per se, unless we're talking about a more abstract level of matching. This again comes up when they say the child "would not be expected to produce utterances that do not reflect the target language", later on in the section about child language vs. adult language.

- I thought the discussion of core vs. periphery was very good. I think this really is one way the two approaches (linguistic nativist vs. usage-based) significantly differ. For the usage-based folk, this is not a useful distinction - they expect everything to be accounted for the same way. For the linguistic nativist folk, this isn't necessarily true: Core phenomena may be learned in a different way than periphery phenomena.

- I was less impressed by the training study that showed 7-year-olds can't learn structure-independent rules. At that point in acquisition, it wouldn't surprise me at all if their hypothesis space was highly (insurmountably) biased towards structure-dependent rules, even if they had initially allowed structure-independent rules. However, the point I think the authors are trying to make here is that statistical learning needs a hypothesis space to operate over, and doesn't necessarily have anything to do with defining that hypothesis space. (And that, I can agree with.)

- This is the third time this quarter we've seen the structure-dependence of rules problem invoked. However, it's interesting to me that the fact there is still a learning problem seems to be glossed over. That is, let's suppose we know we're only supposed to use structure-dependent rules. It's still a question of which rule we should pick, given the input data, isn't it? This is an interesting learning problem, I think.

- The discussion about how children must avoid making overly-broad generalizations (given ambiguous data) seems a bit old-fashioned to me. Bayesian inference is one really easy way to learn the subset hypothesis, given ambiguous data, for example. But I think this shows how techniques like Bayesian inference haven't really managed to penetrate the discussions of language acquisition in linguistic nativist circles.

- For the Principle C data, the authors make an assertion that 3-year-olds knowing the usage of names vs. pronouns indicates knowledge that they couldn't have learned. But this is an empirical question, I think - what other (and how many other) hypotheses might they have? What are the relevant data to learn from (utterances with names and pronouns in them?), and how often do these data appear in child-directed speech?

- The conjunction and disjunction stuff is definitely very cool - I get the sense that these kind of data don't appear that often in children's data, so it again becomes a very interesting question about what kinds of generalizations are reasonable to make, given ambiguous data. Additionally, it's hard to observe interpretations the way we can observe the forms of utterances - in particular, it's unclear if the child gets the same interpretation the adult intends. This in general makes semantic acquisition stuff like this a very interesting problem.

- For the passives, I wonder if children's passive knowledge varies by verb semantics. I could imagine a situation where passives with physical verbs come first (easily observable), then internal state (like heard), and then mental (like thought). This ties into how observable the data are for each verb type.

- For explaining long-distance wh questions with wh-medial constructions (What do you think what does Cookie Monster like?), I think the authors are a touch hasty on dismissing a juxtaposition account simply because kids don't repeat the full NP (e.g., Which smurf) in the wh-medial position. It seems like this could be explained by a bit of pragmatic knowledge about pronoun vs. name usage, where kids don't like to say the full name after they've already said it earlier in the utterance (we know this from imitation tasks with young kids around 3 years old, I believe).

- The productivity assumption I mentioned in the intro to this post relates to this wh-medial question issue. The third argument against juxtaposition is that we should expect to see certain kinds of utterances regularly (like (41)), but we don't observe them that often. However, before assuming this means that children do not productively use these forms, we probably need to have an objective measure of how often we would expect them to use these forms (probably based on a Zipfian distribution, etc.).

- I love how elegant the continuity hypothesis is. I'm less convinced by the wh-medial questions as evidence, but it's potentially a support for it. However, I find the positive polarity stuff (and in particular, the different behavior in English vs. Japanese children, as compared to adults) to be more convincing support for it (the kids have an initial bias that they probably didn't pick up from the adults). The only issue (for me) with the PPI parameter is that it seems very narrow. Usually, we try to make parameters for things that connect to a lot of different linguistic phenomena. Maybe this parameter might connect to other logical operators, and not just AND and OR? On a related note, if it's just tied to AND and OR, what does the parameter really accomplish? That is, does it reduce the hypothesis space in a useful way? How many other hypotheses could there be otherwise for interpreting AND and OR?

- Related to the PPI stuff: I was less clear on their story about how children pick that initial bias: "...favor parameter values that generate scope relations that make sentences true in the narrowest range of circumstances...". This is very abstract indeed - kids are measuring an interpretation by how many hypothetical situations it would be true for. This really depends on their ability to imagine those other situations and actively be comparing them against a current interpretation...

~~~
References:
Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

Yang, C. (2011). A Statistical Test for Grammar. Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, 30-38.

Wednesday, May 2, 2012

Next time on May 16: Crain & Thornton (2012)

Thanks to everyone who was able to join us for our thoughtful discussion of Bouchard (2012)! Next time on May 16, we'll be reading a survey article on syntactic acquisition that compares two opposing current approaches, and attempts to adjudicate between them. It's possible that the learning problems discussed can be good targets for computational modeling studies as well.

Crain, S. & Thornton, R. (2012). Syntax acquisition. WIREs Cogn Sci, doi: 10.1002/wcs.1158.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/CrainThornton2012_SyntaxAcquisition.pdf

See you then!
-Lisa

Monday, April 30, 2012

Some thoughts on Bouchard (2012)

I think Bouchard (2012) actually takes a similar approach to Perfors et al. (2011) with respect to solving the structure-dependence problem, in the sense of redefining what the problem is and then stating that the solution to this problem does not involve UG learning biases. It's at this point that the two studies part ways, but there is, in fact, the fundamental similarity. Bouchard does believe that meaning is inextricably tied to the problem, but rejects the transformational approach that's traditionally assumed by Chomsky and colleagues. Instead, meaning is more foundational in how the structures are generated. One thing that isn't clear to me at all is whether the UG problem is solved, as the title would suggest. It seems to me that the components that Bouchard assumes involve a lot of knowledge about interpretation (ISSUE and its structural relationship to Tense, incompleteness relating to a non-tensed utterance, etc.), and it's unclear where this knowledge comes from, if it's not meant to be innate. Maybe "solving the UG problem" is just supposed to be about providing a complete specification of what's in UG?

Some more targeted thoughts:

- One of Bouchard's issues with the current ideas about UG is that the components of UG seem hard to explain evolutionarily. That is, if we accept the current UG formulation, it's hard to explain why this would come to be for any kind of adaptive reasons. This is a fair point, but I'm not sure the UG Bouchard proposes gets around this either.

- I think Bouchard does a nice review of the current approach to UG that's motivated by efficient computation. In particular, it's fair to ask if "efficiency" is really the crucial factor - maybe "effectiveness" would be better, if we're trying to relate this to some kind of evolutionary story.

- I'm not sure it's fair to criticize the transformational account by saying that children may not encounter declarative utterances before they encounter interrogative utterances. It should be enough that children recognize the common semantics between them, and assume they're related.

- I appreciate Bouchard's effort to specify the exact form of the rule that relates declarative and interrogative utterances (the four constraints on the rule). This is useful if we were ever interested in making a hypothesis space of rules, and having the child learn which one is the right one (it reminds me a bit of Dillon, Dunbar, & Idsardi (2011), with their rule-learner). Anyway, the main point is clear: The issue is that the actual rule is one of many that could be posited, even given the four constraints Bouchard describes, and we either need the right rule to fall out from other constraints or we need it to be learnable from the available possibilities.

- I agree with the basic point that "with a different order comes different meaning", but the point is that it's a related meaning. Even in example (21), the utterances are still about the event of seeing and involve the actors Mary and John.

- "Question formation is not structure dependent, it is meaning dependent" - Well, sure, but meaning dependent, especially as it's described here, is all about the structure. So "meaning dependent" is the same as saying "structure dependent", isn't it?

- The Coherence Condition of Conindexation (example 30): This sounds great, but don't we then need to specify what "coherent" means? This seems to be an example of describing what's going on, rather than explaining what's going on. For example, for (29), why do those two elements get coindexed, out of all the elements in the utterance? Presumably, this has to do with the structure of the utterance... This relates to a point slightly later on: "...due to the lexical specifications that determine which actant of the event mediates the link between the event and a point in time" - Where do these lexical specifications come from? Are they learned? This seems more a description than an explanation.

- p.25: "Whatever Learning Machine enables them to learn signs also enables them to learn combinatorial signs such as dedicated orders of signs" - This seems like a real simplification. The whole enterprise of syntax is based on the idea that meaning is not the only thing determining syntactic form (otherwise, how do you get ungrammatical utterances that are intelligible, like "Where did Jack think the necklace from was expensive?"). So the Learning Machine needs to have something explicit in there about how combinatorial meaning links to form.

Wednesday, April 18, 2012

Next time on May 2: Bouchard (2012)

Thanks to everyone who was able to join us for an informative discussion of Perfors et al. (2011), along with the reply piece in Berwick et al. (2011)! Next time on May 2, we'll be looking at a different approach to addressing the same problem in language acquisition (structure-dependent rules) by Bouchard (2012). Interestingly, Bouchard is coming from a very different perspective, where the issue is not that too much has been assumed to be part of UG, but rather that not enough has.

Bouchard, D. (2012). Solving the UG Problem. Biolinguistics, 6(1), 1-31.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Bouchard2012_UGStructDep.pdf

See you then!
-Lisa

Monday, April 16, 2012

Some thoughts on Perfors et al. (2011) + Berwick et al. (2011)

I really like how straightforward Perfors et al's (2011) Bayesian model is - it's very easy to see how and why they get the results that they do from child-directed speech. They're very careful to say precisely what their model is doing: Assuming there are hierarchical representations in the child's hypothesis space already, these representations can be selected as the ones that best match the child-directed input. In addition, I think they highlight how previous approaches to this problem have tended to split along two distinct dimensions: domain-specific vs. domain-general, and structured vs. unstructured. It's always useful to figure out where the current approach is adding to the existing discussion.

The only real issue I see is the one Berwick et al. (2011) pointed out: The (infamous) poverty of the stimulus (PoS) problem relating to structure dependence is not the one Perfors et al. (2011) are addressing. In particular, the traditional PoS problem has to do with hypothesizing what kind of rules will relate a declarative utterance (e.g., "I can have an opinion") to its interrogative equivalent (e.g., "Can I have an opinion?"). This relationship isn't addressed in Perfors et al.'s model - all that model is concerned with is the ability to assign structure to these utterances. As far as it knows, there's no relationship between the two. And this is where we see the real divergence from the traditional PoS problem, where it was assumed that the child is trying to generate an interrogative using the same semantic content that would be used to make the declarative. This is why the "rules of transformation" were hypothesized in the first place (granted, with the assumption that the declarative version was more basic, and the interrogative version had to be created from that basic version). So, long story short, the Perfors et al. model is learning something that is different from the original PoS problem.

However, it's fair to assume that knowing there are hierarchical structures is a prerequisite for creating rules that use those hierarchical structures. In this sense, what Perfors et al. have shown is really great - it allows the building blocks of the rules (hierarchical structures) to be chosen from among other representations. However, as Berwick et al. point out, it still remains to be shown how having structures building blocks leads you to create structure-dependent rules. Perfors et al. assume that this is an automatic step: [end of section 1.2] "...any reasonable approach to inducing rules defined over constituent structure should result in appropriate structure-dependent rules". Phrased that way, it does sound plausible - and yet, I think there's a real distinction, especially if we're concerned about relating the declarative and interrogative versions of an utterance. Making a structure-dependent rule requires using the available structure as the context of the rule. So this means you could make a structure-independent rule just by not using structure in the context of the rule - even if your building blocks are structured.

Example of a structure-independent rule using structure building blocks:
Move the auxiliary verb after the first NP.
Building blocks: auxiliary verb, NP (structured)
Context: first (not structured)

So again, I think that what Perfors et al. have shown is great in terms of understanding the stages of learning - it's important to know that the preference for hierarchical structure in language doesn't have to be innate (even if the ability to consider hierarchical structure in the hypothesis space may be). However, I do think it falls short of addressing the PoS problem that linguists typically associate with structure dependence. This isn't a failing of Perfors et al. - it just means that people really have to be careful about how they interpret these results. It's very tempting to say that the structure-dependence PoS problem has been solved if you don't give this a very careful read and know what linguists think the problem actually is.

Wednesday, April 4, 2012

Next time on April 18: Perfors et al. (2011)

We'll have our first meeting of the CoLa Reading Group for this quarter on Wednesday April 18 at 10:30am in SBSG 2221. You can check out the schedule at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

for the rest of this quarter's meetings.

For our first article of the quarter, we'll be looking at Perfors, Tenenbaum, & Regier (2011), who use hierarchical Bayesian modeling to examine structure dependence in syntax, which has often been used as an example of an induction problem (or poverty of the stimulus) in language acquisition. I also recommend looking at a section in a recent response to this article by Berwick, Pietroski, Yankama, & Chomsky (2011), since it explicitly addresses the results of Perfors et al. (2011).

Perfors, A., Tenenbaum, J., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition, 118, 306-338.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/PerforsTenenbaumRegier2011_LearnabilityAbstractSyntax.pdf

Berwick, R., Pietroski, P., Yankama, B., & Chomsky, N. (2011). Poverty of the Stimulus revisited. Cognitive Science, 35, 1207-1242. [Section 4.2]

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/BerwickEtAl2011_PoS.pdf

See you then!

Friday, March 30, 2012

Gearing up for the spring - readings available!

I hope everyone's had a good spring break - and now it's time to gear up for the spring quarter of the reading group! :) The schedule of readings is now posted on the CoLa reading group webpage, following several suggestions of topics of interest to the group.

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week.

Monday, March 12, 2012

Thanks for a great quarter!

Thanks to everyone who was able to join us for our discussion of O'Donnell et al. (2011)! It was very useful to compare the models discussed to some existing models that we know about, and think about how to connect the representational issues to language acquisition.

For next quarter, let me know if you have any particular articles or topics that you would be interested in discussing - you're welcome to post them here or email them to me at lpearl@uci.edu.

Have a good spring break!

Friday, March 9, 2012

Some Thoughts on O'Donnell et al. (2011)

I like that this paper is interested in big ideas of knowledge representation (basically, how big are the chunks that we store), and provides what seems like a sensible formalization of the idea that medium-size reusable chunks are probably the way to go. Within the same framework, they also provide formalizations of other ideas for the unit of representation (basically, use the smallest units (full-parsing/generative), use the largest units (full-listing), and use all the units (exemplar)), which is nice for easy comparison purposes. While the intuition that medium-size reusable chunks are best is perhaps unsurprising, I think this gives us a clear quantitative argument for that idea. I do wish we had been given some sense of what exactly these medium-size chunks look like for the two different morphology problems though - at first I thought this was due to space limitation, but the tech report (O'Donnell et al. 2009) version doesn't really show us what these look like either. I wonder how well they match (or don't match) current morphological theories of representation. I know the full-parsing theory is a strong viewpoint for syntax currently, but I don't know how many linguists believe that's really a viable option for morphology. On the flip side, the exemplar-based idea seems like it would make more sense for morphology (where we have a fairly small number of possible combinations), while it seems like that would be a harder sell for syntax (where there can be quite a lot of different parses, especially for longer sentences). Similarly, the full-listing approach seems intractable for syntax. Of course, this only really matters if we think Fragment Grammars apply at multiple levels of linguistic representation (e.g., morphology and syntax). I'm assuming this is what the authors intend, though.

Some more more targeted thoughts:

- Exemplar-based Inference: I can't imagine a world where this would win out, compared to Fragment Grammars (FragGs). At best, it has the same coverage as FragGs, but it has to store a heck of a lot more. Perhaps this is included for completeness in model comparison, particularly since the DOP framework assumes this?

- I thought it was very good to mention other models that have similar properties to FragGs. However, given the descriptions provided, I really wondered how Parsimonious Data-Oriented Parsing differs from FragGs ("...explicitly eschews the all-subtree approach in favor of finding a set of subtrees which best explains the data.") Maybe in the way inference is done?

- In terms of comparing this to our reading from last time (Yang 2010), I wonder what's actually being explained by the inference process behind FragGs. Is this a way to assess which representation is likely to be correct for adult usage? If so, this makes it similar to Yang (2010), as that was an assessment of productivity in child speech. Or is this instead a proposal for how adults actually come to have these medium-size chunks, and so it would be a computational level explanation of the actual process of chunk formation?

- A minor note on the past tense representation: I found it interesting that the rule for past tense formation was explicitly encoded in the "morphological representation". This makes this representation seem much more similar to work by Yang on morphological productivity in the English past tense (e.g., Yang 2005), which talks about predictability of child behavior based on the rules used to form the past tense.

- The derivational morphology section: I admit, I got a bit lost on some of the details here.

How do we take 10,000 "forms" as data, and have that yield 25,000 types and 7.2 million tokens? What are these forms?
I like the P and P* measures, since those seem to correlate somewhat with the idea of precision and recall (P ~= how generalizable is this suffix, P* ~= how many novel words use this suffix). But then, why are we looking for a correlation between them instead of using an F-score? What does it mean in Table 1 to have a correlation for P, for example? Is that P vs. P*? Or P vs something else?
Table 2 left me similarly puzzled - I couldn't decipher this: "...the marginal probability that each suffix occurred first or second in such forms...Table 2 gives the Spearman rank correlation between the (log) ratio of the probability of appearing second to the probability of appearing first with the mean rank statistic..." So if we take a word with two suffixes, s1 and s2, what exactly is being computed? Is it log(prob(s1 in first position & s2 in second position)/prob(s2 in second position & s1 in first position))? And then that's being correlated with the empirical relative ranking of these two suffixes? So we want that probability ratio to be greater than 1, which gives a positive value when you take the log. And then we're trying to correlate that positive number with the mean rank of the two suffixes? Why should this be correlated?

- In the conclusion, the authors talk about how the difference between FragGs and other models is that FragGs care about predictive ability - future novelty vs. future reuse. But I'm not sure I understand how that differs from the computation vs. storage tradeoff (which they advocate replacing with future novelty vs. future reuse) - isn't future novelty based on computation while future reuse is based on storage? If so, this seems like they're restating the tradeoff, but with an emphasis on future usage (i.e., "we care about computation vs. storage because we care about the ability to use language efficiently in the future").

~~~

References

O'Donnell, T., Goodman, N., & Tenenbaum, J. (2009). Fragment Grammars: Exploring Computation and Reuse in Language. Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2009-013.

Yang, C. (2005). On Productivity. Yearbook of Language Variation, 5, 265-302.

Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

Monday, February 27, 2012

Next time on Mar 12: O'Donnell et al. (2011)

Thanks to everyone who was able to join us for our spirited discussion of Yang (2010). I think we definitely clarified what that study accomplishes in the debate between the two theoretical viewpoints. Next time on March 12, we'll be looking at a paper that also investigates productivity, examining it through the learning angle, in addition to the basic question of representation.

O'Donnell, T.J., Snedeker, J., Tenenbaum, J.B., & Goodman, N.D. (2011). Productivity and reuse in language. Proceedings of the Thirty-Third Annual Conference of the Cognitive Science Society. Boston, MA.

See you then!

Friday, February 24, 2012

Some thoughts on Yang (2010)

I found this paper a real delight to read - like many of Yang's other papers that we've looked at, it's very clear what was done and how this relates to the larger questions that are being examined. In particular, I thought it was excellent to compare the item-based approach to a generative approach, based on what predictions they would make for children's productions. As Yang pointed out, a lot of previous intuitions about what it means to have a generative (or productive grammar) didn't take into account the Zipfian distribution nature of linguistic data. So, by having a way to generate predictions about how much productivity (as measured by overlap) is expected under each viewpoint, we not only get support for the generative system viewpoint but also actually have support against (at least one version of) the item-based approach. Given how popular the item-based approach is in some circles (e.g., a 2009 PNAS article by Bannard, Lieven, & Tomasello), I thought this was quite striking. From my viewpoint, this is one great way to use mathematical & modeling techniques: to adjudicate between competing theoretical representations.

Some more targeted thoughts:

I really liked in section 1 where the quotes from Tomasello were presented - this gives a clear idea about what exactly is claimed by the item-based approach, and how they have previously used (apparently flawed) intuitions about expected productivity to support that approach. I thought a quote at the end of section 3.3 summed it up beautifully: "...the advocates of item-based learning not only rejected the alternative hypothesis without adequate statistical tests, but also accepted the favored hypothesis without adequate statistical tests."
The remark in section 2.2 about how even adult usage isn't "productive" by the standard of the item-based crowd is a really nice point. If adult usage isn't "productive", but we believe adults have a generative system, then this should make us question our assumption that "unproductive" child usage indicates a lack of a generative system. Of course, I suppose one might argue that maybe we don't think adults have a fully generative system (this is the view of construction grammar, to some extent, I believe.)
In section 3.2, I thought Table 1 was a beautiful demonstration of the match between expected overlap for the generative system and the empirically observed overlap in children's speech.
A minor point about the S/N threshold discussed in 3.2 - I get that S/ln N is a reasonable approximation for rank, especially as N gets very large. However, I'm not quite sure I understand why S/N was chosen as the threshold. I get that it's an upper bound kind of thing, but if S/ln N grows more slowly than S/N, why not just use S/ln N to get a more accurate threshold? It's not as if ln N is hard to calculate.
In section 3.3, I get that this is merely an attempt to make the item-based approach explicit (and maybe the item-based folk would think it's not the right characterization), but I think it's a pretty good attempt. It gets at the heart of what their theory predicts - you get lots of storage of individual lexical item combinations. Then, of course, Table 2 shows how this representation doesn't match the empirically observed overlap rates nearly as well, so we have a point against that representation.
Section 4 is nice in that it suggests that this way of testing theoretical representations should be a general-purpose one - do it for determiner usage, but also for verbal morphology and verb argument structure. Though this analysis wasn't conducted for those other phenomena, I was very convinced that the data show a Zipfian distribution, and so we might expect a generative system to be compatible with them.

~~~
References:
Bannard, C., Lieven, E. & Tomasello, M (2009). Modeling children's early grammatical knowledge. Proc Natl Acad Sci U S A, 106(41), 17284-9.

Monday, February 6, 2012

Next time on Feb 27: Yang (2010)

Thanks to everyone who was able to join our extremely lively discussion on Waterfall et al. (2010), and their approach to learning generative grammars from realistic data! Next time on February 27, we'll be looking at a paper that examines a way to quantify claims of linguistic productivity.

Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

See you then!

Friday, February 3, 2012

Some thoughts on Waterfall et al (2010)

What I really like about this paper is the opening discussion where they sketch the broad ideas that motivated the studies discussed in the rest of the paper. They explicitly talk about why the aim of language acquisition is a grammar, why we should care about the algorithmic level, what developmental computational psycholinguistics ought to be, why current computational models are still lacking because they miss out on the social situatedness of language, and what exactly is meant by "psychologically real" (and also how that differs from "algorithmically learnable"). I found this to be very valuable to just have all in one place. And I admit, it got my hopes up for what kind of model they would actually be using.

Unfortunately (for me), the rest of the paper ended up being somewhat anti-climactic because they don't end up implementing a model that has all the features of interest. Of course, that's a tall order, but they go through the process of running models that have the first three features, and then they talk about a lovely new discourse-related information type that seems like it should be incorporated into their model - and then they don't incorporate it. I think I was expecting them to at least talk about how to incorporate it into the models they spent so much time on in the beginning, even if it was infeasible at the current time to actually implement (for whatever reason). But that didn't seem to be what happened.

This isn't to say that the models they implemented and the identification of the "variation set" construct aren't interesting - it's just that I was expecting more based on the opening. As it is, the paper ends up feeling a bit scattered to me - a lot of potentially useful pieces, but they're not tied together very well.

Some more targeted thoughts:

p.674: I like that they were questioning the use of a gold standard, given that our theories about what the syntactic structure might be may not necessarily match psychological reality. I did find their definitions of recall and precision a bit hard to understand, though. Like many other things in the paper, I would have found an explicit formula (and possibly an example) to be more helpful than the text description. My best understanding of recall was something like the number of new generalizations divided by the test set plus the number of new generalizations, while precision was something like the number of correct new generalizations over the total number of new generalizations.

p.676: They talk about how a strength of their models is that there's no preliminary knowledge of things like grammatical categories (parts-of-speech). While it's nice to be able to say "Look what we can do with no knowledge!", I think this actually makes the problem less psychologically realistic. As far as I know, everyone's willing to grant that the child has some (at least rudimentary) knowledge of grammatical categories before the child starts positing syntactic structure. This is the kind of thing we might get from a child using frequent frames, for instance.

The ADIOS algorithm: I admit, I found this description very difficult to decipher without accompanying examples. It appears to be a batch algorithm, or is it (it appears that the graph is "rewired" every time a new pattern is detected)? What's an example of a bundle? What's a local flow quantity that would act as a context-sensitive probabilistic criterion for a significant bundle? How exactly does that work? How dissimilar is this whole process from frequent frames, which also induce equivalence classes? What are the basic abilities/knowledge required to make this algorithm work - the ability to create a graph, to identify bundles, to allow recursion of abstract patterns?

The ConText algorithm: This was a little better, because they provided a simple example. But again, I found myself wanting more explicit definitions for the different model components in order to understand how reasonable (or not) a model this was psychologically. For example, there's a local context window of 2, which means in a sentence like "I really like cute penguins", we would get a context vector for "like" where the lefthand context is "I really" and the righthand context is "cute penguins". Okay, great (though I worry about a window of 2 on each side in terms of data sparseness). And in order to construct equivalence classes based on this, the algorithm operates in batch mode over the data. Again, okay. But then, some kind of distance measure is posited to compare different context vectors to each other involving the angle between context vectors - how is this instantiated? What does the angle between "I really" and "But I" look like, for example? Presumably these are mapped into real numbers somehow... On a related note, once the algorithm gets clusters based on these context vectors, it then seems to do something with rewriting sequences - but what are sequences? Are these the utterances themselves, the partially abstracted representations the learner is forming, something else?

p.681: ConText results - I thought it was interesting that the ConText model ends up with subcategorization (for example, eat and drink being in the same class). This again reminds of frequent frame results, and made me want an explicit compare and contrast.

p.683: Human judgments of acceptability of new sentences created by ConText learner - I thought it was a bit strange to ask the participants to judge the acceptability based on how likely it was to appear in child-directed speech. Would the participants have a good sense of child-directed speech? My experience with undergrads who parse utterances from child-directed speech is that they're utterly surprised by how "ungrammatical" and semi-nonsensical conversational speech (and especially child-directed speech) is.

Variation sets: This is something of real value to computational models, I think. We have empirical evidence that children especially benefit from these particular data units and we have a reasonable idea of how to automatically identify them, and so we could reasonable expect a model to be extra sensitive to these kinds of data (perhaps give these data more weight). There's an interesting comment on p.688 where variation sets with roughly 50% of the material changing are the most helpful to children. My big question was why - what's so special about 50%? Does this represent some optimal tradeoff in terms of recognition and contrast? Another interesting note on p.689 and Table 2 on p.695, where they looked at how predictive the frequent n-grams were in variation sets for part-of-speech - some of them are pretty predictive, which is nice, and this shows that sometimes n-grams are useful, as opposed to needing framing elements (this was something a paper by Chemla et al. 2009 looked at). I do wonder at how this predictive quality would hold up cross-linguistically, though - what about languages where the wh-word doesn't move, or languages without auxiliary "do"?

Incremental learning (p.698): There's some discussion at the very end about how to transform ConText into an incremental learner, which I think is a good thing to think about. However, I wonder about the motivation behind using the gap automatically (i.e., a furry marmot gets additional "frames" of ___ furry marmot, a ____ marmot, and a furry _____ presumably). Is the idea that this will jumpstart the abstraction process, which otherwise would have to wait until it saw another instance that used two of those words? (Or in the case of a context window of 2 on each side, 4 of the words?)

References

Chemla, E., Mintz, T., Bernal, S., & Christophe, A. (2009). Categorizing Words Using "Frequent Frames": What Cross-Linguistic Analyses Reveal About Distributional Acquisition Strategies. Developmental Science.