Monday, January 28, 2013

Next time on 2/25/13 @ 2:15pm in SBSG 2221 = Thiessen & Pavlik (forthcoming)

Thanks to everyone who joined out meeting this week, where we had a very enlightening discussion about some of the ideas in Stabler (2009b)! Next time on Monday February 25 @ 2:15pm in SBSG 2221, we'll be looking at an article that investigates a single computational learning framework (and general distributional learning strategy) for multiple language learning tasks:

Thiessen, E., & Pavlik, P. 2012 forthcoming. iMinerva: A Mathematical Model of Distributional Statistical Learning. Cognitive Science

See you then!

Friday, January 25, 2013

Some thoughts on Stabler (2009b)

One of the things I really appreciated about this article was the clear intention to connect the kind of computational models & problems learnability researchers typically worry about with the kind of realistic language acquisition and language use problems that linguistic & psychology researchers typically worry about. A nice example of this was the connection to syntactic bootstrapping, which showed up in some of the later sections.  I also found myself thinking a few times about the connection between some of these ideas and the issue of language evolution (more on this below), though I suspect this often comes up whenever language universals are discussed.

More targeted thoughts:

The connection with language evolution: I first thought about this in the introduction, where Stabler talks about the "special restrictions on the range of structural options" and the idea that some of the language universals "may guarantee that the whole class of languages with such properties is 'learnable' in a relevant sense." The basic thought was that if the universals didn't help language be learned, they probably wouldn't have survived through the generations of language speakers.  This could be because those universals take advantage of already existing cognitive biases humans have for learning, for example.

In section 1, Stabler mentions that it would be useful to care about the universals that apply before more complex abstract notions like "subject" are available. I can see the value of this, but I think most ideas about Universal Grammar (UG) that I'm aware of involve exactly these kind of abstract concepts/symbols.  And this makes a little more sense once we remember that UG is meant to be (innate) language-specific learning biases, which would therefore involve symbols that only exist when we're talking about language. So maybe Stabler's point is more that language universals that apply to less abstract (and more perceptible) symbols are not necessarily based on UG biases.  They just happen to be used for language learning (and again, contributed to how languages evolved to take the shape that they do).

I'm very sympathetic to the view Stabler mentions at the end of section 1 which is concerned with how to connect computational description results to human languages, given the idealized/simplified languages for which those results are shown.

I like Stabler's point in section 2 about the utility of learnability results, specifically when talking about how a learner realizes that finite data does not mean that the language itself is finite. This connects very well to what I know about the human brain's tendency towards generalization (especially young human brains).

Later on in section 2, I think Stabler does a nice job of explaining why we should care about results that deal with properties in languages like reversibility (e.g., if it's known that the language has that property, the hypothesis space of possible languages is constrained - coupled with a bias for compact representations, this can really winnow the hypothesis space). My take away from that was that these kind of results can tell us about what kind of knowledge is necessary to converge on one answer/representation, which is good. (The downside, of course, is that we can only use this new information if human languages actually have the properties that were explored.)  However, it seems like languages might have some of these properties, if we look in the domain of phonotactics.  And that makes this feel much more relevant to researchers interested in human language learning.

In section 3, where Stabler is discussing PAC learning, there's some mention of the time taken to converge on a language (i.e., whether the learner is "efficient").  One formal measure of this that's mentioned is polynomial time. I'm wondering how this connects to notions of a reasonable learning period for human language acquisition. (Maybe it doesn't, but it's a first pass attempt to distinguish "wow, totally beyond human capability" from "not".)

I really liked the exploration of the link between syntax and semantics in section 4. One takeaway point for me was evidence in the formal learnability domain for the utility of multiple sources of information (multiple cues). I wonder if there's any analog for solving multiple problems (i.e., learning multiple aspects of language) simultaneously (e.g., identifying individual words and grammatical categories at the same time, etc.). The potential existence of universal links between syntax and semantics again got me thinking about language evolution, too. Basically, if certain links are known, learning both syntax and semantics is much easier, so maybe these links take advantage of existing cognitive biases. That would then be why languages evolved to capitalize on these links, and how languages with these links got transmitted through the generations.

I also liked the discussion of syntactic bootstrapping in section 4, and the sort of "top-down" approach of inferring semantics, instead of always using the compositional bottom-up approach where you know the pieces before you understand the thing they make up. This seems right, given what we know about children's chunking and initial language productions.

Monday, January 14, 2013

Next time on 1/28/13 @ 2:15pm in SBSG 2221 = Stabler 2009b

Thanks to everyone who joined out meeting this week, where we had a very interesting discussion about some of the ideas in Stabler (2009)! Next time on  Monday January 28 @ 2:15pm in SBSG 2221, we'll be looking at another article by Stabler. This time, it's one that reviews computational approaches to understanding language universals:

Stabler, E. 2009b. Computational models of language universals: Expressiveness, learnability and consequences. Revised version appears in M. H. Christiansen, C. Collins, and S. Edelman, eds., Language Universals, Oxford: Oxford University Press, 200-223. Note: Because this is a non-final version, please do not cite without permission from Ed Stabler.

See you then!

Friday, January 11, 2013

Some thoughts on Stabler (2009)

One of the things I most enjoyed about this paper was the way Stabler gives the intuitions behind the different approaches - in many cases, these are some of the most lucid descriptions I've seen about these different mathematical techniques. I also really appreciated the discussion about model selection - it certainly seems true to me that model selection is what many theoretical linguists are thinking about when they discuss different knowledge representations. Of course, this isn't to say that parameter setting once you know the model isn't worthy of investigation (I worry a lot about it myself!). But I also think it's easier to use existing mathematical techniques to investigate parameter setting (and model selection, when the models are known), as compared to model generation.

Some more targeted thoughts below:

I really liked the initial discussion of "abstraction from irrelevant factors", which is getting at the idealizations that we (as language science researchers) make. I don't think anyone would argue that it's necessary to do that to get anything done, but the fights break out when we start talking about the specifics of what's irrelevant. A simple example would be frequency - I think some linguists would assume that frequency's not part of the linguistic knowledge that's relevant for talking about linguistic competence, while others would say that frequency is inherently part of that knowledge since linguistic knowledge includes how often various units are used.

I thought Stabler made very good points about the contributions from both the nativist and the empiricist perspectives (basically, constrained hypothesis spaces for the model types but also impressive rational learning abilities) - and he did it in multiple places, highlighting that both sides have very reasonable claims.

The example in the HMM section with the discovery of implicit syllable structure reminded me very much of UG parameter setting.  In particular, while it's true that the learner in this example has to discover the particulars of the unobserved syllable structure, there's still knowledge already (by the nature of the hidden units in the HMM) that there is hidden structure to be discovered (and perhaps even more specific, hidden syllabic structure).  I guess the real question is how much has to be specified in the hidden structure for the learner to succeed at discovering the correct syllable structure - is it enough to know that there's a level above consonants & vowel?  Or do the hidden units need to specify that this hidden structure is about syllables, and then it's just a question of figuring out exactly what about syllables is true for this language?

I was struck by Stabler's comment about whether it's methodologically appropriate for linguists to seek grammar formalisms that guarantee that human learners can, from any point on the hypothesis space, always reach the global optimum by using some sort of gradient descent. This reminds me very much of the tension between the complexity of language and the sophistication of language learning. First, if language isn't that complex, then the hypothesis space de facto probably can be traversed by some good domain-general learning algorithms. If, however, language is complex, the hypothesis space may not be so cleanly structured.  But, if children have innate learning biases that guide them through this "bumpy" hypothesis space, effectively restructuring the hypothesis space to become smooth, then this works out. So it wouldn't be so much that the hypothesis space must be smoothly structured on its own, but rather that it can be perceived as being smoothly structured, given the right learning biases. (This is the basic linguistic nativist tenet about UG, I think - UG are the biases that allow swift traversal of the "bumpy" hypothesis space.)

I also got to thinking about the idea mentioned in the section on perceptrons about how there are many facts about language that don't seem to naturally be Boolean (and so wouldn't lend themselves well to being learned by a perceptron). In a way, anything can be made into a Boolean - this is the basis of binary decomposition in categorization problems.  (If you have 10 categories, you first ask if it's category 1 or not, then category 2 or not, etc.) What you do need is a lot of knowledge about the space of possibilities so you know what yes or no questions to ask - and this reminds me of (binary) parameter setting, as it's usually discussed by linguists. The child has a lot of knowledge about the hypothesis space of language, and is making decisions about each parameter (effectively solving a categorizing problem for each parameter - is it value a or value b?, etc.). So I guess the upshot of my thought stream was that perceptrons could be used to learn language, but at the level of implementing the actual parameter setting.

It was very useful to be reminded that the representation of the problem and the initial values for neural networks are crucial for learning success. This of course implies that the correct structure and values for whatever language learning problem must be known a priori (which is effectively a nativist claim, and if these values are specific to language learning, then a linguistic nativist claim). So, the fight between those who use neural networks to explain language learning behavior and those who hold the classic ideas about what's in UG isn't about whether there are some innate biases, or even if those biases are language-specific - it may just be about whether the biases are about the learning mechanism (values in neural networks, for example) or about the knowledge representation (traditional UG biases, but also potentially about network structure for neural nets).

Alas, the one part where I failed to get the intuition that Stabler offered was in the section on support vector machines.  This is probably due to my own inadequate knowledge of SVMs, but given how marvelous the other sections were with their intuitions, I really found myself struggling with this one.

Stabler notes in the section on model selection that model fit cannot be the only criterion for modeling success, since larger models tend to fit the data (and perhaps overfit the data) better than simpler models. MDL seems like one good attempt to deal with this, since it has a simple encoding length metric which it uses to compare models -  encoding not just the data, based on the model, but also the model itself. So, while a larger model may have a more compact data encoding, its larger size counts against it.  In this way, you get some of that nice balance between model complexity and data fit.

Tuesday, January 8, 2013

Winter meeting time set & Jan 14 = Stabler 2009 @ 2:15pm in SBSG 2221

Based on the responses, it seems like Mondays at 2:15pm will work best for everyone's schedules this quarter. Our complete schedule (with specific dates) can now be seen at

So, let's get kicking!  For our first meeting on Monday January 14 @ 2:15pm in SBSG 2221, we'll be looking at an article that surveys several mathematical approaches to language learning, as well as the assumptions inherent in these various approaches.

Stabler, E. 2009. Mathematics of language learning. Revised version appears in Histoire, Epistemologie, Langage, 31, 1, 127-145. Note: Since this a non-final version, please do not cite without permission from Ed Stabler.

See you then!

Friday, January 4, 2013

Winter quarter planning

I hope everyone's had a good winter break - and now it's time to gear up for the winter quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on mathematical language learning, statistical learning, and hierarchy in language:

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here). Thanks and see you soon!