Computational Models of Language (at UC Irvine): 2010

Tuesday, December 7, 2010

Winter 2011 Reading List Posted

After receiving some input on the kinds of topics people are interested in reading about, I've put together the reading schedule for winter 2011, available at the schedule page of the website. We'll be focusing on some fundamental principles behind models of language development, as well as the connection between language acquisition and language change, and language evolution.

Still to be done: Setting an exact time for our every-other-weekly meeting. Please email me with the times during the week you're available for a 1-1.5 hour meeting. I'm leaning towards sometime on Wednesdays if possible. I realize that schedules are still a little in flux, but let me know when you can.

Monday, November 29, 2010

Thoughts on Parisien & Stevenson (2010)

So I very much like the fact that they're combining aspects of previous models (one of which we looked at last time: Perfors, Tenenbaum, & Wonnacott (2010)) into a model that is tackling a more realistic learning problem: not only grouping verbs into classes based on their usage in various constructions, but also identifying the relevant constructions themselves from among many different verbs and construction types. A couple of more targeted thoughts:

I think this is the first time I've seen "competency model" used this way - I think this is basically the same as what we've been calling computational-level models, since this model is interested in whether the statistical information is present in the environment (rather than worrying about how humans could extract that information)?
Practical note: I didn't realize a matlab package (NPBayes) was available that does this kind of Bayesian inference, courtesy of Teh at UCL. This seems like a very nice option if WinBUGS isn't your thing.
Figure 3, generalization of novel dative verbs: The difference between the model with verbs classes (Model 2) and the model without verb classes (Model 1) doesn't seem so great to me. While it's true there's a small change in the right direction for PD only and DO only verbs, it's unclear to me that this is really a huge advantage. Implications: Knowing about verb classes isn't a big advantage at this stage of acquisition? (Which seems not quite right, given what we saw with Perfors, Tenenbaum, & Wonnacott (2010), where having those classes was a key feature in the good model.)
The comparison of the model's behavior to three-year-old generalization behavior, and why it's not the same: While it's entirely possible that they're right and the difference is due to the model learning from too small (and biased) a corpus, isn't the idea to try to use the data children have access to in order to make the kind of judgments children do? So whether the sample is biased or not compared to normal (adult?) corpora like the ICE-GB, the point is that the Manchester corpus is fairly large (1.5 million words of child-directed speech) and presumed to be a reasonable sample of the kind of data children hear - shouldn't the model generalize from these data the way children do if it's a competency model? I suppose it's possible that the Manchester corpus is particularly biased for some reason, but this corpus is made up of data from multiple children, so it would be somewhat surprising if the data sets all happened to be biased the same way by chance.

Wednesday, November 17, 2010

Next time: Parisien & Stevenson (2010)

Thanks to all who were able to join us for our discussion of Perfors, Tenenbaum, & Wonnacott (2010)! I think we definitely had some good ideas about how to incorporate semantics into the existing learning model in a more sophisticated way. Next time, we'll be looking at another Bayesian take on syntax learning, courtesy of Parisien & Stevenson (2010) (downloadable from the schedule section of the CoLa Reading Group webpage).

Monday, November 15, 2010

Thoughts on Perfors, Tenenbaum, & Wonnacott (2010)

I think the main innovation of this model (and this style of modeling in general) is to provide a formal representation of how multiple levels of knowledge can be learned simultaneously in a mathematical framework. This is also (unsurprisingly) something that appeals to me very much. Interestingly, this kind of thing seems to be very similar to the idea of linguistic parameters that nativists talk about. A linguistic parameter would most naturally correspond to hyper-parameters (or hyper-hyper-parameters, etc.), with individual linguistic items providing a way to identify the linguistic parameter value - which then allows generalization to items rarely or not yet seen.

The interesting claim (to me) that the authors make here, of course, is that these hyper-parameters have extremely weak priors and so (perhaps) domain-specific knowledge is not required to set their values. I think this may still leave open the question of how the learner knows what the parameters are, however. In this model, the parameters are domain-general things, but linguistic parameters are often assumed to be domain-specific (ex: head-directionality, subject-drop, etc.). Perhaps the claim would be then that linguistic parameters can be re-imagined as these domain-general parameters, and all the details the domain-specific parameters were originally created to explain would fall out from some interplay between the new domain-general parameters and the data.

Some more targeted thoughts:

p.2, dative alternation restrictions: I admit I have great curiosity about how easy it is to saturate on these kind of restrictions. If they're easily mutable, then they're not necessarily the kind of phenomena that linguists often posit linguistic principles and parameters for. Instead, the alternations that we see would be more a sort of accident of usage, rather than reflecting any deep underlying restrictions of language structure. This idea of "accident of usage" comes up again on p.5, where they mention that the distinctions in usage don't seem to be semantically-driven (no completely reliable semantic cues).

p.12, footnote 4: This footnote mentions that the model doesn't involve memory limitations the way humans do, which leads me to my usual question when dealing with rational models - how do we convert this into a process model? Is it straight-forward to add memory limitations and other processing effects? And then, once you have this, do the results found with the computational-level model still occur? This gets at the difference between "is it possible to do" (yes, apparently) and "is it possible for humans to do" (unclear)?

p.15, related thought to the above one involving the different epochs of training: If this process of running the model's inference engine after every so many data points was taken to its extreme, then it seems we could create an incremental version that does its inference thing after every data point encountered. This would be a first step towards creating a process model, I think.

p.23: related to the large opening comment, with respect to the semantic features: This seems like a place where nativists might want to claim an innate bias to heed certain kinds of semantic features over others. (And then they can think about whether the necessary bias is domain-specific or domain-general.)

Wednesday, November 3, 2010

Next time: Perfors, Tenenbaum, & Wonnacott (2010)

Thanks to everyone who was able to join us this time to discuss Bod (2009)! Next time on November 17, we'll be looking at Perfors, Tenenbaum, & Wonnacott (2010) (available at the CoLa reading group schedule page), who apply hierarchical Bayesian models to learning syntactic alternations.

Monday, November 1, 2010

Thoughts on Bod (2009)

I'm very fond of the general idea underlying this approach, where the structure of sentences is used explicitly to generalize appropriately during acquisition. The way that Bod's learner has access to all possible tree structures reminds me very much of work by Janet Fodor and William Sakas, who have some papers in the early 2000s about their Structural Triggers Learner, which also has access to all possible tree structures. I think the interesting addition in Bod's work is that tree structures can be discontiguous in ways that don't necessarily have to do with dependencies (e.g., Fig 13, p.768, with the discontiguous subtree that involves the subject and part of the object). That being said, I don't know how reasonable/plausible it is for a child to keep track of these kind of strange discontinuities, really. Also, I don't know how plausible it is to track statistics on all possible sub-trees. I know Bod offers some techniques for making this tractable, but it seems trickier than the Bayesian hypothesis spaces, because those hypothesis spaces of structures are very large but importantly implicit - the learner doesn't actually deal with all the items in the hypothesis space. Bod's learner, on the other hand, seems to need to track all those possible sub-trees explicitly.

More specific comments:

I admit my feathers got a little ruffled in section 6 with the poverty of the stimulus discussion. On p.777, Bod cites Crain (1991) who claims that (at the time - it's been 20 years since then) complex yes/no questions were the "parade case of an innate constraint". And then, Bod goes on to show how the U-DOP learner can learn complex yes/no questions. This is all well and good, because the "innate constraint" the nativists claimed was needed to learn this is precisely what Bod's U-DOP learner uses: structure-dependence. So it would actually be really bad (and strange) if the U-DOP learner, with all its knowledge of language structure, couldn't learn how to form complex yes/no questions properly. It seems to me that what Bod has done is shown a method that uses structure-dependence in order to learn complex yes/no questions from the input. Since his learner assumes the knowledge the nativists say children need to assume, I don't think he can claim that he's shown anything that should change nativists' views on the problem.
It seems like this learner is actually tackling a harder problem than is necessary, since children will likely have some idea of grammatical category knowledge (even if they don't have it for all words yet). Given this, children also may be able to use some simple probability information between grammatical categories to form initial groupings (constituents) - so the U-DOP learner is actually considering a wider hypothesis space of possible tree structures when it allows any fragment of a sentence to form a productive unit (e.g., "the book" (constituent) vs. "in the").
I found it interesting that binarity plays such an integral role for this learner. That property seems similar to the property "Merge" (wikipedia info here) in current generative linguistics.
It also seems like the overall process behind the U-DOP is a formalization of the chunking or "algebraic learning" process that gets talked about a lot for learning. In this case, it's chunking over tree structures. This struck me particularly in section 5.2, on p.774, with the "blow it up" example.
Smaller note: Why does the U-DOP do so poorly on Chinese, when compared to German and English data in section 4? It makes me wonder if there's something language-specific about approaching the learning problem this way, or perhaps something language-specific about using this particular structural representation.

Wednesday, October 20, 2010

Next time: Bod (2009)

Thanks to everyone who was able to join us this time to discuss Yang (2010)! We had quite a rousing discussion. Next time on November 3, we'll be looking at Bod (2009) (available at the CoLa reading group schedule page), who has a differing viewpoint on models of syntactic acquisition.

Monday, October 18, 2010

Yang (2010): Some thoughts

One thing I always like about Yang's work: whether or not you agree with what he says, it's always very clear what his position is and what evidence he considers relevant to the question at hand. Because of this, his papers (for me) are very enjoyable reads.

One thing that stood out to me in this paper was his stance on computational-level vs. algorithmic-level models of syntactic acquisition. Right up front, he establishes his view that algorithmic-level models are the ones with the greatest contribution (and this line of discussion seems to continue in section 4, where he seems dismissive of some existing computational-level models). I do have great sympathy for wanting to create algorithmic-level models, but I still believe computational-level models have something to offer. The basic idea for me is this: if you have an ideal learner that can't learn the required knowledge from the available data, this seems like a great starting point for a poverty of the stimulus claim. (It may turn out that some algorithmic-level model doesn't have the same issue, but then you know the "magic" that happens is in the specific process that algorithmic-level model uses. And maybe that "magic" corresponds to some prior knowledge or innate bias in the learning procedure, etc. At any rate, the ideal learner model has contributed something.)

I also found Yang's discussion of the PAC learnability framework enlightening in section 3. A couple of comments stood out to me:

p.6: The comment about how to turn infinite grammars finite, by ignoring sufficiently long sentences (that, for example, contain lots of recursion). Yang notes that few language scientists would find the notion of a finite grammar appealing. On the other hand, I feel like we could have some sympathy for people who believe that sentences of infinite length are not really part of the language. Yes, they're part of the language by definition (of what recursion is, for example), but they seem not to be part of the language if we define language as something like "the strings that people could utter in order to communicate". I think Yang's larger point remains that the set of strings in a grammar of any language is infinite in size, though.

In that same paragraph, Yang seems dismissive of the hypothesis space of probabilistic context-free grammars (PCFGs) being realistic in current model implementations, specifically because the "prior probabilities of these grammars must be assumed". While it may be the case that some models take this approach, I don't think it's necessarily true. If you already have a PCFG, couldn't the prior for the grammar be derived by the some combination of the rules' probabilities? (I feel like Hsu & Chater (2010) do something like this with their MDL framework, where the prior is the encoding of the grammar.)

Wednesday, October 6, 2010

Next time: Yang (2010)

Thanks to everyone who was able to come to our discussion this week! Next time we'll be reading a review of computational models of syntax acquisition by Charles Yang, available for download at the CoLa Reading group's schedule page.

Monday, October 4, 2010

Hsu & Chater (2010): Some Thoughts

So this was a bit longer of an article than the ones we've been reading, probably because it was trying to establish a framework for answering questions about language acquisition rather than tackling a single problem/phenomena. I definitely appreciated how much effort the authors went to at the beginning to motivate the particular framework they advocate, and the extensive energy-saving appliance analogy. It's certainly true that applying the framework to a range of different phenomena shows its utility.

More targeted thoughts:

The authors do go out of their way to highlight that this is a framework for learnability (rather than the "acquirability" I'm fond of), since it assumes an ideal learner. They often mention that it represents an "upper bound on learnability", and note that they provide a "predicted order for the acquisition by children". I think it's important to remember that this upper bound and predicted order still only applies if the problem is viewed by children the particular way that the authors frame it. Looking through the supplemental material and the specifics of the phenomena they examine, sometimes very specific knowledge (or kinds of knowledge) is assumed (for example, knowledge of traces, "it-punctuation", "non-trivial" parenthood - which may or may not be equivalent to the linguistic notion of c-command). In addition, I think they have to assume that no other information is available for these phenomena in order for the "predicted order for the acquisition by children" to hold. I have the same reservations about the data they use to evaluate their ideal MDL learners - some of the data quite clearly isn't child-directed speech because the frequencies are too low for their MDL learner to function properly. But...doesn't that say something about the data real children are exposed to, and how there may not be sufficient data for various phenomena in child-directed speech? Also, relatedly, maybe the mismatch the authors see between their model's predictions and actual child behavior in figure 5 has to do with the fact they didn't train their model on child-directed speech data?

On a related note, I really wonder if there's some way to translate the MDL framework to something more realistic for children (cognitively plausible, etc.). The intuitions behind the framework are simple and intuitive - you want a balance between simplicity of grammar and data coverage. The Bayesian framework is a specific form of this balancing act, and can be adapted easily to be an online kind of process. Can the MDL? What would code length correspond to - efficient processing and representation of data? The authors definitely try to point out where the MDL evaluation would come in for learning, saying that it is the decision to add or not add a rule to the existing grammar (and so the comparison is between two competing grammars that differ by one rule).

I also really want to be able to map some of the MDL-specific notions to something psychological. (Though maybe this isn't possible.) For example, what is C[prob] (the constant value of encoding probabilities)? Is it some psychological processing/evaluation cost? In some cases, the fact that a particular form is required in order to learn the restriction reminds me strongly of the notion of "unambiguous data" that's been around in generative grammar acquisition for awhile.

Specific phenomena: I was surprised by some of the results the authors found. For example "what is" (Table 5) - the frequency in child-directed speech is over 12000 occurrences per year but adult-directed speech is less than 300 per year? That seems odd. The same happens for "who is" (Table 6). Turning to the dative alternation examples, "donate" apparently appears far more often per year (15 times) than "shout" ( less than 4), "whisper" (less than 2), or "suggest" (less than 1). That seems odd to me. Also, for Table 16 on the transitive/intransitive examples, how does a encoding "savings" of 0.0 bits lead to any kind of learning under this framework? Maybe this is a rounding error?

Tuesday, September 21, 2010

Schedule and readings posted for Fall quarter

The CoLa reading group will meet every other Wednesday from 1:15pm to 2:30pm in SBSG 2341. The first meeting of the fall quarter will be October 6th, and we'll be discussing Hsu & Chater (2010). The readings and schedule are accessible via our lovely website. The focus this quarter is syntax acquisition, so all our readings will deal with this in some way. See you all on October 6th!

Tuesday, September 14, 2010

Jones, Johnson, & Frank (2010): Some thoughts

I'm definitely in favor of using information from multiple sources for acquisition, so I was intrigued when I saw that word-meaning mapping information was being used to constrain word segmentations. I found the model description very comprehensible. :) A couple of things:

p.504, "...most other words don't refer to the topic object...corresponding to a much larger [alpha]0 value" - While this was fine at first glance, I started thinking about the nature of the child utterances. Take the example in figure 1: "Is that the pig?" The "Is that the" part would be classified as non-referential by this model, but I could see these being commonly re-used words (and indeed, a commonly reused frame). The same goes for function words in general like "the" and "is". I wonder what would happen if they allowed alpha0 to be smaller, so that they get more reuse in the non-referential words. Part of the reason this integrated model seems to do better is that it has pressure at both the segmentation level and the word-meaning mapping level to make fewer lexicon items. Wouldn't forcing more reuse in the non-referential words make that better?
On a related note, it seems like the point of using the word-meaning mapping info (and having pressure there to make fewer items) is to correct undersegmentation that occurs (see the "book" example on 508). So maybe if there's too much pressure to make fewer lexical items (say, from forcing more reuse in non-referential words), there's a lot of undersegmentation? I'm not sure if that would follow for the "book" example they give, though. Let's suppose you have the following segmentation choices:
- abook, yourbook, thebook
- a book, your book, the book
If you have more pressure to reuse non-referential words, then wouldn't you be even more likely to prefer the second segmentation over the first?

Also, we know of other ways to fix undersegmentation - ex: using a bigram ("context") model for segmentation, instead of using a unigram model. If the model used the bigram assumption, would the word-meaning mapping information still improve segmentation?
A smaller nitpick question: I'm not quite sure I understand how accuracy in Table 2 (p.506) can be better for the Joint model when all the other measures are either equal to or worse than the Gold Seg model. Am I missing something in the discussion of what accuracy is (or maybe what the other measures are getting at)?

Thursday, September 2, 2010

Sept 2: Recap & Planning for next time

Thanks to all who were able to join us today! Our discussion about the psychological reality of syllables vs. phonemes for infant word segmentation continued from last time, and we could easily see how to fit this into the adaptor grammar framework. For next time (Sept 16), we'll be looking at Jones, Johnson, & Frank (2010), which can be found on the schedule page. This one talks about the integration of word meaning information into the word segmentation problem.

Tuesday, August 31, 2010

Johnson & Goldwater (2009): Some comments

So I admit I found this paper a bit tougher going than the previous one, most likely due to how much information they had to fit into a limited space. Anyway, once I wrapped my head around what it meant for something to be "adapted", things started to make more sense.

Some more targeted thoughts:

(1) Given our discussion last time about the syllable as a likely basic unit of representation (given neurological evidence), we had talked about implementing learning models that take the syllable as the basic unit. How similar is what Johnson & Goldwater have done here with their collocation-syllable adaptor grammar to this idea? Clearly, the syllable is one unit of representation that matters in this model, but they also go below the syllable level to include properties of syllables that correspond (roughly) to phonotactic constraints on syllable-hood. Does this mean a learner would have to be able to analyze individual phonemes in order to use this model? If so, what happens if we get rid of any representation below the syllable-level? Is there any place for phonotactic constraints then?

(2) I'd like to look closer at Table 1 to try to understand what benefits the learner. Because there are so many conditions, it's a bit hard to pick apart the impact of any one condition. For example, J&G argue that table label resampling leads to goodness for the models with rich hierarchical structure (like the collocation-syllable model), and point to figure 1 to show this. But looking at the 3rd and 4th entries from the bottom of table 1, it seems like performance worsens with table label resampling.

(3) The idea of maximum marginal decoding is interesting to me, because it reminds me of the difference between "weakly equivalent" grammars and "strongly equivalent" grammars. Weakly equivalent = output is the same, even if internal structure isn't; Strongly equivalent = output is the same and internal structure is the same. It seems here that aggregating "weakly equivalent" word segmentations leads to better performance.

Thursday, August 19, 2010

First Meeting

The first meeting of the CoLa Reading group was quite fun - thanks to all who could make it! Next time (Sept 2), we'll be looking at the word segmentation paper of Johnson & Goldwater (2009), which can be downloaded from the CoLa Reading Group website schedule page.

Tuesday, August 17, 2010

Blanchard, Heinz, & Golinkoff (2010): Some Comments

As many of you know, I'm very sympathetic to this style of modeling where there's an attempt to use learning algorithms that seem like they might be cognitively plausible. So, short version of my thoughts on this: Yay, algorithmic-level modeling (specifically, the algorithmic level of representation of Marr (1982)) that gets some very promising looking results.

More specific things that occurred to me as I was reading:

p.2-3: The authors mention how they're not going to be tackling the segmentation of auditory linguistic stimuli (not unreasonable), but that "any word segmentation model could easily be plugged into a system that recognizes phonemes from speech". It's not so clear to me that the phoneme level of representation is right for modeling initial word segmentation, though it's a reasonable first step. Specifically, given what we know of the time course of acquisition, it seems like native language phoneme identification isn't fully online till about 10-12 months - but initial word segmentation is likely happening around 6 months. Given this, it seems more likely that infants may be working with a representation that's more abstract than the raw auditory signal, but less settled than the adult phonemic representation. For example, perhaps allophones might be perceived as separate sounds by the infant at this point in development. Anyway, this isn't a critique of this model in particular - most word seg models I've seen work with phonemes - but it'd be very interesting to see how any of the prominent word seg models would perform on input that's messier than the phonemic representation commonly used.
p.9, p.14, p.18: The authors emphasize that their target unit of extraction is the phonological word (and their exposition of different definitions of "word" was quite nice, I thought). Unfortunately, they have the problem of only having orthographic word corpora available. I wonder how hard it would be to convert the existing corpus into a phonological word corpus - they say it's a hard and time-consuming process, but perhaps there are some rewrite rules that could do a reasonable approximation? Or maybe it would be useful to note how many "mis-segmentations" of any model are actually viable phonological word segmentations.
Looking at figure 1 on p.10, and the exposition about the model: I wonder how the model actually chooses the most probably segmentation from all possible segmentations for an utterance. Initially, this is probably very easy because there's nothing in the lexicon. But once the lexicon is populated, it seems like there could be a lot of possibilities to choose from. Maybe some kind of heuristic choice? This part of learning is what the dynamic programming algorithms do in the Bayesian models of Pearl, Goldwater, & Steyvers (2010).
p.11 - the second phonotactic constraint: It's probably worth noting that requiring all words to have a syllabic sound means the learner must know beforehand (or somehow be able to derive) what a syllabic sound is. This seems like domain-specific knowledge (i.e., "all sounds with these properties are syllabic", etc.) - is there any way it wouldn't be? Supposing this is definitely domain-specific (though language universal) knowledge, how plausible is it that humans have innate knowledge of the necessary properties for syllable-hood? I know there's some evidence that the syllable is a basic unit of infant perception, so this could be very reasonable after all.
p.17 - testing the "require syllabic" constraint on its own: The authors explain that the reason a learner with only this constraint fails is because longer words receive the same probability as shorter words. Maybe a slightly more informed version of this learner could assign each phoneme a small constant probability (rather than all unfamiliar words getting the same probability) - it seems like this would allow the word length effects to emerge and could lead to better segmentation. Maybe this learner would prefer CV or V words (due to their length + still being syllabic) - which would lead to major undersegmentation. Still, I wonder how bad it would be, since so many English child-directed speech words are monosyllabic anyway.

Wednesday, August 4, 2010

Schedule for rest of summer 2010

So, it looks like one of the best times to meet this summer would be Thursdays at noon. We'll have our first meeting Aug 19th, in SBSG 2200, starting with Blanchard et al. (2010). Hope to see you there! (And even if not, to see your comments here!)

Monday, July 26, 2010

First readings posted + availability check

So, the first few readings the CoLa reading group is going to look at are posted on the schedule section of the webpage. I thought we might start with some recent papers on word segmentation & word meaning identification. Also, feel free to post suggestions of topics and/or papers you're interested here, or email them to me.

Schedule check time: What days & times are definitely not good for the third week of August (Aug 16 - Aug 20)? I'm hoping we'll be able to pick a convenient day & time for the first three sessions before fall quarter starts. After that, we'll do a schedule check again.

Thursday, June 3, 2010

Availability over the summer?

So....who's going to be around this August? I'm hoping we'll be able to find at least one day we can have a first meeting. I've already found a couple of papers I definitely would like to take a closer look at. ;)

Thursday, May 20, 2010

It lives! (aka Welcome to the reading group)

So the blog is up and running. Hurrah! Go Team CoLa Reading Group!