Computational Models of Language (at UC Irvine)

Friday, March 30, 2012

Gearing up for the spring - readings available!

I hope everyone's had a good spring break - and now it's time to gear up for the spring quarter of the reading group! :) The schedule of readings is now posted on the CoLa reading group webpage, following several suggestions of topics of interest to the group.

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week.

Monday, March 12, 2012

Thanks for a great quarter!

Thanks to everyone who was able to join us for our discussion of O'Donnell et al. (2011)! It was very useful to compare the models discussed to some existing models that we know about, and think about how to connect the representational issues to language acquisition.

For next quarter, let me know if you have any particular articles or topics that you would be interested in discussing - you're welcome to post them here or email them to me at lpearl@uci.edu.

Have a good spring break!

Friday, March 9, 2012

Some Thoughts on O'Donnell et al. (2011)

I like that this paper is interested in big ideas of knowledge representation (basically, how big are the chunks that we store), and provides what seems like a sensible formalization of the idea that medium-size reusable chunks are probably the way to go. Within the same framework, they also provide formalizations of other ideas for the unit of representation (basically, use the smallest units (full-parsing/generative), use the largest units (full-listing), and use all the units (exemplar)), which is nice for easy comparison purposes. While the intuition that medium-size reusable chunks are best is perhaps unsurprising, I think this gives us a clear quantitative argument for that idea. I do wish we had been given some sense of what exactly these medium-size chunks look like for the two different morphology problems though - at first I thought this was due to space limitation, but the tech report (O'Donnell et al. 2009) version doesn't really show us what these look like either. I wonder how well they match (or don't match) current morphological theories of representation. I know the full-parsing theory is a strong viewpoint for syntax currently, but I don't know how many linguists believe that's really a viable option for morphology. On the flip side, the exemplar-based idea seems like it would make more sense for morphology (where we have a fairly small number of possible combinations), while it seems like that would be a harder sell for syntax (where there can be quite a lot of different parses, especially for longer sentences). Similarly, the full-listing approach seems intractable for syntax. Of course, this only really matters if we think Fragment Grammars apply at multiple levels of linguistic representation (e.g., morphology and syntax). I'm assuming this is what the authors intend, though.

Some more more targeted thoughts:

- Exemplar-based Inference: I can't imagine a world where this would win out, compared to Fragment Grammars (FragGs). At best, it has the same coverage as FragGs, but it has to store a heck of a lot more. Perhaps this is included for completeness in model comparison, particularly since the DOP framework assumes this?

- I thought it was very good to mention other models that have similar properties to FragGs. However, given the descriptions provided, I really wondered how Parsimonious Data-Oriented Parsing differs from FragGs ("...explicitly eschews the all-subtree approach in favor of finding a set of subtrees which best explains the data.") Maybe in the way inference is done?

- In terms of comparing this to our reading from last time (Yang 2010), I wonder what's actually being explained by the inference process behind FragGs. Is this a way to assess which representation is likely to be correct for adult usage? If so, this makes it similar to Yang (2010), as that was an assessment of productivity in child speech. Or is this instead a proposal for how adults actually come to have these medium-size chunks, and so it would be a computational level explanation of the actual process of chunk formation?

- A minor note on the past tense representation: I found it interesting that the rule for past tense formation was explicitly encoded in the "morphological representation". This makes this representation seem much more similar to work by Yang on morphological productivity in the English past tense (e.g., Yang 2005), which talks about predictability of child behavior based on the rules used to form the past tense.

- The derivational morphology section: I admit, I got a bit lost on some of the details here.

How do we take 10,000 "forms" as data, and have that yield 25,000 types and 7.2 million tokens? What are these forms?
I like the P and P* measures, since those seem to correlate somewhat with the idea of precision and recall (P ~= how generalizable is this suffix, P* ~= how many novel words use this suffix). But then, why are we looking for a correlation between them instead of using an F-score? What does it mean in Table 1 to have a correlation for P, for example? Is that P vs. P*? Or P vs something else?
Table 2 left me similarly puzzled - I couldn't decipher this: "...the marginal probability that each suffix occurred first or second in such forms...Table 2 gives the Spearman rank correlation between the (log) ratio of the probability of appearing second to the probability of appearing first with the mean rank statistic..." So if we take a word with two suffixes, s1 and s2, what exactly is being computed? Is it log(prob(s1 in first position & s2 in second position)/prob(s2 in second position & s1 in first position))? And then that's being correlated with the empirical relative ranking of these two suffixes? So we want that probability ratio to be greater than 1, which gives a positive value when you take the log. And then we're trying to correlate that positive number with the mean rank of the two suffixes? Why should this be correlated?

- In the conclusion, the authors talk about how the difference between FragGs and other models is that FragGs care about predictive ability - future novelty vs. future reuse. But I'm not sure I understand how that differs from the computation vs. storage tradeoff (which they advocate replacing with future novelty vs. future reuse) - isn't future novelty based on computation while future reuse is based on storage? If so, this seems like they're restating the tradeoff, but with an emphasis on future usage (i.e., "we care about computation vs. storage because we care about the ability to use language efficiently in the future").

~~~

References

O'Donnell, T., Goodman, N., & Tenenbaum, J. (2009). Fragment Grammars: Exploring Computation and Reuse in Language. Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2009-013.

Yang, C. (2005). On Productivity. Yearbook of Language Variation, 5, 265-302.

Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

Monday, February 27, 2012

Next time on Mar 12: O'Donnell et al. (2011)

Thanks to everyone who was able to join us for our spirited discussion of Yang (2010). I think we definitely clarified what that study accomplishes in the debate between the two theoretical viewpoints. Next time on March 12, we'll be looking at a paper that also investigates productivity, examining it through the learning angle, in addition to the basic question of representation.

O'Donnell, T.J., Snedeker, J., Tenenbaum, J.B., & Goodman, N.D. (2011). Productivity and reuse in language. Proceedings of the Thirty-Third Annual Conference of the Cognitive Science Society. Boston, MA.

See you then!

Friday, February 24, 2012

Some thoughts on Yang (2010)

I found this paper a real delight to read - like many of Yang's other papers that we've looked at, it's very clear what was done and how this relates to the larger questions that are being examined. In particular, I thought it was excellent to compare the item-based approach to a generative approach, based on what predictions they would make for children's productions. As Yang pointed out, a lot of previous intuitions about what it means to have a generative (or productive grammar) didn't take into account the Zipfian distribution nature of linguistic data. So, by having a way to generate predictions about how much productivity (as measured by overlap) is expected under each viewpoint, we not only get support for the generative system viewpoint but also actually have support against (at least one version of) the item-based approach. Given how popular the item-based approach is in some circles (e.g., a 2009 PNAS article by Bannard, Lieven, & Tomasello), I thought this was quite striking. From my viewpoint, this is one great way to use mathematical & modeling techniques: to adjudicate between competing theoretical representations.

Some more targeted thoughts:

I really liked in section 1 where the quotes from Tomasello were presented - this gives a clear idea about what exactly is claimed by the item-based approach, and how they have previously used (apparently flawed) intuitions about expected productivity to support that approach. I thought a quote at the end of section 3.3 summed it up beautifully: "...the advocates of item-based learning not only rejected the alternative hypothesis without adequate statistical tests, but also accepted the favored hypothesis without adequate statistical tests."
The remark in section 2.2 about how even adult usage isn't "productive" by the standard of the item-based crowd is a really nice point. If adult usage isn't "productive", but we believe adults have a generative system, then this should make us question our assumption that "unproductive" child usage indicates a lack of a generative system. Of course, I suppose one might argue that maybe we don't think adults have a fully generative system (this is the view of construction grammar, to some extent, I believe.)
In section 3.2, I thought Table 1 was a beautiful demonstration of the match between expected overlap for the generative system and the empirically observed overlap in children's speech.
A minor point about the S/N threshold discussed in 3.2 - I get that S/ln N is a reasonable approximation for rank, especially as N gets very large. However, I'm not quite sure I understand why S/N was chosen as the threshold. I get that it's an upper bound kind of thing, but if S/ln N grows more slowly than S/N, why not just use S/ln N to get a more accurate threshold? It's not as if ln N is hard to calculate.
In section 3.3, I get that this is merely an attempt to make the item-based approach explicit (and maybe the item-based folk would think it's not the right characterization), but I think it's a pretty good attempt. It gets at the heart of what their theory predicts - you get lots of storage of individual lexical item combinations. Then, of course, Table 2 shows how this representation doesn't match the empirically observed overlap rates nearly as well, so we have a point against that representation.
Section 4 is nice in that it suggests that this way of testing theoretical representations should be a general-purpose one - do it for determiner usage, but also for verbal morphology and verb argument structure. Though this analysis wasn't conducted for those other phenomena, I was very convinced that the data show a Zipfian distribution, and so we might expect a generative system to be compatible with them.

~~~
References:
Bannard, C., Lieven, E. & Tomasello, M (2009). Modeling children's early grammatical knowledge. Proc Natl Acad Sci U S A, 106(41), 17284-9.

Monday, February 6, 2012

Next time on Feb 27: Yang (2010)

Thanks to everyone who was able to join our extremely lively discussion on Waterfall et al. (2010), and their approach to learning generative grammars from realistic data! Next time on February 27, we'll be looking at a paper that examines a way to quantify claims of linguistic productivity.

Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

See you then!

Friday, February 3, 2012

Some thoughts on Waterfall et al (2010)

What I really like about this paper is the opening discussion where they sketch the broad ideas that motivated the studies discussed in the rest of the paper. They explicitly talk about why the aim of language acquisition is a grammar, why we should care about the algorithmic level, what developmental computational psycholinguistics ought to be, why current computational models are still lacking because they miss out on the social situatedness of language, and what exactly is meant by "psychologically real" (and also how that differs from "algorithmically learnable"). I found this to be very valuable to just have all in one place. And I admit, it got my hopes up for what kind of model they would actually be using.

Unfortunately (for me), the rest of the paper ended up being somewhat anti-climactic because they don't end up implementing a model that has all the features of interest. Of course, that's a tall order, but they go through the process of running models that have the first three features, and then they talk about a lovely new discourse-related information type that seems like it should be incorporated into their model - and then they don't incorporate it. I think I was expecting them to at least talk about how to incorporate it into the models they spent so much time on in the beginning, even if it was infeasible at the current time to actually implement (for whatever reason). But that didn't seem to be what happened.

This isn't to say that the models they implemented and the identification of the "variation set" construct aren't interesting - it's just that I was expecting more based on the opening. As it is, the paper ends up feeling a bit scattered to me - a lot of potentially useful pieces, but they're not tied together very well.

Some more targeted thoughts:

p.674: I like that they were questioning the use of a gold standard, given that our theories about what the syntactic structure might be may not necessarily match psychological reality. I did find their definitions of recall and precision a bit hard to understand, though. Like many other things in the paper, I would have found an explicit formula (and possibly an example) to be more helpful than the text description. My best understanding of recall was something like the number of new generalizations divided by the test set plus the number of new generalizations, while precision was something like the number of correct new generalizations over the total number of new generalizations.

p.676: They talk about how a strength of their models is that there's no preliminary knowledge of things like grammatical categories (parts-of-speech). While it's nice to be able to say "Look what we can do with no knowledge!", I think this actually makes the problem less psychologically realistic. As far as I know, everyone's willing to grant that the child has some (at least rudimentary) knowledge of grammatical categories before the child starts positing syntactic structure. This is the kind of thing we might get from a child using frequent frames, for instance.

The ADIOS algorithm: I admit, I found this description very difficult to decipher without accompanying examples. It appears to be a batch algorithm, or is it (it appears that the graph is "rewired" every time a new pattern is detected)? What's an example of a bundle? What's a local flow quantity that would act as a context-sensitive probabilistic criterion for a significant bundle? How exactly does that work? How dissimilar is this whole process from frequent frames, which also induce equivalence classes? What are the basic abilities/knowledge required to make this algorithm work - the ability to create a graph, to identify bundles, to allow recursion of abstract patterns?

The ConText algorithm: This was a little better, because they provided a simple example. But again, I found myself wanting more explicit definitions for the different model components in order to understand how reasonable (or not) a model this was psychologically. For example, there's a local context window of 2, which means in a sentence like "I really like cute penguins", we would get a context vector for "like" where the lefthand context is "I really" and the righthand context is "cute penguins". Okay, great (though I worry about a window of 2 on each side in terms of data sparseness). And in order to construct equivalence classes based on this, the algorithm operates in batch mode over the data. Again, okay. But then, some kind of distance measure is posited to compare different context vectors to each other involving the angle between context vectors - how is this instantiated? What does the angle between "I really" and "But I" look like, for example? Presumably these are mapped into real numbers somehow... On a related note, once the algorithm gets clusters based on these context vectors, it then seems to do something with rewriting sequences - but what are sequences? Are these the utterances themselves, the partially abstracted representations the learner is forming, something else?

p.681: ConText results - I thought it was interesting that the ConText model ends up with subcategorization (for example, eat and drink being in the same class). This again reminds of frequent frame results, and made me want an explicit compare and contrast.

p.683: Human judgments of acceptability of new sentences created by ConText learner - I thought it was a bit strange to ask the participants to judge the acceptability based on how likely it was to appear in child-directed speech. Would the participants have a good sense of child-directed speech? My experience with undergrads who parse utterances from child-directed speech is that they're utterly surprised by how "ungrammatical" and semi-nonsensical conversational speech (and especially child-directed speech) is.

Variation sets: This is something of real value to computational models, I think. We have empirical evidence that children especially benefit from these particular data units and we have a reasonable idea of how to automatically identify them, and so we could reasonable expect a model to be extra sensitive to these kinds of data (perhaps give these data more weight). There's an interesting comment on p.688 where variation sets with roughly 50% of the material changing are the most helpful to children. My big question was why - what's so special about 50%? Does this represent some optimal tradeoff in terms of recognition and contrast? Another interesting note on p.689 and Table 2 on p.695, where they looked at how predictive the frequent n-grams were in variation sets for part-of-speech - some of them are pretty predictive, which is nice, and this shows that sometimes n-grams are useful, as opposed to needing framing elements (this was something a paper by Chemla et al. 2009 looked at). I do wonder at how this predictive quality would hold up cross-linguistically, though - what about languages where the wh-word doesn't move, or languages without auxiliary "do"?

Incremental learning (p.698): There's some discussion at the very end about how to transform ConText into an incremental learner, which I think is a good thing to think about. However, I wonder about the motivation behind using the gap automatically (i.e., a furry marmot gets additional "frames" of ___ furry marmot, a ____ marmot, and a furry _____ presumably). Is the idea that this will jumpstart the abstraction process, which otherwise would have to wait until it saw another instance that used two of those words? (Or in the case of a context window of 2 on each side, 4 of the words?)

References

Chemla, E., Mintz, T., Bernal, S., & Christophe, A. (2009). Categorizing Words Using "Frequent Frames": What Cross-Linguistic Analyses Reveal About Distributional Acquisition Strategies. Developmental Science.

Monday, January 23, 2012

Next time on Feb 6: Waterfall et al. (2010)

Thanks to everyone who was able to join our discussion our vigorous discussion of phonotactics and word segmentation today! Next time on Feb 6, we'll be looking at an article that focuses on syntactic acquisition, with an emphasis on learning generative grammars from realistic data.

Waterfall, H., Sandbank, B., Onnis, L., & Edelman, S. (2010). An empirical generative framework for computational modeling of language acquisition. Journal of Child Language, 37, 671-703.

See you then!

Friday, January 20, 2012

Some thoughts on Daland & Pierrehumbert (2011)

One of the first things that struck me about this paper was how wonderfully well-written I found it to be - it was so easy to follow the different ideas, and I really appreciated how careful it was to explain the details of pretty much everything involved. I kept saying to myself, "Yes, this is how a modeling paper should be written! So clear, so honest!" (And I'm not just saying this because one of the authors is in the reading group.) To be fair, it's likely that the pieces of this model are somewhat more transparent than pieces of other models we've looked at, and so lend themselves well to examination and explanation. Still, kudos to the authors on this - because they were very precise about both the modeling components and the ideas behind the model.

On a more content-related point, the authors were very clear to indicate that a diphone-based process couldn't occur until after most of the phones of the language were determined, which wouldn't happen till around 9 months. Since word segmentation starts earlier than this, this suggests diphone-based learning is presumably a later stage word segmentation strategy rather than an initial get-you-off-the-ground strategy. But I wondered if this was necessarily true. Suppose you have a learner who really would like to use diphone-based learning, but hasn't figured out her phones yet. Would she perhaps try to do it anyway, but simply using whatever fuzzy definition of phones she has (probably finer-grained distinctions than are actually present in the language)? For example, maybe she hasn't figured out that /b/ and /b^h/ are the same phone in English because she's only 6 months old. (Or maybe that the /b/ in /bo/ is the same as the /b/ in /bi/.) This means that she just has more "phones" than the 9-month-old diphone learner has - but how much does that matter? My guess is this still leads to fewer overall units than a syllable-based learner has. Moreover, because there are more "phone" units for this 6-month-old diphone learner, maybe it takes longer to segment words out. A longer period of undersegmentation might occur, but still yield some useful units that could help bootstrap the lexical-diphone learner.

Some more specific thoughts:
- The lexical learner here reminded me quite a bit of the one by Blanchard, Heinz, & Golinkoff (2010), and I was wondering about a compare and contrast between them. They both clearly make use of phrasal units, and later on lexical units. I believe the BH&G learner also included knowledge of a syllable, which the lexical learner doesn't.

- With respect to syllables vs. diphones, I wonder how many syllables in English are diphones (CV syllables, really), and how informative they are for word segmentation. In some sense, this is getting at how different a syllable-based learner is from a diphone-based learner. I imagine it would vary from language to language - Japanese would have more overlap between syllables and diphones, while German maybe has less. This seems related to the point brought up on p.149, in section 7.4.3, where they mention that the diphone learner could apply to syllables in Japanese (presumably rather than phones).

- p.125: I like that they worry about the implausibility of word learning from a single (word segmentation?) exposure. I think there's something to the idea that it takes a couple of times of successfully segmenting the word form from fluent speech before it sticks around long enough to be entered into the lexicon (and hopefully get assigned a meaning later on). Related note on p.127, where they assume the segmentation mechanism has "no access to specific lexical forms" - this seems like the extreme view of this. Unless I misunderstood, it implies that segmentation doesn't really make use of individual word forms, so algebraic learning (ex: "morepenguins = more+penguins" if more is a known form) shouldn't occur. I'm not sure how early this kind of learning occurs, to be honest, but it's certainly true that a lot of models (including BH&G's, I believe), assume that this kind of information is available during segmentation.

-p.133, section 4.1.3: It's completely reasonable to use the CELEX as a source for phonetic pronunciation and leave out words that aren't in CELEX (like baba), but I wonder how these affect the token sequence probabilities. It would be nice to know if it was just a few types that frequently occurred that were left out, or if it was a number of different types (potentially with many different diphone sequences).

-p.141, section 5.5: I really liked that they explored what would happen if the learner's estimation of how often a word boundary occurred was off (and then found that it didn't really matter). However, I do wonder if the reason the learner was robust has anything to do with the fact that the highest value in the range they looked at was still smaller than the "hard decision" boundary of 0.5 (mentioned on p.135).

-p.142: I also really liked that they looked at more realistic conversational speech data, which included effects of coarticulation (Stanford --> Stamford), which would then be a good clue that the diphone sequence was part of the same word. I thought coarticulation occurred across word boundaries too in conversational speech, though - maybe it's just that it occurs more often within words.

-p.146, section 7.2.1: I'm not quite sure I followed that part that says "undersegmentation means sublexically identified word boundaries can generally be trusted." If you've undersegmented, how do you know about word boundaries inside the chunk you've picked out? By definition, you didn't put in those word boundaries.

-p.148, section 7.4.1: I think the idea of prosodic words is extremely applicable to the word segmentation process at this stage. Given what we know of function words and content words, is there some principled way to resegment an existing corpus so it's made up of prosodic words instead of orthographic words? Or maybe the thing to do is to look at the errors being made by existing word segmentation models and see how many of them could be explained by the model finding prosodic words instead of orthographic words. A model that has a lot of prosodic words is maybe closer to human infants?

Tuesday, January 10, 2012

Next time on 1/23: Daland & Pierrehumbert (2011)

Welcome back! This quarter, we'll be holding our lively reading group on Mondays at 10:30am in SBSG 2221, with our first meeting of the quarter happening on January 23rd. (Check out the schedule for the rest of the quarter's meetings.) We'll be looking at an article that explores a cognitive model of word segmentation that draws on phonotactics and is instantiated using Bayes' theorem:

Daland, R. & Pierrehumbert, J. (2011). Learning Diphone-Based Segmentation. Cognitive Science, 35, 119-155.

See you then!

Tuesday, December 6, 2011

Schedule for Winter 2012 available

The schedule of readings for winter 2012 is now available! We'll be looking at a variety of topics again, including word segmentation, morphology, and linguistic productivity.

Friday, November 18, 2011

Some thoughts on Mitchener & Becker (2011)

I really like that M&B are looking at a learning problem that would be interesting to both nativists and non-nativists (a lot of the time, it seems like the different sides are talking past each other on what problems they're trying to solve). I also really like that they're exploring a variety of different probabilistic learning models. It does seem that M&B are still approaching the learning problem from a strongly nativist perspective, given the way they've described the actual problem: the learner knows there are two classes of behavior that link syntactic structure to semantic interpretation (raising vs. control), and that there are specific cues the learner should use to figure out which behavior a given verb has (animacy & eventivity). Importantly, only those cues (and their distribution) are relevant. There also seems to be an implicit assumption (at least initially) that unambiguous data are required to distinguish the behavior of any given verb, and the learning problem results because unambiguous data aren't always available (this is a common way learnability problems are framed in a nativist perspective). One thing I wondered while reading this is what would happen if the behavior of these verbs was taken in the context of a larger system - that is, would it possibly be easier to recognize these distinct classes of verbs if other information were deemed relevant besides the two cues M&B look at? I believe they hint at this themselves in the paper - that it might be possible to look at the syntactic distribution of these verbs over all frames, rather than just the ambiguous frame that signals either raising or control (She VERBed to laugh). This doesn't solve the problem of knowing what the different linking rules are between structure and interpretation, but maybe it makes the classification problem (that there are distinct classes of verbs) easier.

Some more targeted thoughts:

- Footnote 2 talks about the issues of homophony, and I can certainly see that tend's meanings are pretty distinct between raising and regular transitive verb. However, happens looks like it means very similar things whether it's raising or regular transitive, so I wonder how children would make this distinction - or if they would at all. If not, then this looks like an additional class of verb that involves mixed behavior.

- The end of section 2 talks about how 3- and 4-year-olds are very sensitive to animacy when they interpret verbs in the ambiguous raising/control frame. I can completely believe that animacy might generally be a cue children use to help them figure out what things should mean (e.g., if a verb takes an agent or not).

- I really like the discussion/caveat that M&B do in the intro of section 4 about biological plausibility.

- I also really liked the discussion of the linear reward penalty (LRP) learner's issues in section 4.2.1. Not having an intermediate state equilibrium is problematic if you need there to be mixed behavior (e.g., something is ambiguous between raising and control). I admit, I was surprised by the saturating accumulator model M&B chose to implement to correct that problem. I had some trouble connecting the various pieces of it to the process in a child's mind - the intuitive mapping didn't work for me the way it does for the LRP learner. For example, the index they talk about right at the end of section 4.2.2 seems fairly ad-hoc and requires children to do abstracting over patterns of frames defined by these different semantic cues.

Tuesday, November 8, 2011

Next time on 11/21: Mitchener & Becker (2011)

Thanks to those of you who were able to join our nicely in-depth discussion of Alishahi & Pyykkonen (2011)'s article on syntactic bootstrapping! I think we figured out some of the details that were glossed over, and these really helped to understand the contribution of the study.

Next time, on Nov 21 (@3pm in SBSG 2221), we'll be looking at an article that examines how a subtle syntactic distinction that has specific semantic implications (called the raising-control distinction) could be learned.

Mitchener, G. & Becker, M. (2011). Computational Models of Learning the Raising-Control Distinction. Research on Language and Computation, 8(2), 169-207.

See you then!

Friday, November 4, 2011

Some thoughts on Alishahi & Pyykkonen (2011)

I really like the investigation of syntactic bootstrapping in this kind of computational manner. While experimental approaches like the Human Simulation Paradigm (HSP) offer us certain insights about how (usually adult) humans use different kinds of information, they have certain limitations that the computational learner doesn't (such as the researcher knowing exactly what the internal knowledge state is, and how it changes). From my perspective, the HSP with adults (and maybe even with 7-year-olds) is a kind of ideal learner approach, because it asks what inferences can be made with maximal knowledge about (the native) language - so while it clearly involves human processing limitations, it's examining the best that humans could reasonably be expected to do in a task that's similar to what word-learners might be doing. The computational learner is much more limited in the knowledge it has access to a priori, and I think the researchers really tried to give it reasonable approximations of what very young children might know about different language aspects. In addition, as A & P mention, the ability to track the time course of learning is a nice feature (though with some caveats with respect to implementation limitations).

Some more targeted thoughts:

I thought the probabilistic accuracy was a clever measure for taking advantage of the distribution over words that the learner calculates.

As I said above, tracking learning over time is an admirable goal - however, the modeled learner here clearly is only qualitatively doing this, since there's such a spike in performance between 0 and 100 training examples. I'm assuming A & P would say that children's inference procedures are much noisier than this (and so it would take children longer), unless there's evidence that children really do learn the exact correct meaning in under 100 examples (possible, but seems unlikely to me).

I was a little surprised that A & P didn't discuss the difference in Figure 1 between the top and bottom panel with respect to the -LI condition. (This was probably due to the length constraints, but still.) It's a bit mystifying to me how absolute accuracy could be close to the +LI condition while verb improvement is much lower than the +LI condition. I guess this means the baseline for verb improvement was different between the +LI and -LI conditions somehow?

It was indeed interesting to see that having no linguistic information (-LI) was actually beneficial for noun-learning - I would have thought noun-learning would also be helped by linguistic context. A & P speculate that this is because early nouns refer to observable concepts (e.g., concrete objects) and/or the nature of the training corpus made the linguistic context for nouns more ambiguous than for verbs. (The latter reason ties into the linguistic context more.) I wonder if this effect would persist with a different training corpus (after all, there were some assumptions A & P made when constructing this corpus - they seemed reasonable, but there are still different ways to construct the corpus.)

Monday, October 17, 2011

Next time: Alishahi & Pyykkonen (2011)

Thanks to those of you who were able to join our nicely in-depth discussion today of Dillon et al. (2011)'s article on applying Bayesian models to phonological acquisition! Next time on 11/7 (@3:30pm in SBSG 2221), we'll be discussing an article that looks at the phenomenon of syntactic bootstrapping, which is the ability to infer word meaning and abstract structure associated with that word from the syntactic context of the word:

Alishahi, A. & Pyykkonen, P. (2011). The onset of syntactic bootstrapping in word learning: Evidence from a computational study. Proceedings of the 33nd Annual Conference of the Cognitive Science Society, Boston, MA.

See you then!

Friday, October 14, 2011

Some thoughts on Dillon et al. (2011)

I'm really fond of this paper - I love that they're tackling realistic problems (with realistic language data), that they're seriously looking at the state of the art with respect to computational models of it, and that they're finding a way to connect linguistic theory (e.g., "There are phonological rules") with this level of concreteness (e.g., "Let's make them linear models operating over acoustic space"). Because of all this, I think their point about the potential issues of two-stage models comes across very clearly. And I love that that they can make a model that learns both phonemes and their relationships between phonetic categories simultaneously. Moreover, the fact they can do this without trying to learn a lexicon simultaneously (like Feldman, Griffiths, & Morgan (2009) do) is impressive to me, since that was the main thing that seemed to lead to good results for Feldman et al. (2009). Notably, they make use of the linguistic context (i.e., does a uvualar consonant follow), which is something Swingley (2009) recently suggested looks really helpful for English phonemes in a review of infant phoneme learning.

A few more targeted thoughts:

I really like that they note the three-vowel +allophones system is not just a special weirdness of Inuktitut, but rather something that occurs in a number of different languages. This makes it more important to be able to account for this kind of data, and bolsters support for the single stage model.
I also thought it was useful to note that the EM approach follows the frequentist tradition. After a moment's reflection, this is clearly true, but it didn't occur to me until they pointed it out.
Because of the nature of the Bayesian model, the more data that come in, the more the model is likely to prefer more categories over less (and the explanation they give for this just before the discussion of Expt 1 is entirely sensible). This carries over even for their cool Expt 3 model that learns categories and rules simultaneously (as we can see in Table 6) - the 12000 data point model is much more likely to posit 4 or 5 categories than the 1000 data point model. I'm wondering what this means for actual acquisition. Should we expect that infants learn very quickly and so end up with 3 categories + rules? Or would we expect that infants might go through a stage where they have 4 or 5 categories, and have to recover (maybe based on doing word segmentation/lexicon item discovery)?
For the one-stage model in Expt 3, they mention that they build in a bias for complementary distribution - is this an uncontroversial assumption (or easy to derive from innate abilities we know infants do have)? I honestly don't have strong intuitions about this. It'd be great if it was.

References:

Feldman, N., Griffiths, T., and Morgan, J. (2009). Learning phonetic categories by learning a lexicon. Proceedings of the 31st Annual Conference on Cognitive Science.

Swingley, D. (2009). Contributions of infant word learning to language development. Philosophical Transactions of the Royal Society B, 364, 3617-3632.

Monday, October 3, 2011

Thanks to those of you who were able to join our spirited discussion today of Dunbar et al. (2010)'s article on Bayesian reasoning in linguistics! Next time on 10/17 (@3pm in SBSG 2221), we'll be discussing an article by the same crew of authors that models the acquisition of specific phonological phenomena:

Dillon, B., Dunbar, E., & Idsardi, B. (2011 ms). A single stage approach to learning phonological categories: Insights from Inuktitut. University of Maryland, College Park and University of Massachusetts, Amherst.

See you then!

Friday, September 30, 2011

Some thoughts on Dunbar et al. (2010)

This is probably one of the more linguistically technical articles we've read in the group to date, but I think that even if the linguistic details weren't as accessible to someone without a linguistic background, there was still a very good, basic point made about the simplicity of abstract structures, given principles of Bayesian reasoning. On the one hand, this might seem surprising since adding another layer of representation might seem de facto more complex; on the other hand, there's something clearly simpler about having three basic units of representation instead of six (for instance).

Some more targeted thoughts:

p.7: The particular example they discuss involving phonemes (specifically, three with derivational rules vs. six with no need for derivational rules) - this reminds me of Perfors et al. (2010), where they were looking at recursion in language, also from a Bayesian perspective. In that case, the decision was between a non-recursive grammar, a partially-recursive grammar, and a fully recursive grammar. The outcome turned out to be that for different structures (subject embedding vs. object embedding), different grammars fit the data best, with one of the winners being the partially-recursive grammar. In essence, this is a "direct store + some computation" approach. For the phoneme example in Dunbar et al., it seems like the choices are between "direct store of six" vs. "store three + some computation", and the "some computation" option ends up being the best. (Related note on p.30: I agree that it would be nice to have formal theoretical debates take place at this level when discussing learnability, rather than relying on intuitions of whether computation or direct storage is more complex/costly.)

p.9: Just a quick note about their justification of looking for a theoretically optimal solution (using the ideal learner paradigm, essentially) - I do agree that this has a place in acquisition studies. Basically, if you formulate a problem (and accompanying hypothesis space), and then find that this problem is unsolvable by an ideal learner, this is a clue that some thing is not right - maybe it's the hypothesis space, maybe it's a missing learning bias on how to use the data, etc.

p.14: Another main message of the authors: "Probability theory...is simply a way...of formalizing reasoning under uncertainty." I get the impression that this is to persuade readers who aren't normally very fond of probability.

Monday, September 26, 2011

Welcome back!

The CoLa Reading Group will be holding its meetings on Mondays at 3pm in SBSG 2221 this quarter. We'll meet four times during the quarter, approximately every other week (schedule available here). Our first meeting will be held this coming Monday October 3rd, when we'll be looking at Dunbar, Dillon, & Idsardi (2010), who examine the utility of abstract linguistic representations viewed from a Bayesian perspective:

Dunbar, E., Dillon, B., & Idsardi, W. (2010 ms) A Bayesian Evaluation of the Cost of Abstractness. University of Maryland, College Park and University of Massachusetts, Amherst.

And remember: Even if you aren't able to come to the meeting in person, you're always welcome (and encouraged) to post on the reading group discussion board here!

See you next Monday!

Monday, May 23, 2011

Thanks and see you at the end of the summer!

Thanks to everyone who was able to join us for our discussion of Clark & Lappin's (2011) article - as usual, we had quite the rousing debate about various points! This concludes the reading group activities for the spring quarter. We'll be picking up again at the end of the summer, around late August. Have a good break!