Computational Models of Language (at UC Irvine)

Monday, April 30, 2012

Some thoughts on Bouchard (2012)

I think Bouchard (2012) actually takes a similar approach to Perfors et al. (2011) with respect to solving the structure-dependence problem, in the sense of redefining what the problem is and then stating that the solution to this problem does not involve UG learning biases. It's at this point that the two studies part ways, but there is, in fact, the fundamental similarity. Bouchard does believe that meaning is inextricably tied to the problem, but rejects the transformational approach that's traditionally assumed by Chomsky and colleagues. Instead, meaning is more foundational in how the structures are generated. One thing that isn't clear to me at all is whether the UG problem is solved, as the title would suggest. It seems to me that the components that Bouchard assumes involve a lot of knowledge about interpretation (ISSUE and its structural relationship to Tense, incompleteness relating to a non-tensed utterance, etc.), and it's unclear where this knowledge comes from, if it's not meant to be innate. Maybe "solving the UG problem" is just supposed to be about providing a complete specification of what's in UG?

Some more targeted thoughts:

- One of Bouchard's issues with the current ideas about UG is that the components of UG seem hard to explain evolutionarily. That is, if we accept the current UG formulation, it's hard to explain why this would come to be for any kind of adaptive reasons. This is a fair point, but I'm not sure the UG Bouchard proposes gets around this either.

- I think Bouchard does a nice review of the current approach to UG that's motivated by efficient computation. In particular, it's fair to ask if "efficiency" is really the crucial factor - maybe "effectiveness" would be better, if we're trying to relate this to some kind of evolutionary story.

- I'm not sure it's fair to criticize the transformational account by saying that children may not encounter declarative utterances before they encounter interrogative utterances. It should be enough that children recognize the common semantics between them, and assume they're related.

- I appreciate Bouchard's effort to specify the exact form of the rule that relates declarative and interrogative utterances (the four constraints on the rule). This is useful if we were ever interested in making a hypothesis space of rules, and having the child learn which one is the right one (it reminds me a bit of Dillon, Dunbar, & Idsardi (2011), with their rule-learner). Anyway, the main point is clear: The issue is that the actual rule is one of many that could be posited, even given the four constraints Bouchard describes, and we either need the right rule to fall out from other constraints or we need it to be learnable from the available possibilities.

- I agree with the basic point that "with a different order comes different meaning", but the point is that it's a related meaning. Even in example (21), the utterances are still about the event of seeing and involve the actors Mary and John.

- "Question formation is not structure dependent, it is meaning dependent" - Well, sure, but meaning dependent, especially as it's described here, is all about the structure. So "meaning dependent" is the same as saying "structure dependent", isn't it?

- The Coherence Condition of Conindexation (example 30): This sounds great, but don't we then need to specify what "coherent" means? This seems to be an example of describing what's going on, rather than explaining what's going on. For example, for (29), why do those two elements get coindexed, out of all the elements in the utterance? Presumably, this has to do with the structure of the utterance... This relates to a point slightly later on: "...due to the lexical specifications that determine which actant of the event mediates the link between the event and a point in time" - Where do these lexical specifications come from? Are they learned? This seems more a description than an explanation.

- p.25: "Whatever Learning Machine enables them to learn signs also enables them to learn combinatorial signs such as dedicated orders of signs" - This seems like a real simplification. The whole enterprise of syntax is based on the idea that meaning is not the only thing determining syntactic form (otherwise, how do you get ungrammatical utterances that are intelligible, like "Where did Jack think the necklace from was expensive?"). So the Learning Machine needs to have something explicit in there about how combinatorial meaning links to form.

Wednesday, April 18, 2012

Next time on May 2: Bouchard (2012)

Thanks to everyone who was able to join us for an informative discussion of Perfors et al. (2011), along with the reply piece in Berwick et al. (2011)! Next time on May 2, we'll be looking at a different approach to addressing the same problem in language acquisition (structure-dependent rules) by Bouchard (2012). Interestingly, Bouchard is coming from a very different perspective, where the issue is not that too much has been assumed to be part of UG, but rather that not enough has.

Bouchard, D. (2012). Solving the UG Problem. Biolinguistics, 6(1), 1-31.
http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/Bouchard2012_UGStructDep.pdf

See you then!
-Lisa

Monday, April 16, 2012

Some thoughts on Perfors et al. (2011) + Berwick et al. (2011)

I really like how straightforward Perfors et al's (2011) Bayesian model is - it's very easy to see how and why they get the results that they do from child-directed speech. They're very careful to say precisely what their model is doing: Assuming there are hierarchical representations in the child's hypothesis space already, these representations can be selected as the ones that best match the child-directed input. In addition, I think they highlight how previous approaches to this problem have tended to split along two distinct dimensions: domain-specific vs. domain-general, and structured vs. unstructured. It's always useful to figure out where the current approach is adding to the existing discussion.

The only real issue I see is the one Berwick et al. (2011) pointed out: The (infamous) poverty of the stimulus (PoS) problem relating to structure dependence is not the one Perfors et al. (2011) are addressing. In particular, the traditional PoS problem has to do with hypothesizing what kind of rules will relate a declarative utterance (e.g., "I can have an opinion") to its interrogative equivalent (e.g., "Can I have an opinion?"). This relationship isn't addressed in Perfors et al.'s model - all that model is concerned with is the ability to assign structure to these utterances. As far as it knows, there's no relationship between the two. And this is where we see the real divergence from the traditional PoS problem, where it was assumed that the child is trying to generate an interrogative using the same semantic content that would be used to make the declarative. This is why the "rules of transformation" were hypothesized in the first place (granted, with the assumption that the declarative version was more basic, and the interrogative version had to be created from that basic version). So, long story short, the Perfors et al. model is learning something that is different from the original PoS problem.

However, it's fair to assume that knowing there are hierarchical structures is a prerequisite for creating rules that use those hierarchical structures. In this sense, what Perfors et al. have shown is really great - it allows the building blocks of the rules (hierarchical structures) to be chosen from among other representations. However, as Berwick et al. point out, it still remains to be shown how having structures building blocks leads you to create structure-dependent rules. Perfors et al. assume that this is an automatic step: [end of section 1.2] "...any reasonable approach to inducing rules defined over constituent structure should result in appropriate structure-dependent rules". Phrased that way, it does sound plausible - and yet, I think there's a real distinction, especially if we're concerned about relating the declarative and interrogative versions of an utterance. Making a structure-dependent rule requires using the available structure as the context of the rule. So this means you could make a structure-independent rule just by not using structure in the context of the rule - even if your building blocks are structured.

Example of a structure-independent rule using structure building blocks:
Move the auxiliary verb after the first NP.
Building blocks: auxiliary verb, NP (structured)
Context: first (not structured)

So again, I think that what Perfors et al. have shown is great in terms of understanding the stages of learning - it's important to know that the preference for hierarchical structure in language doesn't have to be innate (even if the ability to consider hierarchical structure in the hypothesis space may be). However, I do think it falls short of addressing the PoS problem that linguists typically associate with structure dependence. This isn't a failing of Perfors et al. - it just means that people really have to be careful about how they interpret these results. It's very tempting to say that the structure-dependence PoS problem has been solved if you don't give this a very careful read and know what linguists think the problem actually is.

Wednesday, April 4, 2012

Next time on April 18: Perfors et al. (2011)

We'll have our first meeting of the CoLa Reading Group for this quarter on Wednesday April 18 at 10:30am in SBSG 2221. You can check out the schedule at

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

for the rest of this quarter's meetings.

For our first article of the quarter, we'll be looking at Perfors, Tenenbaum, & Regier (2011), who use hierarchical Bayesian modeling to examine structure dependence in syntax, which has often been used as an example of an induction problem (or poverty of the stimulus) in language acquisition. I also recommend looking at a section in a recent response to this article by Berwick, Pietroski, Yankama, & Chomsky (2011), since it explicitly addresses the results of Perfors et al. (2011).

Perfors, A., Tenenbaum, J., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition, 118, 306-338.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/PerforsTenenbaumRegier2011_LearnabilityAbstractSyntax.pdf

Berwick, R., Pietroski, P., Yankama, B., & Chomsky, N. (2011). Poverty of the Stimulus revisited. Cognitive Science, 35, 1207-1242. [Section 4.2]

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/BerwickEtAl2011_PoS.pdf

See you then!

Friday, March 30, 2012

Gearing up for the spring - readings available!

I hope everyone's had a good spring break - and now it's time to gear up for the spring quarter of the reading group! :) The schedule of readings is now posted on the CoLa reading group webpage, following several suggestions of topics of interest to the group.

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week.

Monday, March 12, 2012

Thanks for a great quarter!

Thanks to everyone who was able to join us for our discussion of O'Donnell et al. (2011)! It was very useful to compare the models discussed to some existing models that we know about, and think about how to connect the representational issues to language acquisition.

For next quarter, let me know if you have any particular articles or topics that you would be interested in discussing - you're welcome to post them here or email them to me at lpearl@uci.edu.

Have a good spring break!

Friday, March 9, 2012

Some Thoughts on O'Donnell et al. (2011)

I like that this paper is interested in big ideas of knowledge representation (basically, how big are the chunks that we store), and provides what seems like a sensible formalization of the idea that medium-size reusable chunks are probably the way to go. Within the same framework, they also provide formalizations of other ideas for the unit of representation (basically, use the smallest units (full-parsing/generative), use the largest units (full-listing), and use all the units (exemplar)), which is nice for easy comparison purposes. While the intuition that medium-size reusable chunks are best is perhaps unsurprising, I think this gives us a clear quantitative argument for that idea. I do wish we had been given some sense of what exactly these medium-size chunks look like for the two different morphology problems though - at first I thought this was due to space limitation, but the tech report (O'Donnell et al. 2009) version doesn't really show us what these look like either. I wonder how well they match (or don't match) current morphological theories of representation. I know the full-parsing theory is a strong viewpoint for syntax currently, but I don't know how many linguists believe that's really a viable option for morphology. On the flip side, the exemplar-based idea seems like it would make more sense for morphology (where we have a fairly small number of possible combinations), while it seems like that would be a harder sell for syntax (where there can be quite a lot of different parses, especially for longer sentences). Similarly, the full-listing approach seems intractable for syntax. Of course, this only really matters if we think Fragment Grammars apply at multiple levels of linguistic representation (e.g., morphology and syntax). I'm assuming this is what the authors intend, though.

Some more more targeted thoughts:

- Exemplar-based Inference: I can't imagine a world where this would win out, compared to Fragment Grammars (FragGs). At best, it has the same coverage as FragGs, but it has to store a heck of a lot more. Perhaps this is included for completeness in model comparison, particularly since the DOP framework assumes this?

- I thought it was very good to mention other models that have similar properties to FragGs. However, given the descriptions provided, I really wondered how Parsimonious Data-Oriented Parsing differs from FragGs ("...explicitly eschews the all-subtree approach in favor of finding a set of subtrees which best explains the data.") Maybe in the way inference is done?

- In terms of comparing this to our reading from last time (Yang 2010), I wonder what's actually being explained by the inference process behind FragGs. Is this a way to assess which representation is likely to be correct for adult usage? If so, this makes it similar to Yang (2010), as that was an assessment of productivity in child speech. Or is this instead a proposal for how adults actually come to have these medium-size chunks, and so it would be a computational level explanation of the actual process of chunk formation?

- A minor note on the past tense representation: I found it interesting that the rule for past tense formation was explicitly encoded in the "morphological representation". This makes this representation seem much more similar to work by Yang on morphological productivity in the English past tense (e.g., Yang 2005), which talks about predictability of child behavior based on the rules used to form the past tense.

- The derivational morphology section: I admit, I got a bit lost on some of the details here.

How do we take 10,000 "forms" as data, and have that yield 25,000 types and 7.2 million tokens? What are these forms?
I like the P and P* measures, since those seem to correlate somewhat with the idea of precision and recall (P ~= how generalizable is this suffix, P* ~= how many novel words use this suffix). But then, why are we looking for a correlation between them instead of using an F-score? What does it mean in Table 1 to have a correlation for P, for example? Is that P vs. P*? Or P vs something else?
Table 2 left me similarly puzzled - I couldn't decipher this: "...the marginal probability that each suffix occurred first or second in such forms...Table 2 gives the Spearman rank correlation between the (log) ratio of the probability of appearing second to the probability of appearing first with the mean rank statistic..." So if we take a word with two suffixes, s1 and s2, what exactly is being computed? Is it log(prob(s1 in first position & s2 in second position)/prob(s2 in second position & s1 in first position))? And then that's being correlated with the empirical relative ranking of these two suffixes? So we want that probability ratio to be greater than 1, which gives a positive value when you take the log. And then we're trying to correlate that positive number with the mean rank of the two suffixes? Why should this be correlated?

- In the conclusion, the authors talk about how the difference between FragGs and other models is that FragGs care about predictive ability - future novelty vs. future reuse. But I'm not sure I understand how that differs from the computation vs. storage tradeoff (which they advocate replacing with future novelty vs. future reuse) - isn't future novelty based on computation while future reuse is based on storage? If so, this seems like they're restating the tradeoff, but with an emphasis on future usage (i.e., "we care about computation vs. storage because we care about the ability to use language efficiently in the future").

~~~

References

O'Donnell, T., Goodman, N., & Tenenbaum, J. (2009). Fragment Grammars: Exploring Computation and Reuse in Language. Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2009-013.

Yang, C. (2005). On Productivity. Yearbook of Language Variation, 5, 265-302.

Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

Monday, February 27, 2012

Next time on Mar 12: O'Donnell et al. (2011)

Thanks to everyone who was able to join us for our spirited discussion of Yang (2010). I think we definitely clarified what that study accomplishes in the debate between the two theoretical viewpoints. Next time on March 12, we'll be looking at a paper that also investigates productivity, examining it through the learning angle, in addition to the basic question of representation.

O'Donnell, T.J., Snedeker, J., Tenenbaum, J.B., & Goodman, N.D. (2011). Productivity and reuse in language. Proceedings of the Thirty-Third Annual Conference of the Cognitive Science Society. Boston, MA.

See you then!

Friday, February 24, 2012

Some thoughts on Yang (2010)

I found this paper a real delight to read - like many of Yang's other papers that we've looked at, it's very clear what was done and how this relates to the larger questions that are being examined. In particular, I thought it was excellent to compare the item-based approach to a generative approach, based on what predictions they would make for children's productions. As Yang pointed out, a lot of previous intuitions about what it means to have a generative (or productive grammar) didn't take into account the Zipfian distribution nature of linguistic data. So, by having a way to generate predictions about how much productivity (as measured by overlap) is expected under each viewpoint, we not only get support for the generative system viewpoint but also actually have support against (at least one version of) the item-based approach. Given how popular the item-based approach is in some circles (e.g., a 2009 PNAS article by Bannard, Lieven, & Tomasello), I thought this was quite striking. From my viewpoint, this is one great way to use mathematical & modeling techniques: to adjudicate between competing theoretical representations.

Some more targeted thoughts:

I really liked in section 1 where the quotes from Tomasello were presented - this gives a clear idea about what exactly is claimed by the item-based approach, and how they have previously used (apparently flawed) intuitions about expected productivity to support that approach. I thought a quote at the end of section 3.3 summed it up beautifully: "...the advocates of item-based learning not only rejected the alternative hypothesis without adequate statistical tests, but also accepted the favored hypothesis without adequate statistical tests."
The remark in section 2.2 about how even adult usage isn't "productive" by the standard of the item-based crowd is a really nice point. If adult usage isn't "productive", but we believe adults have a generative system, then this should make us question our assumption that "unproductive" child usage indicates a lack of a generative system. Of course, I suppose one might argue that maybe we don't think adults have a fully generative system (this is the view of construction grammar, to some extent, I believe.)
In section 3.2, I thought Table 1 was a beautiful demonstration of the match between expected overlap for the generative system and the empirically observed overlap in children's speech.
A minor point about the S/N threshold discussed in 3.2 - I get that S/ln N is a reasonable approximation for rank, especially as N gets very large. However, I'm not quite sure I understand why S/N was chosen as the threshold. I get that it's an upper bound kind of thing, but if S/ln N grows more slowly than S/N, why not just use S/ln N to get a more accurate threshold? It's not as if ln N is hard to calculate.
In section 3.3, I get that this is merely an attempt to make the item-based approach explicit (and maybe the item-based folk would think it's not the right characterization), but I think it's a pretty good attempt. It gets at the heart of what their theory predicts - you get lots of storage of individual lexical item combinations. Then, of course, Table 2 shows how this representation doesn't match the empirically observed overlap rates nearly as well, so we have a point against that representation.
Section 4 is nice in that it suggests that this way of testing theoretical representations should be a general-purpose one - do it for determiner usage, but also for verbal morphology and verb argument structure. Though this analysis wasn't conducted for those other phenomena, I was very convinced that the data show a Zipfian distribution, and so we might expect a generative system to be compatible with them.

~~~
References:
Bannard, C., Lieven, E. & Tomasello, M (2009). Modeling children's early grammatical knowledge. Proc Natl Acad Sci U S A, 106(41), 17284-9.

Monday, February 6, 2012

Next time on Feb 27: Yang (2010)

Thanks to everyone who was able to join our extremely lively discussion on Waterfall et al. (2010), and their approach to learning generative grammars from realistic data! Next time on February 27, we'll be looking at a paper that examines a way to quantify claims of linguistic productivity.

Yang, C. (2010 Ms.) Who's Afraid of George Kingsley Zipf? Unpublished Manuscript, Universty of Pennsylvania.

See you then!

Friday, February 3, 2012

Some thoughts on Waterfall et al (2010)

What I really like about this paper is the opening discussion where they sketch the broad ideas that motivated the studies discussed in the rest of the paper. They explicitly talk about why the aim of language acquisition is a grammar, why we should care about the algorithmic level, what developmental computational psycholinguistics ought to be, why current computational models are still lacking because they miss out on the social situatedness of language, and what exactly is meant by "psychologically real" (and also how that differs from "algorithmically learnable"). I found this to be very valuable to just have all in one place. And I admit, it got my hopes up for what kind of model they would actually be using.

Unfortunately (for me), the rest of the paper ended up being somewhat anti-climactic because they don't end up implementing a model that has all the features of interest. Of course, that's a tall order, but they go through the process of running models that have the first three features, and then they talk about a lovely new discourse-related information type that seems like it should be incorporated into their model - and then they don't incorporate it. I think I was expecting them to at least talk about how to incorporate it into the models they spent so much time on in the beginning, even if it was infeasible at the current time to actually implement (for whatever reason). But that didn't seem to be what happened.

This isn't to say that the models they implemented and the identification of the "variation set" construct aren't interesting - it's just that I was expecting more based on the opening. As it is, the paper ends up feeling a bit scattered to me - a lot of potentially useful pieces, but they're not tied together very well.

Some more targeted thoughts:

p.674: I like that they were questioning the use of a gold standard, given that our theories about what the syntactic structure might be may not necessarily match psychological reality. I did find their definitions of recall and precision a bit hard to understand, though. Like many other things in the paper, I would have found an explicit formula (and possibly an example) to be more helpful than the text description. My best understanding of recall was something like the number of new generalizations divided by the test set plus the number of new generalizations, while precision was something like the number of correct new generalizations over the total number of new generalizations.

p.676: They talk about how a strength of their models is that there's no preliminary knowledge of things like grammatical categories (parts-of-speech). While it's nice to be able to say "Look what we can do with no knowledge!", I think this actually makes the problem less psychologically realistic. As far as I know, everyone's willing to grant that the child has some (at least rudimentary) knowledge of grammatical categories before the child starts positing syntactic structure. This is the kind of thing we might get from a child using frequent frames, for instance.

The ADIOS algorithm: I admit, I found this description very difficult to decipher without accompanying examples. It appears to be a batch algorithm, or is it (it appears that the graph is "rewired" every time a new pattern is detected)? What's an example of a bundle? What's a local flow quantity that would act as a context-sensitive probabilistic criterion for a significant bundle? How exactly does that work? How dissimilar is this whole process from frequent frames, which also induce equivalence classes? What are the basic abilities/knowledge required to make this algorithm work - the ability to create a graph, to identify bundles, to allow recursion of abstract patterns?

The ConText algorithm: This was a little better, because they provided a simple example. But again, I found myself wanting more explicit definitions for the different model components in order to understand how reasonable (or not) a model this was psychologically. For example, there's a local context window of 2, which means in a sentence like "I really like cute penguins", we would get a context vector for "like" where the lefthand context is "I really" and the righthand context is "cute penguins". Okay, great (though I worry about a window of 2 on each side in terms of data sparseness). And in order to construct equivalence classes based on this, the algorithm operates in batch mode over the data. Again, okay. But then, some kind of distance measure is posited to compare different context vectors to each other involving the angle between context vectors - how is this instantiated? What does the angle between "I really" and "But I" look like, for example? Presumably these are mapped into real numbers somehow... On a related note, once the algorithm gets clusters based on these context vectors, it then seems to do something with rewriting sequences - but what are sequences? Are these the utterances themselves, the partially abstracted representations the learner is forming, something else?

p.681: ConText results - I thought it was interesting that the ConText model ends up with subcategorization (for example, eat and drink being in the same class). This again reminds of frequent frame results, and made me want an explicit compare and contrast.

p.683: Human judgments of acceptability of new sentences created by ConText learner - I thought it was a bit strange to ask the participants to judge the acceptability based on how likely it was to appear in child-directed speech. Would the participants have a good sense of child-directed speech? My experience with undergrads who parse utterances from child-directed speech is that they're utterly surprised by how "ungrammatical" and semi-nonsensical conversational speech (and especially child-directed speech) is.

Variation sets: This is something of real value to computational models, I think. We have empirical evidence that children especially benefit from these particular data units and we have a reasonable idea of how to automatically identify them, and so we could reasonable expect a model to be extra sensitive to these kinds of data (perhaps give these data more weight). There's an interesting comment on p.688 where variation sets with roughly 50% of the material changing are the most helpful to children. My big question was why - what's so special about 50%? Does this represent some optimal tradeoff in terms of recognition and contrast? Another interesting note on p.689 and Table 2 on p.695, where they looked at how predictive the frequent n-grams were in variation sets for part-of-speech - some of them are pretty predictive, which is nice, and this shows that sometimes n-grams are useful, as opposed to needing framing elements (this was something a paper by Chemla et al. 2009 looked at). I do wonder at how this predictive quality would hold up cross-linguistically, though - what about languages where the wh-word doesn't move, or languages without auxiliary "do"?

Incremental learning (p.698): There's some discussion at the very end about how to transform ConText into an incremental learner, which I think is a good thing to think about. However, I wonder about the motivation behind using the gap automatically (i.e., a furry marmot gets additional "frames" of ___ furry marmot, a ____ marmot, and a furry _____ presumably). Is the idea that this will jumpstart the abstraction process, which otherwise would have to wait until it saw another instance that used two of those words? (Or in the case of a context window of 2 on each side, 4 of the words?)

References

Chemla, E., Mintz, T., Bernal, S., & Christophe, A. (2009). Categorizing Words Using "Frequent Frames": What Cross-Linguistic Analyses Reveal About Distributional Acquisition Strategies. Developmental Science.

Monday, January 23, 2012

Next time on Feb 6: Waterfall et al. (2010)

Thanks to everyone who was able to join our discussion our vigorous discussion of phonotactics and word segmentation today! Next time on Feb 6, we'll be looking at an article that focuses on syntactic acquisition, with an emphasis on learning generative grammars from realistic data.

Waterfall, H., Sandbank, B., Onnis, L., & Edelman, S. (2010). An empirical generative framework for computational modeling of language acquisition. Journal of Child Language, 37, 671-703.

See you then!

Friday, January 20, 2012

Some thoughts on Daland & Pierrehumbert (2011)

One of the first things that struck me about this paper was how wonderfully well-written I found it to be - it was so easy to follow the different ideas, and I really appreciated how careful it was to explain the details of pretty much everything involved. I kept saying to myself, "Yes, this is how a modeling paper should be written! So clear, so honest!" (And I'm not just saying this because one of the authors is in the reading group.) To be fair, it's likely that the pieces of this model are somewhat more transparent than pieces of other models we've looked at, and so lend themselves well to examination and explanation. Still, kudos to the authors on this - because they were very precise about both the modeling components and the ideas behind the model.

On a more content-related point, the authors were very clear to indicate that a diphone-based process couldn't occur until after most of the phones of the language were determined, which wouldn't happen till around 9 months. Since word segmentation starts earlier than this, this suggests diphone-based learning is presumably a later stage word segmentation strategy rather than an initial get-you-off-the-ground strategy. But I wondered if this was necessarily true. Suppose you have a learner who really would like to use diphone-based learning, but hasn't figured out her phones yet. Would she perhaps try to do it anyway, but simply using whatever fuzzy definition of phones she has (probably finer-grained distinctions than are actually present in the language)? For example, maybe she hasn't figured out that /b/ and /b^h/ are the same phone in English because she's only 6 months old. (Or maybe that the /b/ in /bo/ is the same as the /b/ in /bi/.) This means that she just has more "phones" than the 9-month-old diphone learner has - but how much does that matter? My guess is this still leads to fewer overall units than a syllable-based learner has. Moreover, because there are more "phone" units for this 6-month-old diphone learner, maybe it takes longer to segment words out. A longer period of undersegmentation might occur, but still yield some useful units that could help bootstrap the lexical-diphone learner.

Some more specific thoughts:
- The lexical learner here reminded me quite a bit of the one by Blanchard, Heinz, & Golinkoff (2010), and I was wondering about a compare and contrast between them. They both clearly make use of phrasal units, and later on lexical units. I believe the BH&G learner also included knowledge of a syllable, which the lexical learner doesn't.

- With respect to syllables vs. diphones, I wonder how many syllables in English are diphones (CV syllables, really), and how informative they are for word segmentation. In some sense, this is getting at how different a syllable-based learner is from a diphone-based learner. I imagine it would vary from language to language - Japanese would have more overlap between syllables and diphones, while German maybe has less. This seems related to the point brought up on p.149, in section 7.4.3, where they mention that the diphone learner could apply to syllables in Japanese (presumably rather than phones).

- p.125: I like that they worry about the implausibility of word learning from a single (word segmentation?) exposure. I think there's something to the idea that it takes a couple of times of successfully segmenting the word form from fluent speech before it sticks around long enough to be entered into the lexicon (and hopefully get assigned a meaning later on). Related note on p.127, where they assume the segmentation mechanism has "no access to specific lexical forms" - this seems like the extreme view of this. Unless I misunderstood, it implies that segmentation doesn't really make use of individual word forms, so algebraic learning (ex: "morepenguins = more+penguins" if more is a known form) shouldn't occur. I'm not sure how early this kind of learning occurs, to be honest, but it's certainly true that a lot of models (including BH&G's, I believe), assume that this kind of information is available during segmentation.

-p.133, section 4.1.3: It's completely reasonable to use the CELEX as a source for phonetic pronunciation and leave out words that aren't in CELEX (like baba), but I wonder how these affect the token sequence probabilities. It would be nice to know if it was just a few types that frequently occurred that were left out, or if it was a number of different types (potentially with many different diphone sequences).

-p.141, section 5.5: I really liked that they explored what would happen if the learner's estimation of how often a word boundary occurred was off (and then found that it didn't really matter). However, I do wonder if the reason the learner was robust has anything to do with the fact that the highest value in the range they looked at was still smaller than the "hard decision" boundary of 0.5 (mentioned on p.135).

-p.142: I also really liked that they looked at more realistic conversational speech data, which included effects of coarticulation (Stanford --> Stamford), which would then be a good clue that the diphone sequence was part of the same word. I thought coarticulation occurred across word boundaries too in conversational speech, though - maybe it's just that it occurs more often within words.

-p.146, section 7.2.1: I'm not quite sure I followed that part that says "undersegmentation means sublexically identified word boundaries can generally be trusted." If you've undersegmented, how do you know about word boundaries inside the chunk you've picked out? By definition, you didn't put in those word boundaries.

-p.148, section 7.4.1: I think the idea of prosodic words is extremely applicable to the word segmentation process at this stage. Given what we know of function words and content words, is there some principled way to resegment an existing corpus so it's made up of prosodic words instead of orthographic words? Or maybe the thing to do is to look at the errors being made by existing word segmentation models and see how many of them could be explained by the model finding prosodic words instead of orthographic words. A model that has a lot of prosodic words is maybe closer to human infants?

Tuesday, January 10, 2012

Next time on 1/23: Daland & Pierrehumbert (2011)

Welcome back! This quarter, we'll be holding our lively reading group on Mondays at 10:30am in SBSG 2221, with our first meeting of the quarter happening on January 23rd. (Check out the schedule for the rest of the quarter's meetings.) We'll be looking at an article that explores a cognitive model of word segmentation that draws on phonotactics and is instantiated using Bayes' theorem:

Daland, R. & Pierrehumbert, J. (2011). Learning Diphone-Based Segmentation. Cognitive Science, 35, 119-155.

See you then!

Tuesday, December 6, 2011

Schedule for Winter 2012 available

The schedule of readings for winter 2012 is now available! We'll be looking at a variety of topics again, including word segmentation, morphology, and linguistic productivity.

Friday, November 18, 2011

Some thoughts on Mitchener & Becker (2011)

I really like that M&B are looking at a learning problem that would be interesting to both nativists and non-nativists (a lot of the time, it seems like the different sides are talking past each other on what problems they're trying to solve). I also really like that they're exploring a variety of different probabilistic learning models. It does seem that M&B are still approaching the learning problem from a strongly nativist perspective, given the way they've described the actual problem: the learner knows there are two classes of behavior that link syntactic structure to semantic interpretation (raising vs. control), and that there are specific cues the learner should use to figure out which behavior a given verb has (animacy & eventivity). Importantly, only those cues (and their distribution) are relevant. There also seems to be an implicit assumption (at least initially) that unambiguous data are required to distinguish the behavior of any given verb, and the learning problem results because unambiguous data aren't always available (this is a common way learnability problems are framed in a nativist perspective). One thing I wondered while reading this is what would happen if the behavior of these verbs was taken in the context of a larger system - that is, would it possibly be easier to recognize these distinct classes of verbs if other information were deemed relevant besides the two cues M&B look at? I believe they hint at this themselves in the paper - that it might be possible to look at the syntactic distribution of these verbs over all frames, rather than just the ambiguous frame that signals either raising or control (She VERBed to laugh). This doesn't solve the problem of knowing what the different linking rules are between structure and interpretation, but maybe it makes the classification problem (that there are distinct classes of verbs) easier.

Some more targeted thoughts:

- Footnote 2 talks about the issues of homophony, and I can certainly see that tend's meanings are pretty distinct between raising and regular transitive verb. However, happens looks like it means very similar things whether it's raising or regular transitive, so I wonder how children would make this distinction - or if they would at all. If not, then this looks like an additional class of verb that involves mixed behavior.

- The end of section 2 talks about how 3- and 4-year-olds are very sensitive to animacy when they interpret verbs in the ambiguous raising/control frame. I can completely believe that animacy might generally be a cue children use to help them figure out what things should mean (e.g., if a verb takes an agent or not).

- I really like the discussion/caveat that M&B do in the intro of section 4 about biological plausibility.

- I also really liked the discussion of the linear reward penalty (LRP) learner's issues in section 4.2.1. Not having an intermediate state equilibrium is problematic if you need there to be mixed behavior (e.g., something is ambiguous between raising and control). I admit, I was surprised by the saturating accumulator model M&B chose to implement to correct that problem. I had some trouble connecting the various pieces of it to the process in a child's mind - the intuitive mapping didn't work for me the way it does for the LRP learner. For example, the index they talk about right at the end of section 4.2.2 seems fairly ad-hoc and requires children to do abstracting over patterns of frames defined by these different semantic cues.

Tuesday, November 8, 2011

Next time on 11/21: Mitchener & Becker (2011)

Thanks to those of you who were able to join our nicely in-depth discussion of Alishahi & Pyykkonen (2011)'s article on syntactic bootstrapping! I think we figured out some of the details that were glossed over, and these really helped to understand the contribution of the study.

Next time, on Nov 21 (@3pm in SBSG 2221), we'll be looking at an article that examines how a subtle syntactic distinction that has specific semantic implications (called the raising-control distinction) could be learned.

Mitchener, G. & Becker, M. (2011). Computational Models of Learning the Raising-Control Distinction. Research on Language and Computation, 8(2), 169-207.

See you then!

Friday, November 4, 2011

Some thoughts on Alishahi & Pyykkonen (2011)

I really like the investigation of syntactic bootstrapping in this kind of computational manner. While experimental approaches like the Human Simulation Paradigm (HSP) offer us certain insights about how (usually adult) humans use different kinds of information, they have certain limitations that the computational learner doesn't (such as the researcher knowing exactly what the internal knowledge state is, and how it changes). From my perspective, the HSP with adults (and maybe even with 7-year-olds) is a kind of ideal learner approach, because it asks what inferences can be made with maximal knowledge about (the native) language - so while it clearly involves human processing limitations, it's examining the best that humans could reasonably be expected to do in a task that's similar to what word-learners might be doing. The computational learner is much more limited in the knowledge it has access to a priori, and I think the researchers really tried to give it reasonable approximations of what very young children might know about different language aspects. In addition, as A & P mention, the ability to track the time course of learning is a nice feature (though with some caveats with respect to implementation limitations).

Some more targeted thoughts:

I thought the probabilistic accuracy was a clever measure for taking advantage of the distribution over words that the learner calculates.

As I said above, tracking learning over time is an admirable goal - however, the modeled learner here clearly is only qualitatively doing this, since there's such a spike in performance between 0 and 100 training examples. I'm assuming A & P would say that children's inference procedures are much noisier than this (and so it would take children longer), unless there's evidence that children really do learn the exact correct meaning in under 100 examples (possible, but seems unlikely to me).

I was a little surprised that A & P didn't discuss the difference in Figure 1 between the top and bottom panel with respect to the -LI condition. (This was probably due to the length constraints, but still.) It's a bit mystifying to me how absolute accuracy could be close to the +LI condition while verb improvement is much lower than the +LI condition. I guess this means the baseline for verb improvement was different between the +LI and -LI conditions somehow?

It was indeed interesting to see that having no linguistic information (-LI) was actually beneficial for noun-learning - I would have thought noun-learning would also be helped by linguistic context. A & P speculate that this is because early nouns refer to observable concepts (e.g., concrete objects) and/or the nature of the training corpus made the linguistic context for nouns more ambiguous than for verbs. (The latter reason ties into the linguistic context more.) I wonder if this effect would persist with a different training corpus (after all, there were some assumptions A & P made when constructing this corpus - they seemed reasonable, but there are still different ways to construct the corpus.)

Monday, October 17, 2011

Next time: Alishahi & Pyykkonen (2011)

Thanks to those of you who were able to join our nicely in-depth discussion today of Dillon et al. (2011)'s article on applying Bayesian models to phonological acquisition! Next time on 11/7 (@3:30pm in SBSG 2221), we'll be discussing an article that looks at the phenomenon of syntactic bootstrapping, which is the ability to infer word meaning and abstract structure associated with that word from the syntactic context of the word:

Alishahi, A. & Pyykkonen, P. (2011). The onset of syntactic bootstrapping in word learning: Evidence from a computational study. Proceedings of the 33nd Annual Conference of the Cognitive Science Society, Boston, MA.

See you then!

Friday, October 14, 2011

Some thoughts on Dillon et al. (2011)

I'm really fond of this paper - I love that they're tackling realistic problems (with realistic language data), that they're seriously looking at the state of the art with respect to computational models of it, and that they're finding a way to connect linguistic theory (e.g., "There are phonological rules") with this level of concreteness (e.g., "Let's make them linear models operating over acoustic space"). Because of all this, I think their point about the potential issues of two-stage models comes across very clearly. And I love that that they can make a model that learns both phonemes and their relationships between phonetic categories simultaneously. Moreover, the fact they can do this without trying to learn a lexicon simultaneously (like Feldman, Griffiths, & Morgan (2009) do) is impressive to me, since that was the main thing that seemed to lead to good results for Feldman et al. (2009). Notably, they make use of the linguistic context (i.e., does a uvualar consonant follow), which is something Swingley (2009) recently suggested looks really helpful for English phonemes in a review of infant phoneme learning.

A few more targeted thoughts:

I really like that they note the three-vowel +allophones system is not just a special weirdness of Inuktitut, but rather something that occurs in a number of different languages. This makes it more important to be able to account for this kind of data, and bolsters support for the single stage model.
I also thought it was useful to note that the EM approach follows the frequentist tradition. After a moment's reflection, this is clearly true, but it didn't occur to me until they pointed it out.
Because of the nature of the Bayesian model, the more data that come in, the more the model is likely to prefer more categories over less (and the explanation they give for this just before the discussion of Expt 1 is entirely sensible). This carries over even for their cool Expt 3 model that learns categories and rules simultaneously (as we can see in Table 6) - the 12000 data point model is much more likely to posit 4 or 5 categories than the 1000 data point model. I'm wondering what this means for actual acquisition. Should we expect that infants learn very quickly and so end up with 3 categories + rules? Or would we expect that infants might go through a stage where they have 4 or 5 categories, and have to recover (maybe based on doing word segmentation/lexicon item discovery)?
For the one-stage model in Expt 3, they mention that they build in a bias for complementary distribution - is this an uncontroversial assumption (or easy to derive from innate abilities we know infants do have)? I honestly don't have strong intuitions about this. It'd be great if it was.

References:

Feldman, N., Griffiths, T., and Morgan, J. (2009). Learning phonetic categories by learning a lexicon. Proceedings of the 31st Annual Conference on Cognitive Science.

Swingley, D. (2009). Contributions of infant word learning to language development. Philosophical Transactions of the Royal Society B, 364, 3617-3632.