Wednesday, December 4, 2013

See you in the winter!

Thanks so much to everyone who was able to join us for our thoughtful, spirited discussion today, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Monday, December 2, 2013

Some thoughts on Nematzadeh et al. 2013

So, I can start off by saying that there are many things about this paper that warmed the cockles of my heart.  First, I love that modeling is highlighted as an explanatory tool. To me, that's one of the best things about computational modeling - the ability to identify an explanation for observed behavior, in addition to being able to produce said behavior. I also love that psychological constraints and biases were being incorporated into the model. This is that algorithmic/process-level-style model that I really enjoy working with, since it focuses on the connection between the abstract representation of what's going on and what people actually are doing. Related to both of the above, I was very happy to see how the model made assumptions concrete and thus isolated (potential) explanatory factors within the model. Now, maybe we don't always agree with how an assumption has been instantiated (see the note on novelty below)- but at least we know it's an assumption and we can see that version of it in action. And that is definitely a good thing, in my (admittedly biased) opinion.

Some more specific thoughts:

I found the general behavioral result from Vlach et al. 2008 about the "spacing effect" to be interesting, where learning was better when items are distributed over a period of time, rather than occurring one right after another. This is the opposite of "burstiness", which (I thought) is supposed to facilitate other types of learning (e.g., word segmentation). Maybe this has to do with the complexity of the thing being learned, or what existing framework there is for learning it (since I believe the Vlach et al. experiments were with adults)?

I thought the semantic representation of the scene as a collection of features was a nice step towards what the learner's representation probably is like (rather than just individual referent objects). When dealing with novel objects and more mature learners, this seems much more likely to me. On the other hand, I was a little fuzzy on how exactly the features and their feature weights were derived for the novel objects. (It's mentioned briefly in the Input Generation section that each word's true meaning is a vector of semantic features, but I missed completely how those are selected.)

Novelty: Nematzadeh et al. (N&al) implement novelty as an inverse function of recency. There's something obviously right about this, but I wonder about other definitions of novelty, like something that taps into overall frequency of this item's appearance (so, novel because it's fairly rare in the input). I'm not sure how this other definition (or a novelty implementation that incorporates both recency and overall frequency) would jive with the experimental results N&al are trying to explain.

Technical side note, related to the above: I had some trouble interpreting equation (2) - is the difference between t and tlastw a fraction of some kind? Maybe because time is measured in minutes, but the presentation durations are in seconds? Otherwise, novelty could become negative, which seems a bit weird.


I was thinking some about the predictions of the model, based on figure 4 and the discussion following it, where N&al are trying to make the model replicate certain experimental results. I think their model would predict that if learners had longer to learn the simplest condition (2 x 2), i.e., the duration of presentation was longer so the semantic representations didn't decay so quickly, that condition should then be the one best learned. That is, the "desirable difficulty" benefit is really about how memory decay doesn't happen so quickly for the 3 x 3 condition, as compared to the 2 x 2 condition.

I found it incredibly interesting that the behavioral experiment Vlach & Sandhofer 2010 (V&S) conducted just happened to have exactly the right item spacing/ordering/something else to yield the interesting results they found, but other orderings of those same items would be likely to yield different (perhaps less interesting) results. You sort of have to wonder how V&S happened upon just the right order - good experiment piloting, I guess?  Though at the end of the discussion section, N&al seem to back off from claiming it's all about the order of item presentation, since none of the obvious variables potentially related to order (average spacing, average time since last presentation, average context familiarity) seemed to correlate with the output scores.

Wednesday, November 20, 2013

Next time on 12/4/13 @ 2:30pm in SBSG 2221 = Nematzadeh et al. 2013

Thanks to everyone who was able to join us for our feisty and thoughtful discussion of Lewis & Frank 2013! Next time on December 4 at 2:30pm in SBSG 2221, we'll be looking at an article that explores the kinds of difficulties in word-learning that can paradoxically help long-term learning and why they help, using a computational modeling approach:

Nematzadeh, A., Fazly, A., & Stevenson, S. 2013. Desirable Difficulty in Learning: A Computational Investigation. Proceedings of the 35th Annual Meeting of the Cognitive Science Society.


See you then!

Monday, November 18, 2013

Some thoughts on Lewis & Frank 2013

I'm always a fan of learning models that involve solving different problems simultaneously, with the idea of leveraging information from one problem to help solve the other (Feldman et al. 2013 and Dillon et al. 2013 are excellent examples of this, IMHO). For Lewis & Frank (L&F), the two problems are related to word learning: how to pick the referent from a set of referents and how to pick which concept class that referent belongs to (which they relate to how to generalize that label appropriately).  I have to say that I struggled to understand how they incorporated the second problem, though -- it doesn't seem like the concept generalization w.r.t. subordinate vs. superordinate classes maps in a straightforward way to the feature analysis they're describing.  (More on this below.) I was also a bit puzzled by their assumption of where the uncertainty in learning originates from and the link they describe between what they did and the origin/development of complex concepts (more on these below, too).

On generalization & features:  If we take the example in their Figure 1, it seems like the features could be something like f1 = "fruit", f2 = "red", and f3 = "apple". The way they talk about generalization is as underspecification of feature values, which feels right.  So if we say f1 is the only important feature, then this corresponds nicely to the idea of "fruit" as a superordinate class.  But what if we allow f2 to be the important feature? Is "red" the superordinate class of "red" things?  Well, in a sense, I suppose. But this falls outside of the noun-referent system that they're working in - "red" spans many referents, because it's a property.  Maybe this is my misunderstanding in trying to map this whole thing to subordinate and superordinate classes, like Xu & Tenenbaum 2007 talk about, but it felt like that's what L&F intended, given the model in Figure 2 that's grounded in Objects at the observable level and the behavioral experiment they actually ran.

On where the uncertainty comes from: L&F mention in the Design of the Model section that the learning model assumes "the speaker could in principle have been mistaken about their referent or misspoken". From a model building perspective, I understand that this is easier to incorporate and allows graded predictions (which are necessary to match the empirical data), but from the cognitive perspective, this seems really weird to me. Do we have reason to believe children assume their speakers are unreliable? I was under the impression children assume their speakers are reliable as a default. Maybe there's a better place to work this uncertainty in - approximated inference from a sub-optimal learner or something like that. Also, as a side note, it seems really important to understand how the various concepts/features are weighted by the learner. Maybe that's where uncertainty could be worked in at the computational level.

On the origin/development of concepts: L&F mention in the General Discussion that "the features are themselves concepts that can be considered as primitives in the construction of more complex concepts", and then state that their model "describes how a learner might bootstrap from these primitives to infer more and complex concepts". This sounds great, but I was unclear how exactly to do that. Taking the f1, f2, and f3 from above, for example, I get that those are primitive features. So the concepts are then things that can be constructed out of some combination of their values (whether specified or unspecified)? And then where does the development come in? Where is the combination (presumably novel) that allows the construction of new features? I understand that these could be the building units for such a model, but I didn't see how the current model shows us something about that.

Behavioral experiment implementation: I'm definitely a fan of matching a model to controlled behavioral data, but I wonder about the specific kind of labeling they gave their subjects. It seems like they intended "dax bren nes" to be the label for one object shown (it's just unclear which it is - but basically, this might as well be a trisyllabic word "daxbrennes" ). This is a bit different from standard cross-situational experiments, where multiple words are given for multiple objects. Given that subjects are tested with that same label, I guess the idea is that it simplifies the learning situation.

Results:  I struggled a bit to decipher the results in Figure 5 - I'm assuming the model predictions are for the different experimental contexts, ordered by human uncertainty about how much to generalize to the superordinate class. Is the lexicon posited by the model just how many concepts to map to "dax-bren-nes", where concept = referent?

~~~
References

B. Dillon, E. Dunbar, & W. Idsardi. 2013. A single-stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science,  37, 344-377.

Feldman, N. H., Griffiths, T. L., Goldwater, S.,  Morgan, J. L. 2013. "A role for the developing lexicon in phonetic category acquisition." Psychological Review120(4), 751-778.

Xu, F., & Tenenbaum, J. 2007. Word Learning as Bayesian Inference.  Psychological Review, 114(2), 245-272.



Wednesday, November 6, 2013

Next time on 11/20/13 @ 2:30pm in SBSG 2221 = Lewis & Frank 2013

Thanks to everyone who was able to join us for our vigorous and thoughtful discussion of Marcus & Davis 2013! Next time on November 20 at 2:30pm in SBSG 2221, we'll be looking at an article that discusses how to solve two problems related to word learning simultaneously, using hierarchical Bayesian modeling and evaluating against human behavioral data:

Lewis, M. & Frank. M. 2013. An integrated model of concept learning and word-concept mapping.Proceedings of the 35th Annual Meeting of the Cognitive Science Society.



See you then!

Monday, November 4, 2013

Some thoughts on Marcus & Davis (2013)

(...and a little also on Jones & Love 2011)

One of the things that struck me about Marcus & Davis (2013) [M&D] is that they seem to be concerned with identifying what the priors are for learning. But what I'm not sure of is how you distinguish the following options:

(a) sub-optimal inference over optimal priors
(b) optimal inference over sub-optimal priors
(c) sub-optimal inference over sub-optimal priors

M&D seem to favor option (a), but I'm not sure there's an obvious reason to do so. Jones & Love 2011 [J&L] mention the possibility of "bounded rationality", which is something like "be as optimal as possible in your inference, given the prior and the processing limitations you have". That sounds an awful lot like (c), and seems like a pretty reasonable option to explore. The concern in general with what the priors are actually dovetails quite nicely with traditional linguistic explorations of how to define (constrain) the learner's hypothesis space appropriately to make successful inference possible. Also, J&L are quite aware of this too, and underscore the importance of selecting the priors appropriately.

That being said, no matter what priors and inference processes end up working, there's clear utility in being explicit about all the assumptions that yield a match to human behavior, which M&D want (and I'm a huge fan of this myself: see my commentary on a recent article here where I happily endorse this). Once you've identified the necessary pieces that make a learning strategy work, you can then investigate (or at least discuss) which assumptions are necessarily optimal.  That may not be an easy task, but it seems like a step in the right direction.

M&D seem to be unhappy with probabilistic models as a default assumption - and okay, that's fine. But it does seem important to recognize that probabilistic reasoning is a legitimate option. And maybe some of cognition is probabilistic and some isn't - I don't think there's a compelling reason to believe that cognition has to be all one or all the other. (I mean, after all, cognition is made up of a lot of different things.) In this vein, I think a reasonable thing that M&D would like is for us to not just toss out non-probabilistic options that work really well solely because they're non-probabilistic.

On a related note, I very much agree with one of the last things M&D note, which is that we should be explicit about "what would constitute evidence that a probabilistic approach is not appropriate for a particular task or domain".  I'm not sure myself what that evidence would look like, since even categorical behavior can be simulated by a probabilistic model that just thresholds. Maybe if it's more "economical" (however we define that) to not have a probabilistic model, and there exists a non-probabilistic model that accomplishes the same thing?

~~~
A few comments about Jones & Love 2011 [J&L]:

J&L seem very concerned with the recent focus in the Bayesian modeling world on existence proofs for various aspects of cognition.  They do mention later in their article (around section 6, I think), that existence proofs are a useful starting point, however -- they just don't want research to stop there. An existence proof that a Bayesian learning strategy can work for some problem should be the first step for getting a particular theory on the table as a real possibility worth considering (e.g., whatever's in the priors for that particular learning strategy that allowed Bayesian inference to succeed, as well as the Bayesian inference process itself).

Overall, J&L seem to make a pretty strong call for process models (i.e., algorithmic-level models, instead of just computational-level models). Again, this seems like a natural follow-up once you have a computational-level model you're happy with.  So the main point is simply not to rest on your Bayesian inference laurels once you have your existence proof at the computational level for some problem in cognition.  The Chater et al. 2011 commentary to J&L note that many Bayesian modelers are moving in this direction already, creating "rational process" models.

~~~
References

Chater, N., Goodman, N., Griffiths, T., Kemp, C., Oaksford, M., & Tenenbaum, J. 2011. The imaginary fundamentalists: The unshocking truth about Bayesian cognitive science. Behavioral and Brain Sciences, 34 (4), 194-196.

Jones, M. & Love, M. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34 (4), 169-188.

Pearl, L. 2013. Evaluating strategy components: Being fair.  [lingbuzz]

Wednesday, October 23, 2013

Next time on 11/6/13 @ 2:30pm in SBSG 2221 = Marcus & David 2013


Thanks to everyone who was able to join us for our lively and informative discussion of Ambridge et al. (in press)! Next time on November 6 at 2:30pm in SBSG 2221, we'll be looking at an article that discusses how probabilistic models of higher-level cognition (including language) are used in cognitive science:

Marcus, G. & Davis, E. 2013. How Robust Are Probabilistic Models of Higher-Level Cognition? Psychological Science, published online Oct 1, 2013, doi:10.1177/095679761349541.

I would also strongly recommend a target article and commentary related to this topic that were written fairly recently:

Jones, M. & Love, M. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34 (4), 169-188.

Chater, N., Goodman, N., Griffiths, T., Kemp, C., Oaksford, M., & Tenenbaum, J. 2011. The imaginary fundamentalists: The unshocking truth about Bayesian cognitive science. Behavioral and Brain Sciences, 34 (4), 194-196.


(Both target article and commentary are included in the pdf file linked above.)

See you then!

Monday, October 21, 2013

Some thoughts on Ambridge et al. in press

This article really hit home for me, since it talks about things I worry about a fair bit with respect to Universal Grammar and language learning in general -- so much so, that I ended up writing a lot more about it than I typically do for the articles we read. Conveniently, this is a target article that's asking for commentaries, so I'm going to put some of my current thoughts here as a sort of teaser for the commentary I plan to submit.

~~~


The basic issue that the authors (AP&L) highlight about proposed learning strategies seems exactly right: What will actually work, and what exactly makes it work? They note that …nothing is gained by positing components of innate knowledge that do not simplify the problem faced by language learners” (p.56, section 7.0), and this is absolutely true. To examine how well several current learning strategy proposals work that involve innate, linguistic knowledge, AP&L present evidence from a commendable range of linguistic phenomena, from what might be considered fairly fundamental knowledge (e.g., grammatical categories) to fairly sophisticated knowledge (e.g., subjacency and binding). In each case, AP&L identify the shortcomings of some existing Universal Grammar (UG) proposals, and observe that these proposals don’t seem to fare very well in realistic scenarios. The challenge at the very end underscores this -- AP&L contend (and I completely agree) that a learning strategy proposal involving innate knowledge needs to show “precisely how a particular type of innate knowledge would help children acquire X” (p.56, section 7.0).  

More importantly, I believe this should be a metric that any component of a learning strategy is measured by.  Namely, for any component (whether innate or derived, whether language-specific or domain-general), we need to not only propose that this component could help children learn some piece of linguistic knowledge but also demonstrate at least “one way that a child could do so” (p.57, section 7.0). To this end, I think it's important to highlight how computational modeling is well suited for doing precisely this: for any proposed component embedded in a learning strategy, modeling allows us to empirically test that strategy in a realistic learning scenario. It’s my view that we should test all potential learning strategies, including the ones AP&L themselves propose as alternatives to the UG-based ones they find lacking.  An additional and highly useful benefit of the computational modeling methdology is that it forces us to recognize hidden assumptions within our proposed learning strategies, a problem that AP&L rightly recognize with many existing proposals.

This leads me to suggest certain criteria that any learning strategy should satisfy, relating to its utility in principle and practice, as well as its usability by children. Once we have a promising learning strategy that satisfies these criteria, we can then concern ourselves with the components comprising that strategy.  With respect to this, I want to briefly discuss the type of components AP&L find unhelpful, since several of the components they would prefer might still be reasonably classified as UG components. The main issue they have is not with components that are innate and language-specific, but rather components of this kind that in addition involve very precise knowledge. This therefore does not rule out UG components that involve more general knowledge, including (again) the components AP&L themselves propose. In addition, AP&L ask for explicit examples of UG components that actually do work. I think one potentially UG component that’s part of a successful learning strategy for syntactic islands (described in Pearl & Sprouse 2013) is a nice example of this: the bias to characterize wh-dependencies at a specific level of granularity. It's not obvious where this bias would come from (i.e., how it would be derived or what innate knowledge would lead to it), but it's crucial for the learning strategy it's a part of to work. As a bonus, that learning strategy also satisfies the criteria I suggest for evaluating learning strategies more generally (utility and useability).

~~~
Reference:

Tuesday, October 1, 2013

Next time on 10/23/13 @ 2:30pm in SBSG 2221 = Ambridge et al. in press


It looks like the best collective time to meet will be Wednesdays at 2:30pm for this quarter, so that's what we'll plan on.  Due to some of my own scheduling conflicts, our first meeting will be in a few weeks on October 23.  Our complete schedule is available on the webpage at 


On Oct 23, we'll be looking at an article that examines the utility of Universal Grammar based learning strategies in several different linguistic domains, arguing that they're not all that helpful at the moment:

Ambridge, B., Pine, J., & Lieven, E. 2013 in press. Child language acquisition: Why Universal Grammar doesn't help. Language.




See you then!

Wednesday, September 25, 2013

Fall quarter planning


I hope everyone's had a good summer break - and now it's time to gear up for the fall quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on Universal Grammar, Bayesian modeling, and word learning:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here).

Thanks and see you soon!

Friday, June 7, 2013

Some thoughts on Parisien and Stevenson 2010


Overall, this paper is concerned with the extent to which children possess abstract knowledge of syntax, and more specifically, children’s ability to acquire generalizations about verb alternations. The authors present two models for the purpose of illustrating that information relevant to verb alternations can be acquired through observations of how verbs occur with individual arguments in the input.

My main point of confusion in this article was and still is about the features used to represent the lowest level of abstraction in the models. The types of features used seem to me to already assume a lot of prior abstract syntactic knowledge… The authors state, “We make the assumption that children at this developmental stage can distinguish various syntactic arguments in the input, but may not yet recognize recurring patterns such as transitive and double-object constructions”, but this assumption still does not quite make sense to me. In order to have a feature such as “OBJ”, don’t you have to have some abstract category for objects? Some abstract representation of what it means to be an object? This seems like more than just a general chunking of the input into constituents because for something to be an object, it has to be in a specific relationship with a verb. So how can you have this feature without already having abstract knowledge of the relationship of the object to the verb? If this type of generalized knowledge is not what is meant, maybe it is just the labels given to these features that bothers me. It seems to me that once a learner has figured out what type each constituent is (OBJ, OBJ2, COMP, PP, etc), the problem of learning generalizations of constructions becomes simple – just find all the verbs that have OBJ and OBJ2 after them and put them into a category together. Even after reading this article twice and discussing it with the class, I am still really missing something essential about the logic behind this assumption.

A few points regarding verb argument preferences:
  1. The comparison of the two models in the results for verb argument preferences seems completely unsurprising… Is this not what Model 1 was made to do? If so, then I would not expect any added benefit from Model 2, but it is unclear what the authors’ expectations were regarding this result.
  2. What is the point of comparing two very similar constructions (prepositional dative and benefactive)? The only difference between these two is the preposition used, so being able to distinguish one from the other does not require abstract syntactic knowledge… as far as I can tell, the differences occur at the phonological level and at the semantic level.
  3. I am curious about the fact that both models acquired approximately 20 different constructions… What were these other constructions and why did they only look at the datives? 
A few points regarding novel verb generalization:
  1. I found the comparison of the two models in the results for novel verb generalization to be rather difficult to interpret… In particular, I think organizing the graph in a different way could have made it much more visually interpretable – one in which the bars for model 1 and model 2 were side-by-side on the same graph rather than on separate graphs displayed one above the other. I also would have liked some discussion of the significance of the differences discussed – They say that in comparing Model 2 with Model 1, the PD frame is now more likely than the SC frame, although only slightly. Perhaps just because I’m not used to looking at log likelihood graphs, it is unclear to me whether this difference is significant enough to even bother mentioning because it is barely noticeable on the graph.
  2. On the topic of the behavior observed in children, the authors note that high-frequency verbs tend to be biased toward the double-object form. However, children tend to be biased toward the prepositional dative form. But even in the larger corpus, only about half of the verbs are prepositional-biased, and it is suggested that these are low frequency. So, what is a potential explanation for the observed bias in children? Why would they be biased toward the prepositional dative form if is the low-frequency verbs that are biased this way? This doesn’t make intuitive sense if children are doing some sort of pattern-matching. I would expect children to behave like the model – to more closely match the biases of the high-frequency verbs and therefore prefer to generalize to the double-object construction from the prepositional dative. I think that rather than simply running the model on a larger corpus, it would be useful to construct a strong theory for why children might have this bias and then construct a model that is able to test that theory.




Thursday, June 6, 2013

Some thoughts on Carlson et al. 2010

I really liked how this paper tackled a really big problem head on. It's inclusion in subsequent works speaks strongly for the interest in this kind of research. I would really like to see more language papers set a high bar like this and establish a framework for achieving it.

My largest concern about this paper is the fact that the authors seemed to feel that human-guided learning can overcome some of the deficits in the model framework. The large drop off in precision (from 90% to 57%) is not surprising as methods such as the Coupled SEAL and Coupled Morphological Classifier are not robust in the face of locally optimal solutions; it is inevitable that as more and more data is added, the fitness will decline, because the models are already anchored to their fit of previous data. Errors will beget errors, and human intervention will only limit this inherent multiplication.

These errors are further compounded by the fact that the framework does not take into account the degree of independence between its various models. Using group and individual model thresholds for decision making is a decent heuristic, but it is unworkable as an architecture because guaranteeing each model's independence is a hard constraint on the number and types of models that can be used. I believe the framework would be better served by combining the underlying information in a proper, hierarchical framework. By including more models that can inform each other, perhaps the necessity of human-supervised learning can be kept to a minimum.

Tuesday, May 28, 2013

Have a good summer, and see you in the fall!


Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be on hiatus this summer, and we'll resume again in the fall quarter.  As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Friday, May 24, 2013

Some thoughts on Kwiatkowski et al 2012

One of the things I really enjoyed about this paper was that it was a much fuller syntax & semantics system than anything I've seen in awhile, which means we get to see the nitty gritty in the assumptions that are required to make it all work. Having seen the assumptions, though, I did find it a little unfair for the authors to claim that no language-specific knowledge was required - as far as I could tell, the "language-universal" rules between syntax and semantics at the very least seem to be a language-specific kind of knowledge (in the sense of domain-specific vs. domain-general). In this respect, whatever learning algorithms they might explore, the overall approach seems similar to other learning models I've seen that are predicated on very precise theoretical linguistic knowledge (e.g., the parameter-setting systems of Yang 2002, Sakas & Fodor 2001, Niyogi & Berwick 1996, Gibson & Wexler 1994, among others.) It just so happens here that CCG assumes different primitives/principles than those other systems - but domain-specific primitives/principles are still there a priori.

Getting back to the semantic learning - I'm a big fan of them learning words besides nouns, and connecting with the language acquisition behavioral literature on syntactic bootstrapping and fast mapping.  That being said, the actual semantics they seemed to learn was a bit different than what I think the fast mapping people generally intend.  In particular, if we look at Figure 5, while three different quantifier meanings are learned, it's more about the form the meaning takes, rather than the actual lexical meaning of the word (i.e., the form for a, another, and any looks identical, so any differences in meaning are not recognized, even though these words clearly do differ in meaning). I think lexical meaning is what people are generally talking about for fast mapping, though. What this seems like is almost grammatical categorization, where knowing the grammatical category means you know the general form the meaning will have (due to those linking rules between syntactic category and semantic form) rather than the precise meaning - that's very in line with syntactic bootstrapping, where the syntactic context might point you towards verb-y meanings or preposition-y meanings, for example.

More specific thoughts:

I found it interesting that the authors wanted to explicitly respond to a criticism that statistical learning models can't generate sudden step-like behavior changes.  I think it's certainly an unspoken view by many in linguistics that statistical learning implies more gradual learning (which was usually seen as a bonus, from what I understood, given how noisy data are). It's also unclear to me that the data taken as evidence for step-wise changes really reflect a step-wise change or instead only seem to be step-wise because of how often the samples were taken and how much learning happened in between.  It's interesting that the model here can generate it for learning word order (in Figure 6), though I think the only case that really stands out for me is the 5 meaning example, around 400 utterances.

I could have used a bit more unpacking of the CCG framework in Figure 2. I know there were space limitations, but the translation from semantic type to the example logical form wasn't always obvious to me. For example, the first and last examples (S_dcl and PP) have the same semantic type but not the same lambda calculus form. Is the semantic type what's linked to the syntactic category (presumably), and then there are additional rules for how to generate the lambda form for any given semantic type?

This provides a nice example where the information that's easily available in dependency structures appears more useful, since the authors describe (in section 6) how they created a deterministic procedure for using the primitive labels in the dependency structures to create the lambda forms. (Though as a side note, I was surprised how this mapping only worked for a third of the child-directed speech examples, leaving out not only fragments but also imperatives and nouns with prepositional phrase modifiers. I guess it's not unreasonable to try to first get your system working on a constrained subset of the data, though.)

I wish they had told us a bit more about the guessing procedure they used for parsing unseen utterances, since it had a clear beneficial impact throughout the learning period. Was it random (and so guessing at all was better than not, since sometimes you'd be right as opposed to always being penalized for not having a representation for a given word)?  Was it some kind of probabilistic sampling?  Or maybe just always picking the most probable hypothesis?




Wednesday, May 15, 2013

Some thoughts on Frank et al. 2010

So what I liked most about this article was the way in which they chose to explore the space of possibilities in a very computational-level way. I think this is a great example of what I'd like to see more of. As someone also interested in cross-linguistic viability for our models, I have to also commend them for testing on not just one foreign language, but on three.

So there were a number of aspects of the model which I think could have been more clearly specified. For instance, I don't believe they ever explicitly say that the model presumes knowledge of the number of states to be learned. Actual infants don't have the benefit of the doubt in this regard, so it would be nice to know what would happen if you inferred that from the data. It turns out there's a well specified model to do that, but I'll get to that later. Another problem with their description of the model has to do with how their hyperparameters are sampled. They apparently simplify the process by resampling only once per iteration of the Gibbs sampler. I'm happy with this although I'm going to assume that it was a typo that they say they run their model for 2000 iterations (Goldwater seems to prefer 20,000). Gibbs samplers tend to converge more slowly on time-dependent models so it would be nice to have some evidence that the sampler has actually converged. Splitting the data by sentence type seems to increase the size of their confidence intervals by quite a lot, which may be an artifact of having less data per parameter, but could also be due to a lack of convergence.

Typically I have to chastise modelers who attempt to use VI or V-measure, but fortunately they are not doing anything technically wrong here. They are correct in that comparing these scores across corpora is hazardous at best. Both of these measures are biased, VI prefers small numbers of tags and V-measure prefers large numbers of tags (they claim at some point that it is "invariant" to different numbers of tags, this is however not true!). It turns out that a measure, V-beta, is more useful than either of these two in that it is unbiased for the number of categories. So there's my rant about the wonders of V-beta.

What I really would have liked to see would be an infinite HMM for this data, which is a well-specified, very similar model which can infer the number of grammatical categories in the data. It has an efficient sampler (as of 2008) so there's no reason they couldn't run that model over their corpus. It's very useful for us to know what the space of possibilities is, but to what extent would their results change if they gave up the assumption that you knew from the get-go how many categories there were? There's really no reason they couldn't run it and I'd be excited to see how well it performed.

The one problem with the models they show here as well as the IHMM is that neither allows for there to be shared information about transition probabilities or emission probabilities (depending on the model) across sentence types. They're treated as entirely different. They mention this in their conclusion, but I wonder if there's any way to share that information in a useful way without hand coding it somehow.

Overall, I'm really happy someone is doing this. I liked the use of some very salient information to help tackle a hard problem, but I would've liked to have seen it a little more realistic by inferring the number of grammatical categories. I might've also liked to have seen better evidence of convergence (perhaps a beam sampler instead of Gibbs, at the very least I hope they ran it for more than 2000 iterations).

Tuesday, May 14, 2013

Next time on 5/28/13 @ 2pm in SBSG 2200 = Kwiatkowski et al. 2012

Thanks to everyone who joined our meeting this week, where we had a very thoughtful discussion about the experimental design for investigating "less is more" and the implications of the computational modeling in Perfors 2012.  Next time on Tuesday May 28 @ 2pm in SBSG 2200, we'll be looking at an article that presents an incremental learning model that incorporates both syntactic and semantic information during learning:

Kwiatkowski, T., Goldwater, S., Zettlemoyer, L., & Steedman, M. 2012. A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.



See you then!
-Lisa


Monday, May 13, 2013

Some thoughts on Perfors 2012 (JML)

One of the things I quite liked about this paper was the description of the intuitions behind the different model parameters and capacity limitations. As a computational modeler who's seen ideal Bayesian learners before, could I have just as easily decoded this from a standard graphical model representation? Sure.  Did I like to have the intuitions laid out for me anyway?  You bet. Moreover, if we want these kind of models to be recognized and used within language research, it's good to know how to explain them like this. On a related note, I also appreciated that Perfors explicitly recognized the potential issues involved in extending her results to actual language learning. As with most models, hers is a simplification, but it may be a useful simplification, and there are probably useful ways to un-simplify it.

It was also good to see the discussion of the relationship between the representations this model used for memory and the existing memory literature. (Given the publication venue, this probably isn't so surprising, but given that my knowledge of memory models is fairly limited, it was helpful to see this spelled out.)

I think the most surprising thing for me was how much memory loss was required for the regularization bias to be able to come into play and allow the model to show regularization. Do we really think children only remember 10-20% of what they hear? (Maybe they do, though, especially in more realistic scenarios.)

More specific thoughts:

Intro: I found the distinctions made between different "less is more" hypothesis variants to be helpful, in particular the difference between a "starting small" version that imposes explicit restrictions on the input (because of attention, memory, etc.) to identify useful units in the input vs. a general regularization tendency (which may be the byproduct of cognitive limitations, but isn't specifically about ignoring some of the input) which is about "smoothing" the input in some sense.

Section 2.1.2: The particular task Perfors chooses to investigate experimentally is based on previous tasks that have been done with children and adults to test regularization, but I wonder what kind of task it seemed like to the adult subjects. Since the stimuli were presented orally, did the subjects think of each one as a single word that had some internal inconsistency (and so might be treating the variable part as morphology tacked onto a noun) or would they have thought of each one as one consistent word plus a separate determiner-like thing (making this more of a combinatorial syntax task)?  I guess it doesn't really matter for the purposes of regularization - if children can regularize syntax (creoles, Nicaraguan sign language, Simon), then presumably they regularize morphology (e.g., children's overregularization of the past tense in English, like goed), and it's not an unreasonable assumption that the same regularization process would apply to both. Perfors touches again on the issue of how adults perceived the task a little in the discussion (p.40) - she mentions that mutual exclusivity might come into play if adults viewed this as a word learning task, and cause more of a bias for regularization.  Whether it's a morphology task or a combinatorial syntax task, I'm not sure I agree with that - mutual exclusivity seems like it would only apply if adults assumed the entire word was the name of the object (as opposed to the determiner-thing being an actual determiner like the or a or morphology like -ed or -ing). Because only a piece of the entire "word" would change with each presentation of the object, it doesn't seem like adults would make that assumption.

Section 3.0.6: For the Prior bias, it seems like prior is constructed from the global frequency of the determiner (based on the CRP). This seems reasonable, but I wonder if it would matter any to have a lexical-item-based prior (maybe in addition to the global prior)? I could imagine that the forgotten data for any individual item might be quite high (even if others are low) when memory loss is less than 80-90% globally, which might allow the regularization effects to show up without needing to forget 80-90% of all the data.

Section 4: It's an interesting observation that the previous experiments that found regularization effects conducted the experiment over multiple days, where consolidation during sleep would have presumably occurred. Perfors mentions this as a potential memory distortion that doesn't occur during encoding itself, or retrieval, but rather with the processes of memory maintenance. If this is true, running the experiments again with adults, but over multiple days, should presumably allow this effect to show up.

Tuesday, April 30, 2013

Next time on 5/14 @ 2pm in SBSG 2200 = Perfors 2012 JML


Thanks to everyone who joined our meeting this week, where we had a very spirited and enlightening discussion about Lignos 2012 and the ideas behind it! Next time on Tuesday May 14 @ 2pm in SBSG 2200, we'll be looking at an article that investigates the interplay between memory limitations and overregularization behavior in learners, providing both experimental and computational modeling results:



See you then!

Monday, April 29, 2013

Some thoughts on Lignos 2012

I found the simplicity of the proposed algorithm in this paper very attractive (especially when compared to some of the more technically involved papers we've read that come from the machine learning literature). The goal of connecting to known experimental and developmental data of course warmed my cognitive modeler's heart, and I certainly sympathized with the aim of pushing the algorithm to be more cognitively plausible.  I did think some of the criticisms of previous approaches were a touch harsh, given what's actually implemented here (more on this below), but that may be more of a subjective interpretation thing.  I did find it curious that the evaluation metrics chosen were about word boundary identification, rather than about lexicon items (in particular, measuring boundary accuracy and word token accuracy, but not lexicon accuracy).  Given the emphasis on building a quality lexicon (which seems absolutely right to me if we're talking about the goal of word segmentation), why not have lexicon item scores as well to get a sense of how good a lexicon this strategy can create?

Some more specific thoughts:

Section 2.1, discussing the 9-month-old English-learning infants who couldn't segment Italian words from transitional probabilities alone unless they had already been presented with words in isolation: Lignos is using this to argue against transitional probabilities as a useful metric at all, but isn't another way to interpret it simply that transitional probabilities (TPs) can't do it all on their own?  That is, if you initialize a proto-lexicon with a few words, TPs would work alright - they just can't work right off the bat with no information.  Relatedly, the discussion of the Shukla et al. 2011 (apparently 6-month-old) infants who couldn't use TPs unless they were aligned with a prosodic boundary made me think more that TPs are useful, just not useful in isolation.  They need to be layered on top of some existing knowledge (however small that knowledge might be).  But I think it just may be Lignos's stance that TPs aren't that useful - they seem to be left out as something a model of word segmentation should pay attention to in section 2.4.

Of course, I (and I'm assuming Lawrence as well, given Phillips & Pearl 2012) was completely sympathetic to the criticism in section 2.3 about how phonemes aren't the right unit of perception for the initial stages of word segmentation. They may be quite appropriate if you're talking about 10-month-olds, though - of course, at that point, infants probably have a much better proto-lexicon, not to mention other cues (e.g., word stress). I was a little less clear about the criticism (of Johnson & Goldwater) about using collocations as a level of representation.  Even though this doesn't necessarily connect to adult knowledge of grammatical categories and phrases, there doesn't seem anything inherently wrong with assuming infants initially learn chunks that span categories and phrases, like "thatsa" or "couldI". They would have to fix them later, but that doesn't seem unreasonable.

One nice aspect of the Lignos strategy is that it's incremental, rather than a batch algorithm.  However, I think it's more a modeling decision rather than an empirical fact to not allow memory of recent utterances to affect the segmentation of the current utterance (section 3 Intro).  It may well turn out to be right, but it's not obviously true at this point that this is how kids are constrained.  On a related note, the implementation of considering multiple segmentations seems a bit more memory-intensive, so what's the principled reason for allowing memory for that but not allowing memory for recent utterances? Conceptually, I understand the motivation for wanting to explore multiple segmentations (and I think it's a good idea - I'm actually not sure why the algorithm here is limited to 2) - I'm just not sure it's quite fair to criticize other models for essentially allowing more memory for one thing when the model here allows more memory for another.

I was a little confused about how the greedy subtractive segmentation worked in section 3.2.  At first, I thought it was an incremental greedy thing - so if your utterance was "syl1 syl2 syl3", you would start with "syl1" and see if that's in your lexicon; if not, try "syl1 syl2", and so on. But this wouldn't run into ambiguity then: "...whenever multiple words in the lexicon could be subtracted from an utterance, the entry with the highest score will be deterministically used". So something else must be meant. Later on when the beam search is described, it makes sense that there would be ambiguity - but I thought ambiguity was supposed to be present even without multiple hypotheses being considered.

The "Trust" feature described in 3.3 seemed like an extra type of knowledge that might be more easily integrated into the existing counts, rather than added on as an additional binary feature.  I get that the idea was to basically use it to select the subset of words to add to the lexicon, but couldn't a more gradient version of this implemented, where the count for words at utterance boundaries gets increased by 1, while the count for words that are internal gets increased by less than 1? I guess you could make an argument either way about which approach is more naturally intuitive (i.e., just ignore words not at utterance boundaries vs. be less confident about words not at utterance boundaries).

I think footnote 7 is probably the first argument I'm seen in favor of using orthographic words as the target state, instead of an apology for not having prosodic words as the target state. I appreciate the viewpoint, but I'm not quite convinced that prosodic words wouldn't be useful as proto-lexicon items (ex: "thatsa" and "couldI" come to mind). Of course, these would have to be segmented further eventually, but they're probably not completely destructive to have in the proto-lexicon (and do feel more intuitively plausible as an infant's target state).

In Table 1, it seems like we see a good example of why precision and recall may be better than hit (H) rate and false alarm (FA) rate: The Syllable learner (which puts a boundary at every syllable) clearly oversegments and does not achieve the target state, but you would never know that from the H and FA scores.  Do we get additional information from H & FA that we don't get from precision and recall? (I guess it would have to be mostly from the FA rate, since H = recall?)

I thought seeing the error analyses in Tables 2 and 3 was helpful, though I was a little surprised Table 3 didn't show the breakdown between undersegmentation and oversegmentation errors, in addition to the breakdown between function and content words.  (Or maybe I just would have liked to have seen that, given the claim that early errors should mostly be undersegmentations. We see plenty of function words as errors, but how many of them are already oversegmentations?)