Computational Models of Language (at UC Irvine): November 2013

Wednesday, November 20, 2013

Next time on 12/4/13 @ 2:30pm in SBSG 2221 = Nematzadeh et al. 2013

Thanks to everyone who was able to join us for our feisty and thoughtful discussion of Lewis & Frank 2013! Next time on December 4 at 2:30pm in SBSG 2221, we'll be looking at an article that explores the kinds of difficulties in word-learning that can paradoxically help long-term learning and why they help, using a computational modeling approach:

Nematzadeh, A., Fazly, A., & Stevenson, S. 2013. Desirable Difficulty in Learning: A Computational Investigation. Proceedings of the 35th Annual Meeting of the Cognitive Science Society.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/NematzadehEtAl2013_DesirableDifficultyWordLearning.pdf

See you then!

Monday, November 18, 2013

Some thoughts on Lewis & Frank 2013

I'm always a fan of learning models that involve solving different problems simultaneously, with the idea of leveraging information from one problem to help solve the other (Feldman et al. 2013 and Dillon et al. 2013 are excellent examples of this, IMHO). For Lewis & Frank (L&F), the two problems are related to word learning: how to pick the referent from a set of referents and how to pick which concept class that referent belongs to (which they relate to how to generalize that label appropriately). I have to say that I struggled to understand how they incorporated the second problem, though -- it doesn't seem like the concept generalization w.r.t. subordinate vs. superordinate classes maps in a straightforward way to the feature analysis they're describing. (More on this below.) I was also a bit puzzled by their assumption of where the uncertainty in learning originates from and the link they describe between what they did and the origin/development of complex concepts (more on these below, too).

On generalization & features: If we take the example in their Figure 1, it seems like the features could be something like f1 = "fruit", f2 = "red", and f3 = "apple". The way they talk about generalization is as underspecification of feature values, which feels right. So if we say f1 is the only important feature, then this corresponds nicely to the idea of "fruit" as a superordinate class. But what if we allow f2 to be the important feature? Is "red" the superordinate class of "red" things? Well, in a sense, I suppose. But this falls outside of the noun-referent system that they're working in - "red" spans many referents, because it's a property. Maybe this is my misunderstanding in trying to map this whole thing to subordinate and superordinate classes, like Xu & Tenenbaum 2007 talk about, but it felt like that's what L&F intended, given the model in Figure 2 that's grounded in Objects at the observable level and the behavioral experiment they actually ran.

On where the uncertainty comes from: L&F mention in the Design of the Model section that the learning model assumes "the speaker could in principle have been mistaken about their referent or misspoken". From a model building perspective, I understand that this is easier to incorporate and allows graded predictions (which are necessary to match the empirical data), but from the cognitive perspective, this seems really weird to me. Do we have reason to believe children assume their speakers are unreliable? I was under the impression children assume their speakers are reliable as a default. Maybe there's a better place to work this uncertainty in - approximated inference from a sub-optimal learner or something like that. Also, as a side note, it seems really important to understand how the various concepts/features are weighted by the learner. Maybe that's where uncertainty could be worked in at the computational level.

On the origin/development of concepts: L&F mention in the General Discussion that "the features are themselves concepts that can be considered as primitives in the construction of more complex concepts", and then state that their model "describes how a learner might bootstrap from these primitives to infer more and complex concepts". This sounds great, but I was unclear how exactly to do that. Taking the f1, f2, and f3 from above, for example, I get that those are primitive features. So the concepts are then things that can be constructed out of some combination of their values (whether specified or unspecified)? And then where does the development come in? Where is the combination (presumably novel) that allows the construction of new features? I understand that these could be the building units for such a model, but I didn't see how the current model shows us something about that.

Behavioral experiment implementation: I'm definitely a fan of matching a model to controlled behavioral data, but I wonder about the specific kind of labeling they gave their subjects. It seems like they intended "dax bren nes" to be the label for one object shown (it's just unclear which it is - but basically, this might as well be a trisyllabic word "daxbrennes" ). This is a bit different from standard cross-situational experiments, where multiple words are given for multiple objects. Given that subjects are tested with that same label, I guess the idea is that it simplifies the learning situation.

Results: I struggled a bit to decipher the results in Figure 5 - I'm assuming the model predictions are for the different experimental contexts, ordered by human uncertainty about how much to generalize to the superordinate class. Is the lexicon posited by the model just how many concepts to map to "dax-bren-nes", where concept = referent?

~~~
References

B. Dillon, E. Dunbar, & W. Idsardi. 2013. A single-stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science, 37, 344-377.

Feldman, N. H., Griffiths, T. L., Goldwater, S., Morgan, J. L. 2013. "A role for the developing lexicon in phonetic category acquisition." Psychological Review, 120(4), 751-778.

Xu, F., & Tenenbaum, J. 2007. Word Learning as Bayesian Inference. Psychological Review, 114(2), 245-272.

Wednesday, November 6, 2013

Next time on 11/20/13 @ 2:30pm in SBSG 2221 = Lewis & Frank 2013

Thanks to everyone who was able to join us for our vigorous and thoughtful discussion of Marcus & Davis 2013! Next time on November 20 at 2:30pm in SBSG 2221, we'll be looking at an article that discusses how to solve two problems related to word learning simultaneously, using hierarchical Bayesian modeling and evaluating against human behavioral data:

Lewis, M. & Frank. M. 2013. An integrated model of concept learning and word-concept mapping.Proceedings of the 35th Annual Meeting of the Cognitive Science Society.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/LewisFrank2013_Concept+WordLearningModel.pdf

See you then!

Monday, November 4, 2013

Some thoughts on Marcus & Davis (2013)

(...and a little also on Jones & Love 2011)

One of the things that struck me about Marcus & Davis (2013) [M&D] is that they seem to be concerned with identifying what the priors are for learning. But what I'm not sure of is how you distinguish the following options:

(a) sub-optimal inference over optimal priors
(b) optimal inference over sub-optimal priors
(c) sub-optimal inference over sub-optimal priors

M&D seem to favor option (a), but I'm not sure there's an obvious reason to do so. Jones & Love 2011 [J&L] mention the possibility of "bounded rationality", which is something like "be as optimal as possible in your inference, given the prior and the processing limitations you have". That sounds an awful lot like (c), and seems like a pretty reasonable option to explore. The concern in general with what the priors are actually dovetails quite nicely with traditional linguistic explorations of how to define (constrain) the learner's hypothesis space appropriately to make successful inference possible. Also, J&L are quite aware of this too, and underscore the importance of selecting the priors appropriately.

That being said, no matter what priors and inference processes end up working, there's clear utility in being explicit about all the assumptions that yield a match to human behavior, which M&D want (and I'm a huge fan of this myself: see my commentary on a recent article here where I happily endorse this). Once you've identified the necessary pieces that make a learning strategy work, you can then investigate (or at least discuss) which assumptions are necessarily optimal. That may not be an easy task, but it seems like a step in the right direction.

M&D seem to be unhappy with probabilistic models as a default assumption - and okay, that's fine. But it does seem important to recognize that probabilistic reasoning is a legitimate option. And maybe some of cognition is probabilistic and some isn't - I don't think there's a compelling reason to believe that cognition has to be all one or all the other. (I mean, after all, cognition is made up of a lot of different things.) In this vein, I think a reasonable thing that M&D would like is for us to not just toss out non-probabilistic options that work really well solely because they're non-probabilistic.

On a related note, I very much agree with one of the last things M&D note, which is that we should be explicit about "what would constitute evidence that a probabilistic approach is not appropriate for a particular task or domain". I'm not sure myself what that evidence would look like, since even categorical behavior can be simulated by a probabilistic model that just thresholds. Maybe if it's more "economical" (however we define that) to not have a probabilistic model, and there exists a non-probabilistic model that accomplishes the same thing?

~~~
A few comments about Jones & Love 2011 [J&L]:

J&L seem very concerned with the recent focus in the Bayesian modeling world on existence proofs for various aspects of cognition. They do mention later in their article (around section 6, I think), that existence proofs are a useful starting point, however -- they just don't want research to stop there. An existence proof that a Bayesian learning strategy can work for some problem should be the first step for getting a particular theory on the table as a real possibility worth considering (e.g., whatever's in the priors for that particular learning strategy that allowed Bayesian inference to succeed, as well as the Bayesian inference process itself).

Overall, J&L seem to make a pretty strong call for process models (i.e., algorithmic-level models, instead of just computational-level models). Again, this seems like a natural follow-up once you have a computational-level model you're happy with. So the main point is simply not to rest on your Bayesian inference laurels once you have your existence proof at the computational level for some problem in cognition. The Chater et al. 2011 commentary to J&L note that many Bayesian modelers are moving in this direction already, creating "rational process" models.

~~~
References

Chater, N., Goodman, N., Griffiths, T., Kemp, C., Oaksford, M., & Tenenbaum, J. 2011. The imaginary fundamentalists: The unshocking truth about Bayesian cognitive science. Behavioral and Brain Sciences, 34 (4), 194-196.

Jones, M. & Love, M. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34 (4), 169-188.

Pearl, L. 2013. Evaluating strategy components: Being fair. [lingbuzz]

Computational Models of Language (at UC Irvine)