Friday, January 24, 2014

Next time on 2/14/14 @ 3pm in SBSG 2221 = Omaki & Lidz 2013 Manuscript

Thanks to everyone who was able to join us for our thorough and thoughtful discussion of the Meylan et al. 2014 manuscript! Next time on Friday February 14 at 3pm in SBSG 2221, we'll be looking at an article manuscript that argues for the need to consider the development of children's processing abilities at the same time as we consider their acquisition of knowledge. This is particularly relevant to computational modelers who must explicitly model what the child's input looks like and how that input is used, for example.

Omaki, A. & Lidz, J. 2013. Linking parser development to acquisition of linguistic knowledge. Manuscript, Johns Hopkins University and University of Maryland, College Park. Please do not cite without permission from Akira Omaki. 

Wednesday, January 22, 2014

Some thoughts on Meylan et al. 2014 Manuscript

One of the things I really enjoyed about this paper was the framing they give to explain why we should care about the emergence of grammatical categories, with respect to the existing debate between (some of) the nativists and (some of) the constructivists.  Of course I'm always a fan of clever uses of Bayesian inference to problems in language acquisition, but I sometimes really miss the level of background story that we get here. (So hurrah!)

That being said, I was somewhat surprised to see the conclusion M&al2014 drew with respect to their results that (some kind of) a nativist view wasn't supported. To me, the fact that we see very early rapid development of this grammatical category knowledge is an unexpected thing from the "gradual emergence based on data" story (i.e., constructivist perspective). So, what's causing the rapid development? I know it's not the focus of M&al2014's work here, but positing some kind of additional learning guidance seems necessary to explain these results. And until we have a story for how that guidance would be learned, the "it's innate" answer is a pretty good placeholder.  So, for me, that places the results in the nativist side, though maybe not the strict "grammatical categories are innate" version. Maybe I'm being unfair to the constructivist side, though -- would they have an explanation for the rapid, early development?

Another very cool thing was the application of this approach to the Speechome data set. It's been around for awhile, but we don't have a lot of studies that use it and it's such an amazing resource. One of the things I wondered, though, was whether the evaluation metric M&al2014 propose can only work if you have this density of data. It seems like that might be true, given the issues with confidence intervals on the CHILDES datasets. If so, this is different from Yang's metric [Yang 2013] which can be used on much smaller datasets. (My understanding is that as long as you have enough data to form a Zipfian distribution, you have enough for Yang's metric to be applied.)

One thing I didn't quite follow was the argument about why only a developmental analysis is possible, rather than both a developmental and a comparative analysis. I completely understand that adults may have different values for their generalized determiner preferences, but we assume that they realize determiners are a grammatical class. So, given this, whatever range of values adults have is the target state for acquisition, right? And this should allow a comparative analysis between wherever the child is and wherever the adult is. (Unless I'm missing something about this.)

Some more targeted thoughts:

As a completely nit-picky thing that probably doesn't matter, it took me a second to get used to calling grammatical categories syntactic abstractions. I get that they're the basis for (many) syntactic generalizations, but I wouldn't have thought of them as syntactic, per se.  (Clearly, this is just a terminology issue, and other researchers that M&al2014 cite definitely have called it syntactic knowledge, too.)

M&al2014 state in the previous work section that Yang's metric is "not well-suited to discovering if a child could be less than fully productive at a given stage of development". I'm not sure I understand why this is so - if the observed overlap in the child's output is less than the expected overlap from a fully productive system, isn't that exactly the indicator of a less than fully productive system?

In the generative model M&al2014 use, they have a latent variable that represents the unrecorded caregiver input (DA), which is assumed to be drawn from the same distribution as the observed caregiver input (dA). I don't follow what this variable contributes, especially if it follows the same distribution as the observed data.

The table just below figure 4:  I'm not sure I followed this. What would rich morphology be for English data, for example? And are the values for "Current" the v value inferred for the child? Are the Yang 2013 values calculated based on his expected overlap metric?

I wonder if the reason there were developmental changes found in the Speechome corpus is more about having enough data in the appropriate age range (i.e., < 2 years old). The other corpora had a much wider range of ages, and it could very well be that the ones that included younger-than-2-year-old data had older-age data included in the earliest developmental window investigated.

There's a claim made in the discussion that "no previous analysis has taken into account the input that individual children hear in judging whether their subsequent determiner usage has changed its productivity". I think what M&al2014 intend is something related to the explicit modeling of how much of the productions are imitated chunks, and if so, that seems completely fine (though one could argue that the Yang 2010 manuscript goes into quite some detail modeling this option). However, the way the current sentence reads, it seems a bit odd to say no previous analysis has cared about the input -- certainly Yang's metric can be used to assess productivity in child-directed speech utterances, which are the children's input. This is how a comparative analysis would presumably be made using Yang's metric.

Similarly, there's a claim near the end that the Bayesian analysis "makes inferences regarding developmental change of continuity in a single child possible". While it's true that this can be done with the Bayesian analysis, there seems to be an implicit claim that the other metrics can't do this. But I'm pretty sure it can also be done with the other metrics out there (e.g., Yang's). You basically apply the metric to data at multiple time points, and track the change, just as M&al2014 did here with the Bayesian metric.


~~~
References

Yang, C. 2013. Onotogeny and philogeny of language. 2013. Proceedings of the National Academy of Science, 110 (16). doi:10.1073/pnas.1216803110.

Thursday, January 9, 2014

Next time on 1/24/14 @ 3pm in SBSG 2221 = Meylan et al. 2014 Manuscript

It looks like the best collective time to meet will be Fridays at 3pm for this quarter, so that's what we'll plan on.  Our first meeting will be in a few weeks on January 24.  Our complete schedule is available on the webpage at 



On Jan 24, we'll be looking at an article that examines a formal metric to gauge productivity for grammatical categories, based on hierarchical Bayesian modeling.

UPDATE for Jan 24: Michael Frank was kind enough to provide us with an updated version of the 2013 paper (2013 version linked below), which they're intending to submit for a journal publication. It's already received some outside feedback, and they'd be delighted to hear any thoughts we had on it.  Michael preferred the manuscript not be posted publicly however, so I've sent it around as an attachment to the mailing list.

Meylan, S., Frank, M. C., & Levy, R. 2013. Modeling the development of determiner productivity in children's early speech. Proceedings of the 35th Annual Meeting of the Cognitive Science Society.

http://www.socsci.uci.edu/~lpearl/colareadinggroup/readings/MeylanEtAl2013_Productivity.pdf

Categorical productivity is typically used to determine when abstract knowledge that a category actually exists is acquired (think "VERB exists, not just see and kiss and want! Woweee! Who knew?"), which is a fundamental building block for more complex linguistic knowledge.


I think the metric proposed in this session's article is particularly useful to compare and contrast against the metric that's been proposed recently by Yang (which is based on straight probability calculations), so I encourage you to have a look at that one as well:

Yang, C. 2013. Onotogeny and philogeny of language. 2013. Proceedings of the National Academy of Science, 110 (16). doi:10.1073/pnas.1216803110.



See you on Jan 24!

Wednesday, December 4, 2013

See you in the winter!

Thanks so much to everyone who was able to join us for our thoughtful, spirited discussion today, and to everyone who's joined us throughout the fall quarter! The CoLa Reading Group will resume again in the winter quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Monday, December 2, 2013

Some thoughts on Nematzadeh et al. 2013

So, I can start off by saying that there are many things about this paper that warmed the cockles of my heart.  First, I love that modeling is highlighted as an explanatory tool. To me, that's one of the best things about computational modeling - the ability to identify an explanation for observed behavior, in addition to being able to produce said behavior. I also love that psychological constraints and biases were being incorporated into the model. This is that algorithmic/process-level-style model that I really enjoy working with, since it focuses on the connection between the abstract representation of what's going on and what people actually are doing. Related to both of the above, I was very happy to see how the model made assumptions concrete and thus isolated (potential) explanatory factors within the model. Now, maybe we don't always agree with how an assumption has been instantiated (see the note on novelty below)- but at least we know it's an assumption and we can see that version of it in action. And that is definitely a good thing, in my (admittedly biased) opinion.

Some more specific thoughts:

I found the general behavioral result from Vlach et al. 2008 about the "spacing effect" to be interesting, where learning was better when items are distributed over a period of time, rather than occurring one right after another. This is the opposite of "burstiness", which (I thought) is supposed to facilitate other types of learning (e.g., word segmentation). Maybe this has to do with the complexity of the thing being learned, or what existing framework there is for learning it (since I believe the Vlach et al. experiments were with adults)?

I thought the semantic representation of the scene as a collection of features was a nice step towards what the learner's representation probably is like (rather than just individual referent objects). When dealing with novel objects and more mature learners, this seems much more likely to me. On the other hand, I was a little fuzzy on how exactly the features and their feature weights were derived for the novel objects. (It's mentioned briefly in the Input Generation section that each word's true meaning is a vector of semantic features, but I missed completely how those are selected.)

Novelty: Nematzadeh et al. (N&al) implement novelty as an inverse function of recency. There's something obviously right about this, but I wonder about other definitions of novelty, like something that taps into overall frequency of this item's appearance (so, novel because it's fairly rare in the input). I'm not sure how this other definition (or a novelty implementation that incorporates both recency and overall frequency) would jive with the experimental results N&al are trying to explain.

Technical side note, related to the above: I had some trouble interpreting equation (2) - is the difference between t and tlastw a fraction of some kind? Maybe because time is measured in minutes, but the presentation durations are in seconds? Otherwise, novelty could become negative, which seems a bit weird.


I was thinking some about the predictions of the model, based on figure 4 and the discussion following it, where N&al are trying to make the model replicate certain experimental results. I think their model would predict that if learners had longer to learn the simplest condition (2 x 2), i.e., the duration of presentation was longer so the semantic representations didn't decay so quickly, that condition should then be the one best learned. That is, the "desirable difficulty" benefit is really about how memory decay doesn't happen so quickly for the 3 x 3 condition, as compared to the 2 x 2 condition.

I found it incredibly interesting that the behavioral experiment Vlach & Sandhofer 2010 (V&S) conducted just happened to have exactly the right item spacing/ordering/something else to yield the interesting results they found, but other orderings of those same items would be likely to yield different (perhaps less interesting) results. You sort of have to wonder how V&S happened upon just the right order - good experiment piloting, I guess?  Though at the end of the discussion section, N&al seem to back off from claiming it's all about the order of item presentation, since none of the obvious variables potentially related to order (average spacing, average time since last presentation, average context familiarity) seemed to correlate with the output scores.

Wednesday, November 20, 2013

Next time on 12/4/13 @ 2:30pm in SBSG 2221 = Nematzadeh et al. 2013

Thanks to everyone who was able to join us for our feisty and thoughtful discussion of Lewis & Frank 2013! Next time on December 4 at 2:30pm in SBSG 2221, we'll be looking at an article that explores the kinds of difficulties in word-learning that can paradoxically help long-term learning and why they help, using a computational modeling approach:

Nematzadeh, A., Fazly, A., & Stevenson, S. 2013. Desirable Difficulty in Learning: A Computational Investigation. Proceedings of the 35th Annual Meeting of the Cognitive Science Society.


See you then!

Monday, November 18, 2013

Some thoughts on Lewis & Frank 2013

I'm always a fan of learning models that involve solving different problems simultaneously, with the idea of leveraging information from one problem to help solve the other (Feldman et al. 2013 and Dillon et al. 2013 are excellent examples of this, IMHO). For Lewis & Frank (L&F), the two problems are related to word learning: how to pick the referent from a set of referents and how to pick which concept class that referent belongs to (which they relate to how to generalize that label appropriately).  I have to say that I struggled to understand how they incorporated the second problem, though -- it doesn't seem like the concept generalization w.r.t. subordinate vs. superordinate classes maps in a straightforward way to the feature analysis they're describing.  (More on this below.) I was also a bit puzzled by their assumption of where the uncertainty in learning originates from and the link they describe between what they did and the origin/development of complex concepts (more on these below, too).

On generalization & features:  If we take the example in their Figure 1, it seems like the features could be something like f1 = "fruit", f2 = "red", and f3 = "apple". The way they talk about generalization is as underspecification of feature values, which feels right.  So if we say f1 is the only important feature, then this corresponds nicely to the idea of "fruit" as a superordinate class.  But what if we allow f2 to be the important feature? Is "red" the superordinate class of "red" things?  Well, in a sense, I suppose. But this falls outside of the noun-referent system that they're working in - "red" spans many referents, because it's a property.  Maybe this is my misunderstanding in trying to map this whole thing to subordinate and superordinate classes, like Xu & Tenenbaum 2007 talk about, but it felt like that's what L&F intended, given the model in Figure 2 that's grounded in Objects at the observable level and the behavioral experiment they actually ran.

On where the uncertainty comes from: L&F mention in the Design of the Model section that the learning model assumes "the speaker could in principle have been mistaken about their referent or misspoken". From a model building perspective, I understand that this is easier to incorporate and allows graded predictions (which are necessary to match the empirical data), but from the cognitive perspective, this seems really weird to me. Do we have reason to believe children assume their speakers are unreliable? I was under the impression children assume their speakers are reliable as a default. Maybe there's a better place to work this uncertainty in - approximated inference from a sub-optimal learner or something like that. Also, as a side note, it seems really important to understand how the various concepts/features are weighted by the learner. Maybe that's where uncertainty could be worked in at the computational level.

On the origin/development of concepts: L&F mention in the General Discussion that "the features are themselves concepts that can be considered as primitives in the construction of more complex concepts", and then state that their model "describes how a learner might bootstrap from these primitives to infer more and complex concepts". This sounds great, but I was unclear how exactly to do that. Taking the f1, f2, and f3 from above, for example, I get that those are primitive features. So the concepts are then things that can be constructed out of some combination of their values (whether specified or unspecified)? And then where does the development come in? Where is the combination (presumably novel) that allows the construction of new features? I understand that these could be the building units for such a model, but I didn't see how the current model shows us something about that.

Behavioral experiment implementation: I'm definitely a fan of matching a model to controlled behavioral data, but I wonder about the specific kind of labeling they gave their subjects. It seems like they intended "dax bren nes" to be the label for one object shown (it's just unclear which it is - but basically, this might as well be a trisyllabic word "daxbrennes" ). This is a bit different from standard cross-situational experiments, where multiple words are given for multiple objects. Given that subjects are tested with that same label, I guess the idea is that it simplifies the learning situation.

Results:  I struggled a bit to decipher the results in Figure 5 - I'm assuming the model predictions are for the different experimental contexts, ordered by human uncertainty about how much to generalize to the superordinate class. Is the lexicon posited by the model just how many concepts to map to "dax-bren-nes", where concept = referent?

~~~
References

B. Dillon, E. Dunbar, & W. Idsardi. 2013. A single-stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science,  37, 344-377.

Feldman, N. H., Griffiths, T. L., Goldwater, S.,  Morgan, J. L. 2013. "A role for the developing lexicon in phonetic category acquisition." Psychological Review120(4), 751-778.

Xu, F., & Tenenbaum, J. 2007. Word Learning as Bayesian Inference.  Psychological Review, 114(2), 245-272.



Wednesday, November 6, 2013

Next time on 11/20/13 @ 2:30pm in SBSG 2221 = Lewis & Frank 2013

Thanks to everyone who was able to join us for our vigorous and thoughtful discussion of Marcus & Davis 2013! Next time on November 20 at 2:30pm in SBSG 2221, we'll be looking at an article that discusses how to solve two problems related to word learning simultaneously, using hierarchical Bayesian modeling and evaluating against human behavioral data:

Lewis, M. & Frank. M. 2013. An integrated model of concept learning and word-concept mapping.Proceedings of the 35th Annual Meeting of the Cognitive Science Society.



See you then!

Monday, November 4, 2013

Some thoughts on Marcus & Davis (2013)

(...and a little also on Jones & Love 2011)

One of the things that struck me about Marcus & Davis (2013) [M&D] is that they seem to be concerned with identifying what the priors are for learning. But what I'm not sure of is how you distinguish the following options:

(a) sub-optimal inference over optimal priors
(b) optimal inference over sub-optimal priors
(c) sub-optimal inference over sub-optimal priors

M&D seem to favor option (a), but I'm not sure there's an obvious reason to do so. Jones & Love 2011 [J&L] mention the possibility of "bounded rationality", which is something like "be as optimal as possible in your inference, given the prior and the processing limitations you have". That sounds an awful lot like (c), and seems like a pretty reasonable option to explore. The concern in general with what the priors are actually dovetails quite nicely with traditional linguistic explorations of how to define (constrain) the learner's hypothesis space appropriately to make successful inference possible. Also, J&L are quite aware of this too, and underscore the importance of selecting the priors appropriately.

That being said, no matter what priors and inference processes end up working, there's clear utility in being explicit about all the assumptions that yield a match to human behavior, which M&D want (and I'm a huge fan of this myself: see my commentary on a recent article here where I happily endorse this). Once you've identified the necessary pieces that make a learning strategy work, you can then investigate (or at least discuss) which assumptions are necessarily optimal.  That may not be an easy task, but it seems like a step in the right direction.

M&D seem to be unhappy with probabilistic models as a default assumption - and okay, that's fine. But it does seem important to recognize that probabilistic reasoning is a legitimate option. And maybe some of cognition is probabilistic and some isn't - I don't think there's a compelling reason to believe that cognition has to be all one or all the other. (I mean, after all, cognition is made up of a lot of different things.) In this vein, I think a reasonable thing that M&D would like is for us to not just toss out non-probabilistic options that work really well solely because they're non-probabilistic.

On a related note, I very much agree with one of the last things M&D note, which is that we should be explicit about "what would constitute evidence that a probabilistic approach is not appropriate for a particular task or domain".  I'm not sure myself what that evidence would look like, since even categorical behavior can be simulated by a probabilistic model that just thresholds. Maybe if it's more "economical" (however we define that) to not have a probabilistic model, and there exists a non-probabilistic model that accomplishes the same thing?

~~~
A few comments about Jones & Love 2011 [J&L]:

J&L seem very concerned with the recent focus in the Bayesian modeling world on existence proofs for various aspects of cognition.  They do mention later in their article (around section 6, I think), that existence proofs are a useful starting point, however -- they just don't want research to stop there. An existence proof that a Bayesian learning strategy can work for some problem should be the first step for getting a particular theory on the table as a real possibility worth considering (e.g., whatever's in the priors for that particular learning strategy that allowed Bayesian inference to succeed, as well as the Bayesian inference process itself).

Overall, J&L seem to make a pretty strong call for process models (i.e., algorithmic-level models, instead of just computational-level models). Again, this seems like a natural follow-up once you have a computational-level model you're happy with.  So the main point is simply not to rest on your Bayesian inference laurels once you have your existence proof at the computational level for some problem in cognition.  The Chater et al. 2011 commentary to J&L note that many Bayesian modelers are moving in this direction already, creating "rational process" models.

~~~
References

Chater, N., Goodman, N., Griffiths, T., Kemp, C., Oaksford, M., & Tenenbaum, J. 2011. The imaginary fundamentalists: The unshocking truth about Bayesian cognitive science. Behavioral and Brain Sciences, 34 (4), 194-196.

Jones, M. & Love, M. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34 (4), 169-188.

Pearl, L. 2013. Evaluating strategy components: Being fair.  [lingbuzz]

Wednesday, October 23, 2013

Next time on 11/6/13 @ 2:30pm in SBSG 2221 = Marcus & David 2013


Thanks to everyone who was able to join us for our lively and informative discussion of Ambridge et al. (in press)! Next time on November 6 at 2:30pm in SBSG 2221, we'll be looking at an article that discusses how probabilistic models of higher-level cognition (including language) are used in cognitive science:

Marcus, G. & Davis, E. 2013. How Robust Are Probabilistic Models of Higher-Level Cognition? Psychological Science, published online Oct 1, 2013, doi:10.1177/095679761349541.

I would also strongly recommend a target article and commentary related to this topic that were written fairly recently:

Jones, M. & Love, M. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34 (4), 169-188.

Chater, N., Goodman, N., Griffiths, T., Kemp, C., Oaksford, M., & Tenenbaum, J. 2011. The imaginary fundamentalists: The unshocking truth about Bayesian cognitive science. Behavioral and Brain Sciences, 34 (4), 194-196.


(Both target article and commentary are included in the pdf file linked above.)

See you then!

Monday, October 21, 2013

Some thoughts on Ambridge et al. in press

This article really hit home for me, since it talks about things I worry about a fair bit with respect to Universal Grammar and language learning in general -- so much so, that I ended up writing a lot more about it than I typically do for the articles we read. Conveniently, this is a target article that's asking for commentaries, so I'm going to put some of my current thoughts here as a sort of teaser for the commentary I plan to submit.

~~~


The basic issue that the authors (AP&L) highlight about proposed learning strategies seems exactly right: What will actually work, and what exactly makes it work? They note that …nothing is gained by positing components of innate knowledge that do not simplify the problem faced by language learners” (p.56, section 7.0), and this is absolutely true. To examine how well several current learning strategy proposals work that involve innate, linguistic knowledge, AP&L present evidence from a commendable range of linguistic phenomena, from what might be considered fairly fundamental knowledge (e.g., grammatical categories) to fairly sophisticated knowledge (e.g., subjacency and binding). In each case, AP&L identify the shortcomings of some existing Universal Grammar (UG) proposals, and observe that these proposals don’t seem to fare very well in realistic scenarios. The challenge at the very end underscores this -- AP&L contend (and I completely agree) that a learning strategy proposal involving innate knowledge needs to show “precisely how a particular type of innate knowledge would help children acquire X” (p.56, section 7.0).  

More importantly, I believe this should be a metric that any component of a learning strategy is measured by.  Namely, for any component (whether innate or derived, whether language-specific or domain-general), we need to not only propose that this component could help children learn some piece of linguistic knowledge but also demonstrate at least “one way that a child could do so” (p.57, section 7.0). To this end, I think it's important to highlight how computational modeling is well suited for doing precisely this: for any proposed component embedded in a learning strategy, modeling allows us to empirically test that strategy in a realistic learning scenario. It’s my view that we should test all potential learning strategies, including the ones AP&L themselves propose as alternatives to the UG-based ones they find lacking.  An additional and highly useful benefit of the computational modeling methdology is that it forces us to recognize hidden assumptions within our proposed learning strategies, a problem that AP&L rightly recognize with many existing proposals.

This leads me to suggest certain criteria that any learning strategy should satisfy, relating to its utility in principle and practice, as well as its usability by children. Once we have a promising learning strategy that satisfies these criteria, we can then concern ourselves with the components comprising that strategy.  With respect to this, I want to briefly discuss the type of components AP&L find unhelpful, since several of the components they would prefer might still be reasonably classified as UG components. The main issue they have is not with components that are innate and language-specific, but rather components of this kind that in addition involve very precise knowledge. This therefore does not rule out UG components that involve more general knowledge, including (again) the components AP&L themselves propose. In addition, AP&L ask for explicit examples of UG components that actually do work. I think one potentially UG component that’s part of a successful learning strategy for syntactic islands (described in Pearl & Sprouse 2013) is a nice example of this: the bias to characterize wh-dependencies at a specific level of granularity. It's not obvious where this bias would come from (i.e., how it would be derived or what innate knowledge would lead to it), but it's crucial for the learning strategy it's a part of to work. As a bonus, that learning strategy also satisfies the criteria I suggest for evaluating learning strategies more generally (utility and useability).

~~~
Reference:

Tuesday, October 1, 2013

Next time on 10/23/13 @ 2:30pm in SBSG 2221 = Ambridge et al. in press


It looks like the best collective time to meet will be Wednesdays at 2:30pm for this quarter, so that's what we'll plan on.  Due to some of my own scheduling conflicts, our first meeting will be in a few weeks on October 23.  Our complete schedule is available on the webpage at 


On Oct 23, we'll be looking at an article that examines the utility of Universal Grammar based learning strategies in several different linguistic domains, arguing that they're not all that helpful at the moment:

Ambridge, B., Pine, J., & Lieven, E. 2013 in press. Child language acquisition: Why Universal Grammar doesn't help. Language.




See you then!

Wednesday, September 25, 2013

Fall quarter planning


I hope everyone's had a good summer break - and now it's time to gear up for the fall quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on Universal Grammar, Bayesian modeling, and word learning:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here).

Thanks and see you soon!

Friday, June 7, 2013

Some thoughts on Parisien and Stevenson 2010


Overall, this paper is concerned with the extent to which children possess abstract knowledge of syntax, and more specifically, children’s ability to acquire generalizations about verb alternations. The authors present two models for the purpose of illustrating that information relevant to verb alternations can be acquired through observations of how verbs occur with individual arguments in the input.

My main point of confusion in this article was and still is about the features used to represent the lowest level of abstraction in the models. The types of features used seem to me to already assume a lot of prior abstract syntactic knowledge… The authors state, “We make the assumption that children at this developmental stage can distinguish various syntactic arguments in the input, but may not yet recognize recurring patterns such as transitive and double-object constructions”, but this assumption still does not quite make sense to me. In order to have a feature such as “OBJ”, don’t you have to have some abstract category for objects? Some abstract representation of what it means to be an object? This seems like more than just a general chunking of the input into constituents because for something to be an object, it has to be in a specific relationship with a verb. So how can you have this feature without already having abstract knowledge of the relationship of the object to the verb? If this type of generalized knowledge is not what is meant, maybe it is just the labels given to these features that bothers me. It seems to me that once a learner has figured out what type each constituent is (OBJ, OBJ2, COMP, PP, etc), the problem of learning generalizations of constructions becomes simple – just find all the verbs that have OBJ and OBJ2 after them and put them into a category together. Even after reading this article twice and discussing it with the class, I am still really missing something essential about the logic behind this assumption.

A few points regarding verb argument preferences:
  1. The comparison of the two models in the results for verb argument preferences seems completely unsurprising… Is this not what Model 1 was made to do? If so, then I would not expect any added benefit from Model 2, but it is unclear what the authors’ expectations were regarding this result.
  2. What is the point of comparing two very similar constructions (prepositional dative and benefactive)? The only difference between these two is the preposition used, so being able to distinguish one from the other does not require abstract syntactic knowledge… as far as I can tell, the differences occur at the phonological level and at the semantic level.
  3. I am curious about the fact that both models acquired approximately 20 different constructions… What were these other constructions and why did they only look at the datives? 
A few points regarding novel verb generalization:
  1. I found the comparison of the two models in the results for novel verb generalization to be rather difficult to interpret… In particular, I think organizing the graph in a different way could have made it much more visually interpretable – one in which the bars for model 1 and model 2 were side-by-side on the same graph rather than on separate graphs displayed one above the other. I also would have liked some discussion of the significance of the differences discussed – They say that in comparing Model 2 with Model 1, the PD frame is now more likely than the SC frame, although only slightly. Perhaps just because I’m not used to looking at log likelihood graphs, it is unclear to me whether this difference is significant enough to even bother mentioning because it is barely noticeable on the graph.
  2. On the topic of the behavior observed in children, the authors note that high-frequency verbs tend to be biased toward the double-object form. However, children tend to be biased toward the prepositional dative form. But even in the larger corpus, only about half of the verbs are prepositional-biased, and it is suggested that these are low frequency. So, what is a potential explanation for the observed bias in children? Why would they be biased toward the prepositional dative form if is the low-frequency verbs that are biased this way? This doesn’t make intuitive sense if children are doing some sort of pattern-matching. I would expect children to behave like the model – to more closely match the biases of the high-frequency verbs and therefore prefer to generalize to the double-object construction from the prepositional dative. I think that rather than simply running the model on a larger corpus, it would be useful to construct a strong theory for why children might have this bias and then construct a model that is able to test that theory.




Thursday, June 6, 2013

Some thoughts on Carlson et al. 2010

I really liked how this paper tackled a really big problem head on. It's inclusion in subsequent works speaks strongly for the interest in this kind of research. I would really like to see more language papers set a high bar like this and establish a framework for achieving it.

My largest concern about this paper is the fact that the authors seemed to feel that human-guided learning can overcome some of the deficits in the model framework. The large drop off in precision (from 90% to 57%) is not surprising as methods such as the Coupled SEAL and Coupled Morphological Classifier are not robust in the face of locally optimal solutions; it is inevitable that as more and more data is added, the fitness will decline, because the models are already anchored to their fit of previous data. Errors will beget errors, and human intervention will only limit this inherent multiplication.

These errors are further compounded by the fact that the framework does not take into account the degree of independence between its various models. Using group and individual model thresholds for decision making is a decent heuristic, but it is unworkable as an architecture because guaranteeing each model's independence is a hard constraint on the number and types of models that can be used. I believe the framework would be better served by combining the underlying information in a proper, hierarchical framework. By including more models that can inform each other, perhaps the necessity of human-supervised learning can be kept to a minimum.

Tuesday, May 28, 2013

Have a good summer, and see you in the fall!


Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be on hiatus this summer, and we'll resume again in the fall quarter.  As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Friday, May 24, 2013

Some thoughts on Kwiatkowski et al 2012

One of the things I really enjoyed about this paper was that it was a much fuller syntax & semantics system than anything I've seen in awhile, which means we get to see the nitty gritty in the assumptions that are required to make it all work. Having seen the assumptions, though, I did find it a little unfair for the authors to claim that no language-specific knowledge was required - as far as I could tell, the "language-universal" rules between syntax and semantics at the very least seem to be a language-specific kind of knowledge (in the sense of domain-specific vs. domain-general). In this respect, whatever learning algorithms they might explore, the overall approach seems similar to other learning models I've seen that are predicated on very precise theoretical linguistic knowledge (e.g., the parameter-setting systems of Yang 2002, Sakas & Fodor 2001, Niyogi & Berwick 1996, Gibson & Wexler 1994, among others.) It just so happens here that CCG assumes different primitives/principles than those other systems - but domain-specific primitives/principles are still there a priori.

Getting back to the semantic learning - I'm a big fan of them learning words besides nouns, and connecting with the language acquisition behavioral literature on syntactic bootstrapping and fast mapping.  That being said, the actual semantics they seemed to learn was a bit different than what I think the fast mapping people generally intend.  In particular, if we look at Figure 5, while three different quantifier meanings are learned, it's more about the form the meaning takes, rather than the actual lexical meaning of the word (i.e., the form for a, another, and any looks identical, so any differences in meaning are not recognized, even though these words clearly do differ in meaning). I think lexical meaning is what people are generally talking about for fast mapping, though. What this seems like is almost grammatical categorization, where knowing the grammatical category means you know the general form the meaning will have (due to those linking rules between syntactic category and semantic form) rather than the precise meaning - that's very in line with syntactic bootstrapping, where the syntactic context might point you towards verb-y meanings or preposition-y meanings, for example.

More specific thoughts:

I found it interesting that the authors wanted to explicitly respond to a criticism that statistical learning models can't generate sudden step-like behavior changes.  I think it's certainly an unspoken view by many in linguistics that statistical learning implies more gradual learning (which was usually seen as a bonus, from what I understood, given how noisy data are). It's also unclear to me that the data taken as evidence for step-wise changes really reflect a step-wise change or instead only seem to be step-wise because of how often the samples were taken and how much learning happened in between.  It's interesting that the model here can generate it for learning word order (in Figure 6), though I think the only case that really stands out for me is the 5 meaning example, around 400 utterances.

I could have used a bit more unpacking of the CCG framework in Figure 2. I know there were space limitations, but the translation from semantic type to the example logical form wasn't always obvious to me. For example, the first and last examples (S_dcl and PP) have the same semantic type but not the same lambda calculus form. Is the semantic type what's linked to the syntactic category (presumably), and then there are additional rules for how to generate the lambda form for any given semantic type?

This provides a nice example where the information that's easily available in dependency structures appears more useful, since the authors describe (in section 6) how they created a deterministic procedure for using the primitive labels in the dependency structures to create the lambda forms. (Though as a side note, I was surprised how this mapping only worked for a third of the child-directed speech examples, leaving out not only fragments but also imperatives and nouns with prepositional phrase modifiers. I guess it's not unreasonable to try to first get your system working on a constrained subset of the data, though.)

I wish they had told us a bit more about the guessing procedure they used for parsing unseen utterances, since it had a clear beneficial impact throughout the learning period. Was it random (and so guessing at all was better than not, since sometimes you'd be right as opposed to always being penalized for not having a representation for a given word)?  Was it some kind of probabilistic sampling?  Or maybe just always picking the most probable hypothesis?




Wednesday, May 15, 2013

Some thoughts on Frank et al. 2010

So what I liked most about this article was the way in which they chose to explore the space of possibilities in a very computational-level way. I think this is a great example of what I'd like to see more of. As someone also interested in cross-linguistic viability for our models, I have to also commend them for testing on not just one foreign language, but on three.

So there were a number of aspects of the model which I think could have been more clearly specified. For instance, I don't believe they ever explicitly say that the model presumes knowledge of the number of states to be learned. Actual infants don't have the benefit of the doubt in this regard, so it would be nice to know what would happen if you inferred that from the data. It turns out there's a well specified model to do that, but I'll get to that later. Another problem with their description of the model has to do with how their hyperparameters are sampled. They apparently simplify the process by resampling only once per iteration of the Gibbs sampler. I'm happy with this although I'm going to assume that it was a typo that they say they run their model for 2000 iterations (Goldwater seems to prefer 20,000). Gibbs samplers tend to converge more slowly on time-dependent models so it would be nice to have some evidence that the sampler has actually converged. Splitting the data by sentence type seems to increase the size of their confidence intervals by quite a lot, which may be an artifact of having less data per parameter, but could also be due to a lack of convergence.

Typically I have to chastise modelers who attempt to use VI or V-measure, but fortunately they are not doing anything technically wrong here. They are correct in that comparing these scores across corpora is hazardous at best. Both of these measures are biased, VI prefers small numbers of tags and V-measure prefers large numbers of tags (they claim at some point that it is "invariant" to different numbers of tags, this is however not true!). It turns out that a measure, V-beta, is more useful than either of these two in that it is unbiased for the number of categories. So there's my rant about the wonders of V-beta.

What I really would have liked to see would be an infinite HMM for this data, which is a well-specified, very similar model which can infer the number of grammatical categories in the data. It has an efficient sampler (as of 2008) so there's no reason they couldn't run that model over their corpus. It's very useful for us to know what the space of possibilities is, but to what extent would their results change if they gave up the assumption that you knew from the get-go how many categories there were? There's really no reason they couldn't run it and I'd be excited to see how well it performed.

The one problem with the models they show here as well as the IHMM is that neither allows for there to be shared information about transition probabilities or emission probabilities (depending on the model) across sentence types. They're treated as entirely different. They mention this in their conclusion, but I wonder if there's any way to share that information in a useful way without hand coding it somehow.

Overall, I'm really happy someone is doing this. I liked the use of some very salient information to help tackle a hard problem, but I would've liked to have seen it a little more realistic by inferring the number of grammatical categories. I might've also liked to have seen better evidence of convergence (perhaps a beam sampler instead of Gibbs, at the very least I hope they ran it for more than 2000 iterations).