Monday, November 4, 2013

Some thoughts on Marcus & Davis (2013)

(...and a little also on Jones & Love 2011)

One of the things that struck me about Marcus & Davis (2013) [M&D] is that they seem to be concerned with identifying what the priors are for learning. But what I'm not sure of is how you distinguish the following options:

(a) sub-optimal inference over optimal priors
(b) optimal inference over sub-optimal priors
(c) sub-optimal inference over sub-optimal priors

M&D seem to favor option (a), but I'm not sure there's an obvious reason to do so. Jones & Love 2011 [J&L] mention the possibility of "bounded rationality", which is something like "be as optimal as possible in your inference, given the prior and the processing limitations you have". That sounds an awful lot like (c), and seems like a pretty reasonable option to explore. The concern in general with what the priors are actually dovetails quite nicely with traditional linguistic explorations of how to define (constrain) the learner's hypothesis space appropriately to make successful inference possible. Also, J&L are quite aware of this too, and underscore the importance of selecting the priors appropriately.

That being said, no matter what priors and inference processes end up working, there's clear utility in being explicit about all the assumptions that yield a match to human behavior, which M&D want (and I'm a huge fan of this myself: see my commentary on a recent article here where I happily endorse this). Once you've identified the necessary pieces that make a learning strategy work, you can then investigate (or at least discuss) which assumptions are necessarily optimal.  That may not be an easy task, but it seems like a step in the right direction.

M&D seem to be unhappy with probabilistic models as a default assumption - and okay, that's fine. But it does seem important to recognize that probabilistic reasoning is a legitimate option. And maybe some of cognition is probabilistic and some isn't - I don't think there's a compelling reason to believe that cognition has to be all one or all the other. (I mean, after all, cognition is made up of a lot of different things.) In this vein, I think a reasonable thing that M&D would like is for us to not just toss out non-probabilistic options that work really well solely because they're non-probabilistic.

On a related note, I very much agree with one of the last things M&D note, which is that we should be explicit about "what would constitute evidence that a probabilistic approach is not appropriate for a particular task or domain".  I'm not sure myself what that evidence would look like, since even categorical behavior can be simulated by a probabilistic model that just thresholds. Maybe if it's more "economical" (however we define that) to not have a probabilistic model, and there exists a non-probabilistic model that accomplishes the same thing?

~~~
A few comments about Jones & Love 2011 [J&L]:

J&L seem very concerned with the recent focus in the Bayesian modeling world on existence proofs for various aspects of cognition.  They do mention later in their article (around section 6, I think), that existence proofs are a useful starting point, however -- they just don't want research to stop there. An existence proof that a Bayesian learning strategy can work for some problem should be the first step for getting a particular theory on the table as a real possibility worth considering (e.g., whatever's in the priors for that particular learning strategy that allowed Bayesian inference to succeed, as well as the Bayesian inference process itself).

Overall, J&L seem to make a pretty strong call for process models (i.e., algorithmic-level models, instead of just computational-level models). Again, this seems like a natural follow-up once you have a computational-level model you're happy with.  So the main point is simply not to rest on your Bayesian inference laurels once you have your existence proof at the computational level for some problem in cognition.  The Chater et al. 2011 commentary to J&L note that many Bayesian modelers are moving in this direction already, creating "rational process" models.

~~~
References

Chater, N., Goodman, N., Griffiths, T., Kemp, C., Oaksford, M., & Tenenbaum, J. 2011. The imaginary fundamentalists: The unshocking truth about Bayesian cognitive science. Behavioral and Brain Sciences, 34 (4), 194-196.

Jones, M. & Love, M. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34 (4), 169-188.

Pearl, L. 2013. Evaluating strategy components: Being fair.  [lingbuzz]

Wednesday, October 23, 2013

Next time on 11/6/13 @ 2:30pm in SBSG 2221 = Marcus & David 2013


Thanks to everyone who was able to join us for our lively and informative discussion of Ambridge et al. (in press)! Next time on November 6 at 2:30pm in SBSG 2221, we'll be looking at an article that discusses how probabilistic models of higher-level cognition (including language) are used in cognitive science:

Marcus, G. & Davis, E. 2013. How Robust Are Probabilistic Models of Higher-Level Cognition? Psychological Science, published online Oct 1, 2013, doi:10.1177/095679761349541.

I would also strongly recommend a target article and commentary related to this topic that were written fairly recently:

Jones, M. & Love, M. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain Sciences, 34 (4), 169-188.

Chater, N., Goodman, N., Griffiths, T., Kemp, C., Oaksford, M., & Tenenbaum, J. 2011. The imaginary fundamentalists: The unshocking truth about Bayesian cognitive science. Behavioral and Brain Sciences, 34 (4), 194-196.


(Both target article and commentary are included in the pdf file linked above.)

See you then!

Monday, October 21, 2013

Some thoughts on Ambridge et al. in press

This article really hit home for me, since it talks about things I worry about a fair bit with respect to Universal Grammar and language learning in general -- so much so, that I ended up writing a lot more about it than I typically do for the articles we read. Conveniently, this is a target article that's asking for commentaries, so I'm going to put some of my current thoughts here as a sort of teaser for the commentary I plan to submit.

~~~


The basic issue that the authors (AP&L) highlight about proposed learning strategies seems exactly right: What will actually work, and what exactly makes it work? They note that …nothing is gained by positing components of innate knowledge that do not simplify the problem faced by language learners” (p.56, section 7.0), and this is absolutely true. To examine how well several current learning strategy proposals work that involve innate, linguistic knowledge, AP&L present evidence from a commendable range of linguistic phenomena, from what might be considered fairly fundamental knowledge (e.g., grammatical categories) to fairly sophisticated knowledge (e.g., subjacency and binding). In each case, AP&L identify the shortcomings of some existing Universal Grammar (UG) proposals, and observe that these proposals don’t seem to fare very well in realistic scenarios. The challenge at the very end underscores this -- AP&L contend (and I completely agree) that a learning strategy proposal involving innate knowledge needs to show “precisely how a particular type of innate knowledge would help children acquire X” (p.56, section 7.0).  

More importantly, I believe this should be a metric that any component of a learning strategy is measured by.  Namely, for any component (whether innate or derived, whether language-specific or domain-general), we need to not only propose that this component could help children learn some piece of linguistic knowledge but also demonstrate at least “one way that a child could do so” (p.57, section 7.0). To this end, I think it's important to highlight how computational modeling is well suited for doing precisely this: for any proposed component embedded in a learning strategy, modeling allows us to empirically test that strategy in a realistic learning scenario. It’s my view that we should test all potential learning strategies, including the ones AP&L themselves propose as alternatives to the UG-based ones they find lacking.  An additional and highly useful benefit of the computational modeling methdology is that it forces us to recognize hidden assumptions within our proposed learning strategies, a problem that AP&L rightly recognize with many existing proposals.

This leads me to suggest certain criteria that any learning strategy should satisfy, relating to its utility in principle and practice, as well as its usability by children. Once we have a promising learning strategy that satisfies these criteria, we can then concern ourselves with the components comprising that strategy.  With respect to this, I want to briefly discuss the type of components AP&L find unhelpful, since several of the components they would prefer might still be reasonably classified as UG components. The main issue they have is not with components that are innate and language-specific, but rather components of this kind that in addition involve very precise knowledge. This therefore does not rule out UG components that involve more general knowledge, including (again) the components AP&L themselves propose. In addition, AP&L ask for explicit examples of UG components that actually do work. I think one potentially UG component that’s part of a successful learning strategy for syntactic islands (described in Pearl & Sprouse 2013) is a nice example of this: the bias to characterize wh-dependencies at a specific level of granularity. It's not obvious where this bias would come from (i.e., how it would be derived or what innate knowledge would lead to it), but it's crucial for the learning strategy it's a part of to work. As a bonus, that learning strategy also satisfies the criteria I suggest for evaluating learning strategies more generally (utility and useability).

~~~
Reference:

Tuesday, October 1, 2013

Next time on 10/23/13 @ 2:30pm in SBSG 2221 = Ambridge et al. in press


It looks like the best collective time to meet will be Wednesdays at 2:30pm for this quarter, so that's what we'll plan on.  Due to some of my own scheduling conflicts, our first meeting will be in a few weeks on October 23.  Our complete schedule is available on the webpage at 


On Oct 23, we'll be looking at an article that examines the utility of Universal Grammar based learning strategies in several different linguistic domains, arguing that they're not all that helpful at the moment:

Ambridge, B., Pine, J., & Lieven, E. 2013 in press. Child language acquisition: Why Universal Grammar doesn't help. Language.




See you then!

Wednesday, September 25, 2013

Fall quarter planning


I hope everyone's had a good summer break - and now it's time to gear up for the fall quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on Universal Grammar, Bayesian modeling, and word learning:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here).

Thanks and see you soon!

Friday, June 7, 2013

Some thoughts on Parisien and Stevenson 2010


Overall, this paper is concerned with the extent to which children possess abstract knowledge of syntax, and more specifically, children’s ability to acquire generalizations about verb alternations. The authors present two models for the purpose of illustrating that information relevant to verb alternations can be acquired through observations of how verbs occur with individual arguments in the input.

My main point of confusion in this article was and still is about the features used to represent the lowest level of abstraction in the models. The types of features used seem to me to already assume a lot of prior abstract syntactic knowledge… The authors state, “We make the assumption that children at this developmental stage can distinguish various syntactic arguments in the input, but may not yet recognize recurring patterns such as transitive and double-object constructions”, but this assumption still does not quite make sense to me. In order to have a feature such as “OBJ”, don’t you have to have some abstract category for objects? Some abstract representation of what it means to be an object? This seems like more than just a general chunking of the input into constituents because for something to be an object, it has to be in a specific relationship with a verb. So how can you have this feature without already having abstract knowledge of the relationship of the object to the verb? If this type of generalized knowledge is not what is meant, maybe it is just the labels given to these features that bothers me. It seems to me that once a learner has figured out what type each constituent is (OBJ, OBJ2, COMP, PP, etc), the problem of learning generalizations of constructions becomes simple – just find all the verbs that have OBJ and OBJ2 after them and put them into a category together. Even after reading this article twice and discussing it with the class, I am still really missing something essential about the logic behind this assumption.

A few points regarding verb argument preferences:
  1. The comparison of the two models in the results for verb argument preferences seems completely unsurprising… Is this not what Model 1 was made to do? If so, then I would not expect any added benefit from Model 2, but it is unclear what the authors’ expectations were regarding this result.
  2. What is the point of comparing two very similar constructions (prepositional dative and benefactive)? The only difference between these two is the preposition used, so being able to distinguish one from the other does not require abstract syntactic knowledge… as far as I can tell, the differences occur at the phonological level and at the semantic level.
  3. I am curious about the fact that both models acquired approximately 20 different constructions… What were these other constructions and why did they only look at the datives? 
A few points regarding novel verb generalization:
  1. I found the comparison of the two models in the results for novel verb generalization to be rather difficult to interpret… In particular, I think organizing the graph in a different way could have made it much more visually interpretable – one in which the bars for model 1 and model 2 were side-by-side on the same graph rather than on separate graphs displayed one above the other. I also would have liked some discussion of the significance of the differences discussed – They say that in comparing Model 2 with Model 1, the PD frame is now more likely than the SC frame, although only slightly. Perhaps just because I’m not used to looking at log likelihood graphs, it is unclear to me whether this difference is significant enough to even bother mentioning because it is barely noticeable on the graph.
  2. On the topic of the behavior observed in children, the authors note that high-frequency verbs tend to be biased toward the double-object form. However, children tend to be biased toward the prepositional dative form. But even in the larger corpus, only about half of the verbs are prepositional-biased, and it is suggested that these are low frequency. So, what is a potential explanation for the observed bias in children? Why would they be biased toward the prepositional dative form if is the low-frequency verbs that are biased this way? This doesn’t make intuitive sense if children are doing some sort of pattern-matching. I would expect children to behave like the model – to more closely match the biases of the high-frequency verbs and therefore prefer to generalize to the double-object construction from the prepositional dative. I think that rather than simply running the model on a larger corpus, it would be useful to construct a strong theory for why children might have this bias and then construct a model that is able to test that theory.




Thursday, June 6, 2013

Some thoughts on Carlson et al. 2010

I really liked how this paper tackled a really big problem head on. It's inclusion in subsequent works speaks strongly for the interest in this kind of research. I would really like to see more language papers set a high bar like this and establish a framework for achieving it.

My largest concern about this paper is the fact that the authors seemed to feel that human-guided learning can overcome some of the deficits in the model framework. The large drop off in precision (from 90% to 57%) is not surprising as methods such as the Coupled SEAL and Coupled Morphological Classifier are not robust in the face of locally optimal solutions; it is inevitable that as more and more data is added, the fitness will decline, because the models are already anchored to their fit of previous data. Errors will beget errors, and human intervention will only limit this inherent multiplication.

These errors are further compounded by the fact that the framework does not take into account the degree of independence between its various models. Using group and individual model thresholds for decision making is a decent heuristic, but it is unworkable as an architecture because guaranteeing each model's independence is a hard constraint on the number and types of models that can be used. I believe the framework would be better served by combining the underlying information in a proper, hierarchical framework. By including more models that can inform each other, perhaps the necessity of human-supervised learning can be kept to a minimum.

Tuesday, May 28, 2013

Have a good summer, and see you in the fall!


Thanks so much to everyone who was able to join us for our lively discussion today, and to everyone who's joined us this past academic year!

The CoLa Reading Group will be on hiatus this summer, and we'll resume again in the fall quarter.  As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

Friday, May 24, 2013

Some thoughts on Kwiatkowski et al 2012

One of the things I really enjoyed about this paper was that it was a much fuller syntax & semantics system than anything I've seen in awhile, which means we get to see the nitty gritty in the assumptions that are required to make it all work. Having seen the assumptions, though, I did find it a little unfair for the authors to claim that no language-specific knowledge was required - as far as I could tell, the "language-universal" rules between syntax and semantics at the very least seem to be a language-specific kind of knowledge (in the sense of domain-specific vs. domain-general). In this respect, whatever learning algorithms they might explore, the overall approach seems similar to other learning models I've seen that are predicated on very precise theoretical linguistic knowledge (e.g., the parameter-setting systems of Yang 2002, Sakas & Fodor 2001, Niyogi & Berwick 1996, Gibson & Wexler 1994, among others.) It just so happens here that CCG assumes different primitives/principles than those other systems - but domain-specific primitives/principles are still there a priori.

Getting back to the semantic learning - I'm a big fan of them learning words besides nouns, and connecting with the language acquisition behavioral literature on syntactic bootstrapping and fast mapping.  That being said, the actual semantics they seemed to learn was a bit different than what I think the fast mapping people generally intend.  In particular, if we look at Figure 5, while three different quantifier meanings are learned, it's more about the form the meaning takes, rather than the actual lexical meaning of the word (i.e., the form for a, another, and any looks identical, so any differences in meaning are not recognized, even though these words clearly do differ in meaning). I think lexical meaning is what people are generally talking about for fast mapping, though. What this seems like is almost grammatical categorization, where knowing the grammatical category means you know the general form the meaning will have (due to those linking rules between syntactic category and semantic form) rather than the precise meaning - that's very in line with syntactic bootstrapping, where the syntactic context might point you towards verb-y meanings or preposition-y meanings, for example.

More specific thoughts:

I found it interesting that the authors wanted to explicitly respond to a criticism that statistical learning models can't generate sudden step-like behavior changes.  I think it's certainly an unspoken view by many in linguistics that statistical learning implies more gradual learning (which was usually seen as a bonus, from what I understood, given how noisy data are). It's also unclear to me that the data taken as evidence for step-wise changes really reflect a step-wise change or instead only seem to be step-wise because of how often the samples were taken and how much learning happened in between.  It's interesting that the model here can generate it for learning word order (in Figure 6), though I think the only case that really stands out for me is the 5 meaning example, around 400 utterances.

I could have used a bit more unpacking of the CCG framework in Figure 2. I know there were space limitations, but the translation from semantic type to the example logical form wasn't always obvious to me. For example, the first and last examples (S_dcl and PP) have the same semantic type but not the same lambda calculus form. Is the semantic type what's linked to the syntactic category (presumably), and then there are additional rules for how to generate the lambda form for any given semantic type?

This provides a nice example where the information that's easily available in dependency structures appears more useful, since the authors describe (in section 6) how they created a deterministic procedure for using the primitive labels in the dependency structures to create the lambda forms. (Though as a side note, I was surprised how this mapping only worked for a third of the child-directed speech examples, leaving out not only fragments but also imperatives and nouns with prepositional phrase modifiers. I guess it's not unreasonable to try to first get your system working on a constrained subset of the data, though.)

I wish they had told us a bit more about the guessing procedure they used for parsing unseen utterances, since it had a clear beneficial impact throughout the learning period. Was it random (and so guessing at all was better than not, since sometimes you'd be right as opposed to always being penalized for not having a representation for a given word)?  Was it some kind of probabilistic sampling?  Or maybe just always picking the most probable hypothesis?




Wednesday, May 15, 2013

Some thoughts on Frank et al. 2010

So what I liked most about this article was the way in which they chose to explore the space of possibilities in a very computational-level way. I think this is a great example of what I'd like to see more of. As someone also interested in cross-linguistic viability for our models, I have to also commend them for testing on not just one foreign language, but on three.

So there were a number of aspects of the model which I think could have been more clearly specified. For instance, I don't believe they ever explicitly say that the model presumes knowledge of the number of states to be learned. Actual infants don't have the benefit of the doubt in this regard, so it would be nice to know what would happen if you inferred that from the data. It turns out there's a well specified model to do that, but I'll get to that later. Another problem with their description of the model has to do with how their hyperparameters are sampled. They apparently simplify the process by resampling only once per iteration of the Gibbs sampler. I'm happy with this although I'm going to assume that it was a typo that they say they run their model for 2000 iterations (Goldwater seems to prefer 20,000). Gibbs samplers tend to converge more slowly on time-dependent models so it would be nice to have some evidence that the sampler has actually converged. Splitting the data by sentence type seems to increase the size of their confidence intervals by quite a lot, which may be an artifact of having less data per parameter, but could also be due to a lack of convergence.

Typically I have to chastise modelers who attempt to use VI or V-measure, but fortunately they are not doing anything technically wrong here. They are correct in that comparing these scores across corpora is hazardous at best. Both of these measures are biased, VI prefers small numbers of tags and V-measure prefers large numbers of tags (they claim at some point that it is "invariant" to different numbers of tags, this is however not true!). It turns out that a measure, V-beta, is more useful than either of these two in that it is unbiased for the number of categories. So there's my rant about the wonders of V-beta.

What I really would have liked to see would be an infinite HMM for this data, which is a well-specified, very similar model which can infer the number of grammatical categories in the data. It has an efficient sampler (as of 2008) so there's no reason they couldn't run that model over their corpus. It's very useful for us to know what the space of possibilities is, but to what extent would their results change if they gave up the assumption that you knew from the get-go how many categories there were? There's really no reason they couldn't run it and I'd be excited to see how well it performed.

The one problem with the models they show here as well as the IHMM is that neither allows for there to be shared information about transition probabilities or emission probabilities (depending on the model) across sentence types. They're treated as entirely different. They mention this in their conclusion, but I wonder if there's any way to share that information in a useful way without hand coding it somehow.

Overall, I'm really happy someone is doing this. I liked the use of some very salient information to help tackle a hard problem, but I would've liked to have seen it a little more realistic by inferring the number of grammatical categories. I might've also liked to have seen better evidence of convergence (perhaps a beam sampler instead of Gibbs, at the very least I hope they ran it for more than 2000 iterations).

Tuesday, May 14, 2013

Next time on 5/28/13 @ 2pm in SBSG 2200 = Kwiatkowski et al. 2012

Thanks to everyone who joined our meeting this week, where we had a very thoughtful discussion about the experimental design for investigating "less is more" and the implications of the computational modeling in Perfors 2012.  Next time on Tuesday May 28 @ 2pm in SBSG 2200, we'll be looking at an article that presents an incremental learning model that incorporates both syntactic and semantic information during learning:

Kwiatkowski, T., Goldwater, S., Zettlemoyer, L., & Steedman, M. 2012. A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.



See you then!
-Lisa


Monday, May 13, 2013

Some thoughts on Perfors 2012 (JML)

One of the things I quite liked about this paper was the description of the intuitions behind the different model parameters and capacity limitations. As a computational modeler who's seen ideal Bayesian learners before, could I have just as easily decoded this from a standard graphical model representation? Sure.  Did I like to have the intuitions laid out for me anyway?  You bet. Moreover, if we want these kind of models to be recognized and used within language research, it's good to know how to explain them like this. On a related note, I also appreciated that Perfors explicitly recognized the potential issues involved in extending her results to actual language learning. As with most models, hers is a simplification, but it may be a useful simplification, and there are probably useful ways to un-simplify it.

It was also good to see the discussion of the relationship between the representations this model used for memory and the existing memory literature. (Given the publication venue, this probably isn't so surprising, but given that my knowledge of memory models is fairly limited, it was helpful to see this spelled out.)

I think the most surprising thing for me was how much memory loss was required for the regularization bias to be able to come into play and allow the model to show regularization. Do we really think children only remember 10-20% of what they hear? (Maybe they do, though, especially in more realistic scenarios.)

More specific thoughts:

Intro: I found the distinctions made between different "less is more" hypothesis variants to be helpful, in particular the difference between a "starting small" version that imposes explicit restrictions on the input (because of attention, memory, etc.) to identify useful units in the input vs. a general regularization tendency (which may be the byproduct of cognitive limitations, but isn't specifically about ignoring some of the input) which is about "smoothing" the input in some sense.

Section 2.1.2: The particular task Perfors chooses to investigate experimentally is based on previous tasks that have been done with children and adults to test regularization, but I wonder what kind of task it seemed like to the adult subjects. Since the stimuli were presented orally, did the subjects think of each one as a single word that had some internal inconsistency (and so might be treating the variable part as morphology tacked onto a noun) or would they have thought of each one as one consistent word plus a separate determiner-like thing (making this more of a combinatorial syntax task)?  I guess it doesn't really matter for the purposes of regularization - if children can regularize syntax (creoles, Nicaraguan sign language, Simon), then presumably they regularize morphology (e.g., children's overregularization of the past tense in English, like goed), and it's not an unreasonable assumption that the same regularization process would apply to both. Perfors touches again on the issue of how adults perceived the task a little in the discussion (p.40) - she mentions that mutual exclusivity might come into play if adults viewed this as a word learning task, and cause more of a bias for regularization.  Whether it's a morphology task or a combinatorial syntax task, I'm not sure I agree with that - mutual exclusivity seems like it would only apply if adults assumed the entire word was the name of the object (as opposed to the determiner-thing being an actual determiner like the or a or morphology like -ed or -ing). Because only a piece of the entire "word" would change with each presentation of the object, it doesn't seem like adults would make that assumption.

Section 3.0.6: For the Prior bias, it seems like prior is constructed from the global frequency of the determiner (based on the CRP). This seems reasonable, but I wonder if it would matter any to have a lexical-item-based prior (maybe in addition to the global prior)? I could imagine that the forgotten data for any individual item might be quite high (even if others are low) when memory loss is less than 80-90% globally, which might allow the regularization effects to show up without needing to forget 80-90% of all the data.

Section 4: It's an interesting observation that the previous experiments that found regularization effects conducted the experiment over multiple days, where consolidation during sleep would have presumably occurred. Perfors mentions this as a potential memory distortion that doesn't occur during encoding itself, or retrieval, but rather with the processes of memory maintenance. If this is true, running the experiments again with adults, but over multiple days, should presumably allow this effect to show up.

Tuesday, April 30, 2013

Next time on 5/14 @ 2pm in SBSG 2200 = Perfors 2012 JML


Thanks to everyone who joined our meeting this week, where we had a very spirited and enlightening discussion about Lignos 2012 and the ideas behind it! Next time on Tuesday May 14 @ 2pm in SBSG 2200, we'll be looking at an article that investigates the interplay between memory limitations and overregularization behavior in learners, providing both experimental and computational modeling results:



See you then!

Monday, April 29, 2013

Some thoughts on Lignos 2012

I found the simplicity of the proposed algorithm in this paper very attractive (especially when compared to some of the more technically involved papers we've read that come from the machine learning literature). The goal of connecting to known experimental and developmental data of course warmed my cognitive modeler's heart, and I certainly sympathized with the aim of pushing the algorithm to be more cognitively plausible.  I did think some of the criticisms of previous approaches were a touch harsh, given what's actually implemented here (more on this below), but that may be more of a subjective interpretation thing.  I did find it curious that the evaluation metrics chosen were about word boundary identification, rather than about lexicon items (in particular, measuring boundary accuracy and word token accuracy, but not lexicon accuracy).  Given the emphasis on building a quality lexicon (which seems absolutely right to me if we're talking about the goal of word segmentation), why not have lexicon item scores as well to get a sense of how good a lexicon this strategy can create?

Some more specific thoughts:

Section 2.1, discussing the 9-month-old English-learning infants who couldn't segment Italian words from transitional probabilities alone unless they had already been presented with words in isolation: Lignos is using this to argue against transitional probabilities as a useful metric at all, but isn't another way to interpret it simply that transitional probabilities (TPs) can't do it all on their own?  That is, if you initialize a proto-lexicon with a few words, TPs would work alright - they just can't work right off the bat with no information.  Relatedly, the discussion of the Shukla et al. 2011 (apparently 6-month-old) infants who couldn't use TPs unless they were aligned with a prosodic boundary made me think more that TPs are useful, just not useful in isolation.  They need to be layered on top of some existing knowledge (however small that knowledge might be).  But I think it just may be Lignos's stance that TPs aren't that useful - they seem to be left out as something a model of word segmentation should pay attention to in section 2.4.

Of course, I (and I'm assuming Lawrence as well, given Phillips & Pearl 2012) was completely sympathetic to the criticism in section 2.3 about how phonemes aren't the right unit of perception for the initial stages of word segmentation. They may be quite appropriate if you're talking about 10-month-olds, though - of course, at that point, infants probably have a much better proto-lexicon, not to mention other cues (e.g., word stress). I was a little less clear about the criticism (of Johnson & Goldwater) about using collocations as a level of representation.  Even though this doesn't necessarily connect to adult knowledge of grammatical categories and phrases, there doesn't seem anything inherently wrong with assuming infants initially learn chunks that span categories and phrases, like "thatsa" or "couldI". They would have to fix them later, but that doesn't seem unreasonable.

One nice aspect of the Lignos strategy is that it's incremental, rather than a batch algorithm.  However, I think it's more a modeling decision rather than an empirical fact to not allow memory of recent utterances to affect the segmentation of the current utterance (section 3 Intro).  It may well turn out to be right, but it's not obviously true at this point that this is how kids are constrained.  On a related note, the implementation of considering multiple segmentations seems a bit more memory-intensive, so what's the principled reason for allowing memory for that but not allowing memory for recent utterances? Conceptually, I understand the motivation for wanting to explore multiple segmentations (and I think it's a good idea - I'm actually not sure why the algorithm here is limited to 2) - I'm just not sure it's quite fair to criticize other models for essentially allowing more memory for one thing when the model here allows more memory for another.

I was a little confused about how the greedy subtractive segmentation worked in section 3.2.  At first, I thought it was an incremental greedy thing - so if your utterance was "syl1 syl2 syl3", you would start with "syl1" and see if that's in your lexicon; if not, try "syl1 syl2", and so on. But this wouldn't run into ambiguity then: "...whenever multiple words in the lexicon could be subtracted from an utterance, the entry with the highest score will be deterministically used". So something else must be meant. Later on when the beam search is described, it makes sense that there would be ambiguity - but I thought ambiguity was supposed to be present even without multiple hypotheses being considered.

The "Trust" feature described in 3.3 seemed like an extra type of knowledge that might be more easily integrated into the existing counts, rather than added on as an additional binary feature.  I get that the idea was to basically use it to select the subset of words to add to the lexicon, but couldn't a more gradient version of this implemented, where the count for words at utterance boundaries gets increased by 1, while the count for words that are internal gets increased by less than 1? I guess you could make an argument either way about which approach is more naturally intuitive (i.e., just ignore words not at utterance boundaries vs. be less confident about words not at utterance boundaries).

I think footnote 7 is probably the first argument I'm seen in favor of using orthographic words as the target state, instead of an apology for not having prosodic words as the target state. I appreciate the viewpoint, but I'm not quite convinced that prosodic words wouldn't be useful as proto-lexicon items (ex: "thatsa" and "couldI" come to mind). Of course, these would have to be segmented further eventually, but they're probably not completely destructive to have in the proto-lexicon (and do feel more intuitively plausible as an infant's target state).

In Table 1, it seems like we see a good example of why precision and recall may be better than hit (H) rate and false alarm (FA) rate: The Syllable learner (which puts a boundary at every syllable) clearly oversegments and does not achieve the target state, but you would never know that from the H and FA scores.  Do we get additional information from H & FA that we don't get from precision and recall? (I guess it would have to be mostly from the FA rate, since H = recall?)

I thought seeing the error analyses in Tables 2 and 3 was helpful, though I was a little surprised Table 3 didn't show the breakdown between undersegmentation and oversegmentation errors, in addition to the breakdown between function and content words.  (Or maybe I just would have liked to have seen that, given the claim that early errors should mostly be undersegmentations. We see plenty of function words as errors, but how many of them are already oversegmentations?)

Tuesday, April 16, 2013

Next time on 4/30/13 @ 2pm in SBSG 2200 = Lignos 2012


Thanks to everyone who joined our meeting this week, where we had a very helpful discussion about the empirical basis and learning model in Martin 2011, as well as some ideas for how to extend this model in interesting ways. Next time on Tuesday April 30 @ 2pm in SBSG 2200, we'll be looking at an article that develops an algorithmic model of word segmentation, using experimental evidence from infant learning to ground itself:

Lignos, C. 2012. Infant Word Segmentation: An Incremental, Integrated Model. Proceedings of the 30th West Coast Conference on Formal Linguistics, ed. Nathan Arnett and Ryan Bennett, 237-247. Somerville, MA: Cascadilla Proceedings Project.


See you then!
-Lisa

Monday, April 15, 2013

Some thoughts on Martin (2011)

I really liked how compact this paper was - there was quite a bit of material included without it feeling like a part of the discussion was missing. I appreciated the connections made between the implementation of the model and the cognitive learning biases that implementation represented.

As a researcher with a soft spot for empirically-grounded modeling, I was also pleased to see the connections to English and Navajo phonotactic variation. (I admit, I would have liked a bit less abstraction for some of the modeling demonstrations once the basic principle had been illustrated, but that's probably why it was a 20 page paper instead of a 40 page paper.)  One of the things that really struck me was how much the MaxEnt framework discussed seemed similar to hierarchical Bayesian models (HBMs) - I kept wanting to map the different frameworks to each other (prior = prefer simpler grammars, likelihood = maximize probability of input data, etc.). It seemed like the MaxEnt framework included an overhypothesis (dislike geminate consonants in general [structure-blind]), and then some more specific instantiations (dislike them within words, but don't care about them as much across words [structure-sensitive]).  This would be the "leaking" that the title refers to - the leaking of specific constraints back up to the overhypothesis. This also ties into the idea on p.763 where Martin mentions that structure-blind constraints may be a hold-over from very early learning (Perfors, Tenenbaum and colleagues often talk about the "blessing of abstraction" for overhypotheses, where the more abstract thing can be learned earlier because it's instantiated in so many things. And so perhaps the overhypothesis is reinforced more than any individual instantiation of it, making it more resistant to change later on.) But instead of having them arranged in this kind of hierarchy (or maybe it's more like two factors interacting - (1) geminate preference + (2) within vs. across words?), the constraints were specified explicitly by the modeler. This is great first step to show that all of these constraints are needed, but it does feel like some more-general representation is missing.

I also thought it was a very interesting hypothesis that marked forms (i.e., geminates across word boundaries in compounds) persist because new compounds are formed that are not drawn from the existing phonotactic distribution of geminates.  Martin suggests this is because semantic factors play a role in compound formation, and they have nothing to do with phonotactics. This seems reasonable, but really, the main empirical finding is simply that something besides the existing phonotactic distribution matters.  Something I would have liked to have seen was how far away the new-compound-formation distribution has to be from the existing distribution in order for these forms to persist - in the demonstration Martin does, this distribution is simply 0.5 (half the time new compounds contain geminates).  But one might easily imagine that new compounds are formed from the existing words in the lexicon, and this might be less than 0.5, depending on the actual words in the lexicon.  Do these forms persist if the new-compound-formation distribution is 0.25 geminates, for instance?

Specific comments:

Section 4: I was unsure how to map the learning model to Universal Grammar (UG), especially since Martin makes it a point to connect the model to UG in the first paragraph here. I think he's saying that the "entanglement" of the constraints (which reads to me like overhypothesis + more specific constraints) is not part of UG.  This is fine, if we think about the structure of overhypotheses as general not being a UG thing. But what does seem to then be a UG thing is what the overhypothesis actually is - in this case, it's knowing that geminates are a thing to pay attention to, and that word structure may matter for them. (In the same way, if we think of UG parameters as overhypotheses, the UG part is what the content of the overhypothesis/parameter is, not the fact that there is actually an overhypothesis.) So would Martin be happy to claim that both the "entanglement" structure and the content of the constraints themselves aren't part of UG?  If so, where does the focus on geminates and word structure come from?  Does the attention to geminates and word structure logically arise in some way?

Section 4.2, p.760, discussing the tradeoff between modeling the data as accurately as possible and having as general a grammar as possible: This tradeoff is completely fine, of course, as that's exactly the sort of thing Bayesian models do.  But Martin also equates a "general" grammar to a uniform distribution grammar - I was trying to think if that's the right connection to draw. In one sense, it may be, if we think about how much data each grammar is compatible with - a grammar with a uniform distribution doesn't really give much importance to any of the constraints (if I'm understanding this correctly), so it would presumably be fine with the entire set of input data. This then makes it more general than grammars that do place priority on some constraints, and so don't allow in some of the data.

Section 4.2, p.760: The learning described, where the constraints are assigned arbitrary weights, and then the constraint weights are updated using the SGA update rule, reminds me a lot of neural net updating.  How similar are these? On a more specific note, I was trying to figure out how to interpret C_i(x) and C_i(y) in the rule in (7) - are these simply binary (1 or 0)? (This would make sense, since the constraints themselves are things like "allow geminates".)



Wednesday, April 10, 2013

Some thoughts on Mohamed et al. (2011)


This brief article focuses on the principles of how deep belief networks (DBN) achieve good speech recognition performance, while glossing over many of the details. Therefore, it seems to me that this article can be approached with two levels of rigor. For the novice with a more leisurely approach, the article provides some very clear and concise descriptions of what a DBN model has that sets it apart from other types of competing models. For the experimentalist who wants to replicate the actual models used in the paper, good luck. Nevertheless, there are more extensive treatments of the technical details elsewhere in the literature, and even the novice will probably wish to consult some of these sources to appreciate the nuances in the method that receive short shrift here.

The three main things that make DBNs an attractive modeling choice:
1) They are neural networks. Neural networks are an efficient way to estimate the states of hidden Markov models (HMM), compared to mixture of Gaussians.
2) They are deep. More hidden layers allows for more complicated correlations between the input and the model states, so more structure can be extracted from the data.
3) They are generatively pre-trained. This is a neat pre-optimization algorithm that places the model in a good starting point for back-propagation to discover local maxima. Without this pre-optimization, models with many hidden layers are unlikely to converge on a good solution.

The idea of using a "generative" procedure to pre-optimize a system seems like it may have immediate applicability for psychologists and linguists who also study "generative" phenomena. After all, the training algorithm is even called the "wake-sleep" algorithm, where the model generates "fantasies" during its pre-training. While the parallels are certainly interesting, without appreciating the details of the algorithm, it's difficult to know how deep these similarities actually are. In his IPAM lecture, Hinton notes that while some neuroscientists such as Friston do believe the model is directly applicable to the brain, he remains skeptical.

Ignoring psychological applications for the moment, I'm still left wondering about how "good" DBNs actually perform. The best performing model in this paper still only achieves a Phoneme Error Rate of 20%, and the variability attributable to feature types, number of hidden layers, or pre-training appears small, affecting performance by only a few percentage points. Again, the evaluation procedure is not entirely clear to me, so it's difficult to know how these values translate into real-world performance. I would believe that current voice-recognition technology does much better than 80%, and in far more adverse conditions than those tested here. It was also interesting to note that DBNs appear to have a problem with ignoring irrelevant input.

The multidimensional reduction visualization (t-SNE) was pretty cool, plotting data points that are near to each other in high-dimensional space close together in 2-dimensional space. It would be nice to have some way to quantify the revealed structures using this visualization technique. The distinctions between Figs 3-4 and 7-8 are visually obvious, but I think we just have to take the authors' at their word when they describe differences in Figs 5-6. Perhaps another way to visualize the hidden structure in the model, particularly comparing different individual hidden layers as in Figs 7-8, would be to provide dendograms that cluster inputs based on the hidden vectors that are generated.

Overall, DBNs seem like they can do quite a bit of work for speech recognition systems, and the psychological implications of these models seem to be promising avenues for research. It would be really nice to see some more elaborate demonstrations of DBNs in action.