Tuesday, April 30, 2013

Next time on 5/14 @ 2pm in SBSG 2200 = Perfors 2012 JML

Thanks to everyone who joined our meeting this week, where we had a very spirited and enlightening discussion about Lignos 2012 and the ideas behind it! Next time on Tuesday May 14 @ 2pm in SBSG 2200, we'll be looking at an article that investigates the interplay between memory limitations and overregularization behavior in learners, providing both experimental and computational modeling results:

See you then!

Monday, April 29, 2013

Some thoughts on Lignos 2012

I found the simplicity of the proposed algorithm in this paper very attractive (especially when compared to some of the more technically involved papers we've read that come from the machine learning literature). The goal of connecting to known experimental and developmental data of course warmed my cognitive modeler's heart, and I certainly sympathized with the aim of pushing the algorithm to be more cognitively plausible.  I did think some of the criticisms of previous approaches were a touch harsh, given what's actually implemented here (more on this below), but that may be more of a subjective interpretation thing.  I did find it curious that the evaluation metrics chosen were about word boundary identification, rather than about lexicon items (in particular, measuring boundary accuracy and word token accuracy, but not lexicon accuracy).  Given the emphasis on building a quality lexicon (which seems absolutely right to me if we're talking about the goal of word segmentation), why not have lexicon item scores as well to get a sense of how good a lexicon this strategy can create?

Some more specific thoughts:

Section 2.1, discussing the 9-month-old English-learning infants who couldn't segment Italian words from transitional probabilities alone unless they had already been presented with words in isolation: Lignos is using this to argue against transitional probabilities as a useful metric at all, but isn't another way to interpret it simply that transitional probabilities (TPs) can't do it all on their own?  That is, if you initialize a proto-lexicon with a few words, TPs would work alright - they just can't work right off the bat with no information.  Relatedly, the discussion of the Shukla et al. 2011 (apparently 6-month-old) infants who couldn't use TPs unless they were aligned with a prosodic boundary made me think more that TPs are useful, just not useful in isolation.  They need to be layered on top of some existing knowledge (however small that knowledge might be).  But I think it just may be Lignos's stance that TPs aren't that useful - they seem to be left out as something a model of word segmentation should pay attention to in section 2.4.

Of course, I (and I'm assuming Lawrence as well, given Phillips & Pearl 2012) was completely sympathetic to the criticism in section 2.3 about how phonemes aren't the right unit of perception for the initial stages of word segmentation. They may be quite appropriate if you're talking about 10-month-olds, though - of course, at that point, infants probably have a much better proto-lexicon, not to mention other cues (e.g., word stress). I was a little less clear about the criticism (of Johnson & Goldwater) about using collocations as a level of representation.  Even though this doesn't necessarily connect to adult knowledge of grammatical categories and phrases, there doesn't seem anything inherently wrong with assuming infants initially learn chunks that span categories and phrases, like "thatsa" or "couldI". They would have to fix them later, but that doesn't seem unreasonable.

One nice aspect of the Lignos strategy is that it's incremental, rather than a batch algorithm.  However, I think it's more a modeling decision rather than an empirical fact to not allow memory of recent utterances to affect the segmentation of the current utterance (section 3 Intro).  It may well turn out to be right, but it's not obviously true at this point that this is how kids are constrained.  On a related note, the implementation of considering multiple segmentations seems a bit more memory-intensive, so what's the principled reason for allowing memory for that but not allowing memory for recent utterances? Conceptually, I understand the motivation for wanting to explore multiple segmentations (and I think it's a good idea - I'm actually not sure why the algorithm here is limited to 2) - I'm just not sure it's quite fair to criticize other models for essentially allowing more memory for one thing when the model here allows more memory for another.

I was a little confused about how the greedy subtractive segmentation worked in section 3.2.  At first, I thought it was an incremental greedy thing - so if your utterance was "syl1 syl2 syl3", you would start with "syl1" and see if that's in your lexicon; if not, try "syl1 syl2", and so on. But this wouldn't run into ambiguity then: "...whenever multiple words in the lexicon could be subtracted from an utterance, the entry with the highest score will be deterministically used". So something else must be meant. Later on when the beam search is described, it makes sense that there would be ambiguity - but I thought ambiguity was supposed to be present even without multiple hypotheses being considered.

The "Trust" feature described in 3.3 seemed like an extra type of knowledge that might be more easily integrated into the existing counts, rather than added on as an additional binary feature.  I get that the idea was to basically use it to select the subset of words to add to the lexicon, but couldn't a more gradient version of this implemented, where the count for words at utterance boundaries gets increased by 1, while the count for words that are internal gets increased by less than 1? I guess you could make an argument either way about which approach is more naturally intuitive (i.e., just ignore words not at utterance boundaries vs. be less confident about words not at utterance boundaries).

I think footnote 7 is probably the first argument I'm seen in favor of using orthographic words as the target state, instead of an apology for not having prosodic words as the target state. I appreciate the viewpoint, but I'm not quite convinced that prosodic words wouldn't be useful as proto-lexicon items (ex: "thatsa" and "couldI" come to mind). Of course, these would have to be segmented further eventually, but they're probably not completely destructive to have in the proto-lexicon (and do feel more intuitively plausible as an infant's target state).

In Table 1, it seems like we see a good example of why precision and recall may be better than hit (H) rate and false alarm (FA) rate: The Syllable learner (which puts a boundary at every syllable) clearly oversegments and does not achieve the target state, but you would never know that from the H and FA scores.  Do we get additional information from H & FA that we don't get from precision and recall? (I guess it would have to be mostly from the FA rate, since H = recall?)

I thought seeing the error analyses in Tables 2 and 3 was helpful, though I was a little surprised Table 3 didn't show the breakdown between undersegmentation and oversegmentation errors, in addition to the breakdown between function and content words.  (Or maybe I just would have liked to have seen that, given the claim that early errors should mostly be undersegmentations. We see plenty of function words as errors, but how many of them are already oversegmentations?)

Tuesday, April 16, 2013

Next time on 4/30/13 @ 2pm in SBSG 2200 = Lignos 2012

Thanks to everyone who joined our meeting this week, where we had a very helpful discussion about the empirical basis and learning model in Martin 2011, as well as some ideas for how to extend this model in interesting ways. Next time on Tuesday April 30 @ 2pm in SBSG 2200, we'll be looking at an article that develops an algorithmic model of word segmentation, using experimental evidence from infant learning to ground itself:

Lignos, C. 2012. Infant Word Segmentation: An Incremental, Integrated Model. Proceedings of the 30th West Coast Conference on Formal Linguistics, ed. Nathan Arnett and Ryan Bennett, 237-247. Somerville, MA: Cascadilla Proceedings Project.

See you then!

Monday, April 15, 2013

Some thoughts on Martin (2011)

I really liked how compact this paper was - there was quite a bit of material included without it feeling like a part of the discussion was missing. I appreciated the connections made between the implementation of the model and the cognitive learning biases that implementation represented.

As a researcher with a soft spot for empirically-grounded modeling, I was also pleased to see the connections to English and Navajo phonotactic variation. (I admit, I would have liked a bit less abstraction for some of the modeling demonstrations once the basic principle had been illustrated, but that's probably why it was a 20 page paper instead of a 40 page paper.)  One of the things that really struck me was how much the MaxEnt framework discussed seemed similar to hierarchical Bayesian models (HBMs) - I kept wanting to map the different frameworks to each other (prior = prefer simpler grammars, likelihood = maximize probability of input data, etc.). It seemed like the MaxEnt framework included an overhypothesis (dislike geminate consonants in general [structure-blind]), and then some more specific instantiations (dislike them within words, but don't care about them as much across words [structure-sensitive]).  This would be the "leaking" that the title refers to - the leaking of specific constraints back up to the overhypothesis. This also ties into the idea on p.763 where Martin mentions that structure-blind constraints may be a hold-over from very early learning (Perfors, Tenenbaum and colleagues often talk about the "blessing of abstraction" for overhypotheses, where the more abstract thing can be learned earlier because it's instantiated in so many things. And so perhaps the overhypothesis is reinforced more than any individual instantiation of it, making it more resistant to change later on.) But instead of having them arranged in this kind of hierarchy (or maybe it's more like two factors interacting - (1) geminate preference + (2) within vs. across words?), the constraints were specified explicitly by the modeler. This is great first step to show that all of these constraints are needed, but it does feel like some more-general representation is missing.

I also thought it was a very interesting hypothesis that marked forms (i.e., geminates across word boundaries in compounds) persist because new compounds are formed that are not drawn from the existing phonotactic distribution of geminates.  Martin suggests this is because semantic factors play a role in compound formation, and they have nothing to do with phonotactics. This seems reasonable, but really, the main empirical finding is simply that something besides the existing phonotactic distribution matters.  Something I would have liked to have seen was how far away the new-compound-formation distribution has to be from the existing distribution in order for these forms to persist - in the demonstration Martin does, this distribution is simply 0.5 (half the time new compounds contain geminates).  But one might easily imagine that new compounds are formed from the existing words in the lexicon, and this might be less than 0.5, depending on the actual words in the lexicon.  Do these forms persist if the new-compound-formation distribution is 0.25 geminates, for instance?

Specific comments:

Section 4: I was unsure how to map the learning model to Universal Grammar (UG), especially since Martin makes it a point to connect the model to UG in the first paragraph here. I think he's saying that the "entanglement" of the constraints (which reads to me like overhypothesis + more specific constraints) is not part of UG.  This is fine, if we think about the structure of overhypotheses as general not being a UG thing. But what does seem to then be a UG thing is what the overhypothesis actually is - in this case, it's knowing that geminates are a thing to pay attention to, and that word structure may matter for them. (In the same way, if we think of UG parameters as overhypotheses, the UG part is what the content of the overhypothesis/parameter is, not the fact that there is actually an overhypothesis.) So would Martin be happy to claim that both the "entanglement" structure and the content of the constraints themselves aren't part of UG?  If so, where does the focus on geminates and word structure come from?  Does the attention to geminates and word structure logically arise in some way?

Section 4.2, p.760, discussing the tradeoff between modeling the data as accurately as possible and having as general a grammar as possible: This tradeoff is completely fine, of course, as that's exactly the sort of thing Bayesian models do.  But Martin also equates a "general" grammar to a uniform distribution grammar - I was trying to think if that's the right connection to draw. In one sense, it may be, if we think about how much data each grammar is compatible with - a grammar with a uniform distribution doesn't really give much importance to any of the constraints (if I'm understanding this correctly), so it would presumably be fine with the entire set of input data. This then makes it more general than grammars that do place priority on some constraints, and so don't allow in some of the data.

Section 4.2, p.760: The learning described, where the constraints are assigned arbitrary weights, and then the constraint weights are updated using the SGA update rule, reminds me a lot of neural net updating.  How similar are these? On a more specific note, I was trying to figure out how to interpret C_i(x) and C_i(y) in the rule in (7) - are these simply binary (1 or 0)? (This would make sense, since the constraints themselves are things like "allow geminates".)

Wednesday, April 10, 2013

Some thoughts on Mohamed et al. (2011)

This brief article focuses on the principles of how deep belief networks (DBN) achieve good speech recognition performance, while glossing over many of the details. Therefore, it seems to me that this article can be approached with two levels of rigor. For the novice with a more leisurely approach, the article provides some very clear and concise descriptions of what a DBN model has that sets it apart from other types of competing models. For the experimentalist who wants to replicate the actual models used in the paper, good luck. Nevertheless, there are more extensive treatments of the technical details elsewhere in the literature, and even the novice will probably wish to consult some of these sources to appreciate the nuances in the method that receive short shrift here.

The three main things that make DBNs an attractive modeling choice:
1) They are neural networks. Neural networks are an efficient way to estimate the states of hidden Markov models (HMM), compared to mixture of Gaussians.
2) They are deep. More hidden layers allows for more complicated correlations between the input and the model states, so more structure can be extracted from the data.
3) They are generatively pre-trained. This is a neat pre-optimization algorithm that places the model in a good starting point for back-propagation to discover local maxima. Without this pre-optimization, models with many hidden layers are unlikely to converge on a good solution.

The idea of using a "generative" procedure to pre-optimize a system seems like it may have immediate applicability for psychologists and linguists who also study "generative" phenomena. After all, the training algorithm is even called the "wake-sleep" algorithm, where the model generates "fantasies" during its pre-training. While the parallels are certainly interesting, without appreciating the details of the algorithm, it's difficult to know how deep these similarities actually are. In his IPAM lecture, Hinton notes that while some neuroscientists such as Friston do believe the model is directly applicable to the brain, he remains skeptical.

Ignoring psychological applications for the moment, I'm still left wondering about how "good" DBNs actually perform. The best performing model in this paper still only achieves a Phoneme Error Rate of 20%, and the variability attributable to feature types, number of hidden layers, or pre-training appears small, affecting performance by only a few percentage points. Again, the evaluation procedure is not entirely clear to me, so it's difficult to know how these values translate into real-world performance. I would believe that current voice-recognition technology does much better than 80%, and in far more adverse conditions than those tested here. It was also interesting to note that DBNs appear to have a problem with ignoring irrelevant input.

The multidimensional reduction visualization (t-SNE) was pretty cool, plotting data points that are near to each other in high-dimensional space close together in 2-dimensional space. It would be nice to have some way to quantify the revealed structures using this visualization technique. The distinctions between Figs 3-4 and 7-8 are visually obvious, but I think we just have to take the authors' at their word when they describe differences in Figs 5-6. Perhaps another way to visualize the hidden structure in the model, particularly comparing different individual hidden layers as in Figs 7-8, would be to provide dendograms that cluster inputs based on the hidden vectors that are generated.

Overall, DBNs seem like they can do quite a bit of work for speech recognition systems, and the psychological implications of these models seem to be promising avenues for research. It would be really nice to see some more elaborate demonstrations of DBNs in action.