I found the simplicity of the proposed algorithm in this paper very attractive (especially when compared to some of the more technically involved papers we've read that come from the machine learning literature). The goal of connecting to known experimental and developmental data of course warmed my cognitive modeler's heart, and I certainly sympathized with the aim of pushing the algorithm to be more cognitively plausible. I did think some of the criticisms of previous approaches were a touch harsh, given what's actually implemented here (more on this below), but that may be more of a subjective interpretation thing. I did find it curious that the evaluation metrics chosen were about word boundary identification, rather than about lexicon items (in particular, measuring boundary accuracy and word token accuracy, but not lexicon accuracy). Given the emphasis on building a quality lexicon (which seems absolutely right to me if we're talking about the goal of word segmentation), why not have lexicon item scores as well to get a sense of how good a lexicon this strategy can create?
Some more specific thoughts:
Section 2.1, discussing the 9-month-old English-learning infants who couldn't segment Italian words from transitional probabilities alone unless they had already been presented with words in isolation: Lignos is using this to argue against transitional probabilities as a useful metric at all, but isn't another way to interpret it simply that transitional probabilities (TPs) can't do it all on their own? That is, if you initialize a proto-lexicon with a few words, TPs would work alright - they just can't work right off the bat with no information. Relatedly, the discussion of the Shukla et al. 2011 (apparently 6-month-old) infants who couldn't use TPs unless they were aligned with a prosodic boundary made me think more that TPs are useful, just not useful in isolation. They need to be layered on top of some existing knowledge (however small that knowledge might be). But I think it just may be Lignos's stance that TPs aren't that useful - they seem to be left out as something a model of word segmentation should pay attention to in section 2.4.
Of course, I (and I'm assuming Lawrence as well, given Phillips & Pearl 2012) was completely sympathetic to the criticism in section 2.3 about how phonemes aren't the right unit of perception for the initial stages of word segmentation. They may be quite appropriate if you're talking about 10-month-olds, though - of course, at that point, infants probably have a much better proto-lexicon, not to mention other cues (e.g., word stress). I was a little less clear about the criticism (of Johnson & Goldwater) about using collocations as a level of representation. Even though this doesn't necessarily connect to adult knowledge of grammatical categories and phrases, there doesn't seem anything inherently wrong with assuming infants initially learn chunks that span categories and phrases, like "thatsa" or "couldI". They would have to fix them later, but that doesn't seem unreasonable.
One nice aspect of the Lignos strategy is that it's incremental, rather than a batch algorithm. However, I think it's more a modeling decision rather than an empirical fact to not allow memory of recent utterances to affect the segmentation of the current utterance (section 3 Intro). It may well turn out to be right, but it's not obviously true at this point that this is how kids are constrained. On a related note, the implementation of considering multiple segmentations seems a bit more memory-intensive, so what's the principled reason for allowing memory for that but not allowing memory for recent utterances? Conceptually, I understand the motivation for wanting to explore multiple segmentations (and I think it's a good idea - I'm actually not sure why the algorithm here is limited to 2) - I'm just not sure it's quite fair to criticize other models for essentially allowing more memory for one thing when the model here allows more memory for another.
I was a little confused about how the greedy subtractive segmentation worked in section 3.2. At first, I thought it was an incremental greedy thing - so if your utterance was "syl1 syl2 syl3", you would start with "syl1" and see if that's in your lexicon; if not, try "syl1 syl2", and so on. But this wouldn't run into ambiguity then: "...whenever multiple words in the lexicon could be subtracted from an utterance, the entry with the highest score will be deterministically used". So something else must be meant. Later on when the beam search is described, it makes sense that there would be ambiguity - but I thought ambiguity was supposed to be present even without multiple hypotheses being considered.
The "Trust" feature described in 3.3 seemed like an extra type of knowledge that might be more easily integrated into the existing counts, rather than added on as an additional binary feature. I get that the idea was to basically use it to select the subset of words to add to the lexicon, but couldn't a more gradient version of this implemented, where the count for words at utterance boundaries gets increased by 1, while the count for words that are internal gets increased by less than 1? I guess you could make an argument either way about which approach is more naturally intuitive (i.e., just ignore words not at utterance boundaries vs. be less confident about words not at utterance boundaries).
I think footnote 7 is probably the first argument I'm seen in favor of using orthographic words as the target state, instead of an apology for not having prosodic words as the target state. I appreciate the viewpoint, but I'm not quite convinced that prosodic words wouldn't be useful as proto-lexicon items (ex: "thatsa" and "couldI" come to mind). Of course, these would have to be segmented further eventually, but they're probably not completely destructive to have in the proto-lexicon (and do feel more intuitively plausible as an infant's target state).
In Table 1, it seems like we see a good example of why precision and recall may be better than hit (H) rate and false alarm (FA) rate: The Syllable learner (which puts a boundary at every syllable) clearly oversegments and does not achieve the target state, but you would never know that from the H and FA scores. Do we get additional information from H & FA that we don't get from precision and recall? (I guess it would have to be mostly from the FA rate, since H = recall?)
I thought seeing the error analyses in Tables 2 and 3 was helpful, though I was a little surprised Table 3 didn't show the breakdown between undersegmentation and oversegmentation errors, in addition to the breakdown between function and content words. (Or maybe I just would have liked to have seen that, given the claim that early errors should mostly be undersegmentations. We see plenty of function words as errors, but how many of them are already oversegmentations?)