More specific things that occurred to me as I was reading:
- p.2-3: The authors mention how they're not going to be tackling the segmentation of auditory linguistic stimuli (not unreasonable), but that "any word segmentation model could easily be plugged into a system that recognizes phonemes from speech". It's not so clear to me that the phoneme level of representation is right for modeling initial word segmentation, though it's a reasonable first step. Specifically, given what we know of the time course of acquisition, it seems like native language phoneme identification isn't fully online till about 10-12 months - but initial word segmentation is likely happening around 6 months. Given this, it seems more likely that infants may be working with a representation that's more abstract than the raw auditory signal, but less settled than the adult phonemic representation. For example, perhaps allophones might be perceived as separate sounds by the infant at this point in development. Anyway, this isn't a critique of this model in particular - most word seg models I've seen work with phonemes - but it'd be very interesting to see how any of the prominent word seg models would perform on input that's messier than the phonemic representation commonly used.
- p.9, p.14, p.18: The authors emphasize that their target unit of extraction is the phonological word (and their exposition of different definitions of "word" was quite nice, I thought). Unfortunately, they have the problem of only having orthographic word corpora available. I wonder how hard it would be to convert the existing corpus into a phonological word corpus - they say it's a hard and time-consuming process, but perhaps there are some rewrite rules that could do a reasonable approximation? Or maybe it would be useful to note how many "mis-segmentations" of any model are actually viable phonological word segmentations.
- Looking at figure 1 on p.10, and the exposition about the model: I wonder how the model actually chooses the most probably segmentation from all possible segmentations for an utterance. Initially, this is probably very easy because there's nothing in the lexicon. But once the lexicon is populated, it seems like there could be a lot of possibilities to choose from. Maybe some kind of heuristic choice? This part of learning is what the dynamic programming algorithms do in the Bayesian models of Pearl, Goldwater, & Steyvers (2010).
- p.11 - the second phonotactic constraint: It's probably worth noting that requiring all words to have a syllabic sound means the learner must know beforehand (or somehow be able to derive) what a syllabic sound is. This seems like domain-specific knowledge (i.e., "all sounds with these properties are syllabic", etc.) - is there any way it wouldn't be? Supposing this is definitely domain-specific (though language universal) knowledge, how plausible is it that humans have innate knowledge of the necessary properties for syllable-hood? I know there's some evidence that the syllable is a basic unit of infant perception, so this could be very reasonable after all.
- p.17 - testing the "require syllabic" constraint on its own: The authors explain that the reason a learner with only this constraint fails is because longer words receive the same probability as shorter words. Maybe a slightly more informed version of this learner could assign each phoneme a small constant probability (rather than all unfamiliar words getting the same probability) - it seems like this would allow the word length effects to emerge and could lead to better segmentation. Maybe this learner would prefer CV or V words (due to their length + still being syllabic) - which would lead to major undersegmentation. Still, I wonder how bad it would be, since so many English child-directed speech words are monosyllabic anyway.