One of the first things that struck me about this paper was how wonderfully well-written I found it to be - it was so easy to follow the different ideas, and I really appreciated how careful it was to explain the details of pretty much everything involved. I kept saying to myself, "Yes, this is how a modeling paper should be written! So clear, so honest!" (And I'm not just saying this because one of the authors is in the reading group.) To be fair, it's likely that the pieces of this model are somewhat more transparent than pieces of other models we've looked at, and so lend themselves well to examination and explanation. Still, kudos to the authors on this - because they were very precise about both the modeling components and the ideas behind the model.
On a more content-related point, the authors were very clear to indicate that a diphone-based process couldn't occur until after most of the phones of the language were determined, which wouldn't happen till around 9 months. Since word segmentation starts earlier than this, this suggests diphone-based learning is presumably a later stage word segmentation strategy rather than an initial get-you-off-the-ground strategy. But I wondered if this was necessarily true. Suppose you have a learner who really would like to use diphone-based learning, but hasn't figured out her phones yet. Would she perhaps try to do it anyway, but simply using whatever fuzzy definition of phones she has (probably finer-grained distinctions than are actually present in the language)? For example, maybe she hasn't figured out that /b/ and /bh/ are the same phone in English because she's only 6 months old. (Or maybe that the /b/ in /bo/ is the same as the /b/ in /bi/.) This means that she just has more "phones" than the 9-month-old diphone learner has - but how much does that matter? My guess is this still leads to fewer overall units than a syllable-based learner has. Moreover, because there are more "phone" units for this 6-month-old diphone learner, maybe it takes longer to segment words out. A longer period of undersegmentation might occur, but still yield some useful units that could help bootstrap the lexical-diphone learner.
Some more specific thoughts:
- The lexical learner here reminded me quite a bit of the one by Blanchard, Heinz, & Golinkoff (2010), and I was wondering about a compare and contrast between them. They both clearly make use of phrasal units, and later on lexical units. I believe the BH&G learner also included knowledge of a syllable, which the lexical learner doesn't.
- With respect to syllables vs. diphones, I wonder how many syllables in English are diphones (CV syllables, really), and how informative they are for word segmentation. In some sense, this is getting at how different a syllable-based learner is from a diphone-based learner. I imagine it would vary from language to language - Japanese would have more overlap between syllables and diphones, while German maybe has less. This seems related to the point brought up on p.149, in section 7.4.3, where they mention that the diphone learner could apply to syllables in Japanese (presumably rather than phones).
- p.125: I like that they worry about the implausibility of word learning from a single (word segmentation?) exposure. I think there's something to the idea that it takes a couple of times of successfully segmenting the word form from fluent speech before it sticks around long enough to be entered into the lexicon (and hopefully get assigned a meaning later on). Related note on p.127, where they assume the segmentation mechanism has "no access to specific lexical forms" - this seems like the extreme view of this. Unless I misunderstood, it implies that segmentation doesn't really make use of individual word forms, so algebraic learning (ex: "morepenguins = more+penguins" if more is a known form) shouldn't occur. I'm not sure how early this kind of learning occurs, to be honest, but it's certainly true that a lot of models (including BH&G's, I believe), assume that this kind of information is available during segmentation.
-p.133, section 4.1.3: It's completely reasonable to use the CELEX as a source for phonetic pronunciation and leave out words that aren't in CELEX (like baba), but I wonder how these affect the token sequence probabilities. It would be nice to know if it was just a few types that frequently occurred that were left out, or if it was a number of different types (potentially with many different diphone sequences).
-p.141, section 5.5: I really liked that they explored what would happen if the learner's estimation of how often a word boundary occurred was off (and then found that it didn't really matter). However, I do wonder if the reason the learner was robust has anything to do with the fact that the highest value in the range they looked at was still smaller than the "hard decision" boundary of 0.5 (mentioned on p.135).
-p.142: I also really liked that they looked at more realistic conversational speech data, which included effects of coarticulation (Stanford --> Stamford), which would then be a good clue that the diphone sequence was part of the same word. I thought coarticulation occurred across word boundaries too in conversational speech, though - maybe it's just that it occurs more often within words.
-p.146, section 7.2.1: I'm not quite sure I followed that part that says "undersegmentation means sublexically identified word boundaries can generally be trusted." If you've undersegmented, how do you know about word boundaries inside the chunk you've picked out? By definition, you didn't put in those word boundaries.
-p.148, section 7.4.1: I think the idea of prosodic words is extremely applicable to the word segmentation process at this stage. Given what we know of function words and content words, is there some principled way to resegment an existing corpus so it's made up of prosodic words instead of orthographic words? Or maybe the thing to do is to look at the errors being made by existing word segmentation models and see how many of them could be explained by the model finding prosodic words instead of orthographic words. A model that has a lot of prosodic words is maybe closer to human infants?