Friday, January 20, 2012

Some thoughts on Daland & Pierrehumbert (2011)

One of the first things that struck me about this paper was how wonderfully well-written I found it to be - it was so easy to follow the different ideas, and I really appreciated how careful it was to explain the details of pretty much everything involved. I kept saying to myself, "Yes, this is how a modeling paper should be written!  So clear, so honest!" (And I'm not just saying this because one of the authors is in the reading group.)  To be fair, it's likely that the pieces of this model are somewhat more transparent than pieces of other models we've looked at, and so lend themselves well to examination and explanation. Still, kudos to the authors on this - because they were very precise about both the modeling components and the ideas behind the model.

On a more content-related point, the authors were very clear to indicate that a diphone-based process couldn't occur until after most of the phones of the language were determined, which wouldn't happen till around 9 months.  Since word segmentation starts earlier than this, this suggests diphone-based learning is presumably a later stage word segmentation strategy rather than an initial get-you-off-the-ground strategy.  But I wondered if this was necessarily true.  Suppose you have a learner who really would like to use diphone-based learning, but hasn't figured out her phones yet. Would she perhaps try to do it anyway, but simply using whatever fuzzy definition of phones she has (probably finer-grained distinctions than are actually present in the language)?  For example, maybe she hasn't figured out that /b/ and /bh/ are the same phone in English because she's only 6 months old. (Or maybe that the /b/ in /bo/ is the same as the /b/ in /bi/.) This means that she just has more "phones" than the 9-month-old diphone learner has - but how much does that matter?  My guess is this still leads to fewer overall units than a syllable-based learner has.  Moreover, because there are more "phone" units for this 6-month-old diphone learner, maybe it takes longer to segment words out.  A longer period of undersegmentation might occur, but still yield some useful units that could help bootstrap the lexical-diphone learner.

Some more specific thoughts:
- The lexical learner here reminded me quite a bit of the one by Blanchard, Heinz, & Golinkoff (2010), and I was wondering about a compare and contrast between them.  They both clearly make use of phrasal units, and later on lexical units.  I believe the BH&G learner also included knowledge of a syllable, which the lexical learner doesn't.

- With respect to syllables vs. diphones, I wonder how many syllables in English are diphones (CV syllables, really), and how informative they are for word segmentation. In some sense, this is getting at how different a syllable-based learner is from a diphone-based learner.  I imagine it would vary from language to language - Japanese would have more overlap between syllables and diphones, while German maybe has less.  This seems related to the point brought up on p.149, in section 7.4.3, where they mention that the diphone learner could apply to syllables in Japanese (presumably rather than phones).

- p.125: I like that they worry about the implausibility of word learning from a single (word segmentation?) exposure.  I think there's something to the idea that it takes a couple of times of successfully segmenting the word form from fluent speech before it sticks around long enough to be entered into the lexicon (and hopefully get assigned a meaning later on). Related note on p.127, where they assume the segmentation mechanism has "no access to specific lexical forms" - this seems like the extreme view of this.  Unless I misunderstood, it implies that segmentation doesn't really make use of individual word forms, so algebraic learning (ex: "morepenguins = more+penguins" if more is a known form) shouldn't occur.  I'm not sure how early this kind of learning occurs, to be honest, but it's certainly true that a lot of models (including BH&G's, I believe), assume that this kind of information is available during segmentation.

-p.133, section 4.1.3: It's completely reasonable to use the CELEX as a source for phonetic pronunciation and leave out words that aren't in CELEX (like baba), but I wonder how these affect the token sequence probabilities.  It would be nice to know if it was just a few types that frequently occurred that were left out, or if it was a number of different types (potentially with many different diphone sequences).

-p.141, section 5.5: I really liked that they explored what would happen if the learner's estimation of how often a word boundary occurred was off (and then found that it didn't really matter).  However, I do wonder if the reason the learner was robust has anything to do with the fact that the highest value in the range they looked at was still smaller than the "hard decision" boundary of 0.5 (mentioned on p.135).

-p.142: I also really liked that they looked at more realistic conversational speech data, which included effects of coarticulation (Stanford --> Stamford), which would then be a good clue that the diphone sequence was part of the same word. I thought coarticulation occurred across word boundaries too in conversational speech, though - maybe it's just that it occurs more often within words.

-p.146, section 7.2.1: I'm not quite sure I followed that part that says "undersegmentation means sublexically identified word boundaries can generally be trusted."  If you've undersegmented, how do you know about word boundaries inside the chunk you've picked out?  By definition, you didn't put in those word boundaries.

-p.148, section 7.4.1: I think the idea of prosodic words is extremely applicable to the word segmentation process at this stage. Given what we know of function words and content words, is there some principled way to resegment an existing corpus so it's made up of prosodic words instead of orthographic words?  Or maybe the thing to do is to look at the errors being made by existing word segmentation models and see how many of them could be explained by the model finding prosodic words instead of orthographic words.  A model that has a lot of prosodic words is maybe closer to human infants?

1 comment:

  1. Hiya Lisa,

    A couple of point responses to your post.

    (1) The majority of the evidence I am aware suggests that it normally takes about 7 repetitions to learn a new word on average. I don't mean it is impossible to learn a new word from a single presentation. I just mean that people's success level tends to rise monotonically on various reasonable criteria of word knowledge (able to say they've heard the form before, able to match the form to one of several meanings, able to produce the form from a meaning), up until 7-10 presentations. For example, this number pops up in Storkel's work, and also in unpublished word-learning results of mine.

    (2) "undersegmentation means sublexically identified word boundaries can generally be trusted."
    What I meant was that when you *have* posited a word boundary, you can generally trust that it really is there. In other words, you don't need to worry about false alarms.
    You are right that this means the model fails to find some boundaries, and especially boundaries within the prosodic word. My general belief is that lexical access takes care of this in adulthood.

    (3) "Unless I misunderstood, it implies that segmentation doesn't really make use of individual word forms, so algebraic learning (ex: "morepenguins = more+penguins" if more is a known form) shouldn't occur."
    You are correct that the model itself does *not* make use of this segmentation cue. However, that is not because I believe that infants do not make use of lexical access. Rather, it is because I thought it would be a more productive research strategy to ask, "How much can we get from diphones alone?"
    If I had done a model that used both, *without* doing the diphones alone, we wouldn't know how much came from phonotactics and how much from lexical access. In fact, in my dissertation I did do a model that had a prelexical/phonotactic stage and then a lexical stage. What happened was: *IF* the model was equipped with the correct words to begin with, then nearly flawless segmentation obtained. However, that was a big if. Because if the model tried to learn words using a dumb strategy like "if a unit has been segmented n times, it must be a word", then the whole system went on an error snowball.
    I did not include those results in the Cognitive Paper for two reasons. First, there was limited space, and I thought that the results we did report were important for substantiating the phonotactic approach. Second, the 2-stage model stuff just didn't feel done -- there were so many parametric variations that could have been done, the assumptions were much trickier, etc... The possibility for error snowballs was the most solid result I had to offer, and what people normally want is solutions, not even more problems.