Monday, April 29, 2013

Some thoughts on Lignos 2012

I found the simplicity of the proposed algorithm in this paper very attractive (especially when compared to some of the more technically involved papers we've read that come from the machine learning literature). The goal of connecting to known experimental and developmental data of course warmed my cognitive modeler's heart, and I certainly sympathized with the aim of pushing the algorithm to be more cognitively plausible.  I did think some of the criticisms of previous approaches were a touch harsh, given what's actually implemented here (more on this below), but that may be more of a subjective interpretation thing.  I did find it curious that the evaluation metrics chosen were about word boundary identification, rather than about lexicon items (in particular, measuring boundary accuracy and word token accuracy, but not lexicon accuracy).  Given the emphasis on building a quality lexicon (which seems absolutely right to me if we're talking about the goal of word segmentation), why not have lexicon item scores as well to get a sense of how good a lexicon this strategy can create?

Some more specific thoughts:

Section 2.1, discussing the 9-month-old English-learning infants who couldn't segment Italian words from transitional probabilities alone unless they had already been presented with words in isolation: Lignos is using this to argue against transitional probabilities as a useful metric at all, but isn't another way to interpret it simply that transitional probabilities (TPs) can't do it all on their own?  That is, if you initialize a proto-lexicon with a few words, TPs would work alright - they just can't work right off the bat with no information.  Relatedly, the discussion of the Shukla et al. 2011 (apparently 6-month-old) infants who couldn't use TPs unless they were aligned with a prosodic boundary made me think more that TPs are useful, just not useful in isolation.  They need to be layered on top of some existing knowledge (however small that knowledge might be).  But I think it just may be Lignos's stance that TPs aren't that useful - they seem to be left out as something a model of word segmentation should pay attention to in section 2.4.

Of course, I (and I'm assuming Lawrence as well, given Phillips & Pearl 2012) was completely sympathetic to the criticism in section 2.3 about how phonemes aren't the right unit of perception for the initial stages of word segmentation. They may be quite appropriate if you're talking about 10-month-olds, though - of course, at that point, infants probably have a much better proto-lexicon, not to mention other cues (e.g., word stress). I was a little less clear about the criticism (of Johnson & Goldwater) about using collocations as a level of representation.  Even though this doesn't necessarily connect to adult knowledge of grammatical categories and phrases, there doesn't seem anything inherently wrong with assuming infants initially learn chunks that span categories and phrases, like "thatsa" or "couldI". They would have to fix them later, but that doesn't seem unreasonable.

One nice aspect of the Lignos strategy is that it's incremental, rather than a batch algorithm.  However, I think it's more a modeling decision rather than an empirical fact to not allow memory of recent utterances to affect the segmentation of the current utterance (section 3 Intro).  It may well turn out to be right, but it's not obviously true at this point that this is how kids are constrained.  On a related note, the implementation of considering multiple segmentations seems a bit more memory-intensive, so what's the principled reason for allowing memory for that but not allowing memory for recent utterances? Conceptually, I understand the motivation for wanting to explore multiple segmentations (and I think it's a good idea - I'm actually not sure why the algorithm here is limited to 2) - I'm just not sure it's quite fair to criticize other models for essentially allowing more memory for one thing when the model here allows more memory for another.

I was a little confused about how the greedy subtractive segmentation worked in section 3.2.  At first, I thought it was an incremental greedy thing - so if your utterance was "syl1 syl2 syl3", you would start with "syl1" and see if that's in your lexicon; if not, try "syl1 syl2", and so on. But this wouldn't run into ambiguity then: "...whenever multiple words in the lexicon could be subtracted from an utterance, the entry with the highest score will be deterministically used". So something else must be meant. Later on when the beam search is described, it makes sense that there would be ambiguity - but I thought ambiguity was supposed to be present even without multiple hypotheses being considered.

The "Trust" feature described in 3.3 seemed like an extra type of knowledge that might be more easily integrated into the existing counts, rather than added on as an additional binary feature.  I get that the idea was to basically use it to select the subset of words to add to the lexicon, but couldn't a more gradient version of this implemented, where the count for words at utterance boundaries gets increased by 1, while the count for words that are internal gets increased by less than 1? I guess you could make an argument either way about which approach is more naturally intuitive (i.e., just ignore words not at utterance boundaries vs. be less confident about words not at utterance boundaries).

I think footnote 7 is probably the first argument I'm seen in favor of using orthographic words as the target state, instead of an apology for not having prosodic words as the target state. I appreciate the viewpoint, but I'm not quite convinced that prosodic words wouldn't be useful as proto-lexicon items (ex: "thatsa" and "couldI" come to mind). Of course, these would have to be segmented further eventually, but they're probably not completely destructive to have in the proto-lexicon (and do feel more intuitively plausible as an infant's target state).

In Table 1, it seems like we see a good example of why precision and recall may be better than hit (H) rate and false alarm (FA) rate: The Syllable learner (which puts a boundary at every syllable) clearly oversegments and does not achieve the target state, but you would never know that from the H and FA scores.  Do we get additional information from H & FA that we don't get from precision and recall? (I guess it would have to be mostly from the FA rate, since H = recall?)

I thought seeing the error analyses in Tables 2 and 3 was helpful, though I was a little surprised Table 3 didn't show the breakdown between undersegmentation and oversegmentation errors, in addition to the breakdown between function and content words.  (Or maybe I just would have liked to have seen that, given the claim that early errors should mostly be undersegmentations. We see plenty of function words as errors, but how many of them are already oversegmentations?)

15 comments:

  1. I also found the lit review a bit harsh. I felt a general tone of, "I am following the infant acquisition literature, but no one else does". At best, this is a failure to appreciate the intent behind some other research.

    -----------------------------------
    Comment 1: SYLLABLES VS. ALLOPHONES VS. MIS-REPRESENTING YOUR COMPETITORS' POSITIONS
    -----------------------------------

    For example, the paper attributed a phoneme-level representation to my own paper, whereas we spent over a page explaining the difference between phonemes and allophones, and arguing that an allophonic-level representation was more likely to be appropriate for infants. We quite explicitly made the point that we were using an allophonic transcription, not a phonemic one as Lignos claimed.

    Lignos' lit review seemed to regard this as inexcusably ignoring the acquisition data, whereas in my own opinion it is the position which is the most *faithful* to the acquisition data.
    To expand upon this point, the Lignos paper stated uncritically that the syllable is the unit of perception for infants. It must be acknowledged that many people believe this, but it also must be acknowledged that there is no incontrovertible proof. Lignos cited 3 papers which adopt this interpretation, and whose results are consistent with it. However, none of them is knock-down proof.
    For example, the Bertoncini and Mehler results cannot be explained unless we accept that infants know the difference between 2 and 3 vowels (and do not in all cases care about the difference between two and 3 consonants). This does not automatically imply that they are assigning syllabic mental representations, or that they are failing to regard onset consonants as distinct acoustic events. You could as easily get the same result by counting overt vowels and disregarding coda consonants, which are known to be perceptually difficult even for adults. The three papers which are cited *argue* that the syllable is a perceptual unit for infants, but they do not prove it.
    Moreover, the theoretical status of syllables is unclear. Personally, I find Steriade's licensing-by-cues theory a much more satisfactory account altogether. In this theory, syllables are an epiphenomenon of speakers trying to produce each vowel as a mini-word. It correctly accounts for the ambiguity in items like ?PA.STA/?PAS.TA, and the contrast between DE.MON/*DEM.ON vs. ?LE.MON/?LEM.ON. It correctly accounts for why assimilation processes are generally regressive, which the syllabification theory does also, but unlike the syllabification theory, licensing-by-cue correctly accounts for the fact that retroflexion assimilation is progressive.

    Yet another point. It is simply a fact that ***native speakers of different languages syllabify the same string in different ways***. It follows straightforwardly from this fact that ***syllabification must be learned***. The type of phonotactic knowledge that you would need to learn syllabification is the kind that infants seem to be just coming into at 9 months. Therefore, it seems quite strange to me to assert that syllabification is the primitive unit of perception at 6 months. How could infants have syllables as a unit when they haven't yet learned the phonotactic properties of their language that tell them how to segment strings into syllables??
    Yet another point. It is possible in English and in many other languages for people to make up new syllables that are grammatical. For example, ZILF. How can this be a primitive unit of perception when we have never heard it before?? It seems far more sensible to me that we attempt to interpret this novel acoustic sequence combinatorially, as a sequence of familiar segments. I do not really understand how the "syllable as primitive unit" can account for the perception of novel syllable types.

    ReplyDelete
    Replies
    1. Thanks for the discussion of the infant syllabification lit review - very helpful!

      I definitely agree that some part of syllabification must be learned (e.g., the official syllabification for ambiguous words like "pasta"), but I wonder about that initial percept, from which we construct the rest. My extremely limited understanding of the adult neuroliterature is that some syllable-like thing (maybe it's a syllable nucleus plus some surrounding phonetic material) is the basic thing. From them, we derive phonemes (and maybe more concrete syllabifications). This takes me to the ZILF example - I completely agree that this is a novel syllable which we can interpret as being made up of 4 phones. But is there some coarser representation that gets picked out first when we first hear it (e.g., [+sibilant]IL[+fricative], or whatever)? If so, this first coarse percept might be reasonable to assume as the unit for infant word segmentation.

      Delete
    2. I agree that the label "phoneme-based" for Daland & Pierrehumbert 2011 and Adriaans & Kager 2010 can be misinterpreted. It was not intended to draw a contrast between allophonic and phonemic, but rather segments and larger perceptual units such as syllables. In my opinion the discussion of phonemic/allophonic is largely pointless unless the model generalizes about it coherently, as is the case for Adriaans & Kager (which I think is an exemplary study in this area).

      Delete
    3. I think there is some terminological confusion.

      When I hear people say "the syllable is the basic perceptual unit", what I understand is the following:
      -- there is no internal structure that is perceived
      -- at best there is some kind of phonetic resemblance

      Thus, the claim is that the mental representations of [ba] and [bi] crucially don't include the knowledge that they start with the same type of articulatory/perceptual event. An even stronger claim, which I believe is also entailed, is that the mental representations of [gi] and [bi] do not include the knowledge that they *end* with the same type of articulatory/perceptual event. At best, the claim that "the syllable is the primitive perceptual unit" could only predict that the acoustic similarity between [bi] and [gi] is higher than between many other pairs of syllables.

      Lisa, as for the representation that you proposed ("[+sibilant]IL[+fricative]"), it has internal structure. So, to me, an internal representation like that is in direct conflict with the claim that the syllable is the basic unit. If the listener perceives some segment-like constituents of a syllable, then those are some basic units.
      The time when it crucially matters is when you have a consonant cluster between vowels. Under a syllabic theory, the infant is required to partition that input, with multiple partitions allowed. If the syllable is a primitive unit, what governs the partition? This question is the locus of cross-linguistic variability, which is exactly why I am convinced it must be learned as a result of some phonotactics, rather than the other way around.

      Another thought, which I would not like to expand on here, is that it is not logically necessary that there be a *single*, privileged unit of perception.

      There are two more reasons, in addition to the ones I previously mentioned, that I am opposed to the idea of syllables as a primitive unit in word segmentation. The first is that it fails to explain how listeners could identify alternations (such as [mam]~[ma.mi]); indeed, it seems hard to reconcile with the typological fact that syllabification is never distinctive within a language. This fact is standardly interpreted as evidence that syllable structure is not stored in the lexicon, which provides a straightforward account of how the relationship is identified pairs like [mam]~[mami]. The second reason I do not like using syllables in word segmentation models is because the syllable boundaries give you many non-boundaries "for free". They also prevent you from recovering the correct parse in a small minority of cases, notably when a consonant-final function word precedes a vowel-initial content word.

      I do not mean to be a pest about this, but I have thought about it extensively, and feel quite strongly about it. I have not seen any powerful arguments to the contrary.

      Delete
    4. Constantine, in the discussion section of my paper we discussed the distributional properties of diphones at length. It turns out that you don't actually need to generalize very much for diphones, since you encounter the frequent ones pretty frequently, and you encounter the infrequent ones, well, infrequently. The generalization procedure that we adopted in that paper was the conservative policy of positing a boundary whenever encountering a novel diphone. This was done as an operational matter, since we needed to do something, and that choice stacked the odds against the undersegmentation story we were pushing. However, in practice, the actual number of previously unseen diphones is negligible. Though we made a big deal that there are more than zero, generalization isn't really an issue at that scale.

      As for generalizing "properly", the right metric is a matter of some debate. I am not remembering the exact figures, but I believe the actual performance of the Adriaans model was somewhere in the vicinity of 40% correct boundary recognition. The Daland and Pierrehumbert model was in excess of 90%. So, while it is fine to think of it as a triumph for linguistic theory that the Adriaans model got better with generalization, it must be acknowledged that it is still making orders of magnitude more errors than any other model on the market.

      If we ask ourselves why this happens, there is a likely possibility. The Daland & Pierrehumbert model doesn't care about segmental generalizations, but it efficiently exploits distributional information at phrase/word edges and gets >90%. The Adriaans & Kager model cares about segmental generalization, but doesn't pay attention to distributional information at phrase/word edges, and gets <50%. For me, the implication of this contrast is clear. Properly leveraging the information at boundaries is far more important than segmental generalization.

      Delete
    5. @LA Denizen: About the perceptual unit issue, I definitely see your point that imposing [+sibilant] and [+fricative] already constitutes some structure. I guess what I was thinking was exactly the thing you mentioned about phonetic resemblance (so ZILF would be similar to SILV and ZILP, etc. because of the phonetic similarity, not because the infant actually initially perceives the individual phones/phonemes/phonetic features.)

      "Under a syllabic theory, the infant is required to partition that input, with multiple partitions allowed. If the syllable is a primitive unit, what governs the partition?"

      Putting on my devil's/syllable's advocate hat, I suppose you would have to say that there's some kind of regular/predictable acoustic distinction between the things infants perceive as syllables, and infants use that. (Caveat: I have absolutely no idea if something like that exists or even could exist, given acoustic variability.)

      "The first is that it fails to explain how listeners could identify alternations (such as [mam]~[ma.mi]); indeed, it seems hard to reconcile with the typological fact that syllabification is never distinctive within a language."

      :: syllable's advocate hat on :: While I'm quite willing to agree that adult listeners can relate these pairs, do we know that infants can?


      "The second reason I do not like using syllables in word segmentation models is because the syllable boundaries give you many non-boundaries "for free". They also prevent you from recovering the correct parse in a small minority of cases, notably when a consonant-final function word precedes a vowel-initial content word."

      This is most definitely a problem - but I think it also gets into the issue of what the target state ought to be for the initial stages of word segmentation. I'm assuming no one wants syllables to be the basic unit of segmentation once phonemes are known, so the question is really what to do before phonemes are known. Do we have evidence that indicates whether infants correctly segment these cases you mention (like the consonant-final function word preceding a vowel-initial content word)?



      Delete
  2. -------------------------------------------
    Comment 2: BATCH VS. INCREMENTAL IS A RED HERRING
    -------------------------------------------
    Along the lines of following the infant literature, the Lignos review disses numerous other papers by stating that they are not incremental. There are several problems with this.
    First, Eleanor Batchelder's BOOTLEX (Cognition, 2001 I believe) was incremental. This paper was not reviewed. At best, Lignos missed this reference and stated something factually inaccurate as a result.
    Second, many of the other papers reviewed are *executed* as batch learners for efficiency, but are implementable as incremental learners. For example, the Daland & Pierrehumbert paper did incremental learning, in batches of 1 day. Heinz and colleagues have implemented an incremental version of the Goldwater paper. Lignos dismisses these papers as non-incremental, but a careful read shows that they are equivalent to incremental models. For example, my own model makes a boundary decision on the segment after the potential boundary. That is about as non-batch as you can get.
    I agree with Lisa that the paper seems to be mixing up the implementation with the algorithm specification. I would add that the error in this case seems to be especially self-serving.

    ----------------------------------------------
    Comment 3: COLLOCATIONS
    ----------------------------------------------
    The Lignos paper ignores two really nice pieces of evidence that infants exploit collocational information for segmentation.
    To be clear, the equation that Lignos gives is fully equivalent to probability maximization of a unigram word frequency model. Unigram = assuming statistical independence between words.
    The first piece of evidence is Goldwater's work -- in her paper with Griffiths and Johnson she shows that language exhibits strong collocational tendencies, and that infants would actually do *better* if they were to exploit this. Of course, this does not show that infants *do* use collocational information, but it certainly suggests that ignoring it would be a bad design decision.
    Second, and even more compellingly, there is infant research that suggests that infants do exploit collocations. For example, Mintz's frequent frames experiments suggest that infants only a few months older assign proto-syntactic categories to words based on collocation with function words.

    ----------------------------------------------
    COMMENT 4: Unclarity of the beam search
    ----------------------------------------------
    There were not enough technical details about how the beam search worked for another person to replicate it. This is especially unfortunate, since the beam search was the primary original contribution of this paper.
    The issue of how infants recover from under-/over- segmentation errors is under-researched, no doubt in part because (contra the paper's confident assertions to the primary) we do not actually know very much about what segmentation errors young infants make.
    It really would have been nice if this paper provided more details about how the beam search worked, for example a fully worked-out example of how "isthat" is initially under-segmented and then correctly segmented.

    -------------------------------------------
    Comment 5: Word learning
    -------------------------------------------
    Even adults do not learn most wordforms in a single shot. In the child and adult studies I have seen, the probability of correctly learning a wordform after one exposure seems to be about 20%, while the probability of correctly learning a wordform after 7 exposures in a single session seems to be about 80%.
    Our understanding of the memoric processes involved in word learning is highly imperfect. As modelers we have to do something. But I find it highly unsatisfactory to simply add a form to the lexicon the first time it is encountered.
    In fact, I do not think that word segmentation and word learning need to be modeled together, for this reason. We do not actually know that much about the wordform-learning process.

    ReplyDelete
    Replies
    1. Re: Collocations

      Are you referring to the Goldwater, Griffiths, & Johnson (GGJ) work that shows the bigram assumption is helpful (for a phoneme-based learner, but still)? So in that sense, a bigram would be the useful collocation? In effect, the collocation would serve as the implicit input to the word seg process GGJ use.

      Also, very interesting point about the frequent frames (FFs) evidence! The basic idea would be that a collocation XYZ serves as input to the FF process, with X_Z acting as the frame for Y.

      Re: Word learning

      I'm completely with you that perfect word learning after one exposure seems to be the exception rather than the rule. So, I'd be happy to implement a more gradient wordform hypothesis process for sure. But I do think there's some utility in strategies that create proto-lexicons, even if we don't know a lot about the wordform-learning process. At the very least, they tend to get a leg up performance-wise simply because they can use known wordforms to segment incoming data.

      Delete
    2. Some technical comments:
      1. "To be clear, the equation that Lignos gives is fully equivalent to probability maximization of a unigram word frequency model. Unigram = assuming statistical independence between words." While the metrics are similar, maximizing the unigram probability (the product of the probabilities of words in the utterance) and maximizing the geometric mean (the geometric mean of the probabilities of words in the utterance) are not mathematically equivalent. (I agree that for many inputs they will produce the same output.) Nor is the independence assumption the same as in GGJ work; this learner does not try to reach an optimal unigram segmentation. It uses the geometric mean to guide it in local decisions within a small search space, not as the global goal. This fact and the penalization are why it does not seem to display the same error pattern as other unigram models. That said, I of course agree that bigram/contextual information would be an important extension to the model.

      2. The proceedings format didn't provide for enough room, but regarding comment 4, see the cited 2011 CoNLL paper, as noted in the paper under discussion ("Full implementation details, including pseudocode for all variants, are given in Lignos 2011"). Comment 5 is discussed for a different audience in that paper as well (and in my 2010 CoNLL paper which it cites in turn); in short, I use a simple model of probabilistic word recall to avoid the idea that words are instantly learned. I think there are probably better ways of doing it than I did there, but in short removing the assumption that "words" are instantly learned has little impact on the system. In order to press this point harder, at one point I *only* reported performance numbers that reflected a probabilistic model of recalling words; so in short I am very sympathetic to your view in comment 5.

      Delete
    3. Constantine, I think the Trust feature in your model was a nice first pass. In my dissertation I experimented with something similar. In that case I used a frequency threshold cutoff.

      The claim that "removing the assumption that "words" are instantly learned has little impact on the system" is an interesting one. Establishing this fully and carefully for one particular lexical model would be a nice contribution.

      For various reasons I have become convinced that it is not actually desirable to model word learning and word segmentation together. That is, I think word learning is sensitive to a whole set of factors (like caregiver-infant joint attention, pragmatic information) that we segmentation-modelers don't have access to. Obviously we want segmentation to feed word-learning, but it would be quite nice to know that overly optimistic estimates of word-learning abilities do not actually hurt or help much for the segmentation end.

      Delete
  3. I find it a bit surreal that this paper devoted more than a page to discussing the (de)merits of computational modeling while having less than half a page of discussion and conclusions. I think I'll join them in spirit and leave it at that!

    ReplyDelete
    Replies
    1. Welcome to the surreal world of trying to fit it all in the length limits of a standard linguistics proceedings paper!

      Delete
  4. Hi Lisa,
    It's great to see your thoughts here, and I hope your group had a lively discussion. Just a few minor technical comments below:

    1. I'm not sure what you mean by hit rate and false alarm rate not giving you information about what the syllable baseline is doing when it is inserting a boundary at every possible location. Hit rate and false alarm rate make it clear that it labels every word boundary as a word boundary (same as recall = 1.0) and labels every word-internal boundary as a word boundary too. I find this slightly more informative than providing precision (which will just tell you the balance of classes, what % of your data is made up of word boundaries), but the real point of the discrimination metrics is that learner "knows" nothing as HR = 1.0 and FA = 1.0 implies zero discrimination (A-prime is undefined, but effectively nil).

    2. If anything's unclear about the algorithm, the cited CoNLL paper I did the previous year has all of the details and pseudocode. I wish more could have been fit in that paper! I do agree that a gradient or probabilistic version of Trust would be interesting. It's hard to imagine what it would look like without overfitting the data we have.

    3. I didn't have any room in that paper to talk about why the lexicon metric is strange for online learners evaluated in the way described in that paper: no training before evaluation begins but allows learning during evaluation. I'm adding additional evaluations in the thesis which add on that metric in a way that makes for reasonable comparison between batch and online learners. I'll send you over that thesis chapter when I have a chance.

    ReplyDelete
    Replies
    1. Actually, we're about to have the discussion this afternoon, so your comments are wonderfully timely!

      1. Ah, I see what you're saying now about the HR vs. FA and A' - my apologies about my misinterpretation before. In my too-fast reading of it, I was thinking both being 1.0 was equivalent to a perfect score - but that's not right, of course, since a high FA rate is bad. So is there an easy way to relate them individually or collectively to general under vs. oversegmentation, the way you can for Precision and Recall? Maybe a relatively high FA means more oversegmentation, while a relatively low HR means more undersegmentation?

      2. ;) I definitely hear you on space limitations. I took a look at the CoNNL paper, and I think I get both the main algorithm and the beam search extension now. Both of them effectively use some kind of lookahead on the utterance to determine if more than one lexicon item will fit (so they're not totally greedy, looking at things syllable by syllable). For the beam search, is there even a case where more than two lexicon items would fit, and you have to make some kind of choice to cull the alternatives to 2? (I'm thinking of your "part of an apple" example, where maybe you have "part", "partof", "ofa", and "partofa" in your lexicon at some point.)

      3. Again, I hear you on the space limitations. :) I'd definitely be interested in the lexicon metrics you come up with which seem fair comparisons.

      Delete
    2. 1. Yes, you can do more to interpret HR and FA in the desired fashion. Bias metrics are the way to go for getting the over/undersegmentation analysis, for example B'' ("B double prime"). I wanted to discuss that in this paper, but ran out of room; it'll be in the thesis chapter I'm working on.

      2. Correct, the segmentation is not greedy in the syllable sense and there's another question of how to do lookahead fairly. The segmenter is able to look ahead to the end of the utterance, which is why I do not consider this a realtime processing model. Just like in every other decision in this model, my inclination is to make the assumption that creates the fewest degrees of freedom in the model; similarly even though we know children must have some ability to recall previous utterances I go for the "no utterance memory" model to see what happens as a starting point. Practically speaking, the case where there are > 2 possible subtractions at a single point rarely occurs (even if you do not limit the beam size, the average beam is something only slightly larger than of 1.05, which is what you get when it is limited to 2 IIRC). Once you have hit a beam of two, for example you are exploring segmentations of "Part of an apple" that starts with "partof" and "part" each hypothesis unfolds greedily independently from the other, that is, at each possibly ambiguous point the highest scoring word will be used.

      There are so many interesting variables related to the beam search; I only claim that I have given one simple strategy that will give rise to the type of behavioral changes we see. It is interestingly better at picking segmentations under the objective I define than exhaustive search is. (That is, it avoids hypotheses that are wrong but more attractive to the objective than the right one.) Sometimes I think of algorithmic-level models as something like the "Rifleman's creed": "This is my algorithm. There are many like it, but this one is mine." Ultimately, at the moment we don't know enough about the realtime processing characteristics of word segmentation to know more about what the algorithm should be, but I think the flavor of the algorithm proposed is right, regardless of whether all the details we cannot evaluate are.

      On a wider note that may be relevant to your discussion but is easy to misunderstand from that (short) paper, I do believe many other factors will play a role in a "complete" model of segmentation: there is something to be said about phonotactics, as there is for transitional probabilities, and similarly for word context (even if I don't believe GGJ offers any evidence relevant to what real learners actually use). What's interesting to me is starting with these minimal, cognitively plausible models and building it out from there. I think you and I are in good alignment there about the practice of modeling; for me the question isn't "how do we build the model that uses everyone's favorite cue at once?" (although I acknowledge that has a benefit for trying to get a paper published) but "what characteristics of a model are sufficient to explain the behavioral patterns we observe?".

      Delete