Computational Models of Language (at UC Irvine): Some thoughts on Lignos 2012

Monday, April 29, 2013

Some thoughts on Lignos 2012

I found the simplicity of the proposed algorithm in this paper very attractive (especially when compared to some of the more technically involved papers we've read that come from the machine learning literature). The goal of connecting to known experimental and developmental data of course warmed my cognitive modeler's heart, and I certainly sympathized with the aim of pushing the algorithm to be more cognitively plausible. I did think some of the criticisms of previous approaches were a touch harsh, given what's actually implemented here (more on this below), but that may be more of a subjective interpretation thing. I did find it curious that the evaluation metrics chosen were about word boundary identification, rather than about lexicon items (in particular, measuring boundary accuracy and word token accuracy, but not lexicon accuracy). Given the emphasis on building a quality lexicon (which seems absolutely right to me if we're talking about the goal of word segmentation), why not have lexicon item scores as well to get a sense of how good a lexicon this strategy can create?

Some more specific thoughts:

Section 2.1, discussing the 9-month-old English-learning infants who couldn't segment Italian words from transitional probabilities alone unless they had already been presented with words in isolation: Lignos is using this to argue against transitional probabilities as a useful metric at all, but isn't another way to interpret it simply that transitional probabilities (TPs) can't do it all on their own? That is, if you initialize a proto-lexicon with a few words, TPs would work alright - they just can't work right off the bat with no information. Relatedly, the discussion of the Shukla et al. 2011 (apparently 6-month-old) infants who couldn't use TPs unless they were aligned with a prosodic boundary made me think more that TPs are useful, just not useful in isolation. They need to be layered on top of some existing knowledge (however small that knowledge might be). But I think it just may be Lignos's stance that TPs aren't that useful - they seem to be left out as something a model of word segmentation should pay attention to in section 2.4.

Of course, I (and I'm assuming Lawrence as well, given Phillips & Pearl 2012) was completely sympathetic to the criticism in section 2.3 about how phonemes aren't the right unit of perception for the initial stages of word segmentation. They may be quite appropriate if you're talking about 10-month-olds, though - of course, at that point, infants probably have a much better proto-lexicon, not to mention other cues (e.g., word stress). I was a little less clear about the criticism (of Johnson & Goldwater) about using collocations as a level of representation. Even though this doesn't necessarily connect to adult knowledge of grammatical categories and phrases, there doesn't seem anything inherently wrong with assuming infants initially learn chunks that span categories and phrases, like "thatsa" or "couldI". They would have to fix them later, but that doesn't seem unreasonable.

One nice aspect of the Lignos strategy is that it's incremental, rather than a batch algorithm. However, I think it's more a modeling decision rather than an empirical fact to not allow memory of recent utterances to affect the segmentation of the current utterance (section 3 Intro). It may well turn out to be right, but it's not obviously true at this point that this is how kids are constrained. On a related note, the implementation of considering multiple segmentations seems a bit more memory-intensive, so what's the principled reason for allowing memory for that but not allowing memory for recent utterances? Conceptually, I understand the motivation for wanting to explore multiple segmentations (and I think it's a good idea - I'm actually not sure why the algorithm here is limited to 2) - I'm just not sure it's quite fair to criticize other models for essentially allowing more memory for one thing when the model here allows more memory for another.

I was a little confused about how the greedy subtractive segmentation worked in section 3.2. At first, I thought it was an incremental greedy thing - so if your utterance was "syl1 syl2 syl3", you would start with "syl1" and see if that's in your lexicon; if not, try "syl1 syl2", and so on. But this wouldn't run into ambiguity then: "...whenever multiple words in the lexicon could be subtracted from an utterance, the entry with the highest score will be deterministically used". So something else must be meant. Later on when the beam search is described, it makes sense that there would be ambiguity - but I thought ambiguity was supposed to be present even without multiple hypotheses being considered.

The "Trust" feature described in 3.3 seemed like an extra type of knowledge that might be more easily integrated into the existing counts, rather than added on as an additional binary feature. I get that the idea was to basically use it to select the subset of words to add to the lexicon, but couldn't a more gradient version of this implemented, where the count for words at utterance boundaries gets increased by 1, while the count for words that are internal gets increased by less than 1? I guess you could make an argument either way about which approach is more naturally intuitive (i.e., just ignore words not at utterance boundaries vs. be less confident about words not at utterance boundaries).

I think footnote 7 is probably the first argument I'm seen in favor of using orthographic words as the target state, instead of an apology for not having prosodic words as the target state. I appreciate the viewpoint, but I'm not quite convinced that prosodic words wouldn't be useful as proto-lexicon items (ex: "thatsa" and "couldI" come to mind). Of course, these would have to be segmented further eventually, but they're probably not completely destructive to have in the proto-lexicon (and do feel more intuitively plausible as an infant's target state).

In Table 1, it seems like we see a good example of why precision and recall may be better than hit (H) rate and false alarm (FA) rate: The Syllable learner (which puts a boundary at every syllable) clearly oversegments and does not achieve the target state, but you would never know that from the H and FA scores. Do we get additional information from H & FA that we don't get from precision and recall? (I guess it would have to be mostly from the FA rate, since H = recall?)

I thought seeing the error analyses in Tables 2 and 3 was helpful, though I was a little surprised Table 3 didn't show the breakdown between undersegmentation and oversegmentation errors, in addition to the breakdown between function and content words. (Or maybe I just would have liked to have seen that, given the claim that early errors should mostly be undersegmentations. We see plenty of function words as errors, but how many of them are already oversegmentations?)

15 comments:

LA DenizenApril 29, 2013 at 1:54 PM
I also found the lit review a bit harsh. I felt a general tone of, "I am following the infant acquisition literature, but no one else does". At best, this is a failure to appreciate the intent behind some other research.

-----------------------------------
Comment 1: SYLLABLES VS. ALLOPHONES VS. MIS-REPRESENTING YOUR COMPETITORS' POSITIONS
-----------------------------------

For example, the paper attributed a phoneme-level representation to my own paper, whereas we spent over a page explaining the difference between phonemes and allophones, and arguing that an allophonic-level representation was more likely to be appropriate for infants. We quite explicitly made the point that we were using an allophonic transcription, not a phonemic one as Lignos claimed.

Lignos' lit review seemed to regard this as inexcusably ignoring the acquisition data, whereas in my own opinion it is the position which is the most *faithful* to the acquisition data.
To expand upon this point, the Lignos paper stated uncritically that the syllable is the unit of perception for infants. It must be acknowledged that many people believe this, but it also must be acknowledged that there is no incontrovertible proof. Lignos cited 3 papers which adopt this interpretation, and whose results are consistent with it. However, none of them is knock-down proof.
For example, the Bertoncini and Mehler results cannot be explained unless we accept that infants know the difference between 2 and 3 vowels (and do not in all cases care about the difference between two and 3 consonants). This does not automatically imply that they are assigning syllabic mental representations, or that they are failing to regard onset consonants as distinct acoustic events. You could as easily get the same result by counting overt vowels and disregarding coda consonants, which are known to be perceptually difficult even for adults. The three papers which are cited *argue* that the syllable is a perceptual unit for infants, but they do not prove it.
Moreover, the theoretical status of syllables is unclear. Personally, I find Steriade's licensing-by-cues theory a much more satisfactory account altogether. In this theory, syllables are an epiphenomenon of speakers trying to produce each vowel as a mini-word. It correctly accounts for the ambiguity in items like ?PA.STA/?PAS.TA, and the contrast between DE.MON/*DEM.ON vs. ?LE.MON/?LEM.ON. It correctly accounts for why assimilation processes are generally regressive, which the syllabification theory does also, but unlike the syllabification theory, licensing-by-cue correctly accounts for the fact that retroflexion assimilation is progressive.

Yet another point. It is simply a fact that ***native speakers of different languages syllabify the same string in different ways***. It follows straightforwardly from this fact that ***syllabification must be learned***. The type of phonotactic knowledge that you would need to learn syllabification is the kind that infants seem to be just coming into at 9 months. Therefore, it seems quite strange to me to assert that syllabification is the primitive unit of perception at 6 months. How could infants have syllables as a unit when they haven't yet learned the phonotactic properties of their language that tell them how to segment strings into syllables??
Yet another point. It is possible in English and in many other languages for people to make up new syllables that are grammatical. For example, ZILF. How can this be a primitive unit of perception when we have never heard it before?? It seems far more sensible to me that we attempt to interpret this novel acoustic sequence combinatorially, as a sequence of familiar segments. I do not really understand how the "syllable as primitive unit" can account for the perception of novel syllable types.
ReplyDelete
Replies
LA DenizenApril 29, 2013 at 1:55 PM
-------------------------------------------
Comment 2: BATCH VS. INCREMENTAL IS A RED HERRING
-------------------------------------------
Along the lines of following the infant literature, the Lignos review disses numerous other papers by stating that they are not incremental. There are several problems with this.
First, Eleanor Batchelder's BOOTLEX (Cognition, 2001 I believe) was incremental. This paper was not reviewed. At best, Lignos missed this reference and stated something factually inaccurate as a result.
Second, many of the other papers reviewed are *executed* as batch learners for efficiency, but are implementable as incremental learners. For example, the Daland & Pierrehumbert paper did incremental learning, in batches of 1 day. Heinz and colleagues have implemented an incremental version of the Goldwater paper. Lignos dismisses these papers as non-incremental, but a careful read shows that they are equivalent to incremental models. For example, my own model makes a boundary decision on the segment after the potential boundary. That is about as non-batch as you can get.
I agree with Lisa that the paper seems to be mixing up the implementation with the algorithm specification. I would add that the error in this case seems to be especially self-serving.

----------------------------------------------
Comment 3: COLLOCATIONS
----------------------------------------------
The Lignos paper ignores two really nice pieces of evidence that infants exploit collocational information for segmentation.
To be clear, the equation that Lignos gives is fully equivalent to probability maximization of a unigram word frequency model. Unigram = assuming statistical independence between words.
The first piece of evidence is Goldwater's work -- in her paper with Griffiths and Johnson she shows that language exhibits strong collocational tendencies, and that infants would actually do *better* if they were to exploit this. Of course, this does not show that infants *do* use collocational information, but it certainly suggests that ignoring it would be a bad design decision.
Second, and even more compellingly, there is infant research that suggests that infants do exploit collocations. For example, Mintz's frequent frames experiments suggest that infants only a few months older assign proto-syntactic categories to words based on collocation with function words.

----------------------------------------------
COMMENT 4: Unclarity of the beam search
----------------------------------------------
There were not enough technical details about how the beam search worked for another person to replicate it. This is especially unfortunate, since the beam search was the primary original contribution of this paper.
The issue of how infants recover from under-/over- segmentation errors is under-researched, no doubt in part because (contra the paper's confident assertions to the primary) we do not actually know very much about what segmentation errors young infants make.
It really would have been nice if this paper provided more details about how the beam search worked, for example a fully worked-out example of how "isthat" is initially under-segmented and then correctly segmented.

-------------------------------------------
Comment 5: Word learning
-------------------------------------------
Even adults do not learn most wordforms in a single shot. In the child and adult studies I have seen, the probability of correctly learning a wordform after one exposure seems to be about 20%, while the probability of correctly learning a wordform after 7 exposures in a single session seems to be about 80%.
Our understanding of the memoric processes involved in word learning is highly imperfect. As modelers we have to do something. But I find it highly unsatisfactory to simply add a form to the lexicon the first time it is encountered.
In fact, I do not think that word segmentation and word learning need to be modeled together, for this reason. We do not actually know that much about the wordform-learning process.
ReplyDelete
Replies
BrentApril 29, 2013 at 10:24 PM
I find it a bit surreal that this paper devoted more than a page to discussing the (de)merits of computational modeling while having less than half a page of discussion and conclusions. I think I'll join them in spirit and leave it at that!
ReplyDelete
Replies
Constantine LignosApril 30, 2013 at 7:22 AM
Hi Lisa,
It's great to see your thoughts here, and I hope your group had a lively discussion. Just a few minor technical comments below:

1. I'm not sure what you mean by hit rate and false alarm rate not giving you information about what the syllable baseline is doing when it is inserting a boundary at every possible location. Hit rate and false alarm rate make it clear that it labels every word boundary as a word boundary (same as recall = 1.0) and labels every word-internal boundary as a word boundary too. I find this slightly more informative than providing precision (which will just tell you the balance of classes, what % of your data is made up of word boundaries), but the real point of the discrimination metrics is that learner "knows" nothing as HR = 1.0 and FA = 1.0 implies zero discrimination (A-prime is undefined, but effectively nil).

2. If anything's unclear about the algorithm, the cited CoNLL paper I did the previous year has all of the details and pseudocode. I wish more could have been fit in that paper! I do agree that a gradient or probabilistic version of Trust would be interesting. It's hard to imagine what it would look like without overfitting the data we have.

3. I didn't have any room in that paper to talk about why the lexicon metric is strange for online learners evaluated in the way described in that paper: no training before evaluation begins but allows learning during evaluation. I'm adding additional evaluations in the thesis which add on that metric in a way that makes for reasonable comparison between batch and online learners. I'll send you over that thesis chapter when I have a chance.
ReplyDelete
Replies

Add comment

Computational Models of Language (at UC Irvine)

Monday, April 29, 2013

Some thoughts on Lignos 2012

15 comments:

People who think this blog is awesome

Members