Some more targeted thoughts:
- On p.5, Yu mentions how the learning problem can be thought of effectively as a poverty of the stimulus (PoS) problem, because the learner has to generalize from a finite set to generalizations that cover infinite sets. I do get this, but it does seem like this might be an easier generalization (in this particular case) to make than some of the problems that are traditionally held up as poverty of the stimulus (say, in syntax). This is because the acoustic data points available might be best fit by a generalization that's close enough to the truth - not every data point appears, but enough appear that are spread out sufficiently. On the other hand, a harder PoS problem would be if the data points that appear are most compatible with a generalization that is in fact the wrong one (here, if the proper ellipsis was actually much much bigger than the observed data suggest, and only extended along a particular dimension, for example).
- On p.6, in footnote 1, we can really see the differences in approach to (morpho)syntax taken by linguistics vs. computational linguistics. I believe it's standard to assume a probabilistic distribution over whatever units you're working with, which has to map real-values, while in linguistics it's more standard to assume a categorical (discrete) approach. (Though of course there are linguists who adopt a probabilistic approach by default - I just think they're not in the majority in generative linguistic circles.)
- on p.12, where Yu notes that there are distinctions between adult-directed and child-directed speech, and justifies the decision to use adult-directed speech: While I can certainly understand the practical motivations for doing this, it would be really good to know how different adult-directed speech is compared to child-directed speech, particularly for the acoustic properties that Yu is interested in with respect to tone. I have the (possibly mistaken) impression that there might be quite significant differences.
I definitely enjoyed this paper. I'm very interested in the idea of using raw acoustic data to model language acquisition and this definitely plays into that.
ReplyDeletePitch/Tones have had such a sad history in linguistics. They've essentially been ignored wherever possible because they're extremely messy. Tones influence nearby tones, you have to take into account both relative pitch and pitch velocity, and on top of it all most linguists don't speak a tonal language natively so it's incredibly hard to hear the acoustic differences you're trying to describe. So it's refreshing to see a paper on the computational theory of tone acquisition.
One nit-pick about the model, namely I'm not so impressed by it's performance. It seems like the model performance peaks about about 67% accuracy in tone identification (Fig 5). How does this compare to human performance? Speakers of tone languages have a lot of difficulty with tones in isolation but I'm not so sure they're as low as 67%. Perhaps part of the issue is the interplay between tones that occurs in natural language, I'm not sure.
What I'd really like to see now are some ideas about how children might be acquiring tonal information. I want an algorithmic answer. Obviously they can do some rough sampling to get the job done, and should be paying attention to pitch and pitch-velocity. But I wonder if we could come up with an explicit model of something like word learning that took those things into account and see how it does.
So my outline:
1) Get an annotated tonal language corpus
2) Using Praat grab the pitch for each word in the corpus.
3) For a model that takes n samples per word simply input the corpus with the phonemes and n pitches.
4) Add each word to the lexicon and attach those specific pitch values.
5) Then the goal of the model would be to determine for each word what tonal class it belongs to.
Something like this should be pretty doable although this is obviously a very rough sketch. But this seems like the next step to me for sure.