It’s really interesting to see someone coming at language development from a very different perspective (here: statistical physics). Different terminology means different ways of talking about the same ideas — and this highlighted for me how comfortable I’ve become with my own terminology, and how foreign it can seem when someone uses different terminology (see comments on long-range dependency below).
Specific thoughts:
(1) Implications for language development
(a) I don’t find it all that surprising that early child productions have these long-range correlation properties. This may be because of my naive understanding of power-law relationships, but basically, power-law relationships aren’t a language-specific thing, so why shouldn’t they appear in early child productions too? It made me smile, though, to see this author then use the existence of long-range correlation as an argument for an “innate mechanism of the human language faculty”. I didn’t really see that thought cashed out later though, and maybe that’s for the best.
(b) In the discussion section, the author says “This would require more exhaustive knowledge of long-range memory in natural language, and the model would have to integrate more complex schemes that possibly introduce n-grams or grammar models.” — This made me smile, too. You mean we might need syntactic structure to explain language development? Couldn’t be.
(2) Equation 1, which is correlation at a distance s: I think it’s worth thinking about the intuition of this. It captures the similarity of two subsequences s distance apart, with respect to their deviation from the mean value. Interpretation for word frequency: same frequency (which differs from mean by some amount) s words away. So this means long-range correlation is a power-law relationship w.r.t correlation. That is, it’s a power-law in time for word usage by frequency, not just in overall frequency irrespective of time.
(3) Working with kid data
(a) The author talks about the analysis of one child’s utterances and how things are still under development, but the analysis is effectively over word use in sequences, so it’s not clear how complex the syntactic and semantic knowledge needs to be for this to occur. That is, it’s not surprising that a swath of data between two and five shows this relationship. More interesting would have been this analysis at two vs. three vs. four. Later in the paper, the author says “In early childhood speech, utterances are still lacking in full vocabulary, ungrammatical, and full of mistakes. Therefore, the long-range correlation of such speech must be based on a simple mechanism other than linguistic features such as grammar that we generally consider.” - This comes back to assumptions about what knowledge develops at what age. “Ungrammatical” isn’t very accurate, especially when we’re talking four- and five-year-olds.
(b) I love seeing the author leverage cross-linguistic data, but how old were these kids? Age matters a bunch. And how many words were in these datasets?
(4) Understanding the different generative models
(a) The Simon model is described as “the rich get richer”, which seems like the intuition for the Chinese Restaurant Process (CRP). I definitely understand that this is uniform sampling from previous elements (in time, this means sampling from the past), plus a little for a new element. Except then the Pitman-Yor can reduce to a CRP when a is 0, and Pitman-Yor is meant to be different from Simon. Based on Figure 10, there’s clearly a major difference (the autocorrelation isn’t there for Pitman-Yor), but the intuition of what’s different is hard to grasp.
(b) I’m not sure I understand the issue described here for the Simon model: “the vocabulary growth (proven to have exponent 1.0) is too fast”. Isn’t the Simon model meant to be about sequences in time? Or is the author referring to people who have tried to match child vocabulary development to a Simon model? Or maybe this refers to the left panel of the figures where sometimes we see divergence from a strict Zipf’s law?