Monday, May 29, 2017

Thoughts on Meylan et al. 2017 in press

I really like how M&al2017 have mathematically cashed out the competing hypotheses (and given the intuitions of how Bayesian data analysis works — nice!). But something that I don’t quite understand is this: The way the model is implemented, isn’t it a test for Noun-hood rather than Determiner-hood (more on this below)? There’s nothing wrong with testing Noun-hood, of course, but all the previous debates involving analysis of Determiners and Nouns have been arguing over Determiner-hood, as far as I understand.

Something that also really struck me when trying to connect these results to the generativist vs. constructionist debate: How early is earlier than expected for abstract category knowledge to develop? This seems like the important thing to know if you’re trying to interpret the results for/against either perspective (more on this below too).

Specific thoughts:

(1) How early is earlier than expected?
(a) In the abstract, we already see the generativist perspective pitched as “the presence of syntactic abstractions” in children’s “early language”. So how early is early?  Are the generativists really all that unhappy with the results we end up seeing here where a rapid increase in noun-class homogeneity starts happening around 24 months? Interestingly, this timing correlates nicely with what Alandi, Sue, and I have been finding using overlap-based metrics for assessing categories in VPs like negation and auxiliary (presence of these in adult-ish form between 20 and 24 months for our sample child).

(b) Just looking at Figure 3 with the split-half data treatment, it doesn’t look like there’s a lot of increase in noun-class-ness (productivity) in this age range. Interestingly, it seems like several go down (though I know this isn’t true if we’re using the 99.9% cutoff criterion). Which team is happy with these results, if we squint and just use the visualizations as proxies for the qualitative trends? While the generativists would be happy with no change, they’d also be surprised by negative changes for some of these kids. The constructivists wouldn’t be (they can chalk it up to “still learning”), but then they’d expect more non-zero change, I think.

(c)  The overregularization hypothesis is how M&al2017 explain the positive changes in younger kids and the negative changes for older kids. In particular, they say older kids have really nailed the NP —> Det N rule, and so use more determiner noun combinations that are rare for adults. So, in the terms of the model, what would be happening is that more nouns get their determiner preferences skewed towards 0.5 than really ought to be, I think. If that happens, then shouldn’t the distribution be more peaked around 0.5 in Figure 1? If so, that would lead to higher values of v. So wouldn’t we expect even higher v values (i.e., a really big increase) if this is what’s going on, rather than a decrease to lower v values?  Maybe the idea is that the peak in v is happening because of overregularization, and then the decrease is when kids settle back down. That is, adult-like knowledge of a noun category existing is when we get the peak v value (which may in fact be higher than actual adult values). Judging from Figure 4, it looks like this is happening between 2 and 3. Which is pretty young. Which would make generativists happy, I think? So the conclusion that “these results are broadly consistent with constructivist hypotheses” is somewhat surprising to me. I guess it all comes back to how early is earlier than expected.

(d) Sliding-window results: If we continue with this idea that a peak in v value is the indication kids have hit adult-like awareness of a category (and may be overregularizing), what we should be looking for is when that first big peak happens (or maybe big drop after a peak). Judging from the beautifully rainbow-colored Figure 5, it looks like this happens pretty early on for a bunch of these kids (Speechome, Eve, Lily, Nina, Aran, Thomas, Alex, Adam, Sarah).  So the real question again: how early is earlier than expected? (I feel like this is exactly the question that pops up for standard induction problem/Poverty of the Stimulus arguments.)

(2) Modeling:  
(a) I like how their model involves both categories, rather than just Determiner. This is exactly what Alandi and I realized when we started digging into the assumptions behind the different quantitative metrics we examined.

(b) I also like that M&al2017 explicitly do a side-by-side model comparison (direct experience = memorized amalgam from the input vs. productive inference from abstracted knowledge). Bayesian data analysis is definitely suited for this, and then you can compare which representational hypothesis best fits over different developmental stages for a given child. Bonus for this modeling approach: the ability to estimate confidence intervals.

We can see this in the model parameters, too: n = impact of input (described as ability to “match the variability in her input” = memorized amalgams), v = application of knowledge across all nouns (novel productions = “produce determiner-noun pairs for which she has not received sufficient evidence from caregiver input” = abstract category). It’s really nice to see the two endpoint hypotheses cashed out mathematically.

(c) Testing noun-hood: If I understand this correctly, each noun has a determiner preference (0 = all “a”, 1 = all “the”, 0.5 = half each). Cross-noun variability is then drawn from an underlying common noun distribution if all nouns are from same class (testing whether “nouns behave in a more class-like fashion”). So, this seems like testing for Noun-hood based on Determiner usage, which I quite like.  But it’s interesting that M&al2017 describe this as testing “generalization of determiner use across nouns”, which makes it seem like they’re testing for Determiner instead. I would think that if they want to test for Determiner, they’d swap which one they’re testing classhood for (i.e., have determiners with a noun preference, and look for individual determiner values to all be drawn from the same underlying Determiner distribution).

(3) Critiquing previous approaches involving the overlap score: 
M&al2017 say that overlap might increase simply because children heard more determiner+noun pairs in the input (i.e., it’s due to memorization, and not abstraction). I’m not sure I follow this critique, though — I’m more familiar with Yang’s metrics, of course, and those do indeed take a snapshot of whether the current data are compatible with fully productive categories vs. memorized amalgams from the input. The memorized amalgam assessment seems like it would indeed capture whether children’s output is compatible with memorized amalgams (i.e., more determiner+noun pairs in their input). 

(4) Data extraction (from appendix): 
(a) I like that they checked whether the nouns should be collapsed together or if instead morphological variants should be treated separately (e.g., dog/dogs as one or two nouns). In most analyses I’ve seen, these would be treated as two separate nouns. 

(b) Also, the supplementary material really highlights the interesting splits in PNAS articles, where all the stuff you’d want to know to actually replicate the work isn’t in the main text. 

(c) Also, yay for github code! (Thanks, M&al2017 — this is excellent research practice.)

(5) M&al2017 highlight the need for dense naturalistic corpora in their discussion - I feel like this is an awesome advertisement for the Branwauld corpus: Seriously. It may not have as much child-directed as the Speechome, but it has a wealth of longitudinal child-produced data. (Our sample from 20 to 24 months has 2154 child-produced VPs, for example, which doesn’t sound too bad when compared to Speechome’s 4300 NPs.)

No comments:

Post a Comment