Monday, May 29, 2017

Thoughts on Meylan et al. 2017 in press

I really like how M&al2017 have mathematically cashed out the competing hypotheses (and given the intuitions of how Bayesian data analysis works — nice!). But something that I don’t quite understand is this: The way the model is implemented, isn’t it a test for Noun-hood rather than Determiner-hood (more on this below)? There’s nothing wrong with testing Noun-hood, of course, but all the previous debates involving analysis of Determiners and Nouns have been arguing over Determiner-hood, as far as I understand.

Something that also really struck me when trying to connect these results to the generativist vs. constructionist debate: How early is earlier than expected for abstract category knowledge to develop? This seems like the important thing to know if you’re trying to interpret the results for/against either perspective (more on this below too).

Specific thoughts:

(1) How early is earlier than expected?
(a) In the abstract, we already see the generativist perspective pitched as “the presence of syntactic abstractions” in children’s “early language”. So how early is early?  Are the generativists really all that unhappy with the results we end up seeing here where a rapid increase in noun-class homogeneity starts happening around 24 months? Interestingly, this timing correlates nicely with what Alandi, Sue, and I have been finding using overlap-based metrics for assessing categories in VPs like negation and auxiliary (presence of these in adult-ish form between 20 and 24 months for our sample child).

(b) Just looking at Figure 3 with the split-half data treatment, it doesn’t look like there’s a lot of increase in noun-class-ness (productivity) in this age range. Interestingly, it seems like several go down (though I know this isn’t true if we’re using the 99.9% cutoff criterion). Which team is happy with these results, if we squint and just use the visualizations as proxies for the qualitative trends? While the generativists would be happy with no change, they’d also be surprised by negative changes for some of these kids. The constructivists wouldn’t be (they can chalk it up to “still learning”), but then they’d expect more non-zero change, I think.

(c)  The overregularization hypothesis is how M&al2017 explain the positive changes in younger kids and the negative changes for older kids. In particular, they say older kids have really nailed the NP —> Det N rule, and so use more determiner noun combinations that are rare for adults. So, in the terms of the model, what would be happening is that more nouns get their determiner preferences skewed towards 0.5 than really ought to be, I think. If that happens, then shouldn’t the distribution be more peaked around 0.5 in Figure 1? If so, that would lead to higher values of v. So wouldn’t we expect even higher v values (i.e., a really big increase) if this is what’s going on, rather than a decrease to lower v values?  Maybe the idea is that the peak in v is happening because of overregularization, and then the decrease is when kids settle back down. That is, adult-like knowledge of a noun category existing is when we get the peak v value (which may in fact be higher than actual adult values). Judging from Figure 4, it looks like this is happening between 2 and 3. Which is pretty young. Which would make generativists happy, I think? So the conclusion that “these results are broadly consistent with constructivist hypotheses” is somewhat surprising to me. I guess it all comes back to how early is earlier than expected.

(d) Sliding-window results: If we continue with this idea that a peak in v value is the indication kids have hit adult-like awareness of a category (and may be overregularizing), what we should be looking for is when that first big peak happens (or maybe big drop after a peak). Judging from the beautifully rainbow-colored Figure 5, it looks like this happens pretty early on for a bunch of these kids (Speechome, Eve, Lily, Nina, Aran, Thomas, Alex, Adam, Sarah).  So the real question again: how early is earlier than expected? (I feel like this is exactly the question that pops up for standard induction problem/Poverty of the Stimulus arguments.)


(2) Modeling:  
(a) I like how their model involves both categories, rather than just Determiner. This is exactly what Alandi and I realized when we started digging into the assumptions behind the different quantitative metrics we examined.

(b) I also like that M&al2017 explicitly do a side-by-side model comparison (direct experience = memorized amalgam from the input vs. productive inference from abstracted knowledge). Bayesian data analysis is definitely suited for this, and then you can compare which representational hypothesis best fits over different developmental stages for a given child. Bonus for this modeling approach: the ability to estimate confidence intervals.

We can see this in the model parameters, too: n = impact of input (described as ability to “match the variability in her input” = memorized amalgams), v = application of knowledge across all nouns (novel productions = “produce determiner-noun pairs for which she has not received sufficient evidence from caregiver input” = abstract category). It’s really nice to see the two endpoint hypotheses cashed out mathematically.

(c) Testing noun-hood: If I understand this correctly, each noun has a determiner preference (0 = all “a”, 1 = all “the”, 0.5 = half each). Cross-noun variability is then drawn from an underlying common noun distribution if all nouns are from same class (testing whether “nouns behave in a more class-like fashion”). So, this seems like testing for Noun-hood based on Determiner usage, which I quite like.  But it’s interesting that M&al2017 describe this as testing “generalization of determiner use across nouns”, which makes it seem like they’re testing for Determiner instead. I would think that if they want to test for Determiner, they’d swap which one they’re testing classhood for (i.e., have determiners with a noun preference, and look for individual determiner values to all be drawn from the same underlying Determiner distribution).

(3) Critiquing previous approaches involving the overlap score: 
M&al2017 say that overlap might increase simply because children heard more determiner+noun pairs in the input (i.e., it’s due to memorization, and not abstraction). I’m not sure I follow this critique, though — I’m more familiar with Yang’s metrics, of course, and those do indeed take a snapshot of whether the current data are compatible with fully productive categories vs. memorized amalgams from the input. The memorized amalgam assessment seems like it would indeed capture whether children’s output is compatible with memorized amalgams (i.e., more determiner+noun pairs in their input). 

(4) Data extraction (from appendix): 
(a) I like that they checked whether the nouns should be collapsed together or if instead morphological variants should be treated separately (e.g., dog/dogs as one or two nouns). In most analyses I’ve seen, these would be treated as two separate nouns. 

(b) Also, the supplementary material really highlights the interesting splits in PNAS articles, where all the stuff you’d want to know to actually replicate the work isn’t in the main text. 

(c) Also, yay for github code! (Thanks, M&al2017 — this is excellent research practice.)


(5) M&al2017 highlight the need for dense naturalistic corpora in their discussion - I feel like this is an awesome advertisement for the Branwauld corpus: http://ucispace.lib.uci.edu/handle/10575/11954. Seriously. It may not have as much child-directed as the Speechome, but it has a wealth of longitudinal child-produced data. (Our sample from 20 to 24 months has 2154 child-produced VPs, for example, which doesn’t sound too bad when compared to Speechome’s 4300 NPs.)

Monday, May 15, 2017

Thoughts on Yang et al. 2017

I feel like Universal Grammar (UG) was better defined by the end of this exposition (thanks, Y&al2017!), but now I want to have a heart-to-heart about the difference between “hierarchy” and “combination”. Still, I appreciated this convenient synthesis of evidence from the generative grammar tradition, especially as it relates to the kind of considerations I have as an acquisition modeler. 

Specific thoughts:

(1) Hierarchy vs. combination:
Part 2.2: While I’m a fan of hierarchical structures being everywhere in language, I wasn’t sure how connected the newborn n-syllable tasks were to the point about hierarchy. Why does being sensitive to the number of vowels (“vowel centrality”) indicate there must be hierarchical structure? For example, what if newborns hadn’t inferred hierarchy yet, but were simply sensitive to the more acoustically salient cue of vowels — wouldn’t we see the same results, even if all they really perceived was something like V V for “baku” and “alprim”?

Similarly with the babbling examples: How do we know these are hierarchical (vs. say, linear) structures? 

Similarly with the prosodic contour distinctions for the 6- to 12-week-olds: We know they perceive the prosodic contours, but not that they recognize the words and phrases in these languages. (In fact, we assume they don’t — they haven’t really managed reliable speech segmentation yet.) So how does recognizing prosodic contour distinctions over acoustic units relate to the hierarchical structure Merge gives?

My main issue is coming down to “combinatorial” vs. “hierarchical”. I think you can make combinations of things without those things being combined hierarchically. So these two terms don’t mean the same thing to me, which is why the evidence in section 2.2 doesn’t seem as compelling about hierarchy (though it is for combinations). Contrast this with the 2.3 examples of syntactic development, where c-command definitely is about hierarchy.


(2) UG: Initially, UG is described as domain-specific principles of language knowledge, without specifying whether these are innate principles or not (and also seeming to focus on the knowledge about language, rather than, say, knowledge about how to learn language (= learning mechanism)). But then, we see UG described as “internal constraints that hold across all linguistic structures”  — though this highlights the innate component, it now doesn’t seem to indicate these constraints have to be just about language. That is, they could be constraints that apply to language as well as other things, e.g., hierarchy, which they talk about as Merge. I’m thinking visual scene parsing is similar, where you have hierarchical chunks. So this would be a vision system version of Merge. 

A little later on, we see “Universal Grammar” as the “initial state of language development” that's “determined by our genetic endowment”, which reinforces the innate component, but hedges on whether this is innate knowledge of the structure of language, or innate knowledge about how to learn language. This latter interpretation becomes more salient when they describe UG as infants interpreting parts of the environment as linguistic experience. This seems to be about the perceptual intake, and is less about knowledge of language than knowledge about what could count as language (= learning mechanism). Maybe that’s a broader definition of what it means to be a “principle of language”?

Later on in part 3.2, we get to more canonical UG examples, which are the linguistic parameters. These feel much more obviously language-specific. If they’re meant to be innate (which is how they’re typically talked about), then there we go. 

Side note: I would dearly love to figure out if specific linguistic parameters like these are derivable from other more basic linguistic building blocks. I think this is where the Minimalist Program (MP) and the Principles & Parameters (P&P) representations can meet, with MP providing the core building blocks that generate the P&P variables. I just haven’t seen it explicitly done yet. But it feels very similar to the implicit vs. explicit hypothesis space distinction that Perfors (2012) discusses, where the linguistic parameters are the explicit hypotheses generated from the MP building blocks that are capable of generating all the hypotheses in the implicit hypothesis space.

Perfors, A. (2012). Bayesian models of cognition: what's built in after all? Philosophy Compass, 7(2), 127-138.


(3) Efficient computation: I really like seeing this term here as a core factor, though I’m tempted to make it “efficient enough computation”, especially if we’re going to eventually tie this kind of thing back to evolution.

(4) Rhetorical device danger: Section 3.1 has this statement that I think can get us into hot water later on: “[I]t follows that language learners never witness the whole conjugation table…fully fleshed out, for even a single verb.”  Now we’ve just thrown down the gauntlet for some corpus analyst to hunt through a large enough sample and find just one verb that does. It doesn’t affect the main point at all, but it’s the kind of thing that can be easily misunderstood (c.f., aux inversion input for arguing against Poverty of the Stimulus).

(5) Section 3.3: “…linguistic principles such as Structure Dependence and the constraint on co-reference [c-command]…are most likely accessible to children innately” — Yes! In the sense that these principles are allowed into the hypothesis space. Accessible is definitely the right (hedgy) word, rather than saying these are the only options period.

(6) Section 3.3, on Bayesian models of indirect negative evidence : ”…for this reason, most recent models  of indirect negative evidence explicitly disavow claims of psychological realism” — I find this a bit tricksy. Reading it, you might think: “Oh! The issue is that indirect negative evidence isn’t psychologically plausible to use.” But in actuality,  the “disavowal” is about a computational-level inference algorithm being psychologically real. As far as I know, there are no claims that the computation it’s doing with that algorithm isn’t psychologically real; rather, they assume humans approximate that computation (which uses indirect negative evidence).  

Related is the stated computational "intractability" of using indirect negative evidence: I admit, I find this weird. If we’re happy to posit alternative hypotheses in a subset-superset relationship, why is it so hard to posit predictions from those two hypotheses? The hard part seems to be about defining the hypotheses so explicitly in the first place, and that doesn’t seem to be the part that’s targeted as “psychologically intractable”. If anything, it seems to be the psychologically necessary part. (The description that follows this bit in section 3.3 seems to highlight this, where Y&al2017 talk about the superset grammar existing, even if the default is the subset grammar.)


(7) Section 4.1, on the importance of empirical details: I really appreciate the pitch to make proposals account for specific empirical details. This is something near and dear to my heart. Don’t just tell me your $beautiful_theory will solve all my language acquisition problems; show me exactly how it solves them, one by one. (Minimalism, I’m looking at you. And to be fair, that’s exactly what the next-to-last sentence of section 4.1. says.)

Monday, May 1, 2017

Thoughts on Han et al. 2016 + Piantadosi & Kidd 2016 + Lidz et al. 2016

As with our previous reading, I really appreciate the clarity with which the arguments are laid out by H&al2016, P&K2016’s reply, and L&al2016’s reply-to-the-reply. I can also see where some confusion is arising in the debates surrounding this — there seems to be genuine ambiguity in the way terminology is used to describe the different perspectives about the source of linguistic knowledge (e.g., what “endogenous” actually refers to — more on this below). I also really like seeing a clear, concrete example of solving an induction problem that involves fairly abstract knowledge, and using knowledge internal to the learner to do so.

Specific thoughts:

(1) Endogenous: 
It’s interesting that the basic distinction drawn in the opening paragraph of H&al2016 is between domain-general vs. language-specific innate mechanisms, which is different than simply endogenous vs. not (that is, it’s a question of which endogenous it is): “…did the data…allow for construction of knowledge through general cognitive mechanisms…or did that experience play more of a triggering role, facilitating the expression of abstract core knowledge…”

I think the reply by P&K2016 hits on an interesting terminology issue. For H&al2016, endogenous means “internal to the child”; in contrast, P&K2016 seem to go with the more narrow definition of “genetically specified with no external influence”. This then makes P&K2016 question what to make of parents having different grammars than their kids. For H&al2016, I think the point is simply that something internal to the child  — and not solely genetic — is responsible. It’s possible that the internal something developed from a combination of genetics & other data experience, but it’s clearly something that can differ between parents and children. (General point: Just because something’s genetic doesn’t mean it doesn’t interact with the environment to produce the observed result. Concrete example: Height depends on genetics and nutrition.) 

This issue about what kind of endogenous knowledge (rather than simply is it or isn’t it endogenous) is also something P&K2016 pick up on in their reply. They specifically bring up domain-general endogenous factors as possibilities (“differences in memory, motivation, or attention”) and note that the “root cause of the variation may not even be linguistic”. This, as far as I can tell, doesn’t go against H&el2016’s original point. So, it seems like P&K2016 are targeting a more specific position than H&al2016 argued in their paper, though H&al2016’s initial introductory wording suggested that more specific position.

I think L&al2016’s reply-to-the-reply reflects the ambiguity in this position — they note that their paper provides evidence for “endogenous linguistic content”. While the basic reading of this is simply “knowledge about language that’s internal” (and so silent about whether the origin of this knowledge is domain-specific or domain-general), I think it’s easy to interpret this as arguing for the origin of that knowledge to also be language-specific. The final paragraph of L&al2016’s reply underscores this interpretation, as they argue against domain-general mechanisms like memory, attention, and executive function being the source of the endogenous linguistic knowledge. And that, of course, is what P&K2016 (and many others) aren’t fond of. 


(2) Empiricism, P&K2016’s closing: What’s a “reasonable version” of empiricism? My (perhaps naive) understanding was that empiricism believes everything is learned and nothing is innate, which I didn’t think anyone believed anymore. I thought that as soon as you believe even one thing is innate (no matter what flavor of innate it is), you’re by definition a nativist. Maybe this is another example of terminology being used differently by the different perspectives.


(3) One of the interesting things about the experiments in H&al2016 is that the experimental stimuli could be the driving force of grammatical choice. That is, there’s a possibility that people did have multiple grammars before the experiment, but selected one during the course of the experiment and then learned it. This is one way that could happen:

(a) When finally presented with data that require a choice in the verb-raising parameter, participants make that choice. 
(b) Primed by the previous choice (which may have involved some internal computation that was effortful and which they don’t want to repeat), participants stick with it throughout the first test session, thereby reinforcing that choice. 
(c) This prior experience is then reactivated in the second test session a month later, and used as a prior in favor of whichever option was previously chosen. 

If this is what happened, then by the act of testing people, we enable the convergence on a single option where there were previously multiple ones - how quantum mechanics of us…