Tuesday, February 5, 2019

Some thoughts on Fitz & Chang 2017 + Bonus thoughts on McCoy et al. 2018

(Just a quick note that I had a lot of thoughts about these papers, so this is a lengthy post.)

***F&C2017 general thoughts:

This paper tackles one of the cases commonly held up to argue for innate, language-specific knowledge: structure-dependent rules for syntax (and more specifically, complex yes/no questions that require such rules). The key: learn how to produce these question forms without ever seeing (m)any informative examples of them. There have been a variety of solutions to this problem, including recent Bayesian modeling work (Perfors et al. 2011) demonstrating how this knowledge can be inferred as long as the child has the ability to consider structure-dependent rules in her hypothesis space. Here, the approach is to broaden the relevant information from just the form of language (which is traditionally what syntactic learning focused on) and also include the meaning. This reminds me of CCG, which naturally links the form of something to its meaning during learning, and gets great bootstrapping power from that (see Abend et al. 2017 for an example and my forthcoming book chapter for a handy summary).

Perfors, A., Tenenbaum, J. B., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition, 118(3), 306-338.

Abend, O., Kwiatkowski, T., Smith, N. J., Goldwater, S., & Steedman, M. (2017). Bootstrapping language acquisition. Cognition, 164, 116-143.

Pearl, L. (forthcoming). Modeling syntactic acquisition. In J. Sprouse (ed.), Oxford Handbook of Experimental Syntax.

Interestingly enough with respect to what’s built into the child, it’s not clear to me that F&C2017 aren’t still advocating for innate, language-specific knowledge (which is what Universal Grammar is typically thought of). This knowledge just doesn’t happen to be *syntactic*. Instead, the required knowledge is about how concepts are structured. This reminds me of my comments in Pearl (2014) about exactly this point. It seems that non-generativist folks aren’t opposed to the idea of innate, language-specific knowledge -- they just prefer it not be syntactic (and preferably not labeled as Universal Grammar). Here, it seems that innate, language-specific knowledge about structured concepts is one way to accomplish the learning goal. More on this below in the specific thoughts section.

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

***Bonus general thoughts on McCoy et al. 2018:
In contrast to F&C2017, M&al2018 are using only syntactic info to learn from. However, it seems like they’re similar to prior work in using smaller building blocks (i.e., indirect positive evidence) to generate hierarchical structure (i.e., structure-dependent representations) as the favored hypothesis. This is also similar to Perfors et al. (2011) - the main difference is that M&al2018 are using a non-symbolic model, while Perfors et al. (2011) are using a symbolic one. This then leads into the interpretation issue for M&al2018 -- when you find an RNN that works, why does it work? You have to do much more legwork to figure it out, compared to a symbolic model. However, F&C2017 had to do this too for their connectionist model, and I think they demonstrated how you can infer what may be going on quite well (in particular, which factors matter and how).

M&al2018 end up using machine learning classifiers to figure it out, and this seems like a great technique for trying to understand what’s going on in these distributed representations. It’s also something I’m seeing in the neuroscience realm when they try to interpret the distributed contents of, for instance, an fMRI scan.


**Specific thoughts on F&C2017:
(1) The key idea seems to be that nonlinguistic propositions are structured and this provides the crucial scaffolding that allows children to infer structure-dependent rules for the syntactic forms. Doesn’t this still rely on children having the ability to allow structure-dependence into their hypothesis space? Then, this propositional structure can push them towards the structure-dependent rules. But then, that’s no different than the Perfors et al. (2011) approach, where the syntactic forms from the language more broadly pointed towards structured representations that would naturally form the building blocks of structure-dependent rules.

The point that F&C2017 seem to want to make: The necessary information isn’t in the linguistic input at all, but rather in the non-linguistic input. So, this differs from linguistic nativists, who believe it’s not in the input (i.e., the necessary info is internal to the child) and from emergentists/constructionists, who believe it’s in the input (though I think they also allow it to not be the linguistic input specifically). But then, we come back to what prior knowledge/abilities the child needs to harness the information available if it’s in the input (of whatever kind) somewhere. How does the child know to view the input in the crucial way in order to be able to extract the relevant information? Isn’t that based on prior knowledge, which at some point has to be innate? (And where all the disagreement happens is how specific that innate knowledge is.)

Also related: In the discussion, F&C2017 say “Input is the oil that lubes the acquisition machinery, but it is not the machinery itself.” Exactly! And what everyone argues about is what the machinery consists of that uses that input. Here, F&C2017 say “the structure of meaning can constrain the way the language system interacts with experience and restrict the space of learnable grammar.” Great! So, now we just have to figure out where knowledge of that meaning structure originates.

(2) This description of the generativist take on structure dependence seemed odd to me: “consider only rules where auxiliaries do not move out of their S domains”. Well, sure, in this case we’re talking about (S)entences as the relevant structure. But the bias is more general than that (which is why it’s applicable to all kinds of structures and transformations, not just yes/no questions): only consider rules that use structures (like S) as building blocks/primitives. The reliance on linguistic structures, rather than other building blocks, is what makes this bias language-specific. (Though I could imagine an argument where the bias itself is actually a domain-general thing like “use the salient chunks in your system as building blocks for rules”, and that gets implemented in this domain with “salient chunks” = “linguistic structures like S”.)

(3) I quite liked Figure 1, with its visual representation of what a child’s hypothesis space looks like under each approach. I think it’s fair to say the linguistic nativist approach has traditionally ruled out structure-independent grammars from the hypothesis space, while the constructivist approach hasn’t. Of course, there are far more nuanced ways to implement the linguistic nativist idea (e.g., a low, but non-zero, prior on structure-independent grammars), but this certainly serves as the extreme endpoint.

(4) In 1.2, F&C2017 comment on the Perfors et al. 2011 Bayesian model, saying that it doesn’t “explain how grammars are acquired in the first place”. I think this must be referring to the fact that the hypothesis space of the Bayesian learner included possible grammars and the modeled learner was choosing among them. But how else is learning supposed to work? There’s a hypothesis space that’s defined implicitly, and the learner draws/constructs some explicit hypothesis from that implicit hypothesis space to evaluate (Perfors 2012 talks about this very helpfully). Maybe F&C2017 want a learner that constructs the building blocks of the implicit hypothesis space too? (In which case, sure, I’d love to have a model of conceptual change like that. But no one has that yet, as far as I’m aware.)

Perfors, A. (2012). Bayesian models of cognition: what's built in after all?. Philosophy Compass, 7(2), 127-138.

F&C2017 also note in that same part that it’s problematic that children don’t seem to be as optimal as the computational-level Bayesian model. Again, sure, in the same way that any computational-level model needs to be translated to an algorithmic-level version that approximates the inference with child limitations. But this doesn’t seem such a big problem -- or rather, if it is, it’s *everyone’s* problem who works at the computational level of modeling.

(5) I really like the point F&C2017 make about the need to integrate meaning with these kind of learning problems. As they rightly note, what things mean is a very salient source of information. Traditionally, syntactic learning approaches in the generavist world have assumed the child only considers syntactic information when learning about syntactic knowledge. But precisely because syntax is a conduit for meaning to be expressed through and meaning transfer is the heart of communication, it seems exactly right that the child could care about information coming from meaning even when learning something syntactic. This again is where the Abend et al. (2017) model gets some of it bootstrapping power from. (Also, Pearl & Mis 2016 for anaphoric one -- another traditional example of poverty of the stimulus -- integrates meaning information when learning something ostensibly syntactic.)

Pearl, L. & Mis, B. (2016). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Language, 92(1), 1-30.

(6) The Dual-path connectionist model, which uses thematic role & tense info: Importantly, the need for this information is motivated by production in F&C’s model -- you’re trying to express some particular meaning with the form you choose, and that’s part of what’s motivating the form. In theory, this should also be relevant for comprehension, of course. But what’s nice about this approach is that it gets at one of the key criticisms generativists (e.g., Berwick et al. 2011) had of prior modeling approaches -- namely, the disconnect between the form and the meaning.

Berwick, R. C., Pietroski, P., Yankama, B., & Chomsky, N. (2011). Poverty of the stimulus revisited. Cognitive Science, 35(7), 1207-1242.

(7) The dual path architecture: It’s interesting to see the use of a compression layer here, which forces the model to abstract away from details -- i.e., to form internal categories like we believe humans do. (Here, this means abstracting away from individual words and forming syntactic categories of some kind). I think this forced abstraction is one of the key motivations for current autoencoder approaches in machine learning.

(8) Encoding complex utterances: If I’m understanding this correctly, here’s where we see the structure explicitly -- we have one complete proposition connected to the agent concept of another proposition. So, the structured representation is available to the learner a priori via the conceptual structure. So, we might reasonably call this domain-specific knowledge, just not domain-specific syntactic knowledge. Then, experience with the language input tells the child how to translate that structured concept into a sequence of words, in this case, via the use of relative clauses. In particular, the child needs to see relative clauses used for embedded conceptual structures like this.

(9) Input distribution: I really appreciate F&C2017’s attention to realistic input distributions for training their model. This makes their model connect more to the actual problem children face, and so it makes their modeling results more informative.

(10) I think it’s really informative to see these results where the model can recreate specific observed differences in the developmental trajectory, and explain it by means of how the input is viewed. That is, the power of the learning approach is basically in viewing the input the right way, with the right scaffolding knowledge (here, about links between structured concepts and syntactic forms). Once that input lens is on, the input much more transparently reflects the observed behavior patterns in children. And this is what good computational modeling can do: make a learning theory specific enough to evaluate (here, about how to use that input), and then evaluate it by giving it realistic input and seeing if it can generate realistic output.

(11) It seems like F&C2017’s characterization of the hypothesis space aligns with other prior approaches like Perfors et al. 2011: the prior knowledge is a soft constraint on possible grammars, rather than absolutely ruling out structure-independent grammars. (In fact, Perfors et al. 2011 went further and used a simplicity prior, which is biased against the more complex structure-dependent grammars.) But the basic point is that there’s no need to categorically restrict the hypothesis space a priori. Instead, children can use their input and prior knowledge to restrict their hypotheses appropriately over time to structure-dependent rules.

**Bonus thoughts on M&al2018:
(B1) So, as cognitive scientists, should we spend more research time on the architecture that worked (i.e., the GRU with attention)? It does a very non-human thing, while also doing human things. And we don’t know why it’s doing either of those things, compared with other similar-seeming architectures that don’t. I should note that this is my existential issue with non-symbolic models, not a criticism specifically for M&al2018. I think they did a great job for a first pass at this question. Also, I really appreciate how careful they were about giving caveats when it comes to interpreting their results.

1 comment:

  1. The authors began with the tantalizing suggestion that a big chunk of children's problems comprehending negation have to do with how informative the negation is - that is, children should be better at comprehending 'Abby doesn't have an apple' when the other characters all have apples, as opposed to when nobody has anything. Here, they show that children do indeed find this type of sentence more felicitous/good when appearing in a context that makes it informative. What I'm still waiting for is a confirmation that children's COMPREHENSION problems actually disappear in this type of context. Could one run a TVJT that contrasts the informative ("target") context against the uninformative ("none") context?

    Another thing that I don't quite track is the decision to test alternative negation (and ignore existence negation, for the moment). If existence negation is supposed to be more felicitous, why would they not test that, first?

    ReplyDelete