Tuesday, February 19, 2019

Some thoughts on Tessler & Franke 2018

This is a great example of theoretically-motivated computational modeling coupled with behavioral experiments, here in the realm of negated antonyms (e.g.., "not unhappy"). My main qualm is with the paper length — there’s a lot of interesting stuff going on, and we just don’t get the space to see it fully discussed (more specifics on this below). This of course isn’t the authors’ fault — it just highlights the difficulty of explaining work like this in the space you normally get for conference proceedings.

Specific comments:
(1) The case study here with negated antonyms (which involve double negations like “not unhappy”) seems very relevant for sentiment analysis, where we still struggle to deal precisely with negated expressions. So, more generally, this is a particular case where I can see the NLP community paying closer attention and taking inspiration from cognitive work. For example, based on the results here for single utterances ("unhappy" = "not happy"), the antonym dictionary approach to negation (where "not happy" = "unhappy" or "sad") may not be a bad move in non-contrastive utterances.

(2) I love the clearcut hypothesis space, and the building blocks of contrary (tall vs. short) vs. contradictory (even vs. odd) adjectives. My own sense is that my prior experience is mostly comprised of contrary adjectives, but I wonder if that’s true. (Helloooo, corpus analysis. Also, what do we know about children’s development of these types of fine semantic distinctions?)

(3) I wish there had been a bit more space to explain why we see the modeling results we do. For the full uncertain negation, we get some mileage from a single utterance because it’s unnecessarily costly to say “not unhappy” unless it had a different meaning from "happy", which makes sense. When there are multiple utterances, we see a complete separation of all four options because...there are four different individuals who presumably have different states (or else why use different expressions)?

For the more restricted hypothesis of bonafide contraries that connects morphological negation explicitly to an opposite valence, we see separation for both single and multiple utterances, but much moreso for the multiple utterances. This is definitely a case of a more restricted hypothesis yielding stronger generalizations from ambiguous data, but I don’t quite see how we’re getting it. Certainly, “not unhappy” is more costly to produce than “happy”, so we get separation between those two terms, just as with the full uncertain negation hypothesis. But why, in the single utterance case, do we also get separation between “unhappy” and “not happy”?

For the most restricted hypothesis of logical negation, I get why we never get any separation — by definition, “unhappy” = “not happy” = not(happy), and so “not unhappy” = not(not(happy)) = “happy”.

Tuesday, February 5, 2019

Some thoughts on Fitz & Chang 2017 + Bonus thoughts on McCoy et al. 2018

(Just a quick note that I had a lot of thoughts about these papers, so this is a lengthy post.)

***F&C2017 general thoughts:

This paper tackles one of the cases commonly held up to argue for innate, language-specific knowledge: structure-dependent rules for syntax (and more specifically, complex yes/no questions that require such rules). The key: learn how to produce these question forms without ever seeing (m)any informative examples of them. There have been a variety of solutions to this problem, including recent Bayesian modeling work (Perfors et al. 2011) demonstrating how this knowledge can be inferred as long as the child has the ability to consider structure-dependent rules in her hypothesis space. Here, the approach is to broaden the relevant information from just the form of language (which is traditionally what syntactic learning focused on) and also include the meaning. This reminds me of CCG, which naturally links the form of something to its meaning during learning, and gets great bootstrapping power from that (see Abend et al. 2017 for an example and my forthcoming book chapter for a handy summary).

Perfors, A., Tenenbaum, J. B., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition, 118(3), 306-338.

Abend, O., Kwiatkowski, T., Smith, N. J., Goldwater, S., & Steedman, M. (2017). Bootstrapping language acquisition. Cognition, 164, 116-143.

Pearl, L. (forthcoming). Modeling syntactic acquisition. In J. Sprouse (ed.), Oxford Handbook of Experimental Syntax.

Interestingly enough with respect to what’s built into the child, it’s not clear to me that F&C2017 aren’t still advocating for innate, language-specific knowledge (which is what Universal Grammar is typically thought of). This knowledge just doesn’t happen to be *syntactic*. Instead, the required knowledge is about how concepts are structured. This reminds me of my comments in Pearl (2014) about exactly this point. It seems that non-generativist folks aren’t opposed to the idea of innate, language-specific knowledge -- they just prefer it not be syntactic (and preferably not labeled as Universal Grammar). Here, it seems that innate, language-specific knowledge about structured concepts is one way to accomplish the learning goal. More on this below in the specific thoughts section.

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

***Bonus general thoughts on McCoy et al. 2018:
In contrast to F&C2017, M&al2018 are using only syntactic info to learn from. However, it seems like they’re similar to prior work in using smaller building blocks (i.e., indirect positive evidence) to generate hierarchical structure (i.e., structure-dependent representations) as the favored hypothesis. This is also similar to Perfors et al. (2011) - the main difference is that M&al2018 are using a non-symbolic model, while Perfors et al. (2011) are using a symbolic one. This then leads into the interpretation issue for M&al2018 -- when you find an RNN that works, why does it work? You have to do much more legwork to figure it out, compared to a symbolic model. However, F&C2017 had to do this too for their connectionist model, and I think they demonstrated how you can infer what may be going on quite well (in particular, which factors matter and how).

M&al2018 end up using machine learning classifiers to figure it out, and this seems like a great technique for trying to understand what’s going on in these distributed representations. It’s also something I’m seeing in the neuroscience realm when they try to interpret the distributed contents of, for instance, an fMRI scan.


**Specific thoughts on F&C2017:
(1) The key idea seems to be that nonlinguistic propositions are structured and this provides the crucial scaffolding that allows children to infer structure-dependent rules for the syntactic forms. Doesn’t this still rely on children having the ability to allow structure-dependence into their hypothesis space? Then, this propositional structure can push them towards the structure-dependent rules. But then, that’s no different than the Perfors et al. (2011) approach, where the syntactic forms from the language more broadly pointed towards structured representations that would naturally form the building blocks of structure-dependent rules.

The point that F&C2017 seem to want to make: The necessary information isn’t in the linguistic input at all, but rather in the non-linguistic input. So, this differs from linguistic nativists, who believe it’s not in the input (i.e., the necessary info is internal to the child) and from emergentists/constructionists, who believe it’s in the input (though I think they also allow it to not be the linguistic input specifically). But then, we come back to what prior knowledge/abilities the child needs to harness the information available if it’s in the input (of whatever kind) somewhere. How does the child know to view the input in the crucial way in order to be able to extract the relevant information? Isn’t that based on prior knowledge, which at some point has to be innate? (And where all the disagreement happens is how specific that innate knowledge is.)

Also related: In the discussion, F&C2017 say “Input is the oil that lubes the acquisition machinery, but it is not the machinery itself.” Exactly! And what everyone argues about is what the machinery consists of that uses that input. Here, F&C2017 say “the structure of meaning can constrain the way the language system interacts with experience and restrict the space of learnable grammar.” Great! So, now we just have to figure out where knowledge of that meaning structure originates.

(2) This description of the generativist take on structure dependence seemed odd to me: “consider only rules where auxiliaries do not move out of their S domains”. Well, sure, in this case we’re talking about (S)entences as the relevant structure. But the bias is more general than that (which is why it’s applicable to all kinds of structures and transformations, not just yes/no questions): only consider rules that use structures (like S) as building blocks/primitives. The reliance on linguistic structures, rather than other building blocks, is what makes this bias language-specific. (Though I could imagine an argument where the bias itself is actually a domain-general thing like “use the salient chunks in your system as building blocks for rules”, and that gets implemented in this domain with “salient chunks” = “linguistic structures like S”.)

(3) I quite liked Figure 1, with its visual representation of what a child’s hypothesis space looks like under each approach. I think it’s fair to say the linguistic nativist approach has traditionally ruled out structure-independent grammars from the hypothesis space, while the constructivist approach hasn’t. Of course, there are far more nuanced ways to implement the linguistic nativist idea (e.g., a low, but non-zero, prior on structure-independent grammars), but this certainly serves as the extreme endpoint.

(4) In 1.2, F&C2017 comment on the Perfors et al. 2011 Bayesian model, saying that it doesn’t “explain how grammars are acquired in the first place”. I think this must be referring to the fact that the hypothesis space of the Bayesian learner included possible grammars and the modeled learner was choosing among them. But how else is learning supposed to work? There’s a hypothesis space that’s defined implicitly, and the learner draws/constructs some explicit hypothesis from that implicit hypothesis space to evaluate (Perfors 2012 talks about this very helpfully). Maybe F&C2017 want a learner that constructs the building blocks of the implicit hypothesis space too? (In which case, sure, I’d love to have a model of conceptual change like that. But no one has that yet, as far as I’m aware.)

Perfors, A. (2012). Bayesian models of cognition: what's built in after all?. Philosophy Compass, 7(2), 127-138.

F&C2017 also note in that same part that it’s problematic that children don’t seem to be as optimal as the computational-level Bayesian model. Again, sure, in the same way that any computational-level model needs to be translated to an algorithmic-level version that approximates the inference with child limitations. But this doesn’t seem such a big problem -- or rather, if it is, it’s *everyone’s* problem who works at the computational level of modeling.

(5) I really like the point F&C2017 make about the need to integrate meaning with these kind of learning problems. As they rightly note, what things mean is a very salient source of information. Traditionally, syntactic learning approaches in the generavist world have assumed the child only considers syntactic information when learning about syntactic knowledge. But precisely because syntax is a conduit for meaning to be expressed through and meaning transfer is the heart of communication, it seems exactly right that the child could care about information coming from meaning even when learning something syntactic. This again is where the Abend et al. (2017) model gets some of it bootstrapping power from. (Also, Pearl & Mis 2016 for anaphoric one -- another traditional example of poverty of the stimulus -- integrates meaning information when learning something ostensibly syntactic.)

Pearl, L. & Mis, B. (2016). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Language, 92(1), 1-30.

(6) The Dual-path connectionist model, which uses thematic role & tense info: Importantly, the need for this information is motivated by production in F&C’s model -- you’re trying to express some particular meaning with the form you choose, and that’s part of what’s motivating the form. In theory, this should also be relevant for comprehension, of course. But what’s nice about this approach is that it gets at one of the key criticisms generativists (e.g., Berwick et al. 2011) had of prior modeling approaches -- namely, the disconnect between the form and the meaning.

Berwick, R. C., Pietroski, P., Yankama, B., & Chomsky, N. (2011). Poverty of the stimulus revisited. Cognitive Science, 35(7), 1207-1242.

(7) The dual path architecture: It’s interesting to see the use of a compression layer here, which forces the model to abstract away from details -- i.e., to form internal categories like we believe humans do. (Here, this means abstracting away from individual words and forming syntactic categories of some kind). I think this forced abstraction is one of the key motivations for current autoencoder approaches in machine learning.

(8) Encoding complex utterances: If I’m understanding this correctly, here’s where we see the structure explicitly -- we have one complete proposition connected to the agent concept of another proposition. So, the structured representation is available to the learner a priori via the conceptual structure. So, we might reasonably call this domain-specific knowledge, just not domain-specific syntactic knowledge. Then, experience with the language input tells the child how to translate that structured concept into a sequence of words, in this case, via the use of relative clauses. In particular, the child needs to see relative clauses used for embedded conceptual structures like this.

(9) Input distribution: I really appreciate F&C2017’s attention to realistic input distributions for training their model. This makes their model connect more to the actual problem children face, and so it makes their modeling results more informative.

(10) I think it’s really informative to see these results where the model can recreate specific observed differences in the developmental trajectory, and explain it by means of how the input is viewed. That is, the power of the learning approach is basically in viewing the input the right way, with the right scaffolding knowledge (here, about links between structured concepts and syntactic forms). Once that input lens is on, the input much more transparently reflects the observed behavior patterns in children. And this is what good computational modeling can do: make a learning theory specific enough to evaluate (here, about how to use that input), and then evaluate it by giving it realistic input and seeing if it can generate realistic output.

(11) It seems like F&C2017’s characterization of the hypothesis space aligns with other prior approaches like Perfors et al. 2011: the prior knowledge is a soft constraint on possible grammars, rather than absolutely ruling out structure-independent grammars. (In fact, Perfors et al. 2011 went further and used a simplicity prior, which is biased against the more complex structure-dependent grammars.) But the basic point is that there’s no need to categorically restrict the hypothesis space a priori. Instead, children can use their input and prior knowledge to restrict their hypotheses appropriately over time to structure-dependent rules.

**Bonus thoughts on M&al2018:
(B1) So, as cognitive scientists, should we spend more research time on the architecture that worked (i.e., the GRU with attention)? It does a very non-human thing, while also doing human things. And we don’t know why it’s doing either of those things, compared with other similar-seeming architectures that don’t. I should note that this is my existential issue with non-symbolic models, not a criticism specifically for M&al2018. I think they did a great job for a first pass at this question. Also, I really appreciate how careful they were about giving caveats when it comes to interpreting their results.