Tuesday, June 4, 2019

Some thoughts on Potts 2019 + Berent & Marcus 2019

I really appreciate Potts sketching out how vectors of numbers as the core meaning could impact semantics more broadly. This is the kind of broader speculation that’s helpful for people trying to see the effects of this key assumption on things they know and love. Moreover, Potts is aware of the current shortcomings of the “DL semantics” approach, but focuses on where it could be a useful tool for semantic theory. (This is how I incline myself, so I’m very sympathetic to this point of view.) Interestingly, I think Berent & Marcus also end up with sympathy to a hybrid approach, despite their concerns about the relationship between symbolic and non-symbolic approaches to language. A key difference seems to be where each commentary focuses — Potts zooms in on semantics, while Berent & Marcus mostly seem to think about phonology and syntax. And previously, non-symbolic approaches seem to have left a poor impression on Berent & Marcus.

Other thoughts:
(1) Potts: The idea that machine learning is equivalent to neural networks still confuses me temporarily. In my head, machine learning is the learning part (so it could be symbolic, like SVMs). Another important component is then feature selection, which would correspond to the embedding into that vector of numbers in Potts’s terminology. I guess this just goes to show how terminology changes over time.

(2) Potts: I totally get the analogy of how to do function application with an n-dimensional array. But how do we know that this concatenation and multiplication by a new matrix (W) yields the correct compositional meaning of two elements? Maybe the idea is that we have to find the right function application for our n-dimensional vectors? Potts basically says this by saying we have to learn the values for W from the data, so we have to use supervised learning to get the right W so that compositional meaning results. Okay. But what guarantee do we have that there is in fact a W for all the compositional meaning we might want? Of course, maybe that’s a problem for current semantic theory’s function application as well.

(3) Potts, on how the dataset used to optimize the system will be a collection of utterances, rather than I-language abstractions: So, because of this, it’d be including aspects of both representation and use (like frequency info) together, rather than just the representation part. This isn’t a bad thing necessarily, as long as we don’t explicitly care about the representation part separately. It seems like linguists often do care about this while the NLP community doesn’t. I think Potts’s example with the A but B construction highlights this difference nicely. Potts notes that this would make “use phenomena” more natural to study than they currently are under an intensional semantics approach, and I can see this. I just worry about how we derive explanations from a DL approach (i.e., what do we do with the Weight matrix, once we learn it via supervised machine learning approaches?)

(4) Potts, on how the goal in machine learning is generalization, however that’s accomplished (with compositionality just one way to do this): Maybe compositionality is what humans ended up with due to bottleneck issues during processing and learning over time? This is the kind of stuff Kirby (e.g., Kirby 2017) has modeled with his language evolution simulations.

Kirby, S. (2017). Culture and biology in the origins of linguistic structure. Psychonomic Bulletin & Review, 24(1), 118-137.

(5) Potts, on how having any representation for lexical meaning is better than not: I totally agree with this. A hard-to-interpret vector of numbers encoding helpful aspects about the representation and use of “kitty” is still better than [[kitty]]. It just doesn’t help us explain in symbolic terms that we verbalize things with.

(6) Berent & Marcus, on how the algebraic hypothesis assumes an innate capacity to operate on abstract categories: Sure! Hello, Bayesian inference, for example. Yet another reason why I’m always confused when generative folks don’t like Bayesian inference.

(7) Berent & Marcus, “mental operations are structure-sensitive -- they operate only on the form of representations and ignore their meaning”: It seems like this is a syntax-specific view -- surely semantic operations would operate over meaning? Or is this the difference between lexical semantics and higher-order semantics?

(8) Berent & Marcus, on how we could tell if neural networks (NNs) generated algebraic approaches: I’m not sure I quite follow the train of logic presented. If an NN does manage to capture human behavior correctly, why would we assume that it had spontaneously created algebraic representations? Wouldn’t associationists naturally assume that it didn’t have to (unless explicitly proven otherwise)?

(9) Berent & Marcus, on previous connectionist studies: I definitely understand Berent & Marcus’s frustration with previous connectionist networks and their performance, but it seems like there have been vast improvements since 2001. I’d be surprised if you couldn’t make an LSTM of some kind that couldn’t capture some of the generalizations Marcus investigated before, provided enough data was supplied. Granted, part of the cool thing about small humans is that they don’t get all of Wikipedia to learn from, and yet can still make broad generalizations.

(10) Berent & Marcus: Kudos to Berent & Marcus for being clear that they don’t actually know for sure the scope of human generalizations in online language processing -- they’ve been assuming humans behave a particular way that current NNs can’t seem to capture, but this is yet to be empirically validated. If humans don’t actually behave that way, then maybe the algebraic commitment needs some adjustment.

(11) Berent & Marcus: It’s a fascinating observation that a resistance to the idea of innate ideas itself might be an innate bias (the Berent et al. 2019 reference). This is the first I’ve heard of this. I always thought the resistance was an Occam’s Razor sort of thing, where building in innate stuff is more complex than not building in innate stuff.

Tuesday, May 21, 2019

Some thoughts on Linzen 2019 + Rawski & Heinz 2019

I’m totally with Linzen on linguistic theory providing better evaluation items for RNNs. (Hurrah for linguistic theory contributions!) In contrast, I’m just not sold yet on the utility of RNNs for modeling human language development or processing. The interpretability issue just kills it for me (as it does for Rawski & Heinz)-- how can we know if the RNN is or isn’t representing something? And if we have a concrete idea about what it should be representing vs. not, why not use a symbolic model? (More on this below in the “Other thoughts” section.)

I find it heartening to hear that other folks like Rawski & Heinz are also talking about the ML revolution with deep learning techniques as “alchemy”, longing for the “rigor police” to return. I sympathize with the rigor police.

Rawski & Heinz offer their take on the rigor police, highlighting the contributions that computational learnability (CL) investigations can make, with respect to the problems that RNNs are currently being pitched at. In particular, Rawski & Heinz note how CL approaches can answer the question of “Is it possible to learn this thing at all, given this characterization of the learning problem?” The major selling point is that CL results are easily interpretable (“analytically transparent”). This is a key difference that matters a lot for understanding what’s going on. That said, I tend to have concerns with different CL implementations (basically, if they don’t characterize the learning problem in a way that maps well to children’s language acquisition, I don’t know why I should care as a developmental linguist). But, this is a different, solvable problem (i,e., investigate characterizations that do map well) — in contrast, interpretability of RNNs isn’t as immediately solvable.

Other thoughts:

(1) Linzen, on RNNs for testing what constraints are needed for learning different things: So far, I haven’t been convinced that it’s helpful to use neural networks to test what innate knowledge is required. All we know when we stumble upon a neural network that can learn something is that it hasn’t explicitly encoded knowledge beforehand in a way that’s easy to interpret; who knows what the implicit knowledge is that’s encoded in the architecture and initialization values? (As Rawski & Heinz note, ignorance of bias doesn’t mean absence of bias.)

(2) Linzen, “language model” = “estimating how likely a particular word is to occur given the words that have proceeded it”. I was surprised by this definition. What about other language tasks? I honestly thought “language model” referred to the representation of language knowledge, rather than the evaluation task. So, the language model is the thing that allows you to predict the next word, given the previous word, not the prediction itself. Richard Futrell says this definition of “language model” is right for current ML use, though. (Thanks, Richard!)

(3) Linzen, on using psycholinguistic materials designed to identify linguistic knowledge in humans in order to identify implicit linguistic knowledge in RNNs: This approach makes a lot of sense to me. The human mind is a black box, just like the RNN, and we have decades of materials designed to identify the nature of the knowledge inside that black box. So, I think the key is to start with the most basic tests, since the more complex tests build in assumptions about human knowledge due to the results from the basic ones.

(4) Linzen, noting the importance of having baseline models that are known not to be able to represent the linguistic properties of interest: But how do we know they can’t? Aren’t RNNs universal function approximators, so they can (theoretically) capture any behavior, given enough data? Maybe the point is to use one where we know it’s failed on the linguistic knowledge in question somehow…

(5) Linzen, on the Gulordava et al. RNNs that did better at capturing long-distance agreement when semantic information was helpful: “This suggests that the models did learn some of the syntactic principles underlying subject-verb agreement.” Does it? Maybe if we think “syntactic principles” = something based on the sequence of words, rather than word meaning (i.e., a very broad definition of “the syntactic principles”). But I have no idea how we could tell that the RNN used anything like the syntactic principles we think humans use.

(6) Linzen, on using RNNs for learnability tests: “First, is it indeed the case that the linguistic phenomenon in question cannot be learned from child-directed speech without the proposed constraint?” -- I’m sympathetic to this, but how do we know the RNN isn’t implicitly encoding that constraint in its distributed vectors?

“Second, and equally important, does the proposed constraint in fact aid acquisition?” -- Again, I’m very sympathetic, but why not use a symbolic model for this? Then you can easily tell the model has vs. doesn’t have the proposed constraint. (To be fair, Linzen notes this explicitly: “...the inductive biases of most neural network architectures are not well characterized.”)

(7) Linzen, on building in structural knowledge by giving that structural knowledge as part of the RNN’s input (e.g., “the man” together, then “eats pizza” together = structural knowledge that those two chunks are meaningful chunks): If this is an example of building in a proposed constraint, how do we know the RNN is using those chunks the way we think? Why couldn’t it be doing something wild and wacky with those chunks, instead of treating them as “structured units”? I guess by having chunks at all, it counts as doing something structural? But then how do we make the equivalent of an overhypothesis, where the model likes structured units, but we let the model pick out which structured units it wants?

(8) Linzen, “...neural networks replicate a behavioral result from psycholinguistics without the theoretical machinery...suggest that the human behavior...might arise from statistical patterns in the input.”  Plus whatever implicit biases the RNN has, right? It’s not just statistical patterns working over a blank slate. For example, in the agreement attraction case Linzen discusses, how do we know the RNN didn’t encode some kind of markedness thing for plurals in its distributed representation?

Related to that same study, if the RNNs then show they’re not behaving like humans in other respects, how can we be sure that the behavior which looks human-like actually has the same underlying cause/representation as it does in humans? And if it doesn’t, what have we learned from the RNNs about how humans represent it?

(9) Rawski & Heinz, taking a grammar as target of acquisition, because it’s something of finite size with a symbolic, generative structure: Learning is then a problem of “grammatical inference”. This clearly differs from Linzen’s characterization, where the target of acquisition is something (a function) that can generate accurate predictions, and who cares what it looks like? Note that grammars can make predictions too — and we know what they look like and how they work to make those predictions. (Rigor police, check!)

(10) Rawski & Heinz, on typological arguments for learnability: I have a slight concern with their typological argument. In particular, just because we don’t see certain patterns across existing human languages doesn’t mean they’re impossible. It seems like we should couple typological observations with experimental studies of what generalizations are possible for humans to make when the data are available to support those generalizations.

A related thought regarding typological predictions, though: this seems like a useful evaluation metric for RNNs. In particular, any RNN that’s successful on one language can be applied to other languages’ input to see if it makes the right cross-linguistic generalizations.

(11) Rawski & Heinz, on Weiss et al 2018, which extracted a (symbolic) deterministic FSA representation from an RNN: This seems like exactly what we want for interpretability, though it’s more about identifying a symbolic representation that makes the same predictions as the RNN, rather than reading off the symbolic representation from the RNN. But I guess it doesn’t really matter, as long as you’re sure the symbolic representation really is doing exactly what the RNN is?

Tuesday, May 7, 2019

Some thoughts on Pearl 2019 + Dunbar 2019

I think these two commentaries (mine and Dunbar’s) pair together pretty nicely -- my key thought can be summed up as “if we can interpret neural networks, maybe they can build things we didn’t think to build with the same pieces and that would be cool”; Dunbar’s key thought is something like “we really need to think carefully about how interpretable those networks are…” So, we both seem to agree that it’s great to advance linguistic theory with neural networks, but only if you can in fact interpret them.

More specific thoughts on Dunbar 2019:
(1) Dunbar highlights what he calls the “implementational mapping problem”, which is basically the interpretability problem. How do we draw “a correspondence between an abstract linguistic representational system and an opaque parameter vector”? (Of course, neurolinguists the world over are nodding their heads vigorously in agreement because exactly the same interpretability problem arises with human neural data.)

To draw this correspondence, Dunbar suggests that we need to know what representations are meant to be there. What’s the set of things we should be looking for in those hard-to-interpret network innards? How do we know if a new something is a reasonable something (where reasonable may be “useful for understanding human representations”)?

(2) For learnability:  Dunbar notes that to the extent we believe networks have approximated a theory well enough, we can test learnability claims (such as whether the network can learn from the evidence children learn from or instead requires additional information). I get this, but I still don’t see why it’s better to use this over a symbolic modeling approach (i.e., an approach where the theory is transparent).

Maybe if we don’t have an explicit theory, we generate a network that seems to be human-like in its behavior. Then, we can use the network as a good-enough theory approximation to test learnability claims, even if we can’t exactly say what theory it’s implementing? So, this would focus on the “in principle” learnability claims (i.e., can whatever knowledge be learned from the data children learn from, period).

Tuesday, April 16, 2019

Some thoughts on Pater 2019

As you might imagine, a lot of my thoughts are covered by my commentary that we’re reading as one of the selections next time. But here’s the briefer version: I love seeing the fusion of linguistic representations with statistical methods. The real struggle for me as a cognitive modeler is when using RNNs is better than symbolic models that are more easily interpretable (e.g., hierarchical Bayesian models that allow overhypotheses to define a wider space of latent hypotheses).

At the very end of Pater’s article, I see a potentially exciting path forward with the advent of RNNs (or other models with distributed representations) that are interpretable. I’m definitely a fan of techniques that allow the learning of hidden structure without it being explicitly encoded — this is the same thing I see in hierarchical Bayesian overhypotheses. More on this below (and in my commentary for next time).

Specific thoughts:

(1) I couldn’t agree more with the importance of incorporating statistical approaches more thoroughly into learning/acquisition theories, but I remain to be sold on the neural networks side. It really depends on what kind of network: are they matching neurobiology (e.g., see Avery and Krichmar 2017, Beyeler, Rounds, Carlson, Dutt, & Krichmar 2017, Krichmar, Conrad, & Asada 2015; Neftci, Augustine, Paul, & Detorakis 2017, Neftci, Binas, Rutishauser, Chicca, Indiveri, & Douglas 2013) or are they a computational-level distributed representations approach (I think this is what most RNNs are), which seems hard to decipher, and so less useful for exploring symbolic theories more completely? Maybe the point is to explore non-symbolic theories.

Pater notes the following about non-symbolic approaches: “...it is hard to escape the conclusion that a successful theory of learning from realistic data will have a neural component.” If by neural, Pater means an implementational-level description, sure. But I’m not sold on distributed representations as being necessary for a successful theory of learning -- a theory can operate at the computational or algorithmic levels.

(2) I completely agree that structure-independent representations (statistical sequences that don’t involve phrases, etc.) can only get you so far. The interesting thing from an NLP standpoint, of course, is exactly how far they can get you — which often turns out to be surprisingly far. In fact, it’s often much further than I would have expected — e.g., n-grams over words (not even syntactic categories!!) work remarkably well as features for opinion spam detection, with near 90% classification accuracy: Ott et. al 2011, 2013. Though I guess n-grams do heuristically encode some local structure.

(3) RNNs seem to need to incorporate hierarchical representations to work (e.g., the Recurrent Neural Network Grammars of Dyer et al. 2016, and incorporating hierarchical structure into current neural network approaches in AI/NLP). But, sequence-to-sequence models do pretty well without explicit structure encoded in. So, if sequence-to-sequence models can handle aux-inversion (e.g., as in McCoy, Frank, & Linzen 2018...well, at least sort of -- it’s not clear they handle it the way humans do), what do we make of it from the linguistic cognition perspective?

This comes back to the question of model interpretation. With symbolic models, it’s usually clear what theory of representation is being evaluated. For RNNs, do we know what the distributed representations/continuous hypotheses are encoding? (This of course is less a problem from the engineering perspective -- we’re happy if we can get the machines to do it as well or better than humans.) As Pater noted, some read-out can be done with clever model comparisons, and some distributed representations (e.g., Palangi et al’s (2017) Tensor Product Recurrent Networks) may in fact encode syntactic structures we recognize. So then, the question is what we’re getting from the distributed representation.

Pater: “...it is given the building blocks of symbols and their roles, but must learn their configurations”. This starts to sound like the latent vs. explicit hypothesis space construction of Perfors (2012), which can be implemented in a variety of ways (e.g., variational learning as in Yang 2002). That is, RNNs allow the modeler to specify the building blocks but let the model construct the explicit hypotheses that get evaluated, based on its prior biases (RNN architecture, Bayesian overhypothesis hyperparameters, etc.). Something that could be interesting: the RNN version allows construction of explicit hypotheses from the building blocks that are outside what the modeler would have built in to the overhypothesis parameters; that is, they may be perfectly reasonable hypotheses from the given building blocks, but go against the natural overhypothesis-style parametric biases and so would get a low probability of being generated (and subsequently evaluated).

Since the RNN generates hypotheses with whatever architectural biases mold the explicit hypothesis construction, it may give higher probability to hypotheses that were lower-probability for a hierarchical Bayesian model.  That is, the Bayesian overhypotheses may be quite general (especially if we back off to over-over-hypotheses, and so on), but still require an explicit bias at some level for how hypotheses are generated from overhypotheses. That has to be specified by the modeler. This may cause Bayesian modelers to miss ways that certain building blocks can generate the kinds of linguistic hypotheses we want to generate.

An analogy: Genetic algorithms can be used to identify solutions that humans didn’t think of because they employ a much wider search of the latent hypothesis space; humans are fettered by their biases for what an optimal solution is going to look like.  Here: symbolic modelers may be fettered by ideas about how building blocks can be used to generate explicit hypotheses; RNNs may allow a wider search of the latent hypothesis space because they’re bound by different (implicit) ideas, via the RNN architecture. So, the solution an RNN comes up with (assuming you can interpret it) may provide a novel representational option, based on the building blocks given to it.

Bigger point: RNNs and distributed representations may provide a novel way of exploratory theorizing (especially for syntactic learning), to the extent that their innards are interpretable. For theory evaluation, on the other hand, it’s better to go with a symbolic model that’s already easy to understand….unless your theory is about the building blocks, leaving the explicit hypotheses they build and evaluate unspecified.

Tuesday, March 5, 2019

Some thoughts on Nordmeyer & Frank 2018

This is exactly the kind of behavioral work that serves as a good target of developmental modeling. (Thanks, N&F2018!) Moreover, the particular experiment lends itself very naturally to RSA modeling, given the importance of context manipulation (and then the RSA model allows us to be more concrete about what those contextual variables could be and what exactly they could do). 

More generally, this work also falls in a larger body of work that underscores the importance of pragmatic felicity when doing child language experiments. This was the basis for the Truth Value Judgment paradigm (Crain & Thornton 1998) -- it’s important to give supportive contexts if you want kids to show you their linguistic knowledge. They’re not as good as adults at “test-taking” -- i.e., compensating for a lack of supportive context by implicitly supplying their own. So, if kids aren’t behaving like they have adult-like linguistic knowledge, check if pragmatic (or processing) factors might be getting in the way.

Crain, S., & Thornton, R. (1998). The truth value judgment task: Fundamentals of design. University of Maryland working papers in linguistics, 6, 61-70.

Some other thoughts:

(1) The Kim (1985) child behavioral setup, which involved (for example) someone pointing at an apple and saying “This is not a banana”. The child would reply “wrong!”, but of course we don’t know why she’s saying it’s wrong. Is it the wrong meaning (semantic issue) or the wrong thing to say (pragmatic issue)? This reminds me of recent work on children’s (non-)endorsements when it comes to quantifier scope ambiguity (Viau & Lidz 2010, Savinelli et al. 2017). The key idea is that they weren’t saying no because they couldn’t get the interpretation; they were saying no because it wasn’t a very informative interpretation, given the prior context. This also seems to be a main factor in English children’s pronoun interpretation behavior being wonky (Conroy et al. 2009). Also similar to the Conroy et al. (2009) study is how N&F2018 are explicitly manipulating the context to show a replication of prior behavior and then how to fix it with supportive pragmatic context.

Savinelli, K. J., Scontras, G., & Pearl, L. (2017). Modeling scope ambiguity resolution as pragmatic inference: Formalizing differences in child and adult behavior. In CogSci.

Viau, J., Lidz, J., & Musolino, J. (2010). Priming of abstract logical representations in 4-year-olds. Language Acquisition, 17(1-2), 26-50.

Conroy, A., Takahashi, E., Lidz, J., & Phillips, C. (2009). Equal treatment for all antecedents: How children succeed with Principle B. Linguistic Inquiry, 40(3), 446-486.

(2) Varying the linguistic form: “has no X” vs. “doesn’t have an X”. Corpus analysis could tell you how often negation of different types appear in these forms since “has no X” is less good than “doesn’t have an X”. Then we would know if that’s just a frequency effect or if something more interesting is happening.

(3) With adults, “has no X” is better when everyone else has an X. The pragmatic reason for this is that there’s a more informative utterance when the referent has a Y instead of an X (i.e., “has a Y”) -- this seems like something that could be captured in an RSA model’s cost function. Basically, it costs more to say “has no X” compared with “has Y” when both are true.

(4) In general, kudos for getting kids to give ratings. This is super-hard to do well, since it requires young children to think metalinguistically. I also really appreciate seeing the histogram of responses in Figure 4. Here, we can see that there are still quite a number of kids who, in the unsupportive none context (where no one else has anything) think that “Abby doesn’t have an apple” is fine (>50); however, many more kids (>100) think it’s terrible. Similarly, there are >50 kids who think “Abby doesn’t have an apple” is terrible in the target context (where everyone else has an apple), though many more (>100) think it’s fine. Hello, child data messiness -- and bless your hearts, child behavioral researchers.

I wish we could see an equivalent histogram for adults, though. I wonder how much of this messiness is because we’re dealing with kids vs. dealing with a felicity scale vs. dealing with a phenomenon that’s inherently messy in the target state.

Tuesday, February 19, 2019

Some thoughts on Tessler & Franke 2018

This is a great example of theoretically-motivated computational modeling coupled with behavioral experiments, here in the realm of negated antonyms (e.g.., "not unhappy"). My main qualm is with the paper length — there’s a lot of interesting stuff going on, and we just don’t get the space to see it fully discussed (more specifics on this below). This of course isn’t the authors’ fault — it just highlights the difficulty of explaining work like this in the space you normally get for conference proceedings.

Specific comments:
(1) The case study here with negated antonyms (which involve double negations like “not unhappy”) seems very relevant for sentiment analysis, where we still struggle to deal precisely with negated expressions. So, more generally, this is a particular case where I can see the NLP community paying closer attention and taking inspiration from cognitive work. For example, based on the results here for single utterances ("unhappy" = "not happy"), the antonym dictionary approach to negation (where "not happy" = "unhappy" or "sad") may not be a bad move in non-contrastive utterances.

(2) I love the clearcut hypothesis space, and the building blocks of contrary (tall vs. short) vs. contradictory (even vs. odd) adjectives. My own sense is that my prior experience is mostly comprised of contrary adjectives, but I wonder if that’s true. (Helloooo, corpus analysis. Also, what do we know about children’s development of these types of fine semantic distinctions?)

(3) I wish there had been a bit more space to explain why we see the modeling results we do. For the full uncertain negation, we get some mileage from a single utterance because it’s unnecessarily costly to say “not unhappy” unless it had a different meaning from "happy", which makes sense. When there are multiple utterances, we see a complete separation of all four options because...there are four different individuals who presumably have different states (or else why use different expressions)?

For the more restricted hypothesis of bonafide contraries that connects morphological negation explicitly to an opposite valence, we see separation for both single and multiple utterances, but much moreso for the multiple utterances. This is definitely a case of a more restricted hypothesis yielding stronger generalizations from ambiguous data, but I don’t quite see how we’re getting it. Certainly, “not unhappy” is more costly to produce than “happy”, so we get separation between those two terms, just as with the full uncertain negation hypothesis. But why, in the single utterance case, do we also get separation between “unhappy” and “not happy”?

For the most restricted hypothesis of logical negation, I get why we never get any separation — by definition, “unhappy” = “not happy” = not(happy), and so “not unhappy” = not(not(happy)) = “happy”.

Tuesday, February 5, 2019

Some thoughts on Fitz & Chang 2017 + Bonus thoughts on McCoy et al. 2018

(Just a quick note that I had a lot of thoughts about these papers, so this is a lengthy post.)

***F&C2017 general thoughts:

This paper tackles one of the cases commonly held up to argue for innate, language-specific knowledge: structure-dependent rules for syntax (and more specifically, complex yes/no questions that require such rules). The key: learn how to produce these question forms without ever seeing (m)any informative examples of them. There have been a variety of solutions to this problem, including recent Bayesian modeling work (Perfors et al. 2011) demonstrating how this knowledge can be inferred as long as the child has the ability to consider structure-dependent rules in her hypothesis space. Here, the approach is to broaden the relevant information from just the form of language (which is traditionally what syntactic learning focused on) and also include the meaning. This reminds me of CCG, which naturally links the form of something to its meaning during learning, and gets great bootstrapping power from that (see Abend et al. 2017 for an example and my forthcoming book chapter for a handy summary).

Perfors, A., Tenenbaum, J. B., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition, 118(3), 306-338.

Abend, O., Kwiatkowski, T., Smith, N. J., Goldwater, S., & Steedman, M. (2017). Bootstrapping language acquisition. Cognition, 164, 116-143.

Pearl, L. (forthcoming). Modeling syntactic acquisition. In J. Sprouse (ed.), Oxford Handbook of Experimental Syntax.

Interestingly enough with respect to what’s built into the child, it’s not clear to me that F&C2017 aren’t still advocating for innate, language-specific knowledge (which is what Universal Grammar is typically thought of). This knowledge just doesn’t happen to be *syntactic*. Instead, the required knowledge is about how concepts are structured. This reminds me of my comments in Pearl (2014) about exactly this point. It seems that non-generativist folks aren’t opposed to the idea of innate, language-specific knowledge -- they just prefer it not be syntactic (and preferably not labeled as Universal Grammar). Here, it seems that innate, language-specific knowledge about structured concepts is one way to accomplish the learning goal. More on this below in the specific thoughts section.

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

***Bonus general thoughts on McCoy et al. 2018:
In contrast to F&C2017, M&al2018 are using only syntactic info to learn from. However, it seems like they’re similar to prior work in using smaller building blocks (i.e., indirect positive evidence) to generate hierarchical structure (i.e., structure-dependent representations) as the favored hypothesis. This is also similar to Perfors et al. (2011) - the main difference is that M&al2018 are using a non-symbolic model, while Perfors et al. (2011) are using a symbolic one. This then leads into the interpretation issue for M&al2018 -- when you find an RNN that works, why does it work? You have to do much more legwork to figure it out, compared to a symbolic model. However, F&C2017 had to do this too for their connectionist model, and I think they demonstrated how you can infer what may be going on quite well (in particular, which factors matter and how).

M&al2018 end up using machine learning classifiers to figure it out, and this seems like a great technique for trying to understand what’s going on in these distributed representations. It’s also something I’m seeing in the neuroscience realm when they try to interpret the distributed contents of, for instance, an fMRI scan.

**Specific thoughts on F&C2017:
(1) The key idea seems to be that nonlinguistic propositions are structured and this provides the crucial scaffolding that allows children to infer structure-dependent rules for the syntactic forms. Doesn’t this still rely on children having the ability to allow structure-dependence into their hypothesis space? Then, this propositional structure can push them towards the structure-dependent rules. But then, that’s no different than the Perfors et al. (2011) approach, where the syntactic forms from the language more broadly pointed towards structured representations that would naturally form the building blocks of structure-dependent rules.

The point that F&C2017 seem to want to make: The necessary information isn’t in the linguistic input at all, but rather in the non-linguistic input. So, this differs from linguistic nativists, who believe it’s not in the input (i.e., the necessary info is internal to the child) and from emergentists/constructionists, who believe it’s in the input (though I think they also allow it to not be the linguistic input specifically). But then, we come back to what prior knowledge/abilities the child needs to harness the information available if it’s in the input (of whatever kind) somewhere. How does the child know to view the input in the crucial way in order to be able to extract the relevant information? Isn’t that based on prior knowledge, which at some point has to be innate? (And where all the disagreement happens is how specific that innate knowledge is.)

Also related: In the discussion, F&C2017 say “Input is the oil that lubes the acquisition machinery, but it is not the machinery itself.” Exactly! And what everyone argues about is what the machinery consists of that uses that input. Here, F&C2017 say “the structure of meaning can constrain the way the language system interacts with experience and restrict the space of learnable grammar.” Great! So, now we just have to figure out where knowledge of that meaning structure originates.

(2) This description of the generativist take on structure dependence seemed odd to me: “consider only rules where auxiliaries do not move out of their S domains”. Well, sure, in this case we’re talking about (S)entences as the relevant structure. But the bias is more general than that (which is why it’s applicable to all kinds of structures and transformations, not just yes/no questions): only consider rules that use structures (like S) as building blocks/primitives. The reliance on linguistic structures, rather than other building blocks, is what makes this bias language-specific. (Though I could imagine an argument where the bias itself is actually a domain-general thing like “use the salient chunks in your system as building blocks for rules”, and that gets implemented in this domain with “salient chunks” = “linguistic structures like S”.)

(3) I quite liked Figure 1, with its visual representation of what a child’s hypothesis space looks like under each approach. I think it’s fair to say the linguistic nativist approach has traditionally ruled out structure-independent grammars from the hypothesis space, while the constructivist approach hasn’t. Of course, there are far more nuanced ways to implement the linguistic nativist idea (e.g., a low, but non-zero, prior on structure-independent grammars), but this certainly serves as the extreme endpoint.

(4) In 1.2, F&C2017 comment on the Perfors et al. 2011 Bayesian model, saying that it doesn’t “explain how grammars are acquired in the first place”. I think this must be referring to the fact that the hypothesis space of the Bayesian learner included possible grammars and the modeled learner was choosing among them. But how else is learning supposed to work? There’s a hypothesis space that’s defined implicitly, and the learner draws/constructs some explicit hypothesis from that implicit hypothesis space to evaluate (Perfors 2012 talks about this very helpfully). Maybe F&C2017 want a learner that constructs the building blocks of the implicit hypothesis space too? (In which case, sure, I’d love to have a model of conceptual change like that. But no one has that yet, as far as I’m aware.)

Perfors, A. (2012). Bayesian models of cognition: what's built in after all?. Philosophy Compass, 7(2), 127-138.

F&C2017 also note in that same part that it’s problematic that children don’t seem to be as optimal as the computational-level Bayesian model. Again, sure, in the same way that any computational-level model needs to be translated to an algorithmic-level version that approximates the inference with child limitations. But this doesn’t seem such a big problem -- or rather, if it is, it’s *everyone’s* problem who works at the computational level of modeling.

(5) I really like the point F&C2017 make about the need to integrate meaning with these kind of learning problems. As they rightly note, what things mean is a very salient source of information. Traditionally, syntactic learning approaches in the generavist world have assumed the child only considers syntactic information when learning about syntactic knowledge. But precisely because syntax is a conduit for meaning to be expressed through and meaning transfer is the heart of communication, it seems exactly right that the child could care about information coming from meaning even when learning something syntactic. This again is where the Abend et al. (2017) model gets some of it bootstrapping power from. (Also, Pearl & Mis 2016 for anaphoric one -- another traditional example of poverty of the stimulus -- integrates meaning information when learning something ostensibly syntactic.)

Pearl, L. & Mis, B. (2016). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Language, 92(1), 1-30.

(6) The Dual-path connectionist model, which uses thematic role & tense info: Importantly, the need for this information is motivated by production in F&C’s model -- you’re trying to express some particular meaning with the form you choose, and that’s part of what’s motivating the form. In theory, this should also be relevant for comprehension, of course. But what’s nice about this approach is that it gets at one of the key criticisms generativists (e.g., Berwick et al. 2011) had of prior modeling approaches -- namely, the disconnect between the form and the meaning.

Berwick, R. C., Pietroski, P., Yankama, B., & Chomsky, N. (2011). Poverty of the stimulus revisited. Cognitive Science, 35(7), 1207-1242.

(7) The dual path architecture: It’s interesting to see the use of a compression layer here, which forces the model to abstract away from details -- i.e., to form internal categories like we believe humans do. (Here, this means abstracting away from individual words and forming syntactic categories of some kind). I think this forced abstraction is one of the key motivations for current autoencoder approaches in machine learning.

(8) Encoding complex utterances: If I’m understanding this correctly, here’s where we see the structure explicitly -- we have one complete proposition connected to the agent concept of another proposition. So, the structured representation is available to the learner a priori via the conceptual structure. So, we might reasonably call this domain-specific knowledge, just not domain-specific syntactic knowledge. Then, experience with the language input tells the child how to translate that structured concept into a sequence of words, in this case, via the use of relative clauses. In particular, the child needs to see relative clauses used for embedded conceptual structures like this.

(9) Input distribution: I really appreciate F&C2017’s attention to realistic input distributions for training their model. This makes their model connect more to the actual problem children face, and so it makes their modeling results more informative.

(10) I think it’s really informative to see these results where the model can recreate specific observed differences in the developmental trajectory, and explain it by means of how the input is viewed. That is, the power of the learning approach is basically in viewing the input the right way, with the right scaffolding knowledge (here, about links between structured concepts and syntactic forms). Once that input lens is on, the input much more transparently reflects the observed behavior patterns in children. And this is what good computational modeling can do: make a learning theory specific enough to evaluate (here, about how to use that input), and then evaluate it by giving it realistic input and seeing if it can generate realistic output.

(11) It seems like F&C2017’s characterization of the hypothesis space aligns with other prior approaches like Perfors et al. 2011: the prior knowledge is a soft constraint on possible grammars, rather than absolutely ruling out structure-independent grammars. (In fact, Perfors et al. 2011 went further and used a simplicity prior, which is biased against the more complex structure-dependent grammars.) But the basic point is that there’s no need to categorically restrict the hypothesis space a priori. Instead, children can use their input and prior knowledge to restrict their hypotheses appropriately over time to structure-dependent rules.

**Bonus thoughts on M&al2018:
(B1) So, as cognitive scientists, should we spend more research time on the architecture that worked (i.e., the GRU with attention)? It does a very non-human thing, while also doing human things. And we don’t know why it’s doing either of those things, compared with other similar-seeming architectures that don’t. I should note that this is my existential issue with non-symbolic models, not a criticism specifically for M&al2018. I think they did a great job for a first pass at this question. Also, I really appreciate how careful they were about giving caveats when it comes to interpreting their results.