Friday, November 1, 2019

Some thoughts on Gauthier et al. 2019

General thoughts:
I really enjoy seeing this kind of computational cognitive model, where the model is not only generating general patterns of behavior (like the ability to get the right interpretation for a novel utterance), but specifically matching a set of child behavioral results. I think it’s easier to believe in the model’s informativity when you see it able to account for a specific set of results. And those results then provide a fair benchmark for future models. (So, yay, good developmental modeling practice!)

Other thoughts:
(1) It’s always great to show what can be accomplished “from scratch” (as G&al2019 note), though this is probably harder than the child’s actual task. Presumably, by the time children are using syntactic bootstrapping to learn harder lexical items, they already have a lexicon seeded with some concrete noun items. But this is fine for a proof of concept -- basically, if we can get success on the harder task of starting from scratch, then we should also get success when we start with a headstart in the lexicon. (Caveat: Unless a concrete noun bias in the early lexicon somehow skews the learning the wrong direction for some reason.)

(2) It’s a pity that the Abend et al. 2017 study wasn’t discussed more thoroughly -- that’s another one using a CCG representation for the semantics, a loose idea of what the available meaning elements are from the scene, and doing this kind of rational search over possible syntactic rules, given naturalistic input. That model achieves syntactic bootstrapping, along with a variety of other features like one-shot learning, accelerated learning of individual vocabulary items corresponding to specific syntactic categories, and easier learning of nouns (thereby creating a noun bias in early lexicons). It seems like a compare & contrast with that Bayesian model would have been really helpful, especially noting what about those learning scenarios was simplified, compared with the one used here. 

For instance, “naturalistic” for G&al2019 means utterances which make reference to abstract events and relations. This isn’t what’s normally meant by naturalistic, because these utterances are still idealized (i.e., artificial). That said, these idealized data have more complex pieces in them that make them similar to naturalistic language data. I have no issue with this, per se -- it’s often a very reasonable first step, especially for cognitive models that take awhile to run.

(3) Figure 4: It looks like there’s a dependency where meaning depends on syntactic form, but not the other way around -- I guess that’s the linking rule? But I wonder why that direction and not the other (i.e., shouldn’t form depend on meaning, too, especially if we’re thinking about this as a generative model where the output is the utterance? So, we start with a meaning, and get the language form for that, which means the arrow should go from meaning to syntactic form?)? Certainly, it seems like you need something connecting syntactic type to meaning if you’re going to get syntactic bootstrapping, and I can see in their description of the inference process why it’s helpful to have the meaning depend on the structure (i.e., because they infer the meaning from the structure for a novel verb: P(m_w | s_w), which only works if you have the arrow going from s_w to m_w). 

(4) It took me a little bit to understand what was going on in equations 2 and 3, so let me summarize what I think I got here: if we want to get the probability of a particular meaning (which is comprised of several independent predicates), we have to multiply the probability of each of those predicates together (that’s equation 3). To get the probability of each predicate, we sum over all instances of that predicate that are associated with that syntactic type (that’s equation 2).

(5) The learner is constrained to only encode a limited number of entries per word at all times (i.e., only the l-highest weight lexical entries per wordform are retained): I love the ability to constrain the number of entries per word form. This seems exactly right from what I know of the kid word-learning literature, and I wonder how often a limit of two is the best…from Figure 7, it looks like 2 is pretty darned good (pretty much overlapping 7, and better than 3 or 5, if I’m reading those colors correctly).

Friday, October 18, 2019

Some thoughts on Lavi-Rotbain & Arnon 2019

I’m very sympathetic to the difficulties of creating experimental stimuli (like artificial languages) that don’t idealize away from important aspects of actual language data. So, LR&A2019’s main point about the importance of ecologically valid stimuli is certainly one I can get behind. That said, the trick is figuring out what we want to find out from the experiment -- if we’re interested in children’s ability to use, say, statistical cues along for segmentation (in the absence of any other information) just to show children have this ability, then we specifically don’t want ecologically valid stimuli. 

LR&A2019’s main point about the utility of higher entropy for language acquisition tasks like segmentation and object-label mapping is also one I’m sympathetic to. I’m just less clear on how this relates to what (I thought) we already knew about children’s language acquisition abilities. For instance, if children are sensitive to entropy, doesn’t this just mean that children can tell the difference between probability distributions of different types, like uniform vs. somewhat skewed vs. highly skewed? (So, I thought we already knew that.) For example, I’m thinking of some of the work on how children (vs adults) respond to input that’s inconsistent (work by Hudson Kam and by Newport), and the thing that varies is what the exact probability distribution is. It’s possible I’m missing something more subtle about entropy and information rates, which is touched on in the discussion near the end.

Some other thoughts:
(1) What we can conclude about early native language acquisition from studies with 10-year-olds: I’m always hesitant to conclude anything about early stages of acquisition (here, tasks that start happening before the child is a year old) from studies conducted on older participants. Often it’s a good way to start, in order to get a developmental trajectory of whatever it is we’re studying or provide a proof of experimental concept. But, for example, it’s tricky to conclude something about infant abilities from the performance of 10-year-olds. LR&A2019 do note that they intend to test younger children (7-year-olds, I believe, given their previous work). But even then, I don’t quite know how to extrapolate from 7-year-olds to infants.

(2) Something that comes to mind when considering the specific stimuli setup LR&A2019 went with: the work on how children of different ages vs. adults respond to input with a highly skewed vs. not highly skewed distribution seems really important to think about for comparison purposes. I’m thinking of work by Hudson Kam and Newport, where they see the difference in generalizations made when the input is something like 90-5-5 vs. 60-30-10 vs. other splits. So, the fact that LR&A2019 have a super-frequent option and the rest evenly infrequent (80-7-7-7) might yield different results than having different sort of skews.

Related materials question: Not that I have particular expectations about this mattering, but why not make it so the exposure in minutes was the same for the two different entropy conditions? One could reasonably argue that better performance happened for the one kids heard for longer (even if they heard certain word forms less frequently -- they still had more time on the task). And it doesn’t seem that difficult to create an 80-7-7-7 split for the low entropy condition that lasts the same amount of time as the high entropy condition.

(3) The general scaffolding story that LR&A2019 put forth in the discussion about why higher entropy is helpful makes good sense to me. There’s a bunch of infant segmentation work showing that anchor words (e.g., familiar words) facilitate segmentation of other words. So, if kids here in the high entropy condition can segment the frequent word, that allows them to have a familiar word they can use to segment the other words. Once segmentation is off to a good start, then they have a solid set of labels that they can use for object-label mapping. So, this study would be additional supportive evidence for scaffolding in these two particular tasks.

Tuesday, June 4, 2019

Some thoughts on Potts 2019 + Berent & Marcus 2019

I really appreciate Potts sketching out how vectors of numbers as the core meaning could impact semantics more broadly. This is the kind of broader speculation that’s helpful for people trying to see the effects of this key assumption on things they know and love. Moreover, Potts is aware of the current shortcomings of the “DL semantics” approach, but focuses on where it could be a useful tool for semantic theory. (This is how I incline myself, so I’m very sympathetic to this point of view.) Interestingly, I think Berent & Marcus also end up with sympathy to a hybrid approach, despite their concerns about the relationship between symbolic and non-symbolic approaches to language. A key difference seems to be where each commentary focuses — Potts zooms in on semantics, while Berent & Marcus mostly seem to think about phonology and syntax. And previously, non-symbolic approaches seem to have left a poor impression on Berent & Marcus.

Other thoughts:
(1) Potts: The idea that machine learning is equivalent to neural networks still confuses me temporarily. In my head, machine learning is the learning part (so it could be symbolic, like SVMs). Another important component is then feature selection, which would correspond to the embedding into that vector of numbers in Potts’s terminology. I guess this just goes to show how terminology changes over time.

(2) Potts: I totally get the analogy of how to do function application with an n-dimensional array. But how do we know that this concatenation and multiplication by a new matrix (W) yields the correct compositional meaning of two elements? Maybe the idea is that we have to find the right function application for our n-dimensional vectors? Potts basically says this by saying we have to learn the values for W from the data, so we have to use supervised learning to get the right W so that compositional meaning results. Okay. But what guarantee do we have that there is in fact a W for all the compositional meaning we might want? Of course, maybe that’s a problem for current semantic theory’s function application as well.

(3) Potts, on how the dataset used to optimize the system will be a collection of utterances, rather than I-language abstractions: So, because of this, it’d be including aspects of both representation and use (like frequency info) together, rather than just the representation part. This isn’t a bad thing necessarily, as long as we don’t explicitly care about the representation part separately. It seems like linguists often do care about this while the NLP community doesn’t. I think Potts’s example with the A but B construction highlights this difference nicely. Potts notes that this would make “use phenomena” more natural to study than they currently are under an intensional semantics approach, and I can see this. I just worry about how we derive explanations from a DL approach (i.e., what do we do with the Weight matrix, once we learn it via supervised machine learning approaches?)

(4) Potts, on how the goal in machine learning is generalization, however that’s accomplished (with compositionality just one way to do this): Maybe compositionality is what humans ended up with due to bottleneck issues during processing and learning over time? This is the kind of stuff Kirby (e.g., Kirby 2017) has modeled with his language evolution simulations.

Kirby, S. (2017). Culture and biology in the origins of linguistic structure. Psychonomic Bulletin & Review, 24(1), 118-137.

(5) Potts, on how having any representation for lexical meaning is better than not: I totally agree with this. A hard-to-interpret vector of numbers encoding helpful aspects about the representation and use of “kitty” is still better than [[kitty]]. It just doesn’t help us explain in symbolic terms that we verbalize things with.

(6) Berent & Marcus, on how the algebraic hypothesis assumes an innate capacity to operate on abstract categories: Sure! Hello, Bayesian inference, for example. Yet another reason why I’m always confused when generative folks don’t like Bayesian inference.

(7) Berent & Marcus, “mental operations are structure-sensitive -- they operate only on the form of representations and ignore their meaning”: It seems like this is a syntax-specific view -- surely semantic operations would operate over meaning? Or is this the difference between lexical semantics and higher-order semantics?

(8) Berent & Marcus, on how we could tell if neural networks (NNs) generated algebraic approaches: I’m not sure I quite follow the train of logic presented. If an NN does manage to capture human behavior correctly, why would we assume that it had spontaneously created algebraic representations? Wouldn’t associationists naturally assume that it didn’t have to (unless explicitly proven otherwise)?

(9) Berent & Marcus, on previous connectionist studies: I definitely understand Berent & Marcus’s frustration with previous connectionist networks and their performance, but it seems like there have been vast improvements since 2001. I’d be surprised if you couldn’t make an LSTM of some kind that couldn’t capture some of the generalizations Marcus investigated before, provided enough data was supplied. Granted, part of the cool thing about small humans is that they don’t get all of Wikipedia to learn from, and yet can still make broad generalizations.

(10) Berent & Marcus: Kudos to Berent & Marcus for being clear that they don’t actually know for sure the scope of human generalizations in online language processing -- they’ve been assuming humans behave a particular way that current NNs can’t seem to capture, but this is yet to be empirically validated. If humans don’t actually behave that way, then maybe the algebraic commitment needs some adjustment.

(11) Berent & Marcus: It’s a fascinating observation that a resistance to the idea of innate ideas itself might be an innate bias (the Berent et al. 2019 reference). This is the first I’ve heard of this. I always thought the resistance was an Occam’s Razor sort of thing, where building in innate stuff is more complex than not building in innate stuff.

Tuesday, May 21, 2019

Some thoughts on Linzen 2019 + Rawski & Heinz 2019

I’m totally with Linzen on linguistic theory providing better evaluation items for RNNs. (Hurrah for linguistic theory contributions!) In contrast, I’m just not sold yet on the utility of RNNs for modeling human language development or processing. The interpretability issue just kills it for me (as it does for Rawski & Heinz)-- how can we know if the RNN is or isn’t representing something? And if we have a concrete idea about what it should be representing vs. not, why not use a symbolic model? (More on this below in the “Other thoughts” section.)

I find it heartening to hear that other folks like Rawski & Heinz are also talking about the ML revolution with deep learning techniques as “alchemy”, longing for the “rigor police” to return. I sympathize with the rigor police.

Rawski & Heinz offer their take on the rigor police, highlighting the contributions that computational learnability (CL) investigations can make, with respect to the problems that RNNs are currently being pitched at. In particular, Rawski & Heinz note how CL approaches can answer the question of “Is it possible to learn this thing at all, given this characterization of the learning problem?” The major selling point is that CL results are easily interpretable (“analytically transparent”). This is a key difference that matters a lot for understanding what’s going on. That said, I tend to have concerns with different CL implementations (basically, if they don’t characterize the learning problem in a way that maps well to children’s language acquisition, I don’t know why I should care as a developmental linguist). But, this is a different, solvable problem (i,e., investigate characterizations that do map well) — in contrast, interpretability of RNNs isn’t as immediately solvable.

Other thoughts:

(1) Linzen, on RNNs for testing what constraints are needed for learning different things: So far, I haven’t been convinced that it’s helpful to use neural networks to test what innate knowledge is required. All we know when we stumble upon a neural network that can learn something is that it hasn’t explicitly encoded knowledge beforehand in a way that’s easy to interpret; who knows what the implicit knowledge is that’s encoded in the architecture and initialization values? (As Rawski & Heinz note, ignorance of bias doesn’t mean absence of bias.)

(2) Linzen, “language model” = “estimating how likely a particular word is to occur given the words that have proceeded it”. I was surprised by this definition. What about other language tasks? I honestly thought “language model” referred to the representation of language knowledge, rather than the evaluation task. So, the language model is the thing that allows you to predict the next word, given the previous word, not the prediction itself. Richard Futrell says this definition of “language model” is right for current ML use, though. (Thanks, Richard!)

(3) Linzen, on using psycholinguistic materials designed to identify linguistic knowledge in humans in order to identify implicit linguistic knowledge in RNNs: This approach makes a lot of sense to me. The human mind is a black box, just like the RNN, and we have decades of materials designed to identify the nature of the knowledge inside that black box. So, I think the key is to start with the most basic tests, since the more complex tests build in assumptions about human knowledge due to the results from the basic ones.

(4) Linzen, noting the importance of having baseline models that are known not to be able to represent the linguistic properties of interest: But how do we know they can’t? Aren’t RNNs universal function approximators, so they can (theoretically) capture any behavior, given enough data? Maybe the point is to use one where we know it’s failed on the linguistic knowledge in question somehow…

(5) Linzen, on the Gulordava et al. RNNs that did better at capturing long-distance agreement when semantic information was helpful: “This suggests that the models did learn some of the syntactic principles underlying subject-verb agreement.” Does it? Maybe if we think “syntactic principles” = something based on the sequence of words, rather than word meaning (i.e., a very broad definition of “the syntactic principles”). But I have no idea how we could tell that the RNN used anything like the syntactic principles we think humans use.

(6) Linzen, on using RNNs for learnability tests: “First, is it indeed the case that the linguistic phenomenon in question cannot be learned from child-directed speech without the proposed constraint?” -- I’m sympathetic to this, but how do we know the RNN isn’t implicitly encoding that constraint in its distributed vectors?

“Second, and equally important, does the proposed constraint in fact aid acquisition?” -- Again, I’m very sympathetic, but why not use a symbolic model for this? Then you can easily tell the model has vs. doesn’t have the proposed constraint. (To be fair, Linzen notes this explicitly: “...the inductive biases of most neural network architectures are not well characterized.”)

(7) Linzen, on building in structural knowledge by giving that structural knowledge as part of the RNN’s input (e.g., “the man” together, then “eats pizza” together = structural knowledge that those two chunks are meaningful chunks): If this is an example of building in a proposed constraint, how do we know the RNN is using those chunks the way we think? Why couldn’t it be doing something wild and wacky with those chunks, instead of treating them as “structured units”? I guess by having chunks at all, it counts as doing something structural? But then how do we make the equivalent of an overhypothesis, where the model likes structured units, but we let the model pick out which structured units it wants?

(8) Linzen, “...neural networks replicate a behavioral result from psycholinguistics without the theoretical machinery...suggest that the human behavior...might arise from statistical patterns in the input.”  Plus whatever implicit biases the RNN has, right? It’s not just statistical patterns working over a blank slate. For example, in the agreement attraction case Linzen discusses, how do we know the RNN didn’t encode some kind of markedness thing for plurals in its distributed representation?

Related to that same study, if the RNNs then show they’re not behaving like humans in other respects, how can we be sure that the behavior which looks human-like actually has the same underlying cause/representation as it does in humans? And if it doesn’t, what have we learned from the RNNs about how humans represent it?

(9) Rawski & Heinz, taking a grammar as target of acquisition, because it’s something of finite size with a symbolic, generative structure: Learning is then a problem of “grammatical inference”. This clearly differs from Linzen’s characterization, where the target of acquisition is something (a function) that can generate accurate predictions, and who cares what it looks like? Note that grammars can make predictions too — and we know what they look like and how they work to make those predictions. (Rigor police, check!)

(10) Rawski & Heinz, on typological arguments for learnability: I have a slight concern with their typological argument. In particular, just because we don’t see certain patterns across existing human languages doesn’t mean they’re impossible. It seems like we should couple typological observations with experimental studies of what generalizations are possible for humans to make when the data are available to support those generalizations.

A related thought regarding typological predictions, though: this seems like a useful evaluation metric for RNNs. In particular, any RNN that’s successful on one language can be applied to other languages’ input to see if it makes the right cross-linguistic generalizations.

(11) Rawski & Heinz, on Weiss et al 2018, which extracted a (symbolic) deterministic FSA representation from an RNN: This seems like exactly what we want for interpretability, though it’s more about identifying a symbolic representation that makes the same predictions as the RNN, rather than reading off the symbolic representation from the RNN. But I guess it doesn’t really matter, as long as you’re sure the symbolic representation really is doing exactly what the RNN is?

Tuesday, May 7, 2019

Some thoughts on Pearl 2019 + Dunbar 2019

I think these two commentaries (mine and Dunbar’s) pair together pretty nicely -- my key thought can be summed up as “if we can interpret neural networks, maybe they can build things we didn’t think to build with the same pieces and that would be cool”; Dunbar’s key thought is something like “we really need to think carefully about how interpretable those networks are…” So, we both seem to agree that it’s great to advance linguistic theory with neural networks, but only if you can in fact interpret them.

More specific thoughts on Dunbar 2019:
(1) Dunbar highlights what he calls the “implementational mapping problem”, which is basically the interpretability problem. How do we draw “a correspondence between an abstract linguistic representational system and an opaque parameter vector”? (Of course, neurolinguists the world over are nodding their heads vigorously in agreement because exactly the same interpretability problem arises with human neural data.)

To draw this correspondence, Dunbar suggests that we need to know what representations are meant to be there. What’s the set of things we should be looking for in those hard-to-interpret network innards? How do we know if a new something is a reasonable something (where reasonable may be “useful for understanding human representations”)?

(2) For learnability:  Dunbar notes that to the extent we believe networks have approximated a theory well enough, we can test learnability claims (such as whether the network can learn from the evidence children learn from or instead requires additional information). I get this, but I still don’t see why it’s better to use this over a symbolic modeling approach (i.e., an approach where the theory is transparent).

Maybe if we don’t have an explicit theory, we generate a network that seems to be human-like in its behavior. Then, we can use the network as a good-enough theory approximation to test learnability claims, even if we can’t exactly say what theory it’s implementing? So, this would focus on the “in principle” learnability claims (i.e., can whatever knowledge be learned from the data children learn from, period).

Tuesday, April 16, 2019

Some thoughts on Pater 2019

As you might imagine, a lot of my thoughts are covered by my commentary that we’re reading as one of the selections next time. But here’s the briefer version: I love seeing the fusion of linguistic representations with statistical methods. The real struggle for me as a cognitive modeler is when using RNNs is better than symbolic models that are more easily interpretable (e.g., hierarchical Bayesian models that allow overhypotheses to define a wider space of latent hypotheses).

At the very end of Pater’s article, I see a potentially exciting path forward with the advent of RNNs (or other models with distributed representations) that are interpretable. I’m definitely a fan of techniques that allow the learning of hidden structure without it being explicitly encoded — this is the same thing I see in hierarchical Bayesian overhypotheses. More on this below (and in my commentary for next time).

Specific thoughts:

(1) I couldn’t agree more with the importance of incorporating statistical approaches more thoroughly into learning/acquisition theories, but I remain to be sold on the neural networks side. It really depends on what kind of network: are they matching neurobiology (e.g., see Avery and Krichmar 2017, Beyeler, Rounds, Carlson, Dutt, & Krichmar 2017, Krichmar, Conrad, & Asada 2015; Neftci, Augustine, Paul, & Detorakis 2017, Neftci, Binas, Rutishauser, Chicca, Indiveri, & Douglas 2013) or are they a computational-level distributed representations approach (I think this is what most RNNs are), which seems hard to decipher, and so less useful for exploring symbolic theories more completely? Maybe the point is to explore non-symbolic theories.

Pater notes the following about non-symbolic approaches: “...it is hard to escape the conclusion that a successful theory of learning from realistic data will have a neural component.” If by neural, Pater means an implementational-level description, sure. But I’m not sold on distributed representations as being necessary for a successful theory of learning -- a theory can operate at the computational or algorithmic levels.

(2) I completely agree that structure-independent representations (statistical sequences that don’t involve phrases, etc.) can only get you so far. The interesting thing from an NLP standpoint, of course, is exactly how far they can get you — which often turns out to be surprisingly far. In fact, it’s often much further than I would have expected — e.g., n-grams over words (not even syntactic categories!!) work remarkably well as features for opinion spam detection, with near 90% classification accuracy: Ott et. al 2011, 2013. Though I guess n-grams do heuristically encode some local structure.

(3) RNNs seem to need to incorporate hierarchical representations to work (e.g., the Recurrent Neural Network Grammars of Dyer et al. 2016, and incorporating hierarchical structure into current neural network approaches in AI/NLP). But, sequence-to-sequence models do pretty well without explicit structure encoded in. So, if sequence-to-sequence models can handle aux-inversion (e.g., as in McCoy, Frank, & Linzen 2018...well, at least sort of -- it’s not clear they handle it the way humans do), what do we make of it from the linguistic cognition perspective?

This comes back to the question of model interpretation. With symbolic models, it’s usually clear what theory of representation is being evaluated. For RNNs, do we know what the distributed representations/continuous hypotheses are encoding? (This of course is less a problem from the engineering perspective -- we’re happy if we can get the machines to do it as well or better than humans.) As Pater noted, some read-out can be done with clever model comparisons, and some distributed representations (e.g., Palangi et al’s (2017) Tensor Product Recurrent Networks) may in fact encode syntactic structures we recognize. So then, the question is what we’re getting from the distributed representation.

Pater: “...it is given the building blocks of symbols and their roles, but must learn their configurations”. This starts to sound like the latent vs. explicit hypothesis space construction of Perfors (2012), which can be implemented in a variety of ways (e.g., variational learning as in Yang 2002). That is, RNNs allow the modeler to specify the building blocks but let the model construct the explicit hypotheses that get evaluated, based on its prior biases (RNN architecture, Bayesian overhypothesis hyperparameters, etc.). Something that could be interesting: the RNN version allows construction of explicit hypotheses from the building blocks that are outside what the modeler would have built in to the overhypothesis parameters; that is, they may be perfectly reasonable hypotheses from the given building blocks, but go against the natural overhypothesis-style parametric biases and so would get a low probability of being generated (and subsequently evaluated).

Since the RNN generates hypotheses with whatever architectural biases mold the explicit hypothesis construction, it may give higher probability to hypotheses that were lower-probability for a hierarchical Bayesian model.  That is, the Bayesian overhypotheses may be quite general (especially if we back off to over-over-hypotheses, and so on), but still require an explicit bias at some level for how hypotheses are generated from overhypotheses. That has to be specified by the modeler. This may cause Bayesian modelers to miss ways that certain building blocks can generate the kinds of linguistic hypotheses we want to generate.

An analogy: Genetic algorithms can be used to identify solutions that humans didn’t think of because they employ a much wider search of the latent hypothesis space; humans are fettered by their biases for what an optimal solution is going to look like.  Here: symbolic modelers may be fettered by ideas about how building blocks can be used to generate explicit hypotheses; RNNs may allow a wider search of the latent hypothesis space because they’re bound by different (implicit) ideas, via the RNN architecture. So, the solution an RNN comes up with (assuming you can interpret it) may provide a novel representational option, based on the building blocks given to it.

Bigger point: RNNs and distributed representations may provide a novel way of exploratory theorizing (especially for syntactic learning), to the extent that their innards are interpretable. For theory evaluation, on the other hand, it’s better to go with a symbolic model that’s already easy to understand….unless your theory is about the building blocks, leaving the explicit hypotheses they build and evaluate unspecified.

Tuesday, March 5, 2019

Some thoughts on Nordmeyer & Frank 2018

This is exactly the kind of behavioral work that serves as a good target of developmental modeling. (Thanks, N&F2018!) Moreover, the particular experiment lends itself very naturally to RSA modeling, given the importance of context manipulation (and then the RSA model allows us to be more concrete about what those contextual variables could be and what exactly they could do). 

More generally, this work also falls in a larger body of work that underscores the importance of pragmatic felicity when doing child language experiments. This was the basis for the Truth Value Judgment paradigm (Crain & Thornton 1998) -- it’s important to give supportive contexts if you want kids to show you their linguistic knowledge. They’re not as good as adults at “test-taking” -- i.e., compensating for a lack of supportive context by implicitly supplying their own. So, if kids aren’t behaving like they have adult-like linguistic knowledge, check if pragmatic (or processing) factors might be getting in the way.

Crain, S., & Thornton, R. (1998). The truth value judgment task: Fundamentals of design. University of Maryland working papers in linguistics, 6, 61-70.

Some other thoughts:

(1) The Kim (1985) child behavioral setup, which involved (for example) someone pointing at an apple and saying “This is not a banana”. The child would reply “wrong!”, but of course we don’t know why she’s saying it’s wrong. Is it the wrong meaning (semantic issue) or the wrong thing to say (pragmatic issue)? This reminds me of recent work on children’s (non-)endorsements when it comes to quantifier scope ambiguity (Viau & Lidz 2010, Savinelli et al. 2017). The key idea is that they weren’t saying no because they couldn’t get the interpretation; they were saying no because it wasn’t a very informative interpretation, given the prior context. This also seems to be a main factor in English children’s pronoun interpretation behavior being wonky (Conroy et al. 2009). Also similar to the Conroy et al. (2009) study is how N&F2018 are explicitly manipulating the context to show a replication of prior behavior and then how to fix it with supportive pragmatic context.

Savinelli, K. J., Scontras, G., & Pearl, L. (2017). Modeling scope ambiguity resolution as pragmatic inference: Formalizing differences in child and adult behavior. In CogSci.

Viau, J., Lidz, J., & Musolino, J. (2010). Priming of abstract logical representations in 4-year-olds. Language Acquisition, 17(1-2), 26-50.

Conroy, A., Takahashi, E., Lidz, J., & Phillips, C. (2009). Equal treatment for all antecedents: How children succeed with Principle B. Linguistic Inquiry, 40(3), 446-486.

(2) Varying the linguistic form: “has no X” vs. “doesn’t have an X”. Corpus analysis could tell you how often negation of different types appear in these forms since “has no X” is less good than “doesn’t have an X”. Then we would know if that’s just a frequency effect or if something more interesting is happening.

(3) With adults, “has no X” is better when everyone else has an X. The pragmatic reason for this is that there’s a more informative utterance when the referent has a Y instead of an X (i.e., “has a Y”) -- this seems like something that could be captured in an RSA model’s cost function. Basically, it costs more to say “has no X” compared with “has Y” when both are true.

(4) In general, kudos for getting kids to give ratings. This is super-hard to do well, since it requires young children to think metalinguistically. I also really appreciate seeing the histogram of responses in Figure 4. Here, we can see that there are still quite a number of kids who, in the unsupportive none context (where no one else has anything) think that “Abby doesn’t have an apple” is fine (>50); however, many more kids (>100) think it’s terrible. Similarly, there are >50 kids who think “Abby doesn’t have an apple” is terrible in the target context (where everyone else has an apple), though many more (>100) think it’s fine. Hello, child data messiness -- and bless your hearts, child behavioral researchers.

I wish we could see an equivalent histogram for adults, though. I wonder how much of this messiness is because we’re dealing with kids vs. dealing with a felicity scale vs. dealing with a phenomenon that’s inherently messy in the target state.

Tuesday, February 19, 2019

Some thoughts on Tessler & Franke 2018

This is a great example of theoretically-motivated computational modeling coupled with behavioral experiments, here in the realm of negated antonyms (e.g.., "not unhappy"). My main qualm is with the paper length — there’s a lot of interesting stuff going on, and we just don’t get the space to see it fully discussed (more specifics on this below). This of course isn’t the authors’ fault — it just highlights the difficulty of explaining work like this in the space you normally get for conference proceedings.

Specific comments:
(1) The case study here with negated antonyms (which involve double negations like “not unhappy”) seems very relevant for sentiment analysis, where we still struggle to deal precisely with negated expressions. So, more generally, this is a particular case where I can see the NLP community paying closer attention and taking inspiration from cognitive work. For example, based on the results here for single utterances ("unhappy" = "not happy"), the antonym dictionary approach to negation (where "not happy" = "unhappy" or "sad") may not be a bad move in non-contrastive utterances.

(2) I love the clearcut hypothesis space, and the building blocks of contrary (tall vs. short) vs. contradictory (even vs. odd) adjectives. My own sense is that my prior experience is mostly comprised of contrary adjectives, but I wonder if that’s true. (Helloooo, corpus analysis. Also, what do we know about children’s development of these types of fine semantic distinctions?)

(3) I wish there had been a bit more space to explain why we see the modeling results we do. For the full uncertain negation, we get some mileage from a single utterance because it’s unnecessarily costly to say “not unhappy” unless it had a different meaning from "happy", which makes sense. When there are multiple utterances, we see a complete separation of all four options because...there are four different individuals who presumably have different states (or else why use different expressions)?

For the more restricted hypothesis of bonafide contraries that connects morphological negation explicitly to an opposite valence, we see separation for both single and multiple utterances, but much moreso for the multiple utterances. This is definitely a case of a more restricted hypothesis yielding stronger generalizations from ambiguous data, but I don’t quite see how we’re getting it. Certainly, “not unhappy” is more costly to produce than “happy”, so we get separation between those two terms, just as with the full uncertain negation hypothesis. But why, in the single utterance case, do we also get separation between “unhappy” and “not happy”?

For the most restricted hypothesis of logical negation, I get why we never get any separation — by definition, “unhappy” = “not happy” = not(happy), and so “not unhappy” = not(not(happy)) = “happy”.