Sunday, December 1, 2019

Some thoughts on Ud Deen & Timyam 2018

I really appreciate seeing a traditional Universal Grammar (UG) approach to learnability spelled out so clearly, especially coupled with clear behavioral data about the phenomenon in question (condition C). It makes it easier to see where I agree vs. where I’m concerned that we’re not being fair to alternative accounts (or even how we’re characterizing linguistic nativist vs. non-linguistic nativist accounts). What really struck me after all the learnability discussions at various points (more on this below) is that I really want a computational cognitive model that tries to learn condition C child behavior facts in English and Thai.  When we have a concrete model, we can be explicit about what we’re building in that either makes the model behave vs. not behave like children in these languages. And then it makes sense to have a discussion about the nature of the built in stuff.

Specific thoughts:
(1) Intro, the traditional UG learnability approach: The traditional claim is if we test kids as young as we can (here, that’s Thai four-year-olds, though in English we’ve apparently tested 2.5-year-olds), and they show a certain type of knowledge, we assume that knowledge is innate. For me, I think this is a good placeholder — we have to explain how kids have this knowledge by that age. That then means a careful analysis of the input and reasonable investigation of how that knowledge could be derived from more fundamental building blocks (maybe language-specific building blocks, but maybe not). This goal of course depends on how the knowledge of condition C is represented, which is part of what often evolves in theoretical debates.

(2) Section 2.2, learnability again: “the negative properties of a language are downright impossible to acquire from the input because children never get evidence of what is impossible in the language.” -- Of course, it all depends on the representation we think children are learning, and what building blocks go into that representation. This is what Pearl & Sprouse (2013) did for syntactic islands (i.e., subjacency constraints) -- and it turns out you can get away with much more general-purpose knowledge, which may or may not be UG. 


UD&T2018 walk through how the learning process might work for condition C knowledge (and kudos to them for doing an input analysis! Much more convincing when you have actual counts in children’s input data of the phenomena you’re talking about for learnability). They highlight that children basically hear all the viable combinations of pronoun and name with both co-indexed and non-coindexed readings, but only hear the problematic structure with the non-coindexed reading. They then ask why a child wouldn’t just assume the co-indexed reading is fine for this one, too. (And that would lead to a condition C violation.) 

But the flip side is if children get fairly strong evidence for all the other options allowing coindexed readings but this structure doesn’t, why would they assume it can be coindexed? This really comes back to expectations and indirect evidence -- if you keep expecting something to occur, and it keeps not occurring, you start shifting probability towards the option that it can’t in fact occur. To make this indirect evidence account work, children would have to expect that coindexed reading to occur and keep seeing it not occur. This doesn’t seem implausible to me, but it helps to have an existence proof. (For instance, what causes them to have this expectation and what are the restrictions on the structures they expect to allow co-indexing?)

All that said, I’m completely with UD&T2018 that children aren’t unbiased learners. Children clearly have constraints on the hypotheses they entertain. What I’m not sure of is the origin of those constraints as being language-specific and innate vs. derived from prior experience. As I mentioned above, it really depends what building blocks underlie the explicit hypotheses the child is entertaining. It’s perfectly possible for the implicit hypothesis space to be really (infinitely) large, because of the way building blocks can combine (especially if there’s any kind of recursion). But domain-general biases (e.g., prefer explicit hypotheses that require fewer building blocks) can skew the prior over this implicit hypothesis space in a useful way.

I’m also not very convinced about the need for a language-specific Subset Principle -- this same preference for a narrower hypothesis, given ambiguous data, falls out from standard Bayesian inference. (Basically, a narrower hypothesis has a higher likelihood for generating an ambiguous data point than a wider hypothesis does, so boom -- we have a preference for a narrower hypothesis that comes from a domain-general learning mechanism.) But okay, if a bias for the narrower hypothesis helps children navigate the condition C acquisition problem, great.

(3) Section 2.4, the nativist vs. non-nativist discussion: It’s illuminating to see the discussion of what would count as “nativist” vs. “non-nativist” here. (Side note: I have a whole discussion about this particular terminology issue in a recent manuscript about Poverty of the Stimulus: Pearl 2019: Pearl, L. (under review). Poverty of the Stimulus Without Tears. https://ling.auf.net/lingbuzz/004646.)

Using the terms I prefer, UD&T2018 are thinking about a linguistic nativist approach (=their “nativist”) where explicit knowledge of condition C is innately available vs. a non-linguistic nativist approach (=their “non-nativist”) where this knowledge is instead derived from the input. Of course, it’s very possible that explicit knowledge of condition C is derived from the input over time by *still* using some innate, language-specific knowledge (maybe something more general-purpose). If so, then we still have a linguistic nativist position, but now it’s one that’s utilizing the input more than UD&T2018 think linguistic nativist approaches do. What would make an approach non-linguistic nativist is the following: the way explicit knowledge of condition C was derived from the input never involved language-specific innate knowledge, but rather only domain-general innate knowledge (or mechanisms, like Bayesian inference).

Related though, on children initially obeying condition C everywhere (even for bare nominals): This behavior would accord with (relatively) rapid acquisition of explicit condition C knowledge (which initially gets overapplied for Thai). But, it’s not clear to me that we can definitively claim linguistic vs. non-linguistic nativist for the necessary knowledge. Again, it all depends on how condition C knowledge is represented and what building blocks kids use (and could track in the input) to construct that knowledge. 

Also, with respect to age of acquisition, without an explicit theory of how learning occurs, how do we know that derived approaches to learning condition C couldn’t yield child judgment behavior at four? That is, how do we know that acquisition wouldn’t be fast enough this way? (This is one of my main concerns with claims that young children knowing something means they were innately endowed with that knowledge.)

(4) Section 4, discussion of Thai children overapplying condition C: UD&T2018 talk about this as children allowing the most restrictive grammar. But honestly, thinking about this in terms of Yang’s Tolerance Principle, couldn’t it be about learning the rule (here: condition C) in the presence of exceptions? So it’s the same grammar, but we just have exceptions in Thai while languages like English don’t have exceptions. If so, then it’s not clear that considerations about more restrictive vs. less restrictive grammars apply.

(5) Section 4, on minimalist approaches to condition C: I like the idea of considering the building blocks of condition C a lot. But then, isn’t it an open question whether the explicit condition C knowledge constructed from these building blocks has to be innate (and perhaps more interestingly, as constrained as UD&T2018 initially posit it to be)? I’m very willing to believe some building blocks are innate (I’m a nativist, after all), but it seems yet-to-be-determined whether the necessary building blocks are language-specific. And again, even if they are, how constrained do they make the child’s implicit hypothesis space? (This is where it really hit me that I wanted a computational cognitive model of learning condition C facts — and really, a model that replicates adult and child judgment behavior.)

Friday, November 15, 2019

Some thoughts on Villata et al. 2019

I appreciated seeing up front the traditional argument about economy of representation because it’s often at the heart of theoretical debate. What’s interesting to me is the assumption that something is more economical just based on intuition, without having some formal way to evaluate how economical it is. So, good on V&al2019 for thinking about this issue explicitly. More generally, when I hear this kind of debate about categorical grammar + external gradience vs. gradience in the grammar, I often wonder how on earth you could tell the difference. V&al2019 are approaching this from an economy angle, rather than a behavioral fit angle, and showing a proof-of-concept with the SOSP model. That said, it’s interesting to note that the SOSP implementation of a gradient grammar clearly includes both syntactic and semantic features -- and that’s where its ability to handle some of the desiderata comes from.

Other comments:
(1) Model implementation involving the differential equations: If I understand this correctly, the model is using a computational-level way to accomplish the inference about which treelets to choose. (This is because it seems like it requires a full enumeration of the possible structures that can be formed, which is typically a massively parallel process and not something we think humans are doing on a word-by-word processing basis.) Computational-level inference for me is exactly like this: we think this is the inference computation humans are trying to do, but they approximate it somehow. So, we’re not committed to this being the algorithm humans use to accomplish that inference. 

That said, the way V&al2019 describe here isn’t an optimal inference mechanism, the way Gibbs sampling is, since it seems to allow the equivalent of probabilistic sampling (where a sub-optimal option can win out in the long run). So, this differs from the way I often see computational-level inference in Bayesian modeling land, because there the goal is to identify the optimal result of the desired computation.

(2) Generating grammaticality vs. acceptability judgments from model output: I often hear “acceptability” used when grammaticality is one (major) aspect of human judgment, but there are other things in there too (like memory constraints or lexical choice, etc.). I originally thought the point of this model was that we’re trying to generate the judgment straight from the grammar, rather than from other factors (so it would be a grammaticality judgment). But maybe because a word’s feature vector also includes semantic features (whatever those look like), then this is why the judgment is getting termed acceptability rather than grammaticality?

(3) I appreciate the explanation in the Simulations section about the difference between whether islands and subject islands -- basically, for subject islands, there are no easy lexical alternatives that would allow sub-optimal treelets to persist that will eventually allow a link between the wh-phrase and the gap. But something I want to clear up is the issue of parallel vs. greedy parsing. I had thought that the SOSP approach does greedy parsing because it finds what it considers the (noisy) local maximum option for any given word, and proceeds on from there. So, for whether islands, it picks either the wonder option that’s an island or the wonder option that gets coerced to think, because both of those are somewhat close in how harmonic they are. (For subject islands, there’s only one option -- the island one -- so that’s the one that gets picked). Given this, how can we talk about the whether island as having two options at all? Is it that on any given parse, we pick one, and it’s just that sometimes it’s the one that allows a dependency to form? That would be fine. We’d just expect to see individual variation from instance to instance, and the aggregate effect would be that D-linked whether islands are better, basically because sometimes they’re okay-ish and sometimes they’re not.

Friday, November 1, 2019

Some thoughts on Gauthier et al. 2019

General thoughts:
I really enjoy seeing this kind of computational cognitive model, where the model is not only generating general patterns of behavior (like the ability to get the right interpretation for a novel utterance), but specifically matching a set of child behavioral results. I think it’s easier to believe in the model’s informativity when you see it able to account for a specific set of results. And those results then provide a fair benchmark for future models. (So, yay, good developmental modeling practice!)

Other thoughts:
(1) It’s always great to show what can be accomplished “from scratch” (as G&al2019 note), though this is probably harder than the child’s actual task. Presumably, by the time children are using syntactic bootstrapping to learn harder lexical items, they already have a lexicon seeded with some concrete noun items. But this is fine for a proof of concept -- basically, if we can get success on the harder task of starting from scratch, then we should also get success when we start with a headstart in the lexicon. (Caveat: Unless a concrete noun bias in the early lexicon somehow skews the learning the wrong direction for some reason.)

(2) It’s a pity that the Abend et al. 2017 study wasn’t discussed more thoroughly -- that’s another one using a CCG representation for the semantics, a loose idea of what the available meaning elements are from the scene, and doing this kind of rational search over possible syntactic rules, given naturalistic input. That model achieves syntactic bootstrapping, along with a variety of other features like one-shot learning, accelerated learning of individual vocabulary items corresponding to specific syntactic categories, and easier learning of nouns (thereby creating a noun bias in early lexicons). It seems like a compare & contrast with that Bayesian model would have been really helpful, especially noting what about those learning scenarios was simplified, compared with the one used here. 

For instance, “naturalistic” for G&al2019 means utterances which make reference to abstract events and relations. This isn’t what’s normally meant by naturalistic, because these utterances are still idealized (i.e., artificial). That said, these idealized data have more complex pieces in them that make them similar to naturalistic language data. I have no issue with this, per se -- it’s often a very reasonable first step, especially for cognitive models that take awhile to run.

(3) Figure 4: It looks like there’s a dependency where meaning depends on syntactic form, but not the other way around -- I guess that’s the linking rule? But I wonder why that direction and not the other (i.e., shouldn’t form depend on meaning, too, especially if we’re thinking about this as a generative model where the output is the utterance? So, we start with a meaning, and get the language form for that, which means the arrow should go from meaning to syntactic form?)? Certainly, it seems like you need something connecting syntactic type to meaning if you’re going to get syntactic bootstrapping, and I can see in their description of the inference process why it’s helpful to have the meaning depend on the structure (i.e., because they infer the meaning from the structure for a novel verb: P(m_w | s_w), which only works if you have the arrow going from s_w to m_w). 

(4) It took me a little bit to understand what was going on in equations 2 and 3, so let me summarize what I think I got here: if we want to get the probability of a particular meaning (which is comprised of several independent predicates), we have to multiply the probability of each of those predicates together (that’s equation 3). To get the probability of each predicate, we sum over all instances of that predicate that are associated with that syntactic type (that’s equation 2).

(5) The learner is constrained to only encode a limited number of entries per word at all times (i.e., only the l-highest weight lexical entries per wordform are retained): I love the ability to constrain the number of entries per word form. This seems exactly right from what I know of the kid word-learning literature, and I wonder how often a limit of two is the best…from Figure 7, it looks like 2 is pretty darned good (pretty much overlapping 7, and better than 3 or 5, if I’m reading those colors correctly).

Friday, October 18, 2019

Some thoughts on Lavi-Rotbain & Arnon 2019

I’m very sympathetic to the difficulties of creating experimental stimuli (like artificial languages) that don’t idealize away from important aspects of actual language data. So, LR&A2019’s main point about the importance of ecologically valid stimuli is certainly one I can get behind. That said, the trick is figuring out what we want to find out from the experiment -- if we’re interested in children’s ability to use, say, statistical cues along for segmentation (in the absence of any other information) just to show children have this ability, then we specifically don’t want ecologically valid stimuli. 

LR&A2019’s main point about the utility of higher entropy for language acquisition tasks like segmentation and object-label mapping is also one I’m sympathetic to. I’m just less clear on how this relates to what (I thought) we already knew about children’s language acquisition abilities. For instance, if children are sensitive to entropy, doesn’t this just mean that children can tell the difference between probability distributions of different types, like uniform vs. somewhat skewed vs. highly skewed? (So, I thought we already knew that.) For example, I’m thinking of some of the work on how children (vs adults) respond to input that’s inconsistent (work by Hudson Kam and by Newport), and the thing that varies is what the exact probability distribution is. It’s possible I’m missing something more subtle about entropy and information rates, which is touched on in the discussion near the end.

Some other thoughts:
(1) What we can conclude about early native language acquisition from studies with 10-year-olds: I’m always hesitant to conclude anything about early stages of acquisition (here, tasks that start happening before the child is a year old) from studies conducted on older participants. Often it’s a good way to start, in order to get a developmental trajectory of whatever it is we’re studying or provide a proof of experimental concept. But, for example, it’s tricky to conclude something about infant abilities from the performance of 10-year-olds. LR&A2019 do note that they intend to test younger children (7-year-olds, I believe, given their previous work). But even then, I don’t quite know how to extrapolate from 7-year-olds to infants.

(2) Something that comes to mind when considering the specific stimuli setup LR&A2019 went with: the work on how children of different ages vs. adults respond to input with a highly skewed vs. not highly skewed distribution seems really important to think about for comparison purposes. I’m thinking of work by Hudson Kam and Newport, where they see the difference in generalizations made when the input is something like 90-5-5 vs. 60-30-10 vs. other splits. So, the fact that LR&A2019 have a super-frequent option and the rest evenly infrequent (80-7-7-7) might yield different results than having different sort of skews.

Related materials question: Not that I have particular expectations about this mattering, but why not make it so the exposure in minutes was the same for the two different entropy conditions? One could reasonably argue that better performance happened for the one kids heard for longer (even if they heard certain word forms less frequently -- they still had more time on the task). And it doesn’t seem that difficult to create an 80-7-7-7 split for the low entropy condition that lasts the same amount of time as the high entropy condition.

(3) The general scaffolding story that LR&A2019 put forth in the discussion about why higher entropy is helpful makes good sense to me. There’s a bunch of infant segmentation work showing that anchor words (e.g., familiar words) facilitate segmentation of other words. So, if kids here in the high entropy condition can segment the frequent word, that allows them to have a familiar word they can use to segment the other words. Once segmentation is off to a good start, then they have a solid set of labels that they can use for object-label mapping. So, this study would be additional supportive evidence for scaffolding in these two particular tasks.

Tuesday, June 4, 2019

Some thoughts on Potts 2019 + Berent & Marcus 2019

I really appreciate Potts sketching out how vectors of numbers as the core meaning could impact semantics more broadly. This is the kind of broader speculation that’s helpful for people trying to see the effects of this key assumption on things they know and love. Moreover, Potts is aware of the current shortcomings of the “DL semantics” approach, but focuses on where it could be a useful tool for semantic theory. (This is how I incline myself, so I’m very sympathetic to this point of view.) Interestingly, I think Berent & Marcus also end up with sympathy to a hybrid approach, despite their concerns about the relationship between symbolic and non-symbolic approaches to language. A key difference seems to be where each commentary focuses — Potts zooms in on semantics, while Berent & Marcus mostly seem to think about phonology and syntax. And previously, non-symbolic approaches seem to have left a poor impression on Berent & Marcus.

Other thoughts:
(1) Potts: The idea that machine learning is equivalent to neural networks still confuses me temporarily. In my head, machine learning is the learning part (so it could be symbolic, like SVMs). Another important component is then feature selection, which would correspond to the embedding into that vector of numbers in Potts’s terminology. I guess this just goes to show how terminology changes over time.

(2) Potts: I totally get the analogy of how to do function application with an n-dimensional array. But how do we know that this concatenation and multiplication by a new matrix (W) yields the correct compositional meaning of two elements? Maybe the idea is that we have to find the right function application for our n-dimensional vectors? Potts basically says this by saying we have to learn the values for W from the data, so we have to use supervised learning to get the right W so that compositional meaning results. Okay. But what guarantee do we have that there is in fact a W for all the compositional meaning we might want? Of course, maybe that’s a problem for current semantic theory’s function application as well.

(3) Potts, on how the dataset used to optimize the system will be a collection of utterances, rather than I-language abstractions: So, because of this, it’d be including aspects of both representation and use (like frequency info) together, rather than just the representation part. This isn’t a bad thing necessarily, as long as we don’t explicitly care about the representation part separately. It seems like linguists often do care about this while the NLP community doesn’t. I think Potts’s example with the A but B construction highlights this difference nicely. Potts notes that this would make “use phenomena” more natural to study than they currently are under an intensional semantics approach, and I can see this. I just worry about how we derive explanations from a DL approach (i.e., what do we do with the Weight matrix, once we learn it via supervised machine learning approaches?)

(4) Potts, on how the goal in machine learning is generalization, however that’s accomplished (with compositionality just one way to do this): Maybe compositionality is what humans ended up with due to bottleneck issues during processing and learning over time? This is the kind of stuff Kirby (e.g., Kirby 2017) has modeled with his language evolution simulations.

Kirby, S. (2017). Culture and biology in the origins of linguistic structure. Psychonomic Bulletin & Review, 24(1), 118-137.

(5) Potts, on how having any representation for lexical meaning is better than not: I totally agree with this. A hard-to-interpret vector of numbers encoding helpful aspects about the representation and use of “kitty” is still better than [[kitty]]. It just doesn’t help us explain in symbolic terms that we verbalize things with.

(6) Berent & Marcus, on how the algebraic hypothesis assumes an innate capacity to operate on abstract categories: Sure! Hello, Bayesian inference, for example. Yet another reason why I’m always confused when generative folks don’t like Bayesian inference.

(7) Berent & Marcus, “mental operations are structure-sensitive -- they operate only on the form of representations and ignore their meaning”: It seems like this is a syntax-specific view -- surely semantic operations would operate over meaning? Or is this the difference between lexical semantics and higher-order semantics?

(8) Berent & Marcus, on how we could tell if neural networks (NNs) generated algebraic approaches: I’m not sure I quite follow the train of logic presented. If an NN does manage to capture human behavior correctly, why would we assume that it had spontaneously created algebraic representations? Wouldn’t associationists naturally assume that it didn’t have to (unless explicitly proven otherwise)?

(9) Berent & Marcus, on previous connectionist studies: I definitely understand Berent & Marcus’s frustration with previous connectionist networks and their performance, but it seems like there have been vast improvements since 2001. I’d be surprised if you couldn’t make an LSTM of some kind that couldn’t capture some of the generalizations Marcus investigated before, provided enough data was supplied. Granted, part of the cool thing about small humans is that they don’t get all of Wikipedia to learn from, and yet can still make broad generalizations.

(10) Berent & Marcus: Kudos to Berent & Marcus for being clear that they don’t actually know for sure the scope of human generalizations in online language processing -- they’ve been assuming humans behave a particular way that current NNs can’t seem to capture, but this is yet to be empirically validated. If humans don’t actually behave that way, then maybe the algebraic commitment needs some adjustment.

(11) Berent & Marcus: It’s a fascinating observation that a resistance to the idea of innate ideas itself might be an innate bias (the Berent et al. 2019 reference). This is the first I’ve heard of this. I always thought the resistance was an Occam’s Razor sort of thing, where building in innate stuff is more complex than not building in innate stuff.

Tuesday, May 21, 2019

Some thoughts on Linzen 2019 + Rawski & Heinz 2019

I’m totally with Linzen on linguistic theory providing better evaluation items for RNNs. (Hurrah for linguistic theory contributions!) In contrast, I’m just not sold yet on the utility of RNNs for modeling human language development or processing. The interpretability issue just kills it for me (as it does for Rawski & Heinz)-- how can we know if the RNN is or isn’t representing something? And if we have a concrete idea about what it should be representing vs. not, why not use a symbolic model? (More on this below in the “Other thoughts” section.)

I find it heartening to hear that other folks like Rawski & Heinz are also talking about the ML revolution with deep learning techniques as “alchemy”, longing for the “rigor police” to return. I sympathize with the rigor police.

Rawski & Heinz offer their take on the rigor police, highlighting the contributions that computational learnability (CL) investigations can make, with respect to the problems that RNNs are currently being pitched at. In particular, Rawski & Heinz note how CL approaches can answer the question of “Is it possible to learn this thing at all, given this characterization of the learning problem?” The major selling point is that CL results are easily interpretable (“analytically transparent”). This is a key difference that matters a lot for understanding what’s going on. That said, I tend to have concerns with different CL implementations (basically, if they don’t characterize the learning problem in a way that maps well to children’s language acquisition, I don’t know why I should care as a developmental linguist). But, this is a different, solvable problem (i,e., investigate characterizations that do map well) — in contrast, interpretability of RNNs isn’t as immediately solvable.

Other thoughts:

(1) Linzen, on RNNs for testing what constraints are needed for learning different things: So far, I haven’t been convinced that it’s helpful to use neural networks to test what innate knowledge is required. All we know when we stumble upon a neural network that can learn something is that it hasn’t explicitly encoded knowledge beforehand in a way that’s easy to interpret; who knows what the implicit knowledge is that’s encoded in the architecture and initialization values? (As Rawski & Heinz note, ignorance of bias doesn’t mean absence of bias.)

(2) Linzen, “language model” = “estimating how likely a particular word is to occur given the words that have proceeded it”. I was surprised by this definition. What about other language tasks? I honestly thought “language model” referred to the representation of language knowledge, rather than the evaluation task. So, the language model is the thing that allows you to predict the next word, given the previous word, not the prediction itself. Richard Futrell says this definition of “language model” is right for current ML use, though. (Thanks, Richard!)

(3) Linzen, on using psycholinguistic materials designed to identify linguistic knowledge in humans in order to identify implicit linguistic knowledge in RNNs: This approach makes a lot of sense to me. The human mind is a black box, just like the RNN, and we have decades of materials designed to identify the nature of the knowledge inside that black box. So, I think the key is to start with the most basic tests, since the more complex tests build in assumptions about human knowledge due to the results from the basic ones.

(4) Linzen, noting the importance of having baseline models that are known not to be able to represent the linguistic properties of interest: But how do we know they can’t? Aren’t RNNs universal function approximators, so they can (theoretically) capture any behavior, given enough data? Maybe the point is to use one where we know it’s failed on the linguistic knowledge in question somehow…

(5) Linzen, on the Gulordava et al. RNNs that did better at capturing long-distance agreement when semantic information was helpful: “This suggests that the models did learn some of the syntactic principles underlying subject-verb agreement.” Does it? Maybe if we think “syntactic principles” = something based on the sequence of words, rather than word meaning (i.e., a very broad definition of “the syntactic principles”). But I have no idea how we could tell that the RNN used anything like the syntactic principles we think humans use.

(6) Linzen, on using RNNs for learnability tests: “First, is it indeed the case that the linguistic phenomenon in question cannot be learned from child-directed speech without the proposed constraint?” -- I’m sympathetic to this, but how do we know the RNN isn’t implicitly encoding that constraint in its distributed vectors?

“Second, and equally important, does the proposed constraint in fact aid acquisition?” -- Again, I’m very sympathetic, but why not use a symbolic model for this? Then you can easily tell the model has vs. doesn’t have the proposed constraint. (To be fair, Linzen notes this explicitly: “...the inductive biases of most neural network architectures are not well characterized.”)

(7) Linzen, on building in structural knowledge by giving that structural knowledge as part of the RNN’s input (e.g., “the man” together, then “eats pizza” together = structural knowledge that those two chunks are meaningful chunks): If this is an example of building in a proposed constraint, how do we know the RNN is using those chunks the way we think? Why couldn’t it be doing something wild and wacky with those chunks, instead of treating them as “structured units”? I guess by having chunks at all, it counts as doing something structural? But then how do we make the equivalent of an overhypothesis, where the model likes structured units, but we let the model pick out which structured units it wants?

(8) Linzen, “...neural networks replicate a behavioral result from psycholinguistics without the theoretical machinery...suggest that the human behavior...might arise from statistical patterns in the input.”  Plus whatever implicit biases the RNN has, right? It’s not just statistical patterns working over a blank slate. For example, in the agreement attraction case Linzen discusses, how do we know the RNN didn’t encode some kind of markedness thing for plurals in its distributed representation?

Related to that same study, if the RNNs then show they’re not behaving like humans in other respects, how can we be sure that the behavior which looks human-like actually has the same underlying cause/representation as it does in humans? And if it doesn’t, what have we learned from the RNNs about how humans represent it?

(9) Rawski & Heinz, taking a grammar as target of acquisition, because it’s something of finite size with a symbolic, generative structure: Learning is then a problem of “grammatical inference”. This clearly differs from Linzen’s characterization, where the target of acquisition is something (a function) that can generate accurate predictions, and who cares what it looks like? Note that grammars can make predictions too — and we know what they look like and how they work to make those predictions. (Rigor police, check!)

(10) Rawski & Heinz, on typological arguments for learnability: I have a slight concern with their typological argument. In particular, just because we don’t see certain patterns across existing human languages doesn’t mean they’re impossible. It seems like we should couple typological observations with experimental studies of what generalizations are possible for humans to make when the data are available to support those generalizations.

A related thought regarding typological predictions, though: this seems like a useful evaluation metric for RNNs. In particular, any RNN that’s successful on one language can be applied to other languages’ input to see if it makes the right cross-linguistic generalizations.

(11) Rawski & Heinz, on Weiss et al 2018, which extracted a (symbolic) deterministic FSA representation from an RNN: This seems like exactly what we want for interpretability, though it’s more about identifying a symbolic representation that makes the same predictions as the RNN, rather than reading off the symbolic representation from the RNN. But I guess it doesn’t really matter, as long as you’re sure the symbolic representation really is doing exactly what the RNN is?

Tuesday, May 7, 2019

Some thoughts on Pearl 2019 + Dunbar 2019

I think these two commentaries (mine and Dunbar’s) pair together pretty nicely -- my key thought can be summed up as “if we can interpret neural networks, maybe they can build things we didn’t think to build with the same pieces and that would be cool”; Dunbar’s key thought is something like “we really need to think carefully about how interpretable those networks are…” So, we both seem to agree that it’s great to advance linguistic theory with neural networks, but only if you can in fact interpret them.

More specific thoughts on Dunbar 2019:
(1) Dunbar highlights what he calls the “implementational mapping problem”, which is basically the interpretability problem. How do we draw “a correspondence between an abstract linguistic representational system and an opaque parameter vector”? (Of course, neurolinguists the world over are nodding their heads vigorously in agreement because exactly the same interpretability problem arises with human neural data.)

To draw this correspondence, Dunbar suggests that we need to know what representations are meant to be there. What’s the set of things we should be looking for in those hard-to-interpret network innards? How do we know if a new something is a reasonable something (where reasonable may be “useful for understanding human representations”)?

(2) For learnability:  Dunbar notes that to the extent we believe networks have approximated a theory well enough, we can test learnability claims (such as whether the network can learn from the evidence children learn from or instead requires additional information). I get this, but I still don’t see why it’s better to use this over a symbolic modeling approach (i.e., an approach where the theory is transparent).

Maybe if we don’t have an explicit theory, we generate a network that seems to be human-like in its behavior. Then, we can use the network as a good-enough theory approximation to test learnability claims, even if we can’t exactly say what theory it’s implementing? So, this would focus on the “in principle” learnability claims (i.e., can whatever knowledge be learned from the data children learn from, period).

Tuesday, April 16, 2019

Some thoughts on Pater 2019

As you might imagine, a lot of my thoughts are covered by my commentary that we’re reading as one of the selections next time. But here’s the briefer version: I love seeing the fusion of linguistic representations with statistical methods. The real struggle for me as a cognitive modeler is when using RNNs is better than symbolic models that are more easily interpretable (e.g., hierarchical Bayesian models that allow overhypotheses to define a wider space of latent hypotheses).

At the very end of Pater’s article, I see a potentially exciting path forward with the advent of RNNs (or other models with distributed representations) that are interpretable. I’m definitely a fan of techniques that allow the learning of hidden structure without it being explicitly encoded — this is the same thing I see in hierarchical Bayesian overhypotheses. More on this below (and in my commentary for next time).

Specific thoughts:

(1) I couldn’t agree more with the importance of incorporating statistical approaches more thoroughly into learning/acquisition theories, but I remain to be sold on the neural networks side. It really depends on what kind of network: are they matching neurobiology (e.g., see Avery and Krichmar 2017, Beyeler, Rounds, Carlson, Dutt, & Krichmar 2017, Krichmar, Conrad, & Asada 2015; Neftci, Augustine, Paul, & Detorakis 2017, Neftci, Binas, Rutishauser, Chicca, Indiveri, & Douglas 2013) or are they a computational-level distributed representations approach (I think this is what most RNNs are), which seems hard to decipher, and so less useful for exploring symbolic theories more completely? Maybe the point is to explore non-symbolic theories.

Pater notes the following about non-symbolic approaches: “...it is hard to escape the conclusion that a successful theory of learning from realistic data will have a neural component.” If by neural, Pater means an implementational-level description, sure. But I’m not sold on distributed representations as being necessary for a successful theory of learning -- a theory can operate at the computational or algorithmic levels.

(2) I completely agree that structure-independent representations (statistical sequences that don’t involve phrases, etc.) can only get you so far. The interesting thing from an NLP standpoint, of course, is exactly how far they can get you — which often turns out to be surprisingly far. In fact, it’s often much further than I would have expected — e.g., n-grams over words (not even syntactic categories!!) work remarkably well as features for opinion spam detection, with near 90% classification accuracy: Ott et. al 2011, 2013. Though I guess n-grams do heuristically encode some local structure.

(3) RNNs seem to need to incorporate hierarchical representations to work (e.g., the Recurrent Neural Network Grammars of Dyer et al. 2016, and incorporating hierarchical structure into current neural network approaches in AI/NLP). But, sequence-to-sequence models do pretty well without explicit structure encoded in. So, if sequence-to-sequence models can handle aux-inversion (e.g., as in McCoy, Frank, & Linzen 2018...well, at least sort of -- it’s not clear they handle it the way humans do), what do we make of it from the linguistic cognition perspective?

This comes back to the question of model interpretation. With symbolic models, it’s usually clear what theory of representation is being evaluated. For RNNs, do we know what the distributed representations/continuous hypotheses are encoding? (This of course is less a problem from the engineering perspective -- we’re happy if we can get the machines to do it as well or better than humans.) As Pater noted, some read-out can be done with clever model comparisons, and some distributed representations (e.g., Palangi et al’s (2017) Tensor Product Recurrent Networks) may in fact encode syntactic structures we recognize. So then, the question is what we’re getting from the distributed representation.

Pater: “...it is given the building blocks of symbols and their roles, but must learn their configurations”. This starts to sound like the latent vs. explicit hypothesis space construction of Perfors (2012), which can be implemented in a variety of ways (e.g., variational learning as in Yang 2002). That is, RNNs allow the modeler to specify the building blocks but let the model construct the explicit hypotheses that get evaluated, based on its prior biases (RNN architecture, Bayesian overhypothesis hyperparameters, etc.). Something that could be interesting: the RNN version allows construction of explicit hypotheses from the building blocks that are outside what the modeler would have built in to the overhypothesis parameters; that is, they may be perfectly reasonable hypotheses from the given building blocks, but go against the natural overhypothesis-style parametric biases and so would get a low probability of being generated (and subsequently evaluated).

Since the RNN generates hypotheses with whatever architectural biases mold the explicit hypothesis construction, it may give higher probability to hypotheses that were lower-probability for a hierarchical Bayesian model.  That is, the Bayesian overhypotheses may be quite general (especially if we back off to over-over-hypotheses, and so on), but still require an explicit bias at some level for how hypotheses are generated from overhypotheses. That has to be specified by the modeler. This may cause Bayesian modelers to miss ways that certain building blocks can generate the kinds of linguistic hypotheses we want to generate.

An analogy: Genetic algorithms can be used to identify solutions that humans didn’t think of because they employ a much wider search of the latent hypothesis space; humans are fettered by their biases for what an optimal solution is going to look like.  Here: symbolic modelers may be fettered by ideas about how building blocks can be used to generate explicit hypotheses; RNNs may allow a wider search of the latent hypothesis space because they’re bound by different (implicit) ideas, via the RNN architecture. So, the solution an RNN comes up with (assuming you can interpret it) may provide a novel representational option, based on the building blocks given to it.

Bigger point: RNNs and distributed representations may provide a novel way of exploratory theorizing (especially for syntactic learning), to the extent that their innards are interpretable. For theory evaluation, on the other hand, it’s better to go with a symbolic model that’s already easy to understand….unless your theory is about the building blocks, leaving the explicit hypotheses they build and evaluate unspecified.