Wednesday, April 20, 2016

Some thoughts on Yang 2016 in press

As always, it’s a real pleasure for me to read things by Yang because of how clearly his viewpoints are laid out. For this paper in particular, it’s plain that Yang is underwhelmed by the Bayesian approach to cognitive science (and language acquisition in particular). I definitely understand some of the criticisms (and I should note that I personally love the Tolerance Principle that Yang advocates as a viable alternative). However, I did feel obliged to put on my Bayesian devil’s advocate hat here at several points. 

Specific comments:

(1) The Evaluation Metric (EvalM) is about choosing among alternative hypotheses (presumably balancing fit with simplicity, which is one of the attractive features of the Bayesian approach). If I’m interpreting things correctly, the EvalM was meant to be specifically linguistic (embedded in linguistic hypothesis space) while the Bayesian approach isn’t. So, simplicity needs to be defined in linguistically meaningful ways. As a Bayesian devil’s advocate, this doesn’t seem incompatible with having a general preference for simplicity that gets cashed out within a linguistic hypothesis space.

(2) Idealized learners

General: Yang’s main beef is with idealized approaches to language learning, but presumably, very particular ones, because of course every model is idealizing away some aspects of the learning process.

(a) Section 2: Yang’s opinion is that a good model cares about “what can be plausibly assumed” about “the actual language acquisition process”. Totally agreed. This includes what the hypothesis space is — which is crucially important for any acquisition model. It’s one of things that an ideal learner model can check for — assuming the inference can be carried out to yield the best result, will this hypothesis space yield a “good” answer (however that’s determined)? If not, don’t bother doing an algorithmic-level process where non-optimal inferences might results — the modeled child is already doomed to fail. That is, ideal learner models of the kind that I often see (e.g., Goldwater et al 2009, Perfors et al. 2011, Feldman et al. 2013, Dillon et al. 2013) are useful for determining if the acquisition task conceptualization, as defined by the hypothesis space and realistic input data, is reasonable. This seems like an important sanity check before you get into more cognitively plausible implementations of the inference procedure that’s going to operate over this hypothesis space, given these realistic input data.  In this vein, I think these kind of ideal learner models do in fact include relevant “representational structure”, even if it’s true that they leave out the algorithmic process of inference and the neurobiological implementation of the whole thing (representation, hypothesis space, inference procedure, etc.). 

(b) This relates to the comment in Section 2 about how “surely an idealized learner can read off the care-taker’s intentional states” — well, sure, you could create an idealized learner that does that. But that’s not a reasonable estimate of the input representation a child would have, and so a reasonable ideal learner model wouldn’t do it. Again, I think it’s possible to have an ideal learner model that doesn’t idealize plausible input representation.

Moreover, I think this kind of ideal learner model fits in with the point made about Marr’s view on the interaction of the different levels of explanation, i.e., “a computational level theory should inform the study of the algorithmic and implementational levels”. So, you want to make sure you’ve got the right conceptualization of the acquisition task first (computation-level). Then it makes sense to explore the algorithmic and implementational levels more thoroughly, with that computational-level guidance.

(3) Bayesian models & optimality

(a) Section 3: While it’s true that Bayesian acquisition models build in priors such as preferring “grammars with fewer symbols or lexicons with shorter words”, I always thought that was a specific hypothesis these researchers were making concrete. That is, these are learning assumptions which might be true. If they are (i.e., if this is the conceptualization of the task and the learner biases), then we can see what the answers are. Do these answers then match what we know about children’s behavior (yes or no)? So I don’t see that as a failing of these Bayesian approaches. Rather, it’s a bonus — it’s very clear what’s built in (preferences for these properties) and how exactly it’s built in (the prior over hypotheses). And if it works, great, we have a proposal for the assumptions that children might be bringing to these problems. If not, then maybe these aren’t such great assumptions, which makes it less likely children have them.

(b) Section 3: In terms of model parameters being tuned to fit behavioral data, I’m not sure I see that as quite the problem Yang does. If you have free parameters, that means those are things that could matter (and presumably have psychological import). So, knowing what values they need then tells you what those values should be for humans. 

(c) Section 3:  For likelihoods, I’m also not sure I’m as bothered about them as Yang is. If you have a hypothesis and you have data, then you should have an idea of the likelihood of the data given that hypothesis. In some sense, doesn’t likelihood just fall out from hypothesis + data? In Yang’s example of the probability of a particular sentence given a specific grammar, you should be able to calculate the probability of that sentence if you have a specific PCFG. It could be that Yang’s criticism is more about how little we know about human likelihood calculation. But I think that’s one of the learner assumptions — if you have this hypothesis space and this data and you calculate likelihoods this way (because it follows from the hypothesis you have and the data you have), then these are the learning results you get.

(d) 3.1, variation: I think a Bayesian learner is perfectly capable of dealing with variation. It would presumably infer a distribution over the various options. In fact, as far as I know, that’s generally what the Bayesian acquisition models do. The output at any given moment may either by the maximum a posteriori probability choice or a probabilistic sample of that distribution, so you just get one output — but that doesn’t mean the learner doesn’t have the distribution underneath. This seems like exactly what Yang would want when accounting for variation for a particular linguistic representation within an individual. That said, his criticism of a Bayesian model that has to select the maximum a posteriori option as its underlying representation is perfectly valid — it’s just that this is only one kind of Bayesian model, not all of them.

(e) 3.2: For the discussion about exploding hypothesis spaces, I think there’s a distinction between explicit vs latent hypothesis space for every ideal learner model I’ve ever seen. Perfors (2012) talks about this some, and the basic idea is that the child doesn’t have to consider an infinite (or really large) number of hypotheses explicitly in order to search the hypothesis space. Instead, the child just had to have the ability to construct explicit hypotheses from that latent space. (Which always reminds me of using linguistic parameter values to construct grammars like Yang's variational learner does, for instance.)

Perfors, A. (2012). Bayesian Models of Cognition: What's Built in After All?. Philosophy Compass, 7(2), 127-138.

(f) 3.2: I admit, I find the line of argumentation about output comparison much more convincing. If one model (e.g., a reinforcement learning one) yields better learning results than another (e.g., a Bayesian one), then I’m interested.

(g) 3.2: “Without a feasible means of computing the expectations of hypotheses…indirect negative evidence is unusable.” — Agreed that this is a problem (for everyone). That’s why the hypothesis space definition seems so crucial. I wonder if there’s some way to do a “good enough” calculation, though. That is, given the child’s current understanding of the (grammar) options, can the approximate size of one grammar be calculated? This could be good enough, even if it’s not exactly right.

(h) 3.2: “…use a concrete example to show that indirect negative evidence, regardless of how it is formulated, is ineffective when situated in a realistic setting of language acquisition”. — This is a pretty big claim. Again, I’m happy to walk through a particular example and see that it doesn’t work. But I think it’s a significant step to go from that to the idea that it could never work in any situation.

(i) 3.3, overhypothesis for the a-adjective example:  To me, the natural overhypothesis for the a-adjectives is with the locative particles and prepositional phrases. So, the overhypothesis is about elements that behave certain ways (predicative = yes, attributive = no, right-adverb modification = yes), and the specific hypotheses are about the a-adjectives vs. the locative particles vs. the prepositional phrases, which have some additional differences that distinguish them. That is, overhypotheses are all about leveraging indirect positive evidence like the kind Yang discusses for a-adjectives. Overhypotheses (not unlike linguistic parameters) are the reason you get equivalence classes even thought the specific items may seem pretty different on the surface. Yang seems to touch on this in footnote 11, but then uses it as a dismissal of the Bayesian framework. I admit, I found that puzzling. To me, it seems to be a case of translating an idea into a more formal mathematical version, which seems great when you can do it.

4. Tolerance Principle

(a) 4.1: Is the Elsewhere Condition only a principle “for the organization of linguistic information”? I can understand that it’s easily applied to linguistic information, but I always assumed it’s meant to be a (domain-)general principle of organization.

(b) 4.2: I like seeing the Principle of Sufficiency (PrinSuff) explicitly laid out since it tells us when to expect generalization vs. not. That said, I was a little puzzled by this condemnation of indirect negative evidence that was based on the PrinSuff: “That is, in contrast to the use of indirect negative evidence, the Principle of Sufficiency does not conclude that unattested forms are ungrammatical….”. Maybe the condemnation is about how the eventual conclusion of most inference models relying on indirect negative evidence is that the item in question would be ungrammatical? But this seems all about interpretation - these inference models could just as easily set up the final conclusion of “not(grammatical)” as “I don’t know that it’s grammatical” (the way the PrinSuff does here) rather than “it’s ungrammatical”.