Tuesday, October 18, 2016

Some thoughts on Kao et al. 2016

I really like the approach this paper takes, where insights from humor research are operationalized using existing formal metrics from language understanding. It’s the kind of approach I think is very fruitful for computational research because it demonstrates the utility of bothering to formalize things — in this case, the outcome is surprisingly good prediction about degrees of funniness and more explanatory power about why exactly something is funny. 

As an organizational note, I love the tactic the authors take here of basically saying “we’ll explain the details of this bit in the next section”, which is an implicit way of saying “here’s the intuition and feel free to skip the details in the next section if you’re not as interested in the implementation.” For me, one big challenge of writing up modeling results of this kind is the level of detail to include when you’re explaining how everything works. It’s tricky because of how wide an audience you’re attempting to interest. Put too little computational detail in and everyone’s irritated at your hand-waving; put too much in and readers get derailed from your main point.  So, this presentation style may be a new format to try.

A few more specific thoughts:

(1) I mostly followed the details of the ambiguity and distinctiveness calculations (ambiguity is about entropy, distinctiveness is about KL divergence of the indicator variables f for each meaning). However, I think it’s worth pondering more carefully how the part described at the end of section 2.2 works which goes into the (R)elatedness calculation. If we’re getting relatedness scores between pairs of words (R(w_i, h), where h = homophone and w_i is another word from the utterance, then how do we compile that together to get the R(w_i, m) that shows up in equation 8? For example, where does free parameter r (which was empirically fit to people’s funniness judgments and captures a word’s own relatedness) show up?

(2) I really like that this model is able to pull out exactly which words correspond to which meanings. This features reminds me of topic models, where each word is generated by a specific topic. Here, a word is generated by a specific meaning (or at least, I think that’s what Figure 1 shows, with the idea that m could be a variety of meanings).

(3) I always find it funny in computational linguistics research that the general language statistics portion (here, in the generative model) can be captured effectively by a trigram model. The linguist in me revolts, but the engineer in me shrugs and thinks if it works, then it’s clearly a good enough. The pun classification results here are simply another example of why computational linguistics often uses trigram models to approximate language structure and pretty much finds it adequate for most of what they want to do. Maybe for puns (and other joke types) that involve structural ambiguity rather than phonological ambiguity, we’d need something more than trigrams, though.

Wednesday, June 1, 2016

Some thoughts on Wellwood et al. 2016

In general, I love seeing this combination of behavioral and modeling work, and I’m also a big fan of the cue reliability vs. cue accessibility approach that Gagliardi’s work tends to have (ex: Gagliardi et al. 2012, 2014). That said, I had some difficulty following the details of the model that worked best (Model 4), and I’d really like to understand it better (more about this below).

Specific thoughts:

(1) The partitive frame: The partitive frame (“Gleeb of the cows are by the barn”) is an excellent syntactic signal for adults that the word is a quantity word (exact number, quantifier like “most”, etc.). So, that would signal the sense of numerosity, either exact or approximate. Based on the Wynn (1992) work, it seems like two-and-half-year-olds recognize this numerosity-ness of exact number words. Yet I wonder how prevalent the partitive frame is — I’m sure someone must have done a corpus analysis of child-directed speech (I’m thinking some of the former students of Barbara Sarnecka here at UCI). 

My intuition is that the partitive frame itself isn’t all that common. (Note: W&al2016 mention a corpus analysis by Syrett et al. 2012 of the partitive frame itself that shows this frame isn’t unambiguous for numerosity words, but don’t mention how often the partitive frame itself occurs.) My intuition might be wrong, but if it’s true, I wonder what other cues are available syntactically in order for numerosity to be associated with exact number words so early.  Maybe a more general syntactic distribution sort of thing? This may be important, given that the partitive frame isn’t an unambiguous cue to four-year-old children that the meaning is numerosity-focused. 

On the other hand, for the behavioral results, W&al2016 are working with four-year-olds who may have more experience with the partitive frame in their input. Certainly, the partitive frame appears to be  a very reliable cue to numerosity meanings (at least when the novel word is in the determiner position: [Det position] “gleebest of the…” vs. [Adj position] “The gleebest of the…”). 

It was also useful to learn (in the next section) that determiners only refer to quantities cross-linguistically, so it seems to be a mapping that languages use. A lot. (Interesting question: Where would this bias come from? Built-in (i.e., UG) or (always) derivable somehow?)

(2) A role for informativity?: The example in (5) about why we can’t say “heaviest of the animals” (but we can say “Most of the animals”) reminds me of Greg Scontras’s work on informativity, where, for example, adjective ordering preferences depend on how much uncertainty there is on the part of the listener (ex: big red boxes vs. *red big boxes, as found in Scontras et al. 2015). I wonder if there’s a useful link there from the developmental perspective about which words get mapped to the determiner position. (Or perhaps, why the link between determiner position and permutation invariant words like most would be established.)

(3) The most of the cows were…:  My dialect of English utterly fails to allow 6b (“The most of the cows were by the barn.”) I can say something like “The majority of…”, but I just can’t handle “The most of the…”. Hopefully this cue (appearance in the adjectival position of the partitive frame) isn’t too critical a property of superlative acquisition in general. I guess in my case, it’s a cue for ruling out the numerosity meaning and really zeroing in on the quality meaning. So maybe the reliability of the syntactic cues is cleaner than for the dialect that allows “The most of the cows”? (That is, in the experimental stimuli in Table 1, “the gleebest of the cows” isn’t a confounded cue for me. It’s strictly a quality-meaning indicator when the word has the -est morphology.)  So, following up on this, I’m not surprised in Figure 3 that the Adjective [“the gleebest cows”] and Confounded [“the gleebest of the cows”] results look alike (i.e., children infer a quality meaning for “gleebest” in these contexts).

(4) Understanding the model variants: 

(a) Model 3 (Lexical + Conceptual bias): I couldn’t quite tell from the text, but does joint prior mean there’s equal weight for the lexical and the conceptual prior? 

(b) Model 4 (+Perceptual Bias): W&al2016 describe it as “…combining the lexical prior with the intuition that salience impacts how the likelihood, P(d|h), could be encoded with differing reliability for each hypothesis.” 

While I deeply appreciate Tables 5 and 7, I really wish we could see the full equation where the alpha, beta, and gamma terms are put into equation form. If I’m interpreting the text correctly, these parameters are meant to alter the likelihood calculation. In Models 1-3, likelihood = 1 and all the work is done in the prior. In Model 4, likelihood is 1 with some probability depending on alpha and beta (and not computed = 0 otherwise?). And then Model 4 is a mixture model of all four of the encoding options (A, B, C, and D). So are these weighted somehow when they’re combined into one probability? 

I think they may be weighted based on the alpha and beta parameters (“…combined in a mixture model (the sum of all four terms)” = the alpha, beta, gamma, etc parameters are the weights).  My natural inclination is to think “with probability alpha, A happens; with probability beta, B happens, with probability alpha * beta, C happens; with probability 1-p(A or B or C), D happens”, which would then lead to a simple summation like they describe. 

Small thing: Also, based on Figure 5, where did the values in Table 6 come from (alpha = quantity confusion = 0.2, beta = quality confusion 0.025)? 


Gagliardi, A., Feldman, N. H., & Lidz, J. (2012). When suboptimal behavior is optimal and why: Modeling the acquisition of noun classes in Tsez. In Proceedings of the 34th annual conference of the Cognitive Science Society (pp. 360-365).

Gagliardi, A., & Lidz, J. (2014). Statistical insensitivity in the acquisition of Tsez noun classes. Language90(1), 58-89.

Scontras, Gregory, Judith Degen & Noah D. Goodman. 2015. Subjectivity predicts adjective ordering preferences. Manuscript from http://web.stanford.edu/~scontras/Gregory_Scontras.html.

Wednesday, May 18, 2016

Some thoughts on Moscati & Crain 2014

This paper really highlights to me the impact of pragmatic factors on children’s interpretations, something that I think we have a variety of theories about but maybe not as many formal implementations of (hello, RSA framework potential!). Also, I’m a fan of the idea of the Semantic Subset, though not as a linguistic principle, per se.  I think it could just as easily be the consequence of Bayesian reasoning applied over a linguistically-derived hypothesis space. But the idea that information strength matters is one that seems right to me, given what we know about children’s sensitivities to how we use language to communicate. 

That being said, I’m not quite sure how to interpret the specific results here (more details on this below). Something that becomes immediately clear from the learnability discussions in M&C2014 is the need for corpus analysis to get an accurate assessment of what children’s input looks like for all of these semantic elements and their various combinations.

Specific thoughts:

(1) Epistemic modals and scope
John might not come. vs. John can not come: I get that might not is less certain than cannot, and so entailment relations hold. But the scope assignment ambiguity within a single utterance seems super subtle.

(a) Ex: John might not come. 

surface: might >> not: It might be the case that John doesn’t come. (Translation: There’s some non-zero probability of John not coming.)
inverse: not >> might: It’s not the case that John might come. (Translation: There’s 0% probability that John might come.  = John’s definitely not coming.)

Even though the inverse scope option is technically available, do we actually ever entertain that interpretation in English? It feels more to me like “not all” utterances (ex: “Not all horses jumped over the fence”) — technically the inverse scope reading is there (all >> not = “none”) , but in practice it’s effectively unambiguous in use (always interpreted as “not all”).

(b) Ex: John cannot come.
surface: can >> not: It can be the case that John doesn’t come. (Translation: There’s some non-zero probability of John not coming.)
inverse: not >> can: It’s not the case that John can come. (Translation: There’s 0% probability that John can come. = John’s definitely not coming.)

Here, we get the opposite feeling about how can is used. It seems like the inverse scope is the only interpretation entertained. (And I think M&C2014 effectively say this in the “Modality and Negation in Child Language” section, when they’re discussing how can’t and might not are used in English.)

I guess the point for M&C2014 is that this is the salient difference between might not and cannot. It’s not surface word order, since that’s the same. Instead, the strongly preferred interpretation differs depending on the modal, and it’s not always the surface scope reading. This is what they discuss as a polarity restriction in the introduction, I think. (Though they talk about might allowing both readings, and I just can’t get the inverse scope one.)

(2) Epistemic modals, negation, and input:  Just from an input perspective, I wonder how often English children hear can’t vs. cannot (and then we can compare that to mightn’t vs. might not). My sense is that can’t is much more relatively frequent, and might not is much more relatively frequent in each pair. One possible learning story component: The reason we have a different favored interpretation for cannot is that we first encounter it as a single lexical item can’t, and so treat it differently than an item like might where we overtly recognize two distinct lexical elements, might and not. Beyond this, assuming children are sensitive to meaning (especially by five years old), I wonder how often they hear can’t (or cannot) used to effectively mean “definitely not” (favored/only interpretation for cannot) vs. might not used to mean “possibly not” (favored/only interpretation for might not). 

(3) Conversational usage:

(a) Byrnes & Duff 1989: Five-year-olds don’t seem to distinguish between “The peanut can’t be under the cup” and “The peanut might not be under the box” when determining the location of the peanut. I wonder how adults did on this task. Basically, it’s a bit odd information-wise to get both statements in a single conversation. As an adult, I had to do a bit of meta-linguistic reasoning to interpret this: “Well, if it might not be under the box, that’s better than ‘can’t’ be under the cup, so it’s more likely to be under the box than the cup. But maybe it’s not under the box at all, because the speaker is expressing doubt that it’s under there.” In a way, it reminds me of some of the findings of Lewis et al. (2012) on children’s interpretations of false belief task utterances as literal statements of belief vs. parenthetical endorsements. (Ex: “Hoggle thinks Sarah is the thief”: literal statement of belief = this is about whether Hoggle is thinking something; parenthetical endorsement: there’s some probability (according to Hoggle) that Sarah is the thief.) Kids hear these kind of statements as parenthetical endorsements way more than they hear them as literal statements of belief in day-to-day conversation, and so interpret them as parenthetical endorsements in false belief tasks. That is, kids are assuming this is a normal conversation and interpreting the statements as they would be used in normal conversation.

Lewis, S., Lidz, J., & Hacquard, V. (2012, September). The semantics and pragmatics of belief reports in preschoolers. In Semantics and Linguistic Theory (Vol. 22, pp. 247-267).

(b) Similarly, in Experiment 1, I wonder again about conversational usage. In the discussion of children’s responses to the Negative Weak true items like “There might not be a cow in the box” (might >> not: It’s possible there isn’t a cow), many children apparently responded False because “A cow might be in the box.” Conversationally, this seems like a perfectly legitimate response. The tricky part is whether the original assertion is false, per se, rather than simply not the best utterance to have selected for this scenario.

(4) The hidden “only” hypothesis: 

In Experiment 1, M&C2014 found on the Positive True statements (“There is a cow in the box” with the child peeking to see if it’s true) that children were only at ~51.5% accuracy for being right. This is weirdly low, as M&C2014 note. They discuss this as having to do with the particle “also”, suggesting a link to the “only” interpretation, i.e., children were interpreting this as “There is only a cow in the box.” (Side note: M&C2014 talk about this as “There might only be a cow in the box.”, which is odd. I thought the Positive and Negative sentences were just the bare “There is/isn’t an X in the box.”)  Anyway, they designed Experiment 2 to address this specific weirdness, which is nice.

In Experiment 2 though, there seems to me to be a potential weirdness with statements like “There might not be only a cow in the box”. Only has its own scopal impacts, doesn’t it? Even if might takes scope over the rest, we still have might >> not >> only (= “It’s possible that it’s not the case there’s only a cow.” = There may be a cow and something else (as discussed later on in examples 44 and 45) = infelicitous in this setup where you can only have one animal = unpredictable behavior from kids). Another interpretation option is might >> only >> not (= It’s possible that it’s only the case that it’s not a cow.” = may be not-a-cow (and instead be something else) = must be a horse in this setup =  desired behavior from kids). 

We then find that children in Experiment 2 decrease acceptance of Negative Weak True statements like “There might not be a cow in the box” to 33.3%. So, going with the hidden only story, they’re interpreting this as “It’s not the case that there might be (only) a cow in the box.” Again, we get infelicity if not >> only since there can only be one animal in the box at a time. But this could either be because of the the interpretation above (not >> might >> only) or because of the interpretation might >> not >> only (which is the interpretation that follows surface scope, i.e., not reconstructed.) So it’s not clear to me what this rejection by children means.

(5) Discussion clarification: What’s the difference between example 46 = “It is not possible that a cow is in the box” and  example 48 = “It is not possible that there is a cow [is] in the box”? Do these not mean the same thing? And I’m afraid I didn’t follow the the paragraph after these examples at all, in terms of its discussion of how many situations one vs. the other is true in.

(6) Semantic Subset Principle (SSP) selectivity: It’s interesting to note that M&C2014 say the SSP is only invoked when there are polarity restrictions due to a lexical parameter. So, this is why M&C2014 say it doesn’t apply when the quantifier every is involved (in response to Musolino 2006). This then presupposes that children need to know which words have a lexical parameter related to polarity restrictions and which don’t. How would they know this? Is the idea that they just know that some meanings (like quantifier every) don’t get them while others (like quantifier some) do? Is this triggered/inferrable from the input in some way?

Wednesday, May 4, 2016

Some thoughts on Snedeker & Huang 2016 in press

One of the things I really enjoyed about this book chapter was all the connections I can see for language acquisition modeling. An example of this for me was the discussion about kids’ (lack of) ability to incorporate pragmatic information of various kinds (more detailed comments on this below). Given that some of us in the lab are currently thinking about using the Rational Speech Act model to investigate quantifier scope interpretations in children, the fact that four- and five-year-olds have certain pragmatic deficits is very relevant. 

More generally, the idea that children’s representation of the input — which depends on their processing abilities — matters is exactly right (e.g., see my favorite Lidz & Gagliardi 2015 ref). As acquisition modelers, this is why we need to care about processing. Passives may be present in the input (for example) but that doesn’t mean children recognize them (and the associated morphology). That is, access to the information of the input has an impact, beyond the reliability of the information in the input, and access to the information is what children’s processing deals with.

More specific thoughts:

18.1:  I thought it was interesting that there are some theories of adult sentence processing that actively invoke an approximation of the ideal observer as a reasonable model (ex: the McRae & Matuski 2013 that SH2016 cite). I suppose this is the foundation of the Rational Speech Act model as well, even though it doesn’t explicitly consider processing as an active process per se.

18.3: Something that generally comes out of this chapter is children’s poorer cognitive control (which is why they perseverate on their first choices). This seems like it could matter a lot in pragmatic contexts where children’s expectations might be violated in some way. They may show show non-adult behavior not because they can’t get the correct answer, but rather that they can’t get to the correct answer once they’ve built up a strong enough expectation for a different answer.

18.4: Here we see evidence that five-year-olds aren’t sensitive to the referential context when it comes to disambiguating an ambiguous PP attachment (as in “Put the frog on the napkin in the box”). (And this contrasts with their sensitivity to prosody.) So, not only do they perseverate on their first mistaken interpretation, but they apparently don’t utilize the pragmatic context information that would enable them to get the correct interpretation to begin with (i.e. there are two frogs so saying “the frog” is weird until you know which frog —  therefore “the frog on the napkin” as a unit makes sense in this communicative context). This insensitivity to the pragmatics of “the” makes me wonder how sensitive children are in general to pragmatic inferences that hinge on specific lexical items — we see in section 18.5 that they’re generally not good at scalar implicatures till later, but I think they can get ad-hoc implicatures that aren’t lexically based (Stiller et al. 2015). 

So, if we’re trying to incorporate this kind of pragmatic processing limitation into a model of child’s language understanding (e.g., cripple an adult RSA model appropriately), we may want to pay attention to what the pragmatic inference hinges on. That is, is it word-based or not? And which word is it? Apparently, children are okay if you use “the big glass” when there are two glasses present (Huang & Snedeker 2013). So it’s not just about “the” and referential uniqueness. It’s about “the” with specific linguistic ways of determining referential uniqueness, e.g., with PP attachment. HS2016 mention cue reliability in children’s input as one mitigating factor, with the idea that more reliable cues are what children pick first — and then they presumably perseverate on the results of what those reliable cues tell them.

18.6: It was very cool to see evidence of the abstract category of Verb coming from children’s syntactic priming studies. At least by three (according to the Thothathiri & Snedeker 2008 study), the abstract priming effects are just as strong as the within-verb priming effects, which suggests category knowledge that’s transferring from one individual verb to another. To be fair, I’m not entirely sure when the verb-island hypothesis folks expect the category Verb to emerge (they just don’t expect it to be there initially). But by three is already relatively early.

18.7: Again, something that comes to mind for me as an acquisition modeler is how to use the information here to build better models. In particular, if we’re thinking about causes of non-adult behavior in older children, we should look at the top-down information sources children might need to integrate into their interpretations. Children's access to this information may be less than adults have (or simply children's ability to utilize it, which may effectively work out to the same thing in a model).


Lidz, J., & Gagliardi, A. (2015). How nature meets nurture: Universal grammar and statistical learning. Annu. Rev. Linguist., 1(1), 333-353.

McRae, K., & Matsuki, K. (2013). Constraint-based models of sentence processing. In R. Van Gompel (Ed.), Sentence Processing (pp. 51-77). New York, NY: Psychology Press. 

Stiller, A. J., Goodman, N. D., & Frank, M. C. (2015). Ad-hoc implicature in preschool children. Language Learning and Development, 11(2), 176-190.

Wednesday, April 20, 2016

Some thoughts on Yang 2016 in press

As always, it’s a real pleasure for me to read things by Yang because of how clearly his viewpoints are laid out. For this paper in particular, it’s plain that Yang is underwhelmed by the Bayesian approach to cognitive science (and language acquisition in particular). I definitely understand some of the criticisms (and I should note that I personally love the Tolerance Principle that Yang advocates as a viable alternative). However, I did feel obliged to put on my Bayesian devil’s advocate hat here at several points. 

Specific comments:

(1) The Evaluation Metric (EvalM) is about choosing among alternative hypotheses (presumably balancing fit with simplicity, which is one of the attractive features of the Bayesian approach). If I’m interpreting things correctly, the EvalM was meant to be specifically linguistic (embedded in linguistic hypothesis space) while the Bayesian approach isn’t. So, simplicity needs to be defined in linguistically meaningful ways. As a Bayesian devil’s advocate, this doesn’t seem incompatible with having a general preference for simplicity that gets cashed out within a linguistic hypothesis space.

(2) Idealized learners

General: Yang’s main beef is with idealized approaches to language learning, but presumably, very particular ones, because of course every model is idealizing away some aspects of the learning process.

(a) Section 2: Yang’s opinion is that a good model cares about “what can be plausibly assumed” about “the actual language acquisition process”. Totally agreed. This includes what the hypothesis space is — which is crucially important for any acquisition model. It’s one of things that an ideal learner model can check for — assuming the inference can be carried out to yield the best result, will this hypothesis space yield a “good” answer (however that’s determined)? If not, don’t bother doing an algorithmic-level process where non-optimal inferences might results — the modeled child is already doomed to fail. That is, ideal learner models of the kind that I often see (e.g., Goldwater et al 2009, Perfors et al. 2011, Feldman et al. 2013, Dillon et al. 2013) are useful for determining if the acquisition task conceptualization, as defined by the hypothesis space and realistic input data, is reasonable. This seems like an important sanity check before you get into more cognitively plausible implementations of the inference procedure that’s going to operate over this hypothesis space, given these realistic input data.  In this vein, I think these kind of ideal learner models do in fact include relevant “representational structure”, even if it’s true that they leave out the algorithmic process of inference and the neurobiological implementation of the whole thing (representation, hypothesis space, inference procedure, etc.). 

(b) This relates to the comment in Section 2 about how “surely an idealized learner can read off the care-taker’s intentional states” — well, sure, you could create an idealized learner that does that. But that’s not a reasonable estimate of the input representation a child would have, and so a reasonable ideal learner model wouldn’t do it. Again, I think it’s possible to have an ideal learner model that doesn’t idealize plausible input representation.

Moreover, I think this kind of ideal learner model fits in with the point made about Marr’s view on the interaction of the different levels of explanation, i.e., “a computational level theory should inform the study of the algorithmic and implementational levels”. So, you want to make sure you’ve got the right conceptualization of the acquisition task first (computation-level). Then it makes sense to explore the algorithmic and implementational levels more thoroughly, with that computational-level guidance.

(3) Bayesian models & optimality

(a) Section 3: While it’s true that Bayesian acquisition models build in priors such as preferring “grammars with fewer symbols or lexicons with shorter words”, I always thought that was a specific hypothesis these researchers were making concrete. That is, these are learning assumptions which might be true. If they are (i.e., if this is the conceptualization of the task and the learner biases), then we can see what the answers are. Do these answers then match what we know about children’s behavior (yes or no)? So I don’t see that as a failing of these Bayesian approaches. Rather, it’s a bonus — it’s very clear what’s built in (preferences for these properties) and how exactly it’s built in (the prior over hypotheses). And if it works, great, we have a proposal for the assumptions that children might be bringing to these problems. If not, then maybe these aren’t such great assumptions, which makes it less likely children have them.

(b) Section 3: In terms of model parameters being tuned to fit behavioral data, I’m not sure I see that as quite the problem Yang does. If you have free parameters, that means those are things that could matter (and presumably have psychological import). So, knowing what values they need then tells you what those values should be for humans. 

(c) Section 3:  For likelihoods, I’m also not sure I’m as bothered about them as Yang is. If you have a hypothesis and you have data, then you should have an idea of the likelihood of the data given that hypothesis. In some sense, doesn’t likelihood just fall out from hypothesis + data? In Yang’s example of the probability of a particular sentence given a specific grammar, you should be able to calculate the probability of that sentence if you have a specific PCFG. It could be that Yang’s criticism is more about how little we know about human likelihood calculation. But I think that’s one of the learner assumptions — if you have this hypothesis space and this data and you calculate likelihoods this way (because it follows from the hypothesis you have and the data you have), then these are the learning results you get.

(d) 3.1, variation: I think a Bayesian learner is perfectly capable of dealing with variation. It would presumably infer a distribution over the various options. In fact, as far as I know, that’s generally what the Bayesian acquisition models do. The output at any given moment may either by the maximum a posteriori probability choice or a probabilistic sample of that distribution, so you just get one output — but that doesn’t mean the learner doesn’t have the distribution underneath. This seems like exactly what Yang would want when accounting for variation for a particular linguistic representation within an individual. That said, his criticism of a Bayesian model that has to select the maximum a posteriori option as its underlying representation is perfectly valid — it’s just that this is only one kind of Bayesian model, not all of them.

(e) 3.2: For the discussion about exploding hypothesis spaces, I think there’s a distinction between explicit vs latent hypothesis space for every ideal learner model I’ve ever seen. Perfors (2012) talks about this some, and the basic idea is that the child doesn’t have to consider an infinite (or really large) number of hypotheses explicitly in order to search the hypothesis space. Instead, the child just had to have the ability to construct explicit hypotheses from that latent space. (Which always reminds me of using linguistic parameter values to construct grammars like Yang's variational learner does, for instance.)

Perfors, A. (2012). Bayesian Models of Cognition: What's Built in After All?. Philosophy Compass, 7(2), 127-138.

(f) 3.2: I admit, I find the line of argumentation about output comparison much more convincing. If one model (e.g., a reinforcement learning one) yields better learning results than another (e.g., a Bayesian one), then I’m interested.

(g) 3.2: “Without a feasible means of computing the expectations of hypotheses…indirect negative evidence is unusable.” — Agreed that this is a problem (for everyone). That’s why the hypothesis space definition seems so crucial. I wonder if there’s some way to do a “good enough” calculation, though. That is, given the child’s current understanding of the (grammar) options, can the approximate size of one grammar be calculated? This could be good enough, even if it’s not exactly right.

(h) 3.2: “…use a concrete example to show that indirect negative evidence, regardless of how it is formulated, is ineffective when situated in a realistic setting of language acquisition”. — This is a pretty big claim. Again, I’m happy to walk through a particular example and see that it doesn’t work. But I think it’s a significant step to go from that to the idea that it could never work in any situation.

(i) 3.3, overhypothesis for the a-adjective example:  To me, the natural overhypothesis for the a-adjectives is with the locative particles and prepositional phrases. So, the overhypothesis is about elements that behave certain ways (predicative = yes, attributive = no, right-adverb modification = yes), and the specific hypotheses are about the a-adjectives vs. the locative particles vs. the prepositional phrases, which have some additional differences that distinguish them. That is, overhypotheses are all about leveraging indirect positive evidence like the kind Yang discusses for a-adjectives. Overhypotheses (not unlike linguistic parameters) are the reason you get equivalence classes even thought the specific items may seem pretty different on the surface. Yang seems to touch on this in footnote 11, but then uses it as a dismissal of the Bayesian framework. I admit, I found that puzzling. To me, it seems to be a case of translating an idea into a more formal mathematical version, which seems great when you can do it.

4. Tolerance Principle

(a) 4.1: Is the Elsewhere Condition only a principle “for the organization of linguistic information”? I can understand that it’s easily applied to linguistic information, but I always assumed it’s meant to be a (domain-)general principle of organization.

(b) 4.2: I like seeing the Principle of Sufficiency (PrinSuff) explicitly laid out since it tells us when to expect generalization vs. not. That said, I was a little puzzled by this condemnation of indirect negative evidence that was based on the PrinSuff: “That is, in contrast to the use of indirect negative evidence, the Principle of Sufficiency does not conclude that unattested forms are ungrammatical….”. Maybe the condemnation is about how the eventual conclusion of most inference models relying on indirect negative evidence is that the item in question would be ungrammatical? But this seems all about interpretation - these inference models could just as easily set up the final conclusion of “not(grammatical)” as “I don’t know that it’s grammatical” (the way the PrinSuff does here) rather than “it’s ungrammatical”.

Monday, February 29, 2016

Some thoughts on Goldberg & Boyd 2015

I definitely appreciated G&B2015’s clarification of how precisely statistical preemption and categorization are meant to work for learning about a-adjectives (or at least, one concrete implementation of it). In particular, statistical preemption is likened to blocking, which means the learner needs to have an explicit set of alternatives over which to form expectations. For A-adjectives,  the relevant alternatives could be something like “the sleeping boy” vs. “the asleep boy”. If both are possible, then “the asleep boy” should appear sometimes (i.e., with some probability). When it doesn’t, this is because it’s blocked. Clearly, we could easily implement this with Bayesian inference (or as G&B2015 point out themselves, with simple error-driven learning), provided we have the right hypothesis space. 

For example, H1 = only “the sleeping boy” is allowed, while H2 = “the sleeping boy” and “the asleep boy” are both allowed. H1 will win over H2 in a very short amount of time as long as children hear lots of non-a-adjective equivalents (like "sleeping") in this syntactic construction. The real trick is making sure these are the hypotheses under consideration.  For example, there seems to be another reasonable way to think about the hypothesis space, based on the relative clause vs. attributive syntactic usage. H1 = “the boy who is asleep”; H2 = “the asleep boy” and “the boy who is asleep”. Here, we really need to instances of relative-clause usage to drive us towards H1.

It makes me think about the more general issue of determining the hypothesis space that statistical preemption (or Bayesian inference, etc.) is supposed to operate over. G&B2015 explicitly note this themselves in the beginning of section 5, and talk more about hypothesis space construction in 5.2. For the a-adjective learning story G&B2015 promote, I would think some sort of recognition of the semantic similarity of words and the syntactic environments is the basis of the hypothesis space generation.

Some other thoughts:
(1) Section 1: I thought it was an interesting point about “afraid” being sucked into the a-adjective class even though it lacks the morphological property (aspectual “a-“ prefix + free morpheme, the way we see with “asleep”, “ablaze”, “alone”, etc.). This is presumably because of the relevant distributional properties categorizing it with the other a-adjectives? (That is, it’s “close enough”, given the other properties it has.)

(2) Section 2: Just as a note about the description of the experimental tasks, I wonder why they didn’t use novel-a-adjectives that matched the morphological segmentation properties that the real a-adjectives and alternatives have, i..e, asleep and sleepy, so ablim and blimmy (instead of chammy).  

(3) Section 3: G&B2015 note that Yang’s child-directed survey didn’t find a-adjectives being used in relative clauses (i.e., the relevant syntactic distribution cue). So, this is a problem if you think you need to see relative clause usage to learn something about a-adjectives. But, as mentioned above (and also in Yang 2015), I think that’s only one way to learn about them. There are other options, based on semantic equivalents (“sleeping”, “sleepy”, etc. vs. “asleep”) or similarity to other linguistic categories (e.g., the Yang 2015 approach with locative particles).

(4) Section 4: I really appreciate the explicit discussion of how the distributional similarity-based classification would need to work for the locative particles-strategy to pan out (i.e., Table 1). It’s the next logical step once we have Yang’s proposal about using locative particles in the first place.

(5) Section 4: I admit a bit of trepidation about the conclusion that the available distributional evidence for locative particles is insufficient to lump them together with a-adjectives. It’s the sort of thing where we have to remember that children are learning a system of knowledge, and so while the right-type adverb modification may not be a slam dunk for distinguishing a-adjectives from non-a-adjectives, I do wonder if the collection of syntactic distribution properties (e.g., probability of coordination with PPs, etc.) would cause children to lump a-adjectives together with locative particles and prepositional phrases and, importantly, not with non-a-adjectives. Or perhaps, more generally, the distributional information might cause children to just separate out a-adjectives, and note that they have some overlap with locative particles/PPs and also with regular non-a-adjectives. 

Side note: This is the sort of thing ideal learner models are fantastic at telling us: is the information sufficient to draw conclusion x? In this case, the conclusion would be that non-a-adjectives go together, given the various syntactic distribution cues available. G&B2015 touch on this kind of model at the beginning of section 5.2, mentioning the Perfors et al. 2010 work.

(6) Section 5: I was delighted to see the Hao (2015) study, which gets us the developmental trajectory for a-adjective categorization (or at least, how a-adjectives project onto syntactic distribution). Ten years old is really old for most acquisition stuff. So, this accords with the evidence being pretty scanty (or at least, children taking awhile until they can recognize that the evidence is there, and then make use of it).

Monday, February 15, 2016

Some thoughts on Yang 2015

Just from a purely organizational standpoint, I really appreciate how explicitly the goals of this paper are laid out (basically, (i) here’s why the other strategy won’t work, and (ii) why this new one does). Also, because of the clarity of the presentation, I’ll be interested to read Goldberg & Boyd's response for next time. Additionally, I greatly enjoyed reading about the application of what I’ve been calling “indirect positive evidence” (Pearl & Mis in press) — that is, things that are present in the input that can be leveraged indirectly to tell you about something else you’re trying to currently learn about (here: leverage distributional cues for locative particles and PPs to learn about a-adjectives). I really do think this is the way to deal with a variety of acquisition problems (and as I’ve mentioned before, it’s the same intuition that underlies both linguistic parameters and Bayesian overhypotheses: Pearl & Lidz 2013). In my opinion, the more we see explicit examples of how indirect positive evidence can work for various language acquisition problems, the better.

Some more specific thoughts:
(1) I found it quite helpful to have the different cues to a-adjectives listed out, in particular that the phonological cue of beginning with the schwa isn’t 100%, while the morphological cue of being split into aspectual “a” (= something like presently occurring?) + root is nearly 100%. It reminds me of the Gagliardi et al. (2012) work on children’s differing sensitivity to available cues when categorizing nouns in Tsez. In particular, Gagliardi et al. found that the model had to be more sensitive to phonological cues than semantic cues in order to match children’s behavior. This possibly has to do with the ability to reliably observe phonological cues as compared to semantic cues. I suspect the fairly diagnostic morphological cue might also be more observable, since it involves recognition of a free morpheme within the a-adjective (e.g., wake in awake).

(2) Related point: the actual trajectory of children’s development with a-adjectives. This is something that seems really relevant for determining which learning strategies children are using (as Yang himself points out, when he notes that all the experiments from Boyd & Goldberg are with adults). Do children make errors and use infrequent non-a-adjectives only predicatively (i.e., they don’t think they can use them attributively)? And on the flip side, do they use some a-adjectives attributively? Knowing about the errors children make (or lack thereof) can help us decide if they’re really learning on a lexical item by lexical item basis, or instead recognizing certain classes of adjectives and therefore able to make generalizations from one class instance to another (or perhaps more likely, at what age they recognize the classes of adjectives). 

Yang quite admirably does a corpus search of naturalistic child productions, which is consistent with children knowing not to use a-adjectives attributively, but it’s not quite the same as behavioral evidence where children definitively show they disallow (or strongly disprefer) the attributive usage.

(3) Indirect negative evidence: One of Yang’s concerns is that this kind of evidence “requires comparing the extensions of the competing hypotheses”. I get the general gist of this, but I think we run into the same problem with all the language hypothesis spaces we set up, where one language’s parameter is a subset of another’s. That is, classical approaches like the Subset Principle run into the exact same problem. This is something we always have to deal with, and I think it depends very much on the hypothesis spaces children entertain.

Moreover, on the flip side, how much of a problem is it really? For the concrete example we’re given about the language that includes “the asleep cat” vs. the language that doesn’t, the extensional difference is one utterance (or one category of utterances, if group them all together under a-adjectives). How computationally hard is this to calculate? Importantly, we really just need to know that the difference is one construction — the rest of the language’s extension doesn’t matter. So it seems like there should be a way to form a hypothesis space exactly like the one described above (P = “the asleep cat” is allowed vs. not-P = “the asleep cat” is not allowed)?  

Also, related to the point about how Boyd & Goldberg’s strategy works — does it even matter what other constructions do appear with those adjectives (i.e., the cat is asleep)?  Isn’t it enough that “the asleep cat” doesn’t? I guess the point is that you want to have appropriate abstract classes like the ones described in section 3.1, i.e., predicative usage = “the cat is asleep”, “the cat is nice”; attributive = *“the asleep cat”, “the nice cat”. This makes the P hypothesis more like “asleep can be used both predicatively and attributively” and the not-P class is “asleep can be used only predicatively”. But okay, let’s assume children have enough syntactic knowledge to manage this. Then we go back to the point about how hard it is in practice to deal with hypothesis space extensions. Especially once we add this kind of abstraction in, it doesn’t seem too hard at all, unless I’m missing something (which is always possible).

(4) I personally have a great love for the Tolerance Principle, and I enjoyed seeing its usage here. But, as always, it gets me thinking about the relationship between the Tolerance Principle and Bayesian inference, especially when we have nice hypothesis spaces laid out like we do here. So, here’s my thinking at the moment:

For the Tolerance Principle, we have a setup like this:

H1 = the generalization applies to all N items, even though e exceptions exist. 
H2 = there is no generalization, and all N items do their own thing.

O = items the pattern/rule is observed to apply to
e = exceptional items the pattern/rule should apply to but doesn’t
N - O - e = unobserved items (if any). We can simplify this and just assume all items have been observed to either follow the pattern (and be in O) or not (and be in e), so N - O - e = 0. 

Turning over to Bayesian thinking, let’s assume the priors for H1 and H2 are equal. So, all the work is really done in the likelihood, i.e, P(Hx | data) is proportional to P(Hx) [prior] * P(data | Hx) [likelihood].

Okay, so how do we calculate P(data | H1) vs. P(data | H2)? The data here is O pattern-following items and e exceptions, where N = O + e.

To calculate both likelihoods, we need to know the probability of generating those O pattern-following items and the probability of generating those e exceptions under both H1 and H2. I think this kind of question is where we get into the derivation of the Tolerance Principle, as described by Yang (2005). In particular, there’s an idea that if you have a rule (as in H1), it’s cheaper to store and access the right forms when there are enough items that follow the rule. 

More specifically, it’s some kind of constant cost for those O items (rule application), though the constant cost involves some work because you actually have to do the computation of the rule/pattern over the item. For the e exceptions, there’s some cost of accessing the stored form individually, based on the frequency of the stored items. Importantly, if you have H1 with a rule + exceptions, every time you use the rule, you have to look through the exceptions first and then apply the rule. For H2 where everything is individually stored, you just wander down the list by frequency until you get to the individual item you care about. 

The Tolerance Principle seems to be the result of doing this likelihood calculation, and giving a categorical decision. Instead of spelling out P(data | H1) and P(data | H2) explicitly, Yang (2005) worked out the decision point: if e <= N/ ln N, then P(data | H1) is higher (i.e., having the rule is worth it). So, if we wanted to generate the actual likelihood probabilities for H1 and H2, we’d want to plumb the depths of the Tolerance Principle derivation to determine these. And maybe that would be useful for tracking the trajectory of generalization over time, because it’s very possible these probabilities wouldn’t be close to 0 or 1 immediately. (Quick thoughts: P(data | H1) = something like (p_individualaccess)^e * p(p_followsrule)^O; P(data | H2) = something like (p_individualaccess)^N).

Gagliardi, A., Feldman, N. H., & Lidz, J. 2012. When suboptimal behavior is optimal and why: Modeling the acquisition of noun classes in Tsez. In Proceedings of the 34th annual conference of the Cognitive Science Society (pp. 360-365).

Pearl, L., & Lidz, J. 2013. Parameters in Language Acquisition. The Cambridge Handbook of Biolinguistics, 129-159.

Pearl, L., & Mis, B. (in press - updated 2/2/15). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric oneLanguage.

Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

Monday, February 1, 2016

Some thoughts on van Schijndel & Elsner 2014

I really like the idea of seeing how far you can get with understanding filler-gap interpretation, given very naive ideas about language structure (i.e., linear w.r.t. verb position, as vS&E2014 do). Even if it’s not this particular shallow representation (and instead maybe a syntactic skeleton like the kind Gutman et al. 2014 talked about), the idea of what a “good enough” representation can do for scaffolding other acquisition processes is something near and dear to my heart.  

One niggling thing — given that vS&M2014 say that this model represents a learner between 15 and 25-30 months, it’s likely the syntactic knowledge is vastly more sophisticated at the end of the learning (i.e., ~25 months). So the assumptions of simplified syntactic input may not be as necessary (or appropriate) later on in development. More generally, this kind of extended modeling timeline makes me want more integration with the kind of acquisition framework of Lidz & Gagliardi (2015), which incorporates developing knowledge into the model’s input & inference.

One other thing I really appreciated in this paper was how much they strove to connect the modeling assumptions and evaluation with developmental trajectory data. We can argue about the implementation of the information those empirical data provide, sure, but at least vS&E2014 are trying to seriously incorporate the known facts so that we can get an informative model.

Other specific thoughts:

(1) At the end of section 3, vS&E2014 say the model “assumes that semantic roles have a one-to-one correspondence with nouns in a sentence”. So…is it surprising that “A and B gorped” is interpreted as “A gorped B” since it’s built into the model to begin with? That is, this misinterpretation is exactly what a one-to-one mapping would predict - A and B don’t get the same role (subject/agent) because only one of them can get the role. Unless I misunderstood what the one-to-one correspondence is doing.

(2) I wasn’t quite sure about this assumption mentioned in section 3: “To handle recursion, this work assumes children treat the final verb in each sentence as the main verb…”. So in the example in Table 1, “Susan said John gave (the) girl (a) book”, “gave” is the “main” verb because…why? Why not just break the sentence up by verbs anyway? (That is, “said” would get positions relative to it and “gave” would get positions relative to it, and they might overlap, but…okay?) Is this assumption maybe doing some other kind of work, like with respect to where gaps tend to be?

(3) If I’m understanding the evaluation in section 5 correctly, it seems that semantic roles commonly associated with subject and object (i.e., agent, patient, etc. depending on the specific verb) are automatically assigned by the model. I think this works for standard transitive and intransitive verbs really well, but I wonder about accusatives (fall, melt, freeze, etc.) where the subject is actually the “done-to” thing (i.e., Theme or Patient, so the event is actually affecting that thing). This is something that would be available if you had observable conceptual information (i.e., you could observe the event the utterance refers to and determine the role that participant plays in the event). 

Practically speaking, it means the model assigning “theme/patient” to the subject position (preverbal) would be correct for unaccusatives. But I don’t think the current model does this - in fact, if it just uses “subject” and “object” to stand in for thematic/conceptual roles, the “correct” assignment would be the subject NP of unaccusatives as an “object” (Theme/Patient)….which would be counted as incorrect for this model. (Unless the BabySRL corpus that vS&E2014 used labels thematic roles and not just grammatical roles? It was a bit unclear.) I guess the broader issue is the complexity of different predicate types, and the fact that there isn’t a single mapping that works for all of them.  

This came up again for me in section 6 when vS&E2014 compare their results to the competing BabySRL model and they note that when given a NV frame (like with intransitives or unaccusatives), BabySRL labels the lone NP as an “object” 30 or 40% of the time. If the verb is an unaccusative, this would actually be correct (again, assuming “object” maps to “patient” or “theme”).

(4) Section 6: “…these observations suggest that any linear classifier which relies on positioning features will have difficulties modeling filler-gap acquisition” — including the model here? It seemed like the one vS&E2014 used captured the filler-gap interpretations effects they were after, and yet relied on positioning features (relative to the main verb). 

Gutman, Ariel, Isabelle Dautriche, Benoit Crabbe, & Anne Christophe 2015. Bootstrapping the Syntactic Bootstrapper: Probabilistic Labeling of Prosodic Phrases, Language Acquisition, 22(3), 285-309.

Lidz, J., & Gagliardi, A. (2015). How Nature Meets Nurture: Universal Grammar and Statistical Learning. Annu. Rev. Linguist., 1(1), 333-353.

Monday, January 18, 2016

Some thoughts on Gutman et al. 2014

I’m a big fan of G&al2014’s goal of learning the initial knowledge that gets other acquisition processes started. In this case, it’s about learning the basic elements that allow syntactic boostrapping to start, which itself allows children to learn more abstract word meanings. In CoLaLab, we’ve been looking at this same idea of useful initial knowledge with respect to speech segmentation and early syntactic categorization. 

For G&al2014’s work, I find it interesting that they rely on comparison to adult prosodic categories (specifically VN and NP) — I wonder if there’s a way to determine if the inferred prosodic categories are “good enough” in some sense, beyond matching VN and NP. For example, maybe the inferred categories can be used directly for syntactic bootstrapping, or maybe they can used to ease language processing in some measurable way. (As a side note, it also took me a moment to realize “syntactic categorization” for G&al2014 referred to prosodic phrase types rather than the typical syntactic categories like “noun” and “verb”. Just goes to show the importance of defining your terms to avoid confusion.)

I’m also a big fan of models that recognize children use a variety of cues very early on, i.e., here, prosody and semantics of a few familiar words, as well as edge sensitivity. Of course, it’s also important to understand the contribution of individual sources of information. But it’s really nice to see a more integrated model like this because it’s likely to be a more accurate simulation of what children are actually doing.

Other thoughts:

(1) I really like how this model shows which property of function words (the fact that they occur at prosodic phrase edges) allows children to learn that function words are really great cues — even before they have an official “function word” category like “determiner”.

(2) It’s interesting that the syntactic skeleton (formed via function words and prosodic boundaries) matches adult structure (NP = an apple) in some cases and not so much in others (he’s eating = VN, which isn’t a VP or an NP - it’s actually a non-constituent). I wonder how the recovery/update process works if you end up with a bunch of VN units - that is, what causes you to switch to VP = V NP and treat “he’s eating” as not-really-a-syntactic-unit in “he’s eating an apple”? 

(3) Section 2, Experiment 1: If units are constructed by looking at the initial word, it’s important that there not be too much variety in that first word (unless we want toddlers to end up with a zillion phrasal units). From the details in 2.2.1, it looks like they use the k most frequent words to define k classes of units, with k ranging from 5 to 70. Presumably, this would be something implicit to the learner, based on the learner's cognitive capacity limitations or some such. I also like that this is relying on the most frequent words, since that seems quite plausible as a way to figure out which phrasal types to notice. Related thought: Is it possible to design a model where k itself is inferred? I’m thinking generative non-parametric Bayesian models, for example.

(4) I also found it interesting that they used purity as the evaluation measure for a phrasal category, rather than pairwise precision (PWP). I wonder what benefit purity has over PWP, since footnote 7 explicitly notes they’re related. Is purity easier to interpret for some reason? G&al2014 do calculate recall and precision for the best instances of VN and NP, though (and find that the categories are very precise, even with as few as 10 categories).

Wednesday, November 25, 2015

Some thoughts on Morley 2015

I definitely appreciate the detailed thought that went into this paper — Morley uses this deceptively simple case study to highlight how to take complexity in representation and acquisition seriously, and also how to take arguments about Universal Grammar seriously. (Both of these are, of course, near and dear to my heart.) I also loved the appeal to use computational modeling to make linguistic theories explicit. (I am all about that.)

I also liked how she notes the distinction between learning mechanism and hypothesis space constraints in her discussion of how UG might be instantiated — again, something near and dear to my heart. My understanding is that we’ve typically thought about UG as constraints on the hypothesis space (and the particular UG instantiation Morley investigated is this kind of UG constraint). To be fair, I tend to lean this way myself, preferring domain-general mechanisms for navigating the hypothesis space and UG for defining the hypothesis space in some useful way. 

Turning to the particular UG instantiation Morley looks at, I do find it interesting that she contrasts the “UG-delimited H Principle” with the “cycle of language change and language acquisition” (Intro). To me, the latter could definitely have a UG component in either the hypothesis space definition or the learning mechanism. So I guess it goes to show the importance of being particular about the UG claim you’re investigating. If the UG-delimited H Principle isn’t necessary, that just rules out the logical necessity of that type of UG component rather than all UG components. (I feel like this is the same point made to some extent in the Ambridge et al. 2014 and Pearl 2014 discussion about identifying/needing UG.)

Some other thoughts:
(1) Case Study: 

(a)  I love seeing the previous argument for “poverty of the typology implies UG” laid out. Once you see the pieces that lead to the conclusion, it becomes much easier to evaluate each component in its own right.

(b) The hypothetical lexicon items in Table 1 provide a beautiful example of overlapping hypothesis extensions, some of which are in a subset-superset relationship depending on the actual lexical items observed (I’m thinking of the Penultimate grammar vs the other two, given items 1,3, and 4 or item 1, 2, and 5). Bayesian Size Principle to the rescue (potentially)!

(c) For stress grammars, I definitely agree that some sort of threshold for determining whether a rule should be posited is necessary. I’m fond of Legate & Yang (2013)/Yang (2005)’s Tolerance Principle myself (see Pearl, Ho, & Detrano 2014, 2015 for how we implement it for English stress. Basic idea: this principle provides a concrete threshold for which patterns are the productive ones. Then, the learner can use those to pick the productive grammar from the available hypotheses). I was delighted to see the Tolerance Principle proposal explicitly discussed in section 5.

(2) The Learner

(a) It’s interesting that a distribution over lexical item stress patterns is allowed, which would then imply that a distribution over grammars is allowed (this seems right to me intuitively when you have both productive and non-productive patterns that are predictable). Then, the “core” grammar is simply the one with the highest probability. One sticky thing: Would this predict variability within a single lexical item? (That is, sometimes an item gets the stress contour from grammar 1 and sometimes it gets the one from grammar 2.) If so, that’s a bit weird, except in cases of code-switching within dialects (maybe example: American vs. British pronunciation). But is this what Stochastic OT predicts? It sounds like the other frameworks mentioned could be interpreted this way too. I’m most familiar with Yang’s Variational Learning (VL), but I’m not sure the VL framework has been applied to stress patterns on individual lexical items, and perhaps the sticky issue mentioned above is why? 

Following this up with the general learners described, I think that’s sort of what the Variability/Mixture learners would predict, since grammars can just randomly fail to apply to a given lexical item with some probability. This is then a bit funny because these are the only two general learners pursued further. The discarded learners predict different-sized subclasses of lexical items within which a given grammar applies absolutely, and that seems much more plausible to me, given my knowledge of English stress. Except the description of the hypotheses given later on in example (5) make me think this is effectively how the Mixture model is being applied? But then the text beneath (7) clarifies that, no, this hypothesis really does allow the same lexical item to show up with different stress patterns.

(b) It’s really interesting to see the connection between descriptive and explanatory adequacy and Bayesian likelihood and prior. I immediately got the descriptive-likelihood link, but goggled for a moment at the explanatory-prior link. Isn’t explanatory adequacy about generalization? Ah, but a prior can be thought of as an extension of items -- and so the items included in that extension are ones the hypothesis would generalize to. Nice!

(3)  Likely Input and a Reasonable Learner: The take-home point seems to be that lexicons that support Gujarati* are rare, but not impossible. I wonder how well these match up to the distributions we see in child-directed speech (CDS)? Is CDS more like Degree 4, which seems closest to the Zipfian distribution we tend to see in language at different levels?

(4) Interpretation of Results: I think Morley makes a really striking point about how much we actually (don’t) know about typological diversity, given the sample available to us (basically, we have 0.02% of all the languages). It really makes you (me) rethink making claims based on typology.


Ambridge, B., Pine, J. M., & Lieven, E. V. (2014). Child language acquisition: Why universal grammar doesn't help. Language, 90(3), e53-e90.

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

Pearl, L., Ho, T., & Detrano, Z. 2014. More learnable than thou? Testing metrical phonology representations with child-directed speech. Proceedings of the Berkeley Linguistics Society, 398-422.