Monday, April 17, 2017

Thoughts on Lasnik & Lidz 2016

I really enjoyed L&L2016’s take on poverty of the stimulus and how it relates to the argument for Universal Grammar (UG) — so much so that I’m definitely using this chapter as a reference when we talk about poverty of the stimulus in my upper-division language acquisition course. 

One thing that surprised me, though, is that there seems to be some legitimate confusion in the research community about how to define UG (more on this below), which leads to one of two situations: (i) everyone who believes there are induction problems in language by definition believes in Universal Grammar, or (ii) everyone who believes there are induction problems in language that are only solvable by language-specific components believes in Universal Grammar. I feel like the linguistics, cognitive science, and psychology communities need to have a heart-to-heart about what we’re all talking about when we argue for or against Universal Grammar. (To be fair, I’ve felt this way when responding to some folks in the developmental psych community — see Pearl 2014 for an example.)

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

Specific thoughts:

(1) Universal Grammar:  
Chomsky (1971)’s quote in section 10.6 about structure-dependence concludes with “This is a very simple example of an invariant principle of language, what might be called a formal linguistic universal or a principle of universal grammar.” — Here, the term Universal Grammar seems to apply to anything that occurs in all human languages (not unreasonable, give the adjective “universal”). But it doesn’t specify whether that thing is innate vs. derived, or language-specific vs. domain-general. I begin to see where the confusion in the research community may have come from. 

Right now, some people seem to get really upset at the term Universal Grammar, taking it to mean things that are both innate and language-specific (and this is the certainly working definition Jon Sprouse and I use). But Chomsky’s use of Universal Grammar above can clearly be interpreted quite differently. And for that interpretation of Universal Grammar, it’s really just a question of whether the thing is in fact something that occurs in all human languages, period. It doesn’t matter what kind of thing it is.

Related: In the conclusion section 10.9, L&L2016 zero in on the innate part of UG: “…there must be something inside the learner which leads to that particular way of organizing experience…organizing structure is what we typically refer to as Universal Grammar…”. This notably leaves it open about whether the innate components are language-specific or domain-general. But the part that immediately follows zeros in on the language-specific part by saying Universal Grammar is “the innate knowledge of language” that underlies human language structure and makes acquisition possible. On the other hand….maybe “innate knowledge of language” could mean innate knowledge that follows from domain-general components and which happens to apply to language too? If so, that would back us off to innate stuff period, and then, by that definition, everyone believes in Universal Grammar as long as they believe in innate stuff applying to language.

(2) Induction problems: I really appreciate how the intro in 10.1 highlights that the existence of induction problems doesn’t require language-specific innate components (just innate components). The additional step of asserting that the innate components are also language-specific (for other reasons) is just that — an additional step. Sometimes, I think these steps get conflated when induction problems and poverty of the stimulus are discussed, and it’s really great to see it so explicitly laid out here. I think the general line of argument in this opening section also makes it clear why the pure empiricist view just doesn’t fly anymore in cognitive development — everyone’s some kind of nativist. But where people really split is whether they believe at least some innate component is also language-specific (or not). This is highlighted by a Chomsky (1971) quote in section 10.3, which notes that the language-specific part is an “empirical hypothes[i]s”, and the components might in fact be “special cases of more general principles of mind”.

(3) The data issue for induction problems: 
Where I think a lot of interest has been focused is the issue of how much data are actually available for different aspects of language acquisition. Chomsky’s quote in 10.1 about the A-over-A example closes with “…there is little data available to the language learner to show that they apply”. Two points: 

(a) "Little" is different than "none", and how much data is actually available is a very important question. (Obviously, it’s far more impressive if the input contains none or effectively none of the data that a person is able to judge as grammatical or ungrammatical.) This is picked up in the Chomsky (1971) quote in section 10.6, which claims that someone “might go through much or all of his life without ever having been exposed to relevant evidence”. This is something we can actually check out in child-directed speech corpora — once we decide what the “relevant evidence” is (no small feat, and often a core contribution of an acquisition theory). This also comes back in the discussion of English anaphoric one in section 10.8, where the idea that of what counts as informative data is talked about in some detail (unambiguous vs. ambiguous data of different kinds). 

(b) How much data is "enough" to support successful acquisition is also a really excellent question. Basically, an induction problem happens when the data are too scarce to support correct generalization as fast as kids do it. So, it really matters what “too scarce” means. (Legate & Yang (2002) and Hsu & Chater (2010) have two interesting ideas for how to assess this quantitatively.) L&L2016 bring this up explicitly in the closing bit of 10.5 on Principle C acquisition, which is really great.

Legate, J. A., & Yang, C. D. (2002). Empirical re-assessment of stimulus poverty arguments. Linguistic Review, 19(1/2), 151-162.

Hsu, A. S., & Chater, N. (2010). The logical problem of language acquisition: A probabilistic perspective. Cognitive science, 34(6), 972-1016.

(4) Section 10.4 has a nice explanation of Principle C empirical data in children, but we didn’t quite get to the indirect negative evidence part for it (which I was quite interested to see!). My guess: Something about structure-dependent representations, and then tracking what positions certain pronouns allow reference to (a la Orita et al. 2013), though section 10.5 also talks about a more idealized account that’s based on the simple consideration of data likelihood.

Orita, N., McKeown, R., Feldman, N., Lidz, J., & Boyd-Graber, J. L. (2013). Discovering Pronoun Categories using Discourse Information. In CogSci.

(5) A very minor quibble in section 10.5, about the explanation given for likelihood. I think the intuition is more about how the learner views data compatibility with the hypothesis. P(D | H ) = something like “how probable the observed data are under this hypothesis”, which is exactly why the preference falls out for a smaller hypothesis space that generates fewer data points. (How the learner’s input affects the beliefs is the whole calculation of likelihood * prior, which is transformed into the posterior.) 

Related: I love the direct connection of Bayesian reasoning to the Subset Principle. It seems to be exactly what Chomsky was talking about as something that’s a special case of a more general principle of mind.

(6) Structure-dependence, from section 10.6: “Unfortunately, other scholars were frequently misled by this into taking one particular aspect of the aux-fronting paradigm as the principle structure dependence claim, or, worse still, as the principle poverty of the stimulus claim.” — Too darned true, alas! Hopefully, this paper and others like it will help rectify that misunderstanding. I think it also highlights that our job, as people who believe there are all these complex induction problems out there, should be to accessibly demonstrate what these induction problems are. A lot.

(7) Artificial language learning experiments in 10.7: I’ve always thought the artificial language learning work of Takahashi and Lidz was a really beautiful demonstration of statistical learning abilities applied to learning structure-dependent rules that operate over constituents (= what I’ll call a “constituent bias”). But, as with all artificial language learning experiments, I’m less clear about how to relate this to native language acquisition, where the learners don’t already have a set of language biases about using constituents from their prior experience. It could indeed be that such biases are innate, but it could also be that such biases (however learned) are already present in the adult and 18-month-old learners, and these biases are deployed for learning the novel artificial language. So, it’s not clear what this tells us about the origin of the constituent bias. (Note: I think it’s impressive as heck to do this with 18-month-olds. But 18-month-olds do already have quite a lot of experience with their native language.)


(8) Section 10.8 & anaphoric one (minor clarification): This example is of course near and dear to my heart, since I worked on it with Jeff Lidz.  And indeed, based on our corpus analyses, unambiguous data for one’s syntactic category (and referent) in context is pretty darned rare. The thing that’s glossed over somewhat is that the experiment with 18-month-olds involves not just identifying one’s antecedent as an N’, but specifically as the N’ “red bottle” (because “bottle” is also an N’ on its own, as example 20 shows). This is an important distinction, because it means the acquisition task is actually a bit more complicated. The syntactic category of N’ is linked to 18-month-olds preferring the antecedent “red bottle” — if they behaved as if they thought it was “bottle”, we wouldn’t know if they thought it was N’ “bottle” or plain old N0 “bottle”.

Tuesday, March 7, 2017

Thoughts on Ranganath et al. 2013

I really appreciate seeing the principled reasoning for using certain types of classifiers, and doing feature analysis both before and after classification. On this basis alone, this paper seems like a good guide to classifier best practices for the social sciences. Moreover, the discussion section takes care to relate the specific findings to larger theoretical ideas in affective states, like collaborative conversation style, and the relationship between specific features and affective state (e.g.,  negation use during flirtation may be related to teasing or self-deprecation; the potential distinction between extraversion and assertiveness; the connection between hedging and psychological distancing; what laughter signals at different points in the conversational turn). Thanks, R&al2013!

Other thoughts:

(1) Data cleanliness: R&al2013 want a really clean data set to learn from, which is why they start with the highest 10% and lowest 10% of judged stance ratings.  We can certainly see the impact of having messier data, based on the quartile experiments. In short, if you use less obvious examples to train, you end up with worse performance. I wonder what would happen if you use the cleaner data to train (say, the top and bottom 10%), but tested on classifying the messier data (top and bottom 25%). Do you think you would still do as poorly, or would you have learned some good general features from the clean dataset that can be applied to the messy dataset? (I’m thinking about this in terms of child-directed speech (CDS) for language acquisition, where CDS is “cleaner” in various respects than messy adult-directed data.)

(2) This relates to the point in the main section about how R&al2013 really care about integrating insights from the psychology of the things they’re trying to classify. In the lit review, I appreciated the discussion of the psychological literature related to interpersonal stance (e.g., specifying the different categories of affective states). This demonstrates the authors are aware of the cognitive states underpinning the linguistic expression.

(3) Lexical categories, using 10 LIWC-like categories: I appreciated seeing the reasoning in footnote 1 about how they came up with these, and more importantly, why they modified them the way they did. While I might not agree with leaving the “love” and “hate” categories so basic (why not use WordNet synsets to expand this?), it’s at least a reasonable start. Same comment for the hedge category (which I love seeing in the first place).

(4) Dialog and discourse features: Some of these seem much more complex to extract (ex: sympathetic negative assessments). The authors went for a simple heuristic regular expression to extract these, but this is presumably only a (reasonable) first-pass attempt. On the other hand, given that they had less than 1000 speed-dates, they probably could have done some human annotation of these just to give the feature the best shot of being useful. Then, if it’s useful, they can worry about how to automatically extract it later.

(5)  It’s so interesting to see the accommodation of function words signifying flirtation. Function words were the original authorship stylistic marker, under the assumption that your use of function words isn’t under your conscious control. I guess the idea would be that function word accommodation also isn’t really under your conscious control, and imitation is the sincerest form of flattery (=~ flirtation)…


Tuesday, February 21, 2017

Thoughts on Rubin et al. 2015

As with much of the deception detection literature, it’s always such a surprise to me how relatively modest the performance gains are. (Here, the predictive model doesn’t actually get above chance performance, for example — of course, neither do humans.) This underscores how difficult a problem deception detection from linguistic cues generally is (or at least, is currently). 

For this paper, I appreciated seeing the incorporation of more sophisticated linguistic cues, especially those with more intuitive links to the psychological processes underlying deception (e.g., rhetorical elements representing both what the deceiver chooses to focus on and the chain of argument from one point to the next). I wonder if there’s a way to incorporate theory of mind considerations more concretely, perhaps via pragmatic inference linked to discourse properties (I have visions of a Rational-Speech-Act-style framework being useful somehow).

Other thoughts:

(1) I wonder if it’s useful to compare and contrast the deception process that underlies fake product reviews with the process underlying fake news. In some sense, they’re both “imaginative writing”, and they’re both  about a specific topic that could involve verifiable facts. (This comes to mind especially because of the detection rate of around 90% for the fake product reviews in the data set of Ott et al. 2011, 2013, using just n-grams + some LIWC features).

Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 309-319). Association for Computational Linguistics.

Ott, M., Cardie, C., & Hancock, J. T. 2013. Negative Deceptive Opinion Spam. In HLT-NAACL (pp. 497-501).


(2) I really appreciated the discussion of the issues surrounding “citizen journalism”. I wonder if an easier (or alternative route) for news verification is considering a set of reports about the same topic in aggregate — i.e., a wisdom of the crowds approach over the content of the reports. The result is aggregated content (note: perhaps cleverly aggregated to be weighted by various linguistic/rhetorical/topic features) that reflects the ground truth better than any individual report, and thus would potentially mitigate the impact of any single fake news report. You might even be able to use the “Bluff the Listener” NPS news data R&al2015 used, though there you only have three stories at a time on the same topic (and two are in fact fake, so your “crowd” of stories is deception-biased).


(3) Something I’m interested in, given the sophistication of the RST discourse features — what are some news examples that a simplistic n-grams approach would miss (either false positives or false negatives)? Once we have those key examples, we can look at the discourse feature profiles of those examples to see if anything pops out. This then tells us what value would be added to a standard baseline n-gram model that also incorporated these discourse features, especially since they have to be manually annotated currently. 

Tuesday, January 31, 2017

Thoughts on Iyyer et al. 2014

I really appreciate that I&al2014’s goal is to go beyond bag-of-words approaches and leverage the syntactic information available (something that warms my linguistic heart).  To this end, we see a nice example in Figure 1 of the impact of lexical choice and structure on the overall bias of a sentence, with “big lie” + its complement (a proposition) = opposite bias of the proposition. Seeing this seemingly sophisticated compositional process, I was surprised to see later on that negation causes such trouble. Maybe this has to do with the sentiment associated with “lie” (which is implicitly negative), while “not” has no obvious valence on its own?

Some other thoughts:

(1) Going over some of the math specifics: In the supervised objective loss function in (5), I’m on board with l(pred_i), but what’s gamma? (A bias parameter of some kind? And is over two just so the derivative works on in equation 6?)) Theta is apparently the set of vectors corresponding to the components (W_L, W_R), the weights on the components (W_cat), the biases (b_1, b_2), and some other vector W_e (which later on is described as a word embedding matrix from word2vec)…and that gets squared in the objective function because…?

(2) I like seeing the impact of initialization settings (random vs prior knowledge= 300 dimensional word2vec). The upshot is that word2vec prior knowledge about words is helpful — though only by 1% in performance, much to my surprise. I expected this semantic knowledge to be more helpful (again, my linguistic knowledge bias is showing).

(3) Dataset stuff:

(a) I found it a bit odd that the authors first note that partisanship (i.e., whether someone is Republican or Democrat) doesn’t always correlate with their ideological stance on a particular issue (i.e., conservative or democrat), and then say how they’re going to avoid conflating these things by creating a new annotated data set. But then, when creating their sentence labels, they propagate the party label (Republican/Demoncrat) down from the speaker to individual sentences, making exactly these mappings (Republican—>conservative, Democrat—>liberal) they just said they didn’t want to conflate. Did I miss something? (Also, why not use crowdflower to verify the propagated annotations?)

(b) Relatedly, when winnowing down the sentences that are likely to be biased for the annotated dataset, I&al2014 rely on exactly the hand-crafted methods that they shied away from before (e.g., a dictionary of “sticky bigrams” strongly associated with one party or the other). So maybe there’s a place for these methods at some point in the classifier development pipeline (in terms of identifying useful data to train on).

(c) The final dataset size is 7816 sentences — wow! That’s tiny in NLP dataset size terms. Even when you add the 11,555 hand-tagged ones from the IBC, that’s still less than 20K sentences to learn from. Maybe this is an instance of quality over quantity when it comes to learning (and hopefully not overfitting)?


(4) It’s really nice to see specific examples where I&al2014’s approach did better than the different  baselines. This helps with the explanation of what might be going on (basically, structurally-cued shifts in ideology get captured). Also, here’s where negation strikes! It’s always surprising to me that more explicit things to handle negation structurally aren’t implemented, given how much power negation has when it comes to interpretation. I&al2014 say this can be solved by more training data (probably true)…so maybe the vectorized representation of “not” would get encoded to be something like its linguistic structural equivalent? 

Tuesday, January 10, 2017

Some thoughts on Mikolov et al. 2013

I definitely find it as interesting as M&al2013 do that some morphosyntactic relationships (e.g., past tense vs. present tense) are captured by these distributed vector representations of words, in addition to the semantic relationships. That said, this paper left me desperately wanting to know why these vector representations worked that way. Was there anything interpretable in the encodings themselves? (This is one reason why current research into explaining neural network results is so attractive — it’s nice to see cool results, but we want to know what the explanation is for those results.) Put simply, I can see that forcing a neural network to learn from big data in an unsupervised way yields these implicit relationships in the word encodings. (Yay! Very cool.) But tell me more about why the encodings look the way they do so we better understand this representation of meaning.

Other thoughts:

(1) Everything clearly rides on how the word vectors are created (“…where similar words are likely to have similar vectors”). And that’s accomplished via an RNN language model very briefly sketched in Figure 1. I think it would be useful to better understand what we can of this, since this is the force that’s compressing the big data into helpful word vectors. 

One example:  the model is “…trained with back-propagation to maximize the data log-likelihood under the model…training such a purely lexical model to maximize likelihood will induce word representations…”  — What exactly are the data? Utterances?  Is there some sense of trying to predict the next word the way previous models did? Otherwise, if everything’s just treated as a bag of words presumably, how would that help regularize word representations?

(2) Table 2: Since the RNN-1600 does the best, it would be handy to know what the “several systems” were that comprised it. That said, there seems to be an interesting difference in performance between adjectives and nouns on one hand (at best, 23-29% correct) and verbs on the other (at best, 62%), especially for the RNN versions. Why might that be? The only verb relation was the past vs present tense…were there subsets of noun or adjective relations with differing performance, or were all the noun and all the adjective relations equal? (That is, is this effectively a sampling error, and if we tested more verb relations, we’d find more varied performance?) Also, it’d be interesting to dig into the individual results and see if there were particular word types the RNN representations were especially good or bad at. 

(3) Table 3: Since the RNN-1600 was by far the best of the RNNs in Table 2 (and in fact RNN-80 was the worst), why pick the RNN-80 to compare against the other models (CW, HLBL)?


(4) Table 4, semantic relation results: When .275 is the best Spearman’s rho can you can get, it shows this is a pretty hard task…I wonder what human performance would be. I assume close to 1.00 if these are the simple analogy-style questions? (Side note: MaxDiff is apparently this, and is another way of dealing with scoring relational data.)

Monday, November 28, 2016

Some thoughts on McCauley & Christiansen 2014

I really appreciate this kind of overview, especially for an acquisition modeling literature I’m not as familiar with. It’s heartening to see similar broad concerns (consensus about what models should be doing), even if I might not always agree with the particulars. What caught my initial attention here is the focus on moving beyond “purely distributional features of the input” — though it turns out this might mean something different to me than to the authors.

For me, “purely distributional” means using only distributional information (rather than being additionally biased to skew the distributions in some way, e.g., by upweighting certain data and downweighting others). Importantly, "purely distributional" can still be information about the distribution of fairly abstract things, like thematic role positions. For M&C2014, based on the intro, it seems like they want it to mean distributions of words, since they specifically point out the “relative lack of semantic information” in current distributional usage-based models. They also contrast a purely distributional version of Perfors et al.’s dative alternation learning model with one that includes “a single semantic feature”. So while I’m happy to see the inclusion of more abstract linguistic features, I would still class the use of the distributions of those features as a purely distributional strategy. (This is part of the general idea that it's not that you're counting, but rather what you're counting.)

Some additional thoughts:

(1) I like the suggestion to create models that can produce behavioral output that we can compare against children’s behavioral output.  (This is under the general heading of “Models should aim to capture aspects of language use”.) That way, we don’t have to spend so much time arguing over the theoretical representation we choose for the model’s internal knowledge — the ultimate checkpoint is that it’s a way to generate the observed behavior (i.e., an existence proof). This is exactly the sort of the thing we read about last time in the reading group. Of course, as we also saw last time, this is much easier said than done.

(2) One criticism M&C2014 bring up as they discuss the models of semantic role labeling is that there’s a fixed set of predefined semantic roles. Is this really a problem, though? I think there’s evidence for early conceptual roles in infants (something like proto-agent and proto-patient). 

Also, later on in the discussion of verb argument structure, M&C2014 describe Chang’s Embodied Construction Grammar model as involving a set of “predefined schemas” that correspond to “actions, objects, and agents”. This doesn’t seem to cause M&C2014 as much consternation — why is it any more usage-based to have predefined conceptual schemas instead of predefined conceptual roles?


(3) I admit, I was somewhat surprised in the future extensions discussion to see “subject-auxiliary inversion” as an example of complex grammatical phenomena. In my head, that’s far more basic than many other things I see in the syntactic development literature, such as raising vs. control verb interpretation, quantifier scope ambiguity, syntactic islands constraints, binding relations, negative polarity items, and so on. Related to this, it’s unclear to me how much “social feedback” incorporation that “reflect[s] the semi-supervised nature of the learning task” is going to matter for syntactic knowledge like this. How much feedback do children get (and actually absorb, even if they get it) for these more sophisticated knowledge elements?

Monday, November 14, 2016

Some thoughts on Lloyd-Kelly et al. 2016

I really appreciate this paper as a first attempt to provide a linking story between model representations and infant behavior (in this case, turning probabilities associated with chunked representations into actual infant listening times, using things like time-sensitive trace decay). This highlights how the details of the experimental procedure matter, such as how often syllables are uttered, how long between habituation and test phases, and how long between the individual test stimuli during the test phase. In theory, this would also include all the non-linguistic processes that go into generating observable behavior, like motor control, attention, and memory, though LK&al2016 focused on memory for this first-pass attempt. (I should note that I think including some mechanism for attention would really help them out in future modeling attempts — more on this below.)

Some additional thoughts:

(1) It might be useful to go over some of the details of the CHREST model discussed in the “Participant Modelling” section, and embodied in Figure 1. While the basic division into long-term memory, short-term memory, and a phonological loop makes good sense, I want to make sure I’m clear on the distinction between discriminating, familiarizing, and a node being finished. For instance, why does a “finished” node cause something new to be created?

Relatedly, based on Figure 1, it seems like there’s a built-in primacy effect with respect to inserting a new node. For example, when pa-go is encountered in “pa-go-ti”, but only pa-do exists, the first thing that happens is “go” is added on its own as a primitive. My interpretation: If you get something new, you only manage to grab a piece of it. Primacy biases make you grab the first piece you don’t recognize. (An alternative might be a recency bias, where you grab the last thing, due to phonological loop decay. So, in pa-go-ti, you grab “ti” first.). 

(2) I think it’s very handy how the learner ignores incoming requests during the search, retrieval, and updating process.  The upshot is that the learner can’t learn new things while it’s still updating old things, which intuitively feels right. Also, it’s nice from a model fit perspective to have three distinct timing variables to tweak in order to match human behavior (though this also gets into issues of maybe being able to overfit with that many degrees of freedom).

(3) I really appreciated the empirical grounding based on children’s sensory auditory memory strength for the phonological store (=600ms). However, then I got a bit confused at to why they were testing out other values for this (800ms and 1000ms) in their simulations. Perhaps because 600ms was only a guess?

This then relates to the interpretation of Figure 2. It looks like the least variable performance comes from a short phonological store trace decay (600ms), though the r^2 is also low (but then, so is the RMSE, which is a good thing). If we take this as “this is the best”, then we might interpret this as quick forgetting mattering more than the other memory retrieval aspects encoded by familiarization and discrimination time.

On the other hand, if we focus on the highest r^2 and lowest RMSE values, then we get these combinations as being best: 

800ms phon decay + 10000ms discrimination + 1000-1500ms familiarization
1000ms phon decay + 9000ms discrimination + 2000ms familiarization

Importantly, the 600ms phon decay isn’t even in there. If we take these at face value, then the question is how to interpret it. Perhaps it narrows down the set of possible values for these different memory components in infants. In that case, maybe an 8-month-old phonological store trace decay is closer to a 1 or 2-year-old's, which is 1000-2000ms, rather than 600ms…

…except LK&al2016’s conclusion section seems to take the opposite tack: “…the data obtained in this paper would lend credence to the proposal that the trace decay time of the phonological store is around 600ms for very young infants.” I think I missed how they get there from their results, especially the connection to the digit span findings cited from Gathercole & Adams (1993). It seems super important, given how LK&al2016 think it’s the biggest finding of their paper.


(4) LK&al2016 find a qualitative match to infant looking times (Figure 3), but they note that they’re getting longer times for everything.  As LK&al themselves note: “infants appear to become bored much more quickly than the model”. It seems like this indicates a natural role for attention in future models. Interestingly, this is something LK&al2016 didn’t explicitly mention in describing future adaptations of the model in the conclusion. On the plus side, it doesn’t seems like it would be hard to build attention into the listening time calculation (e.g., just subtract some amount from the total looking time, based on some parameter connected to how much time has passed).

Monday, October 31, 2016

Some thoughts on Degen & Goodman 2014

One of the things I really enjoyed about this paper was seeing the precise assumptions that (we think) underlie dependent measures. It’s important to understand them — and understand the linking story more generally — if you’re going to connect model output (which typically is about some knowledge state that’s achieved/learned) to behavioral results (which involve using that knowledge to generate the observed behavior).

Meanwhile, I was just as surprised as the authors that the most natural of the three behavioral tasks they used (the sentence interpretation, i.e. what did the speaker mean by this?) was the one that seemed to wash away the pragmatic effects. I would have thought that pragmatic reasoning is what we use to understand how utterances are used in conversation (i.e., to figure out what the speaker meant in context). So, they ought to be more in effect for this kind of task than the more metalinguistic truth-value-ish (Expt 1) or what’s-the-speaker-going-to-say (Expt 2) tasks. But, clearly they weren’t. 

D&G2014 offer up a potential explanation involving an RSA model that views the interpretation task as involving a pragmatic listener (who reasons about a speaker informing a naive listener). In contrast, the truth-value and speaker-production tasks involve imagining a speaker’s productions. The  reason the pragmatic effects disappear for the interpretation task is because they get washed away by the pragmatic listener’s reasoning, according to D&G. I think I’d like to understand this a bit better (i.e., why exactly is this true, using the equation they provide). Is it because the pragmatic effects are only in play for certain utterances, and the world-state priors are really low for those utterances, so this yields no effect at the pragmatic listener level? (More specifically using equation 1 notation: Is it that P_speaker(w | b, QUD) has the pragmatic effect for certain box world-states b, and these are the ones with low prior P(b)?)

Some additional thoughts:

(1) Expt 2, predicting the probability of a speaker’s word choice, Figure 2: It seems funny that speakers give any probability to answers besides “all” and the exact number when shown the complete set of marbles for the utterance “I found X of the marbles.” Even when the QUD is “Did she find all of them?”, we see some probability on “some” (for the “4” set, it’s not that much different from 0, but for the 16 set, it’s up there at 20%). Maybe this is really D&G’s note about people not wanting to be bothered with counting if they can’t subitize? (That is, having probability on “some” is a hedge because the participant is too lazy to count if there are sixteen marbles present.)


(2) Expt 3 and what it means for truth-value judgment (TVJT) tasks that we often use with kids to assess interpretations: Maybe we should back off from truth-judgments and try to go for more naturalistic “which of these did the speaker mean” judgments? For example, we give them an utterance and do some sort of eyetracking thing where they look at one of two pictures that correspond to possible utterance interpretations. This would seem to factor out some of the pragmatic interference, based on the adult results. I guess the main response from the TVJT people is they want to know when children allow a certain interpretation, even if it’s a very minority one — the setup of the TVJT is typically that children will only answer “no” if they really can’t get the interpretation in question period. But maybe you can also get around this with more indirect measures like eye gaze, too. That is, even if children consciously would say “no” for a TVJT, their eye gaze between two pictures would indicate they considered the relevant interpretation at some point during processing.

Tuesday, October 18, 2016

Some thoughts on Kao et al. 2016

I really like the approach this paper takes, where insights from humor research are operationalized using existing formal metrics from language understanding. It’s the kind of approach I think is very fruitful for computational research because it demonstrates the utility of bothering to formalize things — in this case, the outcome is surprisingly good prediction about degrees of funniness and more explanatory power about why exactly something is funny. 

As an organizational note, I love the tactic the authors take here of basically saying “we’ll explain the details of this bit in the next section”, which is an implicit way of saying “here’s the intuition and feel free to skip the details in the next section if you’re not as interested in the implementation.” For me, one big challenge of writing up modeling results of this kind is the level of detail to include when you’re explaining how everything works. It’s tricky because of how wide an audience you’re attempting to interest. Put too little computational detail in and everyone’s irritated at your hand-waving; put too much in and readers get derailed from your main point.  So, this presentation style may be a new format to try.

A few more specific thoughts:

(1) I mostly followed the details of the ambiguity and distinctiveness calculations (ambiguity is about entropy, distinctiveness is about KL divergence of the indicator variables f for each meaning). However, I think it’s worth pondering more carefully how the part described at the end of section 2.2 works which goes into the (R)elatedness calculation. If we’re getting relatedness scores between pairs of words (R(w_i, h), where h = homophone and w_i is another word from the utterance, then how do we compile that together to get the R(w_i, m) that shows up in equation 8? For example, where does free parameter r (which was empirically fit to people’s funniness judgments and captures a word’s own relatedness) show up?

(2) I really like that this model is able to pull out exactly which words correspond to which meanings. This features reminds me of topic models, where each word is generated by a specific topic. Here, a word is generated by a specific meaning (or at least, I think that’s what Figure 1 shows, with the idea that m could be a variety of meanings).


(3) I always find it funny in computational linguistics research that the general language statistics portion (here, in the generative model) can be captured effectively by a trigram model. The linguist in me revolts, but the engineer in me shrugs and thinks if it works, then it’s clearly a good enough. The pun classification results here are simply another example of why computational linguistics often uses trigram models to approximate language structure and pretty much finds it adequate for most of what they want to do. Maybe for puns (and other joke types) that involve structural ambiguity rather than phonological ambiguity, we’d need something more than trigrams, though.

Wednesday, June 1, 2016

Some thoughts on Wellwood et al. 2016

In general, I love seeing this combination of behavioral and modeling work, and I’m also a big fan of the cue reliability vs. cue accessibility approach that Gagliardi’s work tends to have (ex: Gagliardi et al. 2012, 2014). That said, I had some difficulty following the details of the model that worked best (Model 4), and I’d really like to understand it better (more about this below).

Specific thoughts:

(1) The partitive frame: The partitive frame (“Gleeb of the cows are by the barn”) is an excellent syntactic signal for adults that the word is a quantity word (exact number, quantifier like “most”, etc.). So, that would signal the sense of numerosity, either exact or approximate. Based on the Wynn (1992) work, it seems like two-and-half-year-olds recognize this numerosity-ness of exact number words. Yet I wonder how prevalent the partitive frame is — I’m sure someone must have done a corpus analysis of child-directed speech (I’m thinking some of the former students of Barbara Sarnecka here at UCI). 

My intuition is that the partitive frame itself isn’t all that common. (Note: W&al2016 mention a corpus analysis by Syrett et al. 2012 of the partitive frame itself that shows this frame isn’t unambiguous for numerosity words, but don’t mention how often the partitive frame itself occurs.) My intuition might be wrong, but if it’s true, I wonder what other cues are available syntactically in order for numerosity to be associated with exact number words so early.  Maybe a more general syntactic distribution sort of thing? This may be important, given that the partitive frame isn’t an unambiguous cue to four-year-old children that the meaning is numerosity-focused. 

On the other hand, for the behavioral results, W&al2016 are working with four-year-olds who may have more experience with the partitive frame in their input. Certainly, the partitive frame appears to be  a very reliable cue to numerosity meanings (at least when the novel word is in the determiner position: [Det position] “gleebest of the…” vs. [Adj position] “The gleebest of the…”). 

It was also useful to learn (in the next section) that determiners only refer to quantities cross-linguistically, so it seems to be a mapping that languages use. A lot. (Interesting question: Where would this bias come from? Built-in (i.e., UG) or (always) derivable somehow?)


(2) A role for informativity?: The example in (5) about why we can’t say “heaviest of the animals” (but we can say “Most of the animals”) reminds me of Greg Scontras’s work on informativity, where, for example, adjective ordering preferences depend on how much uncertainty there is on the part of the listener (ex: big red boxes vs. *red big boxes, as found in Scontras et al. 2015). I wonder if there’s a useful link there from the developmental perspective about which words get mapped to the determiner position. (Or perhaps, why the link between determiner position and permutation invariant words like most would be established.)



(3) The most of the cows were…:  My dialect of English utterly fails to allow 6b (“The most of the cows were by the barn.”) I can say something like “The majority of…”, but I just can’t handle “The most of the…”. Hopefully this cue (appearance in the adjectival position of the partitive frame) isn’t too critical a property of superlative acquisition in general. I guess in my case, it’s a cue for ruling out the numerosity meaning and really zeroing in on the quality meaning. So maybe the reliability of the syntactic cues is cleaner than for the dialect that allows “The most of the cows”? (That is, in the experimental stimuli in Table 1, “the gleebest of the cows” isn’t a confounded cue for me. It’s strictly a quality-meaning indicator when the word has the -est morphology.)  So, following up on this, I’m not surprised in Figure 3 that the Adjective [“the gleebest cows”] and Confounded [“the gleebest of the cows”] results look alike (i.e., children infer a quality meaning for “gleebest” in these contexts).


(4) Understanding the model variants: 

(a) Model 3 (Lexical + Conceptual bias): I couldn’t quite tell from the text, but does joint prior mean there’s equal weight for the lexical and the conceptual prior? 

(b) Model 4 (+Perceptual Bias): W&al2016 describe it as “…combining the lexical prior with the intuition that salience impacts how the likelihood, P(d|h), could be encoded with differing reliability for each hypothesis.” 

While I deeply appreciate Tables 5 and 7, I really wish we could see the full equation where the alpha, beta, and gamma terms are put into equation form. If I’m interpreting the text correctly, these parameters are meant to alter the likelihood calculation. In Models 1-3, likelihood = 1 and all the work is done in the prior. In Model 4, likelihood is 1 with some probability depending on alpha and beta (and not computed = 0 otherwise?). And then Model 4 is a mixture model of all four of the encoding options (A, B, C, and D). So are these weighted somehow when they’re combined into one probability? 

I think they may be weighted based on the alpha and beta parameters (“…combined in a mixture model (the sum of all four terms)” = the alpha, beta, gamma, etc parameters are the weights).  My natural inclination is to think “with probability alpha, A happens; with probability beta, B happens, with probability alpha * beta, C happens; with probability 1-p(A or B or C), D happens”, which would then lead to a simple summation like they describe. 


Small thing: Also, based on Figure 5, where did the values in Table 6 come from (alpha = quantity confusion = 0.2, beta = quality confusion 0.025)? 


References

Gagliardi, A., Feldman, N. H., & Lidz, J. (2012). When suboptimal behavior is optimal and why: Modeling the acquisition of noun classes in Tsez. In Proceedings of the 34th annual conference of the Cognitive Science Society (pp. 360-365).

Gagliardi, A., & Lidz, J. (2014). Statistical insensitivity in the acquisition of Tsez noun classes. Language90(1), 58-89.

Scontras, Gregory, Judith Degen & Noah D. Goodman. 2015. Subjectivity predicts adjective ordering preferences. Manuscript from http://web.stanford.edu/~scontras/Gregory_Scontras.html.



Wednesday, May 18, 2016

Some thoughts on Moscati & Crain 2014

This paper really highlights to me the impact of pragmatic factors on children’s interpretations, something that I think we have a variety of theories about but maybe not as many formal implementations of (hello, RSA framework potential!). Also, I’m a fan of the idea of the Semantic Subset, though not as a linguistic principle, per se.  I think it could just as easily be the consequence of Bayesian reasoning applied over a linguistically-derived hypothesis space. But the idea that information strength matters is one that seems right to me, given what we know about children’s sensitivities to how we use language to communicate. 

That being said, I’m not quite sure how to interpret the specific results here (more details on this below). Something that becomes immediately clear from the learnability discussions in M&C2014 is the need for corpus analysis to get an accurate assessment of what children’s input looks like for all of these semantic elements and their various combinations.

Specific thoughts:

(1) Epistemic modals and scope
John might not come. vs. John can not come: I get that might not is less certain than cannot, and so entailment relations hold. But the scope assignment ambiguity within a single utterance seems super subtle.

(a) Ex: John might not come. 

surface: might >> not: It might be the case that John doesn’t come. (Translation: There’s some non-zero probability of John not coming.)
inverse: not >> might: It’s not the case that John might come. (Translation: There’s 0% probability that John might come.  = John’s definitely not coming.)

Even though the inverse scope option is technically available, do we actually ever entertain that interpretation in English? It feels more to me like “not all” utterances (ex: “Not all horses jumped over the fence”) — technically the inverse scope reading is there (all >> not = “none”) , but in practice it’s effectively unambiguous in use (always interpreted as “not all”).

(b) Ex: John cannot come.
surface: can >> not: It can be the case that John doesn’t come. (Translation: There’s some non-zero probability of John not coming.)
inverse: not >> can: It’s not the case that John can come. (Translation: There’s 0% probability that John can come. = John’s definitely not coming.)

Here, we get the opposite feeling about how can is used. It seems like the inverse scope is the only interpretation entertained. (And I think M&C2014 effectively say this in the “Modality and Negation in Child Language” section, when they’re discussing how can’t and might not are used in English.)

I guess the point for M&C2014 is that this is the salient difference between might not and cannot. It’s not surface word order, since that’s the same. Instead, the strongly preferred interpretation differs depending on the modal, and it’s not always the surface scope reading. This is what they discuss as a polarity restriction in the introduction, I think. (Though they talk about might allowing both readings, and I just can’t get the inverse scope one.)

(2) Epistemic modals, negation, and input:  Just from an input perspective, I wonder how often English children hear can’t vs. cannot (and then we can compare that to mightn’t vs. might not). My sense is that can’t is much more relatively frequent, and might not is much more relatively frequent in each pair. One possible learning story component: The reason we have a different favored interpretation for cannot is that we first encounter it as a single lexical item can’t, and so treat it differently than an item like might where we overtly recognize two distinct lexical elements, might and not. Beyond this, assuming children are sensitive to meaning (especially by five years old), I wonder how often they hear can’t (or cannot) used to effectively mean “definitely not” (favored/only interpretation for cannot) vs. might not used to mean “possibly not” (favored/only interpretation for might not). 

(3) Conversational usage:

(a) Byrnes & Duff 1989: Five-year-olds don’t seem to distinguish between “The peanut can’t be under the cup” and “The peanut might not be under the box” when determining the location of the peanut. I wonder how adults did on this task. Basically, it’s a bit odd information-wise to get both statements in a single conversation. As an adult, I had to do a bit of meta-linguistic reasoning to interpret this: “Well, if it might not be under the box, that’s better than ‘can’t’ be under the cup, so it’s more likely to be under the box than the cup. But maybe it’s not under the box at all, because the speaker is expressing doubt that it’s under there.” In a way, it reminds me of some of the findings of Lewis et al. (2012) on children’s interpretations of false belief task utterances as literal statements of belief vs. parenthetical endorsements. (Ex: “Hoggle thinks Sarah is the thief”: literal statement of belief = this is about whether Hoggle is thinking something; parenthetical endorsement: there’s some probability (according to Hoggle) that Sarah is the thief.) Kids hear these kind of statements as parenthetical endorsements way more than they hear them as literal statements of belief in day-to-day conversation, and so interpret them as parenthetical endorsements in false belief tasks. That is, kids are assuming this is a normal conversation and interpreting the statements as they would be used in normal conversation.

Lewis, S., Lidz, J., & Hacquard, V. (2012, September). The semantics and pragmatics of belief reports in preschoolers. In Semantics and Linguistic Theory (Vol. 22, pp. 247-267).

(b) Similarly, in Experiment 1, I wonder again about conversational usage. In the discussion of children’s responses to the Negative Weak true items like “There might not be a cow in the box” (might >> not: It’s possible there isn’t a cow), many children apparently responded False because “A cow might be in the box.” Conversationally, this seems like a perfectly legitimate response. The tricky part is whether the original assertion is false, per se, rather than simply not the best utterance to have selected for this scenario.

(4) The hidden “only” hypothesis: 

In Experiment 1, M&C2014 found on the Positive True statements (“There is a cow in the box” with the child peeking to see if it’s true) that children were only at ~51.5% accuracy for being right. This is weirdly low, as M&C2014 note. They discuss this as having to do with the particle “also”, suggesting a link to the “only” interpretation, i.e., children were interpreting this as “There is only a cow in the box.” (Side note: M&C2014 talk about this as “There might only be a cow in the box.”, which is odd. I thought the Positive and Negative sentences were just the bare “There is/isn’t an X in the box.”)  Anyway, they designed Experiment 2 to address this specific weirdness, which is nice.

In Experiment 2 though, there seems to me to be a potential weirdness with statements like “There might not be only a cow in the box”. Only has its own scopal impacts, doesn’t it? Even if might takes scope over the rest, we still have might >> not >> only (= “It’s possible that it’s not the case there’s only a cow.” = There may be a cow and something else (as discussed later on in examples 44 and 45) = infelicitous in this setup where you can only have one animal = unpredictable behavior from kids). Another interpretation option is might >> only >> not (= It’s possible that it’s only the case that it’s not a cow.” = may be not-a-cow (and instead be something else) = must be a horse in this setup =  desired behavior from kids). 

We then find that children in Experiment 2 decrease acceptance of Negative Weak True statements like “There might not be a cow in the box” to 33.3%. So, going with the hidden only story, they’re interpreting this as “It’s not the case that there might be (only) a cow in the box.” Again, we get infelicity if not >> only since there can only be one animal in the box at a time. But this could either be because of the the interpretation above (not >> might >> only) or because of the interpretation might >> not >> only (which is the interpretation that follows surface scope, i.e., not reconstructed.) So it’s not clear to me what this rejection by children means.

(5) Discussion clarification: What’s the difference between example 46 = “It is not possible that a cow is in the box” and  example 48 = “It is not possible that there is a cow [is] in the box”? Do these not mean the same thing? And I’m afraid I didn’t follow the the paragraph after these examples at all, in terms of its discussion of how many situations one vs. the other is true in.


(6) Semantic Subset Principle (SSP) selectivity: It’s interesting to note that M&C2014 say the SSP is only invoked when there are polarity restrictions due to a lexical parameter. So, this is why M&C2014 say it doesn’t apply when the quantifier every is involved (in response to Musolino 2006). This then presupposes that children need to know which words have a lexical parameter related to polarity restrictions and which don’t. How would they know this? Is the idea that they just know that some meanings (like quantifier every) don’t get them while others (like quantifier some) do? Is this triggered/inferrable from the input in some way?

Wednesday, May 4, 2016

Some thoughts on Snedeker & Huang 2016 in press

One of the things I really enjoyed about this book chapter was all the connections I can see for language acquisition modeling. An example of this for me was the discussion about kids’ (lack of) ability to incorporate pragmatic information of various kinds (more detailed comments on this below). Given that some of us in the lab are currently thinking about using the Rational Speech Act model to investigate quantifier scope interpretations in children, the fact that four- and five-year-olds have certain pragmatic deficits is very relevant. 

More generally, the idea that children’s representation of the input — which depends on their processing abilities — matters is exactly right (e.g., see my favorite Lidz & Gagliardi 2015 ref). As acquisition modelers, this is why we need to care about processing. Passives may be present in the input (for example) but that doesn’t mean children recognize them (and the associated morphology). That is, access to the information of the input has an impact, beyond the reliability of the information in the input, and access to the information is what children’s processing deals with.

More specific thoughts:

18.1:  I thought it was interesting that there are some theories of adult sentence processing that actively invoke an approximation of the ideal observer as a reasonable model (ex: the McRae & Matuski 2013 that SH2016 cite). I suppose this is the foundation of the Rational Speech Act model as well, even though it doesn’t explicitly consider processing as an active process per se.

18.3: Something that generally comes out of this chapter is children’s poorer cognitive control (which is why they perseverate on their first choices). This seems like it could matter a lot in pragmatic contexts where children’s expectations might be violated in some way. They may show show non-adult behavior not because they can’t get the correct answer, but rather that they can’t get to the correct answer once they’ve built up a strong enough expectation for a different answer.

18.4: Here we see evidence that five-year-olds aren’t sensitive to the referential context when it comes to disambiguating an ambiguous PP attachment (as in “Put the frog on the napkin in the box”). (And this contrasts with their sensitivity to prosody.) So, not only do they perseverate on their first mistaken interpretation, but they apparently don’t utilize the pragmatic context information that would enable them to get the correct interpretation to begin with (i.e. there are two frogs so saying “the frog” is weird until you know which frog —  therefore “the frog on the napkin” as a unit makes sense in this communicative context). This insensitivity to the pragmatics of “the” makes me wonder how sensitive children are in general to pragmatic inferences that hinge on specific lexical items — we see in section 18.5 that they’re generally not good at scalar implicatures till later, but I think they can get ad-hoc implicatures that aren’t lexically based (Stiller et al. 2015). 

So, if we’re trying to incorporate this kind of pragmatic processing limitation into a model of child’s language understanding (e.g., cripple an adult RSA model appropriately), we may want to pay attention to what the pragmatic inference hinges on. That is, is it word-based or not? And which word is it? Apparently, children are okay if you use “the big glass” when there are two glasses present (Huang & Snedeker 2013). So it’s not just about “the” and referential uniqueness. It’s about “the” with specific linguistic ways of determining referential uniqueness, e.g., with PP attachment. HS2016 mention cue reliability in children’s input as one mitigating factor, with the idea that more reliable cues are what children pick first — and then they presumably perseverate on the results of what those reliable cues tell them.

18.6: It was very cool to see evidence of the abstract category of Verb coming from children’s syntactic priming studies. At least by three (according to the Thothathiri & Snedeker 2008 study), the abstract priming effects are just as strong as the within-verb priming effects, which suggests category knowledge that’s transferring from one individual verb to another. To be fair, I’m not entirely sure when the verb-island hypothesis folks expect the category Verb to emerge (they just don’t expect it to be there initially). But by three is already relatively early.

18.7: Again, something that comes to mind for me as an acquisition modeler is how to use the information here to build better models. In particular, if we’re thinking about causes of non-adult behavior in older children, we should look at the top-down information sources children might need to integrate into their interpretations. Children's access to this information may be less than adults have (or simply children's ability to utilize it, which may effectively work out to the same thing in a model).


References

Lidz, J., & Gagliardi, A. (2015). How nature meets nurture: Universal grammar and statistical learning. Annu. Rev. Linguist., 1(1), 333-353.

McRae, K., & Matsuki, K. (2013). Constraint-based models of sentence processing. In R. Van Gompel (Ed.), Sentence Processing (pp. 51-77). New York, NY: Psychology Press. 

Stiller, A. J., Goodman, N. D., & Frank, M. C. (2015). Ad-hoc implicature in preschool children. Language Learning and Development, 11(2), 176-190.










Wednesday, April 20, 2016

Some thoughts on Yang 2016 in press

As always, it’s a real pleasure for me to read things by Yang because of how clearly his viewpoints are laid out. For this paper in particular, it’s plain that Yang is underwhelmed by the Bayesian approach to cognitive science (and language acquisition in particular). I definitely understand some of the criticisms (and I should note that I personally love the Tolerance Principle that Yang advocates as a viable alternative). However, I did feel obliged to put on my Bayesian devil’s advocate hat here at several points. 

Specific comments:

(1) The Evaluation Metric (EvalM) is about choosing among alternative hypotheses (presumably balancing fit with simplicity, which is one of the attractive features of the Bayesian approach). If I’m interpreting things correctly, the EvalM was meant to be specifically linguistic (embedded in linguistic hypothesis space) while the Bayesian approach isn’t. So, simplicity needs to be defined in linguistically meaningful ways. As a Bayesian devil’s advocate, this doesn’t seem incompatible with having a general preference for simplicity that gets cashed out within a linguistic hypothesis space.

(2) Idealized learners

General: Yang’s main beef is with idealized approaches to language learning, but presumably, very particular ones, because of course every model is idealizing away some aspects of the learning process.

(a) Section 2: Yang’s opinion is that a good model cares about “what can be plausibly assumed” about “the actual language acquisition process”. Totally agreed. This includes what the hypothesis space is — which is crucially important for any acquisition model. It’s one of things that an ideal learner model can check for — assuming the inference can be carried out to yield the best result, will this hypothesis space yield a “good” answer (however that’s determined)? If not, don’t bother doing an algorithmic-level process where non-optimal inferences might results — the modeled child is already doomed to fail. That is, ideal learner models of the kind that I often see (e.g., Goldwater et al 2009, Perfors et al. 2011, Feldman et al. 2013, Dillon et al. 2013) are useful for determining if the acquisition task conceptualization, as defined by the hypothesis space and realistic input data, is reasonable. This seems like an important sanity check before you get into more cognitively plausible implementations of the inference procedure that’s going to operate over this hypothesis space, given these realistic input data.  In this vein, I think these kind of ideal learner models do in fact include relevant “representational structure”, even if it’s true that they leave out the algorithmic process of inference and the neurobiological implementation of the whole thing (representation, hypothesis space, inference procedure, etc.). 

(b) This relates to the comment in Section 2 about how “surely an idealized learner can read off the care-taker’s intentional states” — well, sure, you could create an idealized learner that does that. But that’s not a reasonable estimate of the input representation a child would have, and so a reasonable ideal learner model wouldn’t do it. Again, I think it’s possible to have an ideal learner model that doesn’t idealize plausible input representation.

Moreover, I think this kind of ideal learner model fits in with the point made about Marr’s view on the interaction of the different levels of explanation, i.e., “a computational level theory should inform the study of the algorithmic and implementational levels”. So, you want to make sure you’ve got the right conceptualization of the acquisition task first (computation-level). Then it makes sense to explore the algorithmic and implementational levels more thoroughly, with that computational-level guidance.

(3) Bayesian models & optimality

(a) Section 3: While it’s true that Bayesian acquisition models build in priors such as preferring “grammars with fewer symbols or lexicons with shorter words”, I always thought that was a specific hypothesis these researchers were making concrete. That is, these are learning assumptions which might be true. If they are (i.e., if this is the conceptualization of the task and the learner biases), then we can see what the answers are. Do these answers then match what we know about children’s behavior (yes or no)? So I don’t see that as a failing of these Bayesian approaches. Rather, it’s a bonus — it’s very clear what’s built in (preferences for these properties) and how exactly it’s built in (the prior over hypotheses). And if it works, great, we have a proposal for the assumptions that children might be bringing to these problems. If not, then maybe these aren’t such great assumptions, which makes it less likely children have them.

(b) Section 3: In terms of model parameters being tuned to fit behavioral data, I’m not sure I see that as quite the problem Yang does. If you have free parameters, that means those are things that could matter (and presumably have psychological import). So, knowing what values they need then tells you what those values should be for humans. 

(c) Section 3:  For likelihoods, I’m also not sure I’m as bothered about them as Yang is. If you have a hypothesis and you have data, then you should have an idea of the likelihood of the data given that hypothesis. In some sense, doesn’t likelihood just fall out from hypothesis + data? In Yang’s example of the probability of a particular sentence given a specific grammar, you should be able to calculate the probability of that sentence if you have a specific PCFG. It could be that Yang’s criticism is more about how little we know about human likelihood calculation. But I think that’s one of the learner assumptions — if you have this hypothesis space and this data and you calculate likelihoods this way (because it follows from the hypothesis you have and the data you have), then these are the learning results you get.

(d) 3.1, variation: I think a Bayesian learner is perfectly capable of dealing with variation. It would presumably infer a distribution over the various options. In fact, as far as I know, that’s generally what the Bayesian acquisition models do. The output at any given moment may either by the maximum a posteriori probability choice or a probabilistic sample of that distribution, so you just get one output — but that doesn’t mean the learner doesn’t have the distribution underneath. This seems like exactly what Yang would want when accounting for variation for a particular linguistic representation within an individual. That said, his criticism of a Bayesian model that has to select the maximum a posteriori option as its underlying representation is perfectly valid — it’s just that this is only one kind of Bayesian model, not all of them.

(e) 3.2: For the discussion about exploding hypothesis spaces, I think there’s a distinction between explicit vs latent hypothesis space for every ideal learner model I’ve ever seen. Perfors (2012) talks about this some, and the basic idea is that the child doesn’t have to consider an infinite (or really large) number of hypotheses explicitly in order to search the hypothesis space. Instead, the child just had to have the ability to construct explicit hypotheses from that latent space. (Which always reminds me of using linguistic parameter values to construct grammars like Yang's variational learner does, for instance.)

Perfors, A. (2012). Bayesian Models of Cognition: What's Built in After All?. Philosophy Compass, 7(2), 127-138.

(f) 3.2: I admit, I find the line of argumentation about output comparison much more convincing. If one model (e.g., a reinforcement learning one) yields better learning results than another (e.g., a Bayesian one), then I’m interested.

(g) 3.2: “Without a feasible means of computing the expectations of hypotheses…indirect negative evidence is unusable.” — Agreed that this is a problem (for everyone). That’s why the hypothesis space definition seems so crucial. I wonder if there’s some way to do a “good enough” calculation, though. That is, given the child’s current understanding of the (grammar) options, can the approximate size of one grammar be calculated? This could be good enough, even if it’s not exactly right.

(h) 3.2: “…use a concrete example to show that indirect negative evidence, regardless of how it is formulated, is ineffective when situated in a realistic setting of language acquisition”. — This is a pretty big claim. Again, I’m happy to walk through a particular example and see that it doesn’t work. But I think it’s a significant step to go from that to the idea that it could never work in any situation.

(i) 3.3, overhypothesis for the a-adjective example:  To me, the natural overhypothesis for the a-adjectives is with the locative particles and prepositional phrases. So, the overhypothesis is about elements that behave certain ways (predicative = yes, attributive = no, right-adverb modification = yes), and the specific hypotheses are about the a-adjectives vs. the locative particles vs. the prepositional phrases, which have some additional differences that distinguish them. That is, overhypotheses are all about leveraging indirect positive evidence like the kind Yang discusses for a-adjectives. Overhypotheses (not unlike linguistic parameters) are the reason you get equivalence classes even thought the specific items may seem pretty different on the surface. Yang seems to touch on this in footnote 11, but then uses it as a dismissal of the Bayesian framework. I admit, I found that puzzling. To me, it seems to be a case of translating an idea into a more formal mathematical version, which seems great when you can do it.


4. Tolerance Principle

(a) 4.1: Is the Elsewhere Condition only a principle “for the organization of linguistic information”? I can understand that it’s easily applied to linguistic information, but I always assumed it’s meant to be a (domain-)general principle of organization.


(b) 4.2: I like seeing the Principle of Sufficiency (PrinSuff) explicitly laid out since it tells us when to expect generalization vs. not. That said, I was a little puzzled by this condemnation of indirect negative evidence that was based on the PrinSuff: “That is, in contrast to the use of indirect negative evidence, the Principle of Sufficiency does not conclude that unattested forms are ungrammatical….”. Maybe the condemnation is about how the eventual conclusion of most inference models relying on indirect negative evidence is that the item in question would be ungrammatical? But this seems all about interpretation - these inference models could just as easily set up the final conclusion of “not(grammatical)” as “I don’t know that it’s grammatical” (the way the PrinSuff does here) rather than “it’s ungrammatical”.