Monday, December 4, 2017

Some thoughts on Perkins et al. 2017

I really enjoy seeing Bayesian models like this because it’s so clear exactly what’s built in and how. In this particular model, a couple of things struck me: 

(1) This learner needs to have prior (innate? definitely linguistic) knowledge that there are three classes of verbs with different properties. That actually goes a bit beyond just saying a verb has some probability of taking a direct object, which I think is pretty uncontroversial.

(2) The learner only has to know that its parsing is fallible, which causes errors — but notably the learner doesn’t need to know the error rate(s) beforehand. So, as P&al2017 note in their discussion, this means less specific knowledge about the filter has to be built in a priori.

Other thoughts:
(1) Thinking some about the initial stage of learning P&al2017 describe in section 2: So, this learner isn’t supposed to yet know that a wh-word can connect to the object of the verb. It’s true that knowing that specific knowledge is hard without already knowing which verbs are transitive (as P&al2017 point out). But does the learner know anything about wh-words looking for connections to things later in the utterance? For example, I’m thinking that maybe the learner encounters other wh-words that are clearly connected to the subject or object of a preposition: “Who ate a sandwich?” “Who did Amy throw a frisbee to?”. In those cases, it’s not a question of verb subcategorization - the wh-word is connecting to/standing in for something later on in the utterance. 

If the learner does know wh-words are searching for something to connect to later in the utterance, due to experience with non-object wh-words, then maybe a wh-word that connects to the object of a verb isn’t so mysterious (e.g., “What did John eat?”). That is, because the child knows wh-words connect to something else and there’s already a subject present, that leaves the object. Then, non-basic wh-questions actually can be parsed correctly and don’t have to be filtered out. They in fact are signals of a verb’s transitivity.

Maybe P&al2017’s idea is that this wh-awareness is a later stage of development. But I do wonder how early this more basic wh-words-indicate-a-connection knowledge is available.

(2) Thinking about the second part of the filter, involving delta (which is the chance of getting a spurious direct object due to a parsing error): I would have thought that this depended on which verb it was. Maybe it would help to think of a specific parsing error that would yield a spurious direct object. From section 5.1, we get this concrete example: “wait a minute”, with “a minute” parsed as a direct object. It does seem like it should depend on whether the verb is likely to have a direct object there to begin with, rather than a general direct object hallucination parsing error. I could imagine that spurious direct objects are more likely to occur for intransitive verbs, for instance.

I get that parsing error propensity (epsilon) doesn’t depend on verb, though.

(3) Thinking about the model’s target state: P&al2017 base this on adult classes from Levin (1993), but I wonder if it might be fairer to adjust that based on the actual child-directed speech usage (e.g., what’s in Table 2). For example, if “jump” was only ever used intransitively in this input sample, is it a fair target state to say it should be alternating? 

I guess this comes down to the general problem of defining the target state for models of early language learning. Here, what you’d ideally like is an output set of verb classes that corresponds to those of a very young child (say, a year old). That, of course, is hard to get. Alternatively, maybe what you want to have is some sort of downstream evaluation where you see if a model using that inferred knowledge representation can perform the way young children are attested to in some other task.

For example, one of the behaviors of this model, as noted in section 5.1, is that it assigns lots of alternating verbs to be either transitive or intransitive. It would be great to test this behaviorally with kids of the appropriate age to see if they also have these same mis-assignments.

(4) Related to the above about the overregularization tendencies: I love the idea that P&al2017 suggest in the discussion about this style of assumption (i.e.,“the parser will make errors but I don’t know how often”). They note that it could be useful for modeling cases of child overregularization. We certainly have a ton of data where children seem more deterministic than adults in the presence of noisy data. It’d be great to try to capture some of those known behavioral differences with a model like this.

Monday, November 20, 2017

Some thoughts on Stevens et al. 2017

It’s really nice to see an RSA model engaging with pretty technical aspects of linguistic theory, as S&al2017 do here. In these kinds of problems, there tend to be a lot of links to follow in the chain of reasoning, and it’s definitely not easy to adequately communicate them in such a limited space. (Side note: I forget how disorienting it can be to not know specific linguistics terms until I try to read them all at once in an abstract without a concrete example. This is a good reminder to those of us who work in more technical areas: Make sure to have concrete examples handy. The same thing is true for walking through the empirical details with the prosodic realizations as S&al2017 have here —  I found the concrete examples super-helpful.)

Specific thoughts:

(1) For S&al2017, “information structure” = inferring the QUD probabilistically from prosodic cues?

 (2) I think the technical linguistic material is worth going over, as it connects to the RSA model. For instance, I’m struggling a bit to understand the QUD implications for having incomplete answers vs. having complete answers, especially as it relates to a QUD’s compatibility with a given melody. 

For example, when we hear “Masha didn’t run QUICKLY”, the QUD is something like “How did Masha run?”. That’s an example of an incomplete answer. What’s a complete answer version of this scenario, and how does this impact the QUD? Once I get this, then I think it makes complete sense to use the utility function defined in equation (10). 

(3) I was struck by S&al2017’s notational trick, where they get out of the recursive social reasoning loop of literal listener to speaker to pragmatic listener. Here, it’s utility function to speaker to hearer because they’re presumably trying to deemphasize the social reasoning aspect? Or they just thought it made more sense described this way?

(4) About those results:
Figure 2: It’s nice to see modelers investigating the effect of the rationality (softmax) parameter in the speaker function. From the look of Figure 2, speakers need to be pretty darned rational indeed (really exaggerate endpoint behavior) in order to get any separation in commitment certainty predictions. 

Thinking about this intuitively, we should expect the LH Name condition (MASHA didn’t run quickly) to continue to be ambivalent about commitment to Masha running at all. That definitely shows up. I think. (Actually, I wonder if if might have been more helpful to ask participants to rate things on a scale from 1 (No, certainly not) to 7 (Yes, certainly so). That seems like it would make a 4 score easier to interpret (4 = maybe yes, maybe no). Here, I’m a little unsure how participants were interpreting the middle of the scale. I would have thought “No, not certain” would be the “maybe yes, maybe no” option, and so we would expect scores of 1. This is something of an issue when we come to the quantitative fit of the model results to the experimental results. Is the behavioral difference shallow just because of the way humans were asked to give their answers?  The way the model probability is calculated in (16) suggests that the model is operating more under the 1 = “no, certainly not” version (if I’m interpreting it correctly - -you have the “certainly yes” option contrasted with the “certainly not” option).

Clearly, however, we see a shift up in human responses in Figure 3 for the LH Adverb condition (Masha didn’t run QUICKLY), which does accord with my intuitions. And we get them from the model in Figure 2, as long as that rationality parameter is turned way up. (Side note: I’m a little unclear about how to interpret the rationality parameter, though. We always hedge about it in our simulation results. It seems to be treated as a noise parameter, i.e., humans are noisy, so let’s use this to capture some messy bits of their behavior. In that case, maybe it doesn’t mean much of anything that it has to be turned up so high here.)

Monday, November 6, 2017

Thoughts on Orita et al. 2015

I really appreciated how O&al2015 used the RSA modeling framework to make a theory (in this case, about discourse salience) concrete enough to implement and then evaluate against observable behavior. As always, this is the kind of thing I think modeling is particularly good at, so the more that we as modelers emphasize that, the better.

Some more targeted thoughts:

(1) The Uniform Information Density (UID) Hypothesis assumes receiving information in chunks of approximately the same size is better for communication. I was trying to get the intuition of that down -- is it that new information is easier to integrate if the amount of hypothesis adjustment needed based on that new information is always the same? (And if so, why should that be exactly? Some kind of processing thing?)

Related: If I’m understanding correctly, the discourse salience version of the UID hypothesis means more predictable forms become pronouns. This gets cashed out initially as the surprisal component of the speaker function in (3) (I(words; intended referent, available referent)), which is just about vocabulary specificity (that is, inversely proportional w.r.t how ambiguous the literal meaning of the word is). Then 3.2 talks about how to incorporate discourse salience. In particular, (4) incorporates the literal listener interpretation given the word, and (5) is just straight Bayesian inference where the priors over referents are what discourse salience affects. Question: Would we need these discourse-salience-based priors to reappear in the pragmatic listener level if we were using that level? (It seems like they belong there too, right?)

Speaking of levels, since O&al2015 are modeling speaker productions, is the S1 level the right level? Or should they be using an S2 level, where the speaker assumes a pragmatic listener is the conversational partner? Maybe not because we usually save the S2 level for metalinguistic judgments like endorsements in a truth-value judgment task?

(2) Table 1: Just looking at the log likelihood scores, it seems like frequency-based discourse salience is the way to go (and this effect is much more pronounced in child-directed speech). However, the text in the discussion by the authors notes how the recency-based discourse salience version has better accuracy scores, though most of that is due to the proper name accuracy since every model is pretty terrible at pronoun accuracy. I’m not entirely sure I follow the authors’ point about why the accuracy and log likelihood scores don’t agree on the winner. If the recency-based models return higher probabilities for a proper name, shouldn’t that make the recency-based log likelihood score better than the frequency-based log likelihood score? Is the idea that some proper names get all the probability (for whatever reason) for the recency-based version, and this so drastically lowers the probabilities of the other proper names that a worse log likelihood results?

But still, no matter what, discourse saliency looks like it’s having the most impact (though there’s some impact of expression cost). In the adult-directed dataset, you can actually get pretty close to the best log likelihood with the -cost frequency-based version (-1017) vs. the complete  frequency-based version (-958). But if you remove discourse salience, things get much, much worse (-6904). Similarly, in the child-directed dataset, the -cost versions aren’t too much worse than the complete versions, but the -discourse version is horrible.

All that said, what on earth happened with pronoun accuracy? There’s clearly a dichotomy between the proper name results and the pronoun results, no matter what model version you look at (except maybe the adult-directed -unseen frequency-based version).

(3) In terms of next steps, incorporating visual salience seems like a natural step when calculating discourse saliency. Probably the best way to do this is as a joint distribution in the listener function for the prior? (I also liked the proposed extension that involves speaker identity as part of the relevant context.) Similarly, incorporating grammatical and semantic constraints seems like a natural extension that could be implemented the same way. Probably a hard part is getting plausible estimates for these priors?

Monday, October 16, 2017

Thoughts on Yoon et al. 2017

I really enjoyed seeing another example of a quantitative framework that builds a pipeline between behavioral data and modeling work. The new(er?) twist in Y&al2017 for me is using Bayesian data analysis to do model-fitting after the behavioral data were collected (originally collected to evaluate the unfitted model predictions). It definitely seems like the right thing to do for model validation. More generally, this pipeline approach seems like the way forward for a lot of different language science questions where we can’t easily manipulate the factors we want experimentally. (In fact, you can see some of the trouble here about how to interpret the targeted behavioral manipulations even still.)

More targeted thoughts:
(1) I liked seeing this specific implementation of hedging, which is a catch-all term for a variety of behaviors that soften the content (= skew towards face-saving). It’s notable that the intuition seems sensible (use more negation when you want to face-save), but the point of the model and subsequent behavioral verification is to concretely test that intuition. Just because something’s sensible in theory doesn’t mean it’s true in practice. 

A nice example of this for me was the prediction in Figure 2 that more negations occur when the goal is both social and informative (both-goal), rather than just social. Basically, the social-only speaker tells direct white lies, while the informative-only speaker just tells the truth, so neither uses negation as much as the both-goal speaker for negative states.

(2) I think I need to unpack that first equation in the Polite RSA model. I’m not familiar with the semicolon notation — is this the joint utility of the utterance (w) ….given the state (s)….and given the goal weights (epistemic and social)? (This is what shows up P_S2.) The rest I think I follow: the first term is the epistemic weight * the negative surprisal of L0; the second term is the social weight * the value for those states that are true for L0; the third term is the cost of the utterance (presumably in length, as measured by words).

(3) Figure 1: How funny that “it wasn’t terrible” is accepted at near ceiling when the true value is “good” (4 out of 5) or “amazing” (5 out of 5).  Is this some kind of sarcasm/curmudgeonly speaker component at work?

(4) For the production experiment, I keep thinking this is the kind of thing where a person’s own social acuity might matter. (That is, if the poem was 2 out of 5, and a tactful vs. direct person is asked what they’d say to make someone else feel good, you might get different responses.) I wonder if they got self-report on how tactful vs. direct their participants thought they were (and whether this actually does matter).

I also have vague thoughts that this is the kind of task you could use to indirectly gauge tactful vs. direct in neurotypical people (say, for HR purposes) as well as in populations that struggle with non-literal language. This might explain some of the significant deviations in the center panel of Figure 2 for the low states (1 and 2): the participants for social-only used negation (rather than white lies, presumably) much more than predicted. (Though maybe not once the weights and free parameters are inferred — the fit is pretty great.)

Maybe this social acuity effect comes out in the Bayesian data analysis, which inferred the participant weights. There wasn’t much participant difference between the social-only weight and the both weight (0.57 vs. 0.51). Again, I’d be really curious to see if this separated out by participant social acuity.

(5) I found the potential (non?)-difference between “it wasn’t amazing” and “it wasn’t terrible” really interesting. I keep trying to decide if I differ in how I deploy them myself. I think I do — if I’m talking directly to the person, I’ll say “it wasn’t terrible”; if I’m talking to someone else about the first person’s poem, I’ll say “it wasn’t amazing”. I’m trying to ferret out why I have those intuitions, but it probably has something to do with what Y&al2017 discuss at the very end about the speaker’s own face-saving tactics.

Monday, May 29, 2017

Thoughts on Meylan et al. 2017 in press

I really like how M&al2017 have mathematically cashed out the competing hypotheses (and given the intuitions of how Bayesian data analysis works — nice!). But something that I don’t quite understand is this: The way the model is implemented, isn’t it a test for Noun-hood rather than Determiner-hood (more on this below)? There’s nothing wrong with testing Noun-hood, of course, but all the previous debates involving analysis of Determiners and Nouns have been arguing over Determiner-hood, as far as I understand.

Something that also really struck me when trying to connect these results to the generativist vs. constructionist debate: How early is earlier than expected for abstract category knowledge to develop? This seems like the important thing to know if you’re trying to interpret the results for/against either perspective (more on this below too).

Specific thoughts:

(1) How early is earlier than expected?
(a) In the abstract, we already see the generativist perspective pitched as “the presence of syntactic abstractions” in children’s “early language”. So how early is early?  Are the generativists really all that unhappy with the results we end up seeing here where a rapid increase in noun-class homogeneity starts happening around 24 months? Interestingly, this timing correlates nicely with what Alandi, Sue, and I have been finding using overlap-based metrics for assessing categories in VPs like negation and auxiliary (presence of these in adult-ish form between 20 and 24 months for our sample child).

(b) Just looking at Figure 3 with the split-half data treatment, it doesn’t look like there’s a lot of increase in noun-class-ness (productivity) in this age range. Interestingly, it seems like several go down (though I know this isn’t true if we’re using the 99.9% cutoff criterion). Which team is happy with these results, if we squint and just use the visualizations as proxies for the qualitative trends? While the generativists would be happy with no change, they’d also be surprised by negative changes for some of these kids. The constructivists wouldn’t be (they can chalk it up to “still learning”), but then they’d expect more non-zero change, I think.

(c)  The overregularization hypothesis is how M&al2017 explain the positive changes in younger kids and the negative changes for older kids. In particular, they say older kids have really nailed the NP —> Det N rule, and so use more determiner noun combinations that are rare for adults. So, in the terms of the model, what would be happening is that more nouns get their determiner preferences skewed towards 0.5 than really ought to be, I think. If that happens, then shouldn’t the distribution be more peaked around 0.5 in Figure 1? If so, that would lead to higher values of v. So wouldn’t we expect even higher v values (i.e., a really big increase) if this is what’s going on, rather than a decrease to lower v values?  Maybe the idea is that the peak in v is happening because of overregularization, and then the decrease is when kids settle back down. That is, adult-like knowledge of a noun category existing is when we get the peak v value (which may in fact be higher than actual adult values). Judging from Figure 4, it looks like this is happening between 2 and 3. Which is pretty young. Which would make generativists happy, I think? So the conclusion that “these results are broadly consistent with constructivist hypotheses” is somewhat surprising to me. I guess it all comes back to how early is earlier than expected.

(d) Sliding-window results: If we continue with this idea that a peak in v value is the indication kids have hit adult-like awareness of a category (and may be overregularizing), what we should be looking for is when that first big peak happens (or maybe big drop after a peak). Judging from the beautifully rainbow-colored Figure 5, it looks like this happens pretty early on for a bunch of these kids (Speechome, Eve, Lily, Nina, Aran, Thomas, Alex, Adam, Sarah).  So the real question again: how early is earlier than expected? (I feel like this is exactly the question that pops up for standard induction problem/Poverty of the Stimulus arguments.)

(2) Modeling:  
(a) I like how their model involves both categories, rather than just Determiner. This is exactly what Alandi and I realized when we started digging into the assumptions behind the different quantitative metrics we examined.

(b) I also like that M&al2017 explicitly do a side-by-side model comparison (direct experience = memorized amalgam from the input vs. productive inference from abstracted knowledge). Bayesian data analysis is definitely suited for this, and then you can compare which representational hypothesis best fits over different developmental stages for a given child. Bonus for this modeling approach: the ability to estimate confidence intervals.

We can see this in the model parameters, too: n = impact of input (described as ability to “match the variability in her input” = memorized amalgams), v = application of knowledge across all nouns (novel productions = “produce determiner-noun pairs for which she has not received sufficient evidence from caregiver input” = abstract category). It’s really nice to see the two endpoint hypotheses cashed out mathematically.

(c) Testing noun-hood: If I understand this correctly, each noun has a determiner preference (0 = all “a”, 1 = all “the”, 0.5 = half each). Cross-noun variability is then drawn from an underlying common noun distribution if all nouns are from same class (testing whether “nouns behave in a more class-like fashion”). So, this seems like testing for Noun-hood based on Determiner usage, which I quite like.  But it’s interesting that M&al2017 describe this as testing “generalization of determiner use across nouns”, which makes it seem like they’re testing for Determiner instead. I would think that if they want to test for Determiner, they’d swap which one they’re testing classhood for (i.e., have determiners with a noun preference, and look for individual determiner values to all be drawn from the same underlying Determiner distribution).

(3) Critiquing previous approaches involving the overlap score: 
M&al2017 say that overlap might increase simply because children heard more determiner+noun pairs in the input (i.e., it’s due to memorization, and not abstraction). I’m not sure I follow this critique, though — I’m more familiar with Yang’s metrics, of course, and those do indeed take a snapshot of whether the current data are compatible with fully productive categories vs. memorized amalgams from the input. The memorized amalgam assessment seems like it would indeed capture whether children’s output is compatible with memorized amalgams (i.e., more determiner+noun pairs in their input). 

(4) Data extraction (from appendix): 
(a) I like that they checked whether the nouns should be collapsed together or if instead morphological variants should be treated separately (e.g., dog/dogs as one or two nouns). In most analyses I’ve seen, these would be treated as two separate nouns. 

(b) Also, the supplementary material really highlights the interesting splits in PNAS articles, where all the stuff you’d want to know to actually replicate the work isn’t in the main text. 

(c) Also, yay for github code! (Thanks, M&al2017 — this is excellent research practice.)

(5) M&al2017 highlight the need for dense naturalistic corpora in their discussion - I feel like this is an awesome advertisement for the Branwauld corpus: Seriously. It may not have as much child-directed as the Speechome, but it has a wealth of longitudinal child-produced data. (Our sample from 20 to 24 months has 2154 child-produced VPs, for example, which doesn’t sound too bad when compared to Speechome’s 4300 NPs.)

Monday, May 15, 2017

Thoughts on Yang et al. 2017

I feel like Universal Grammar (UG) was better defined by the end of this exposition (thanks, Y&al2017!), but now I want to have a heart-to-heart about the difference between “hierarchy” and “combination”. Still, I appreciated this convenient synthesis of evidence from the generative grammar tradition, especially as it relates to the kind of considerations I have as an acquisition modeler. 

Specific thoughts:

(1) Hierarchy vs. combination:
Part 2.2: While I’m a fan of hierarchical structures being everywhere in language, I wasn’t sure how connected the newborn n-syllable tasks were to the point about hierarchy. Why does being sensitive to the number of vowels (“vowel centrality”) indicate there must be hierarchical structure? For example, what if newborns hadn’t inferred hierarchy yet, but were simply sensitive to the more acoustically salient cue of vowels — wouldn’t we see the same results, even if all they really perceived was something like V V for “baku” and “alprim”?

Similarly with the babbling examples: How do we know these are hierarchical (vs. say, linear) structures? 

Similarly with the prosodic contour distinctions for the 6- to 12-week-olds: We know they perceive the prosodic contours, but not that they recognize the words and phrases in these languages. (In fact, we assume they don’t — they haven’t really managed reliable speech segmentation yet.) So how does recognizing prosodic contour distinctions over acoustic units relate to the hierarchical structure Merge gives?

My main issue is coming down to “combinatorial” vs. “hierarchical”. I think you can make combinations of things without those things being combined hierarchically. So these two terms don’t mean the same thing to me, which is why the evidence in section 2.2 doesn’t seem as compelling about hierarchy (though it is for combinations). Contrast this with the 2.3 examples of syntactic development, where c-command definitely is about hierarchy.

(2) UG: Initially, UG is described as domain-specific principles of language knowledge, without specifying whether these are innate principles or not (and also seeming to focus on the knowledge about language, rather than, say, knowledge about how to learn language (= learning mechanism)). But then, we see UG described as “internal constraints that hold across all linguistic structures”  — though this highlights the innate component, it now doesn’t seem to indicate these constraints have to be just about language. That is, they could be constraints that apply to language as well as other things, e.g., hierarchy, which they talk about as Merge. I’m thinking visual scene parsing is similar, where you have hierarchical chunks. So this would be a vision system version of Merge. 

A little later on, we see “Universal Grammar” as the “initial state of language development” that's “determined by our genetic endowment”, which reinforces the innate component, but hedges on whether this is innate knowledge of the structure of language, or innate knowledge about how to learn language. This latter interpretation becomes more salient when they describe UG as infants interpreting parts of the environment as linguistic experience. This seems to be about the perceptual intake, and is less about knowledge of language than knowledge about what could count as language (= learning mechanism). Maybe that’s a broader definition of what it means to be a “principle of language”?

Later on in part 3.2, we get to more canonical UG examples, which are the linguistic parameters. These feel much more obviously language-specific. If they’re meant to be innate (which is how they’re typically talked about), then there we go. 

Side note: I would dearly love to figure out if specific linguistic parameters like these are derivable from other more basic linguistic building blocks. I think this is where the Minimalist Program (MP) and the Principles & Parameters (P&P) representations can meet, with MP providing the core building blocks that generate the P&P variables. I just haven’t seen it explicitly done yet. But it feels very similar to the implicit vs. explicit hypothesis space distinction that Perfors (2012) discusses, where the linguistic parameters are the explicit hypotheses generated from the MP building blocks that are capable of generating all the hypotheses in the implicit hypothesis space.

Perfors, A. (2012). Bayesian models of cognition: what's built in after all? Philosophy Compass, 7(2), 127-138.

(3) Efficient computation: I really like seeing this term here as a core factor, though I’m tempted to make it “efficient enough computation”, especially if we’re going to eventually tie this kind of thing back to evolution.

(4) Rhetorical device danger: Section 3.1 has this statement that I think can get us into hot water later on: “[I]t follows that language learners never witness the whole conjugation table…fully fleshed out, for even a single verb.”  Now we’ve just thrown down the gauntlet for some corpus analyst to hunt through a large enough sample and find just one verb that does. It doesn’t affect the main point at all, but it’s the kind of thing that can be easily misunderstood (c.f., aux inversion input for arguing against Poverty of the Stimulus).

(5) Section 3.3: “…linguistic principles such as Structure Dependence and the constraint on co-reference [c-command]…are most likely accessible to children innately” — Yes! In the sense that these principles are allowed into the hypothesis space. Accessible is definitely the right (hedgy) word, rather than saying these are the only options period.

(6) Section 3.3, on Bayesian models of indirect negative evidence : ”…for this reason, most recent models  of indirect negative evidence explicitly disavow claims of psychological realism” — I find this a bit tricksy. Reading it, you might think: “Oh! The issue is that indirect negative evidence isn’t psychologically plausible to use.” But in actuality,  the “disavowal” is about a computational-level inference algorithm being psychologically real. As far as I know, there are no claims that the computation it’s doing with that algorithm isn’t psychologically real; rather, they assume humans approximate that computation (which uses indirect negative evidence).  

Related is the stated computational "intractability" of using indirect negative evidence: I admit, I find this weird. If we’re happy to posit alternative hypotheses in a subset-superset relationship, why is it so hard to posit predictions from those two hypotheses? The hard part seems to be about defining the hypotheses so explicitly in the first place, and that doesn’t seem to be the part that’s targeted as “psychologically intractable”. If anything, it seems to be the psychologically necessary part. (The description that follows this bit in section 3.3 seems to highlight this, where Y&al2017 talk about the superset grammar existing, even if the default is the subset grammar.)

(7) Section 4.1, on the importance of empirical details: I really appreciate the pitch to make proposals account for specific empirical details. This is something near and dear to my heart. Don’t just tell me your $beautiful_theory will solve all my language acquisition problems; show me exactly how it solves them, one by one. (Minimalism, I’m looking at you. And to be fair, that’s exactly what the next-to-last sentence of section 4.1. says.)

Monday, May 1, 2017

Thoughts on Han et al. 2016 + Piantadosi & Kidd 2016 + Lidz et al. 2016

As with our previous reading, I really appreciate the clarity with which the arguments are laid out by H&al2016, P&K2016’s reply, and L&al2016’s reply-to-the-reply. I can also see where some confusion is arising in the debates surrounding this — there seems to be genuine ambiguity in the way terminology is used to describe the different perspectives about the source of linguistic knowledge (e.g., what “endogenous” actually refers to — more on this below). I also really like seeing a clear, concrete example of solving an induction problem that involves fairly abstract knowledge, and using knowledge internal to the learner to do so.

Specific thoughts:

(1) Endogenous: 
It’s interesting that the basic distinction drawn in the opening paragraph of H&al2016 is between domain-general vs. language-specific innate mechanisms, which is different than simply endogenous vs. not (that is, it’s a question of which endogenous it is): “…did the data…allow for construction of knowledge through general cognitive mechanisms…or did that experience play more of a triggering role, facilitating the expression of abstract core knowledge…”

I think the reply by P&K2016 hits on an interesting terminology issue. For H&al2016, endogenous means “internal to the child”; in contrast, P&K2016 seem to go with the more narrow definition of “genetically specified with no external influence”. This then makes P&K2016 question what to make of parents having different grammars than their kids. For H&al2016, I think the point is simply that something internal to the child  — and not solely genetic — is responsible. It’s possible that the internal something developed from a combination of genetics & other data experience, but it’s clearly something that can differ between parents and children. (General point: Just because something’s genetic doesn’t mean it doesn’t interact with the environment to produce the observed result. Concrete example: Height depends on genetics and nutrition.) 

This issue about what kind of endogenous knowledge (rather than simply is it or isn’t it endogenous) is also something P&K2016 pick up on in their reply. They specifically bring up domain-general endogenous factors as possibilities (“differences in memory, motivation, or attention”) and note that the “root cause of the variation may not even be linguistic”. This, as far as I can tell, doesn’t go against H&el2016’s original point. So, it seems like P&K2016 are targeting a more specific position than H&al2016 argued in their paper, though H&al2016’s initial introductory wording suggested that more specific position.

I think L&al2016’s reply-to-the-reply reflects the ambiguity in this position — they note that their paper provides evidence for “endogenous linguistic content”. While the basic reading of this is simply “knowledge about language that’s internal” (and so silent about whether the origin of this knowledge is domain-specific or domain-general), I think it’s easy to interpret this as arguing for the origin of that knowledge to also be language-specific. The final paragraph of L&al2016’s reply underscores this interpretation, as they argue against domain-general mechanisms like memory, attention, and executive function being the source of the endogenous linguistic knowledge. And that, of course, is what P&K2016 (and many others) aren’t fond of. 

(2) Empiricism, P&K2016’s closing: What’s a “reasonable version” of empiricism? My (perhaps naive) understanding was that empiricism believes everything is learned and nothing is innate, which I didn’t think anyone believed anymore. I thought that as soon as you believe even one thing is innate (no matter what flavor of innate it is), you’re by definition a nativist. Maybe this is another example of terminology being used differently by the different perspectives.

(3) One of the interesting things about the experiments in H&al2016 is that the experimental stimuli could be the driving force of grammatical choice. That is, there’s a possibility that people did have multiple grammars before the experiment, but selected one during the course of the experiment and then learned it. This is one way that could happen:

(a) When finally presented with data that require a choice in the verb-raising parameter, participants make that choice. 
(b) Primed by the previous choice (which may have involved some internal computation that was effortful and which they don’t want to repeat), participants stick with it throughout the first test session, thereby reinforcing that choice. 
(c) This prior experience is then reactivated in the second test session a month later, and used as a prior in favor of whichever option was previously chosen. 

If this is what happened, then by the act of testing people, we enable the convergence on a single option where there were previously multiple ones - how quantum mechanics of us…

Monday, April 17, 2017

Thoughts on Lasnik & Lidz 2016

I really enjoyed L&L2016’s take on poverty of the stimulus and how it relates to the argument for Universal Grammar (UG) — so much so that I’m definitely using this chapter as a reference when we talk about poverty of the stimulus in my upper-division language acquisition course. 

One thing that surprised me, though, is that there seems to be some legitimate confusion in the research community about how to define UG (more on this below), which leads to one of two situations: (i) everyone who believes there are induction problems in language by definition believes in Universal Grammar, or (ii) everyone who believes there are induction problems in language that are only solvable by language-specific components believes in Universal Grammar. I feel like the linguistics, cognitive science, and psychology communities need to have a heart-to-heart about what we’re all talking about when we argue for or against Universal Grammar. (To be fair, I’ve felt this way when responding to some folks in the developmental psych community — see Pearl 2014 for an example.)

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

Specific thoughts:

(1) Universal Grammar:  
Chomsky (1971)’s quote in section 10.6 about structure-dependence concludes with “This is a very simple example of an invariant principle of language, what might be called a formal linguistic universal or a principle of universal grammar.” — Here, the term Universal Grammar seems to apply to anything that occurs in all human languages (not unreasonable, give the adjective “universal”). But it doesn’t specify whether that thing is innate vs. derived, or language-specific vs. domain-general. I begin to see where the confusion in the research community may have come from. 

Right now, some people seem to get really upset at the term Universal Grammar, taking it to mean things that are both innate and language-specific (and this is the certainly working definition Jon Sprouse and I use). But Chomsky’s use of Universal Grammar above can clearly be interpreted quite differently. And for that interpretation of Universal Grammar, it’s really just a question of whether the thing is in fact something that occurs in all human languages, period. It doesn’t matter what kind of thing it is.

Related: In the conclusion section 10.9, L&L2016 zero in on the innate part of UG: “…there must be something inside the learner which leads to that particular way of organizing experience…organizing structure is what we typically refer to as Universal Grammar…”. This notably leaves it open about whether the innate components are language-specific or domain-general. But the part that immediately follows zeros in on the language-specific part by saying Universal Grammar is “the innate knowledge of language” that underlies human language structure and makes acquisition possible. On the other hand….maybe “innate knowledge of language” could mean innate knowledge that follows from domain-general components and which happens to apply to language too? If so, that would back us off to innate stuff period, and then, by that definition, everyone believes in Universal Grammar as long as they believe in innate stuff applying to language.

(2) Induction problems: I really appreciate how the intro in 10.1 highlights that the existence of induction problems doesn’t require language-specific innate components (just innate components). The additional step of asserting that the innate components are also language-specific (for other reasons) is just that — an additional step. Sometimes, I think these steps get conflated when induction problems and poverty of the stimulus are discussed, and it’s really great to see it so explicitly laid out here. I think the general line of argument in this opening section also makes it clear why the pure empiricist view just doesn’t fly anymore in cognitive development — everyone’s some kind of nativist. But where people really split is whether they believe at least some innate component is also language-specific (or not). This is highlighted by a Chomsky (1971) quote in section 10.3, which notes that the language-specific part is an “empirical hypothes[i]s”, and the components might in fact be “special cases of more general principles of mind”.

(3) The data issue for induction problems: 
Where I think a lot of interest has been focused is the issue of how much data are actually available for different aspects of language acquisition. Chomsky’s quote in 10.1 about the A-over-A example closes with “…there is little data available to the language learner to show that they apply”. Two points: 

(a) "Little" is different than "none", and how much data is actually available is a very important question. (Obviously, it’s far more impressive if the input contains none or effectively none of the data that a person is able to judge as grammatical or ungrammatical.) This is picked up in the Chomsky (1971) quote in section 10.6, which claims that someone “might go through much or all of his life without ever having been exposed to relevant evidence”. This is something we can actually check out in child-directed speech corpora — once we decide what the “relevant evidence” is (no small feat, and often a core contribution of an acquisition theory). This also comes back in the discussion of English anaphoric one in section 10.8, where the idea that of what counts as informative data is talked about in some detail (unambiguous vs. ambiguous data of different kinds). 

(b) How much data is "enough" to support successful acquisition is also a really excellent question. Basically, an induction problem happens when the data are too scarce to support correct generalization as fast as kids do it. So, it really matters what “too scarce” means. (Legate & Yang (2002) and Hsu & Chater (2010) have two interesting ideas for how to assess this quantitatively.) L&L2016 bring this up explicitly in the closing bit of 10.5 on Principle C acquisition, which is really great.

Legate, J. A., & Yang, C. D. (2002). Empirical re-assessment of stimulus poverty arguments. Linguistic Review, 19(1/2), 151-162.

Hsu, A. S., & Chater, N. (2010). The logical problem of language acquisition: A probabilistic perspective. Cognitive science, 34(6), 972-1016.

(4) Section 10.4 has a nice explanation of Principle C empirical data in children, but we didn’t quite get to the indirect negative evidence part for it (which I was quite interested to see!). My guess: Something about structure-dependent representations, and then tracking what positions certain pronouns allow reference to (a la Orita et al. 2013), though section 10.5 also talks about a more idealized account that’s based on the simple consideration of data likelihood.

Orita, N., McKeown, R., Feldman, N., Lidz, J., & Boyd-Graber, J. L. (2013). Discovering Pronoun Categories using Discourse Information. In CogSci.

(5) A very minor quibble in section 10.5, about the explanation given for likelihood. I think the intuition is more about how the learner views data compatibility with the hypothesis. P(D | H ) = something like “how probable the observed data are under this hypothesis”, which is exactly why the preference falls out for a smaller hypothesis space that generates fewer data points. (How the learner’s input affects the beliefs is the whole calculation of likelihood * prior, which is transformed into the posterior.) 

Related: I love the direct connection of Bayesian reasoning to the Subset Principle. It seems to be exactly what Chomsky was talking about as something that’s a special case of a more general principle of mind.

(6) Structure-dependence, from section 10.6: “Unfortunately, other scholars were frequently misled by this into taking one particular aspect of the aux-fronting paradigm as the principle structure dependence claim, or, worse still, as the principle poverty of the stimulus claim.” — Too darned true, alas! Hopefully, this paper and others like it will help rectify that misunderstanding. I think it also highlights that our job, as people who believe there are all these complex induction problems out there, should be to accessibly demonstrate what these induction problems are. A lot.

(7) Artificial language learning experiments in 10.7: I’ve always thought the artificial language learning work of Takahashi and Lidz was a really beautiful demonstration of statistical learning abilities applied to learning structure-dependent rules that operate over constituents (= what I’ll call a “constituent bias”). But, as with all artificial language learning experiments, I’m less clear about how to relate this to native language acquisition, where the learners don’t already have a set of language biases about using constituents from their prior experience. It could indeed be that such biases are innate, but it could also be that such biases (however learned) are already present in the adult and 18-month-old learners, and these biases are deployed for learning the novel artificial language. So, it’s not clear what this tells us about the origin of the constituent bias. (Note: I think it’s impressive as heck to do this with 18-month-olds. But 18-month-olds do already have quite a lot of experience with their native language.)

(8) Section 10.8 & anaphoric one (minor clarification): This example is of course near and dear to my heart, since I worked on it with Jeff Lidz.  And indeed, based on our corpus analyses, unambiguous data for one’s syntactic category (and referent) in context is pretty darned rare. The thing that’s glossed over somewhat is that the experiment with 18-month-olds involves not just identifying one’s antecedent as an N’, but specifically as the N’ “red bottle” (because “bottle” is also an N’ on its own, as example 20 shows). This is an important distinction, because it means the acquisition task is actually a bit more complicated. The syntactic category of N’ is linked to 18-month-olds preferring the antecedent “red bottle” — if they behaved as if they thought it was “bottle”, we wouldn’t know if they thought it was N’ “bottle” or plain old N0 “bottle”.

Tuesday, March 7, 2017

Thoughts on Ranganath et al. 2013

I really appreciate seeing the principled reasoning for using certain types of classifiers, and doing feature analysis both before and after classification. On this basis alone, this paper seems like a good guide to classifier best practices for the social sciences. Moreover, the discussion section takes care to relate the specific findings to larger theoretical ideas in affective states, like collaborative conversation style, and the relationship between specific features and affective state (e.g.,  negation use during flirtation may be related to teasing or self-deprecation; the potential distinction between extraversion and assertiveness; the connection between hedging and psychological distancing; what laughter signals at different points in the conversational turn). Thanks, R&al2013!

Other thoughts:

(1) Data cleanliness: R&al2013 want a really clean data set to learn from, which is why they start with the highest 10% and lowest 10% of judged stance ratings.  We can certainly see the impact of having messier data, based on the quartile experiments. In short, if you use less obvious examples to train, you end up with worse performance. I wonder what would happen if you use the cleaner data to train (say, the top and bottom 10%), but tested on classifying the messier data (top and bottom 25%). Do you think you would still do as poorly, or would you have learned some good general features from the clean dataset that can be applied to the messy dataset? (I’m thinking about this in terms of child-directed speech (CDS) for language acquisition, where CDS is “cleaner” in various respects than messy adult-directed data.)

(2) This relates to the point in the main section about how R&al2013 really care about integrating insights from the psychology of the things they’re trying to classify. In the lit review, I appreciated the discussion of the psychological literature related to interpersonal stance (e.g., specifying the different categories of affective states). This demonstrates the authors are aware of the cognitive states underpinning the linguistic expression.

(3) Lexical categories, using 10 LIWC-like categories: I appreciated seeing the reasoning in footnote 1 about how they came up with these, and more importantly, why they modified them the way they did. While I might not agree with leaving the “love” and “hate” categories so basic (why not use WordNet synsets to expand this?), it’s at least a reasonable start. Same comment for the hedge category (which I love seeing in the first place).

(4) Dialog and discourse features: Some of these seem much more complex to extract (ex: sympathetic negative assessments). The authors went for a simple heuristic regular expression to extract these, but this is presumably only a (reasonable) first-pass attempt. On the other hand, given that they had less than 1000 speed-dates, they probably could have done some human annotation of these just to give the feature the best shot of being useful. Then, if it’s useful, they can worry about how to automatically extract it later.

(5)  It’s so interesting to see the accommodation of function words signifying flirtation. Function words were the original authorship stylistic marker, under the assumption that your use of function words isn’t under your conscious control. I guess the idea would be that function word accommodation also isn’t really under your conscious control, and imitation is the sincerest form of flattery (=~ flirtation)…

Tuesday, February 21, 2017

Thoughts on Rubin et al. 2015

As with much of the deception detection literature, it’s always such a surprise to me how relatively modest the performance gains are. (Here, the predictive model doesn’t actually get above chance performance, for example — of course, neither do humans.) This underscores how difficult a problem deception detection from linguistic cues generally is (or at least, is currently). 

For this paper, I appreciated seeing the incorporation of more sophisticated linguistic cues, especially those with more intuitive links to the psychological processes underlying deception (e.g., rhetorical elements representing both what the deceiver chooses to focus on and the chain of argument from one point to the next). I wonder if there’s a way to incorporate theory of mind considerations more concretely, perhaps via pragmatic inference linked to discourse properties (I have visions of a Rational-Speech-Act-style framework being useful somehow).

Other thoughts:

(1) I wonder if it’s useful to compare and contrast the deception process that underlies fake product reviews with the process underlying fake news. In some sense, they’re both “imaginative writing”, and they’re both  about a specific topic that could involve verifiable facts. (This comes to mind especially because of the detection rate of around 90% for the fake product reviews in the data set of Ott et al. 2011, 2013, using just n-grams + some LIWC features).

Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 309-319). Association for Computational Linguistics.

Ott, M., Cardie, C., & Hancock, J. T. 2013. Negative Deceptive Opinion Spam. In HLT-NAACL (pp. 497-501).

(2) I really appreciated the discussion of the issues surrounding “citizen journalism”. I wonder if an easier (or alternative route) for news verification is considering a set of reports about the same topic in aggregate — i.e., a wisdom of the crowds approach over the content of the reports. The result is aggregated content (note: perhaps cleverly aggregated to be weighted by various linguistic/rhetorical/topic features) that reflects the ground truth better than any individual report, and thus would potentially mitigate the impact of any single fake news report. You might even be able to use the “Bluff the Listener” NPS news data R&al2015 used, though there you only have three stories at a time on the same topic (and two are in fact fake, so your “crowd” of stories is deception-biased).

(3) Something I’m interested in, given the sophistication of the RST discourse features — what are some news examples that a simplistic n-grams approach would miss (either false positives or false negatives)? Once we have those key examples, we can look at the discourse feature profiles of those examples to see if anything pops out. This then tells us what value would be added to a standard baseline n-gram model that also incorporated these discourse features, especially since they have to be manually annotated currently. 

Tuesday, January 31, 2017

Thoughts on Iyyer et al. 2014

I really appreciate that I&al2014’s goal is to go beyond bag-of-words approaches and leverage the syntactic information available (something that warms my linguistic heart).  To this end, we see a nice example in Figure 1 of the impact of lexical choice and structure on the overall bias of a sentence, with “big lie” + its complement (a proposition) = opposite bias of the proposition. Seeing this seemingly sophisticated compositional process, I was surprised to see later on that negation causes such trouble. Maybe this has to do with the sentiment associated with “lie” (which is implicitly negative), while “not” has no obvious valence on its own?

Some other thoughts:

(1) Going over some of the math specifics: In the supervised objective loss function in (5), I’m on board with l(pred_i), but what’s gamma? (A bias parameter of some kind? And is over two just so the derivative works on in equation 6?)) Theta is apparently the set of vectors corresponding to the components (W_L, W_R), the weights on the components (W_cat), the biases (b_1, b_2), and some other vector W_e (which later on is described as a word embedding matrix from word2vec)…and that gets squared in the objective function because…?

(2) I like seeing the impact of initialization settings (random vs prior knowledge= 300 dimensional word2vec). The upshot is that word2vec prior knowledge about words is helpful — though only by 1% in performance, much to my surprise. I expected this semantic knowledge to be more helpful (again, my linguistic knowledge bias is showing).

(3) Dataset stuff:

(a) I found it a bit odd that the authors first note that partisanship (i.e., whether someone is Republican or Democrat) doesn’t always correlate with their ideological stance on a particular issue (i.e., conservative or democrat), and then say how they’re going to avoid conflating these things by creating a new annotated data set. But then, when creating their sentence labels, they propagate the party label (Republican/Demoncrat) down from the speaker to individual sentences, making exactly these mappings (Republican—>conservative, Democrat—>liberal) they just said they didn’t want to conflate. Did I miss something? (Also, why not use crowdflower to verify the propagated annotations?)

(b) Relatedly, when winnowing down the sentences that are likely to be biased for the annotated dataset, I&al2014 rely on exactly the hand-crafted methods that they shied away from before (e.g., a dictionary of “sticky bigrams” strongly associated with one party or the other). So maybe there’s a place for these methods at some point in the classifier development pipeline (in terms of identifying useful data to train on).

(c) The final dataset size is 7816 sentences — wow! That’s tiny in NLP dataset size terms. Even when you add the 11,555 hand-tagged ones from the IBC, that’s still less than 20K sentences to learn from. Maybe this is an instance of quality over quantity when it comes to learning (and hopefully not overfitting)?

(4) It’s really nice to see specific examples where I&al2014’s approach did better than the different  baselines. This helps with the explanation of what might be going on (basically, structurally-cued shifts in ideology get captured). Also, here’s where negation strikes! It’s always surprising to me that more explicit things to handle negation structurally aren’t implemented, given how much power negation has when it comes to interpretation. I&al2014 say this can be solved by more training data (probably true)…so maybe the vectorized representation of “not” would get encoded to be something like its linguistic structural equivalent? 

Tuesday, January 10, 2017

Some thoughts on Mikolov et al. 2013

I definitely find it as interesting as M&al2013 do that some morphosyntactic relationships (e.g., past tense vs. present tense) are captured by these distributed vector representations of words, in addition to the semantic relationships. That said, this paper left me desperately wanting to know why these vector representations worked that way. Was there anything interpretable in the encodings themselves? (This is one reason why current research into explaining neural network results is so attractive — it’s nice to see cool results, but we want to know what the explanation is for those results.) Put simply, I can see that forcing a neural network to learn from big data in an unsupervised way yields these implicit relationships in the word encodings. (Yay! Very cool.) But tell me more about why the encodings look the way they do so we better understand this representation of meaning.

Other thoughts:

(1) Everything clearly rides on how the word vectors are created (“…where similar words are likely to have similar vectors”). And that’s accomplished via an RNN language model very briefly sketched in Figure 1. I think it would be useful to better understand what we can of this, since this is the force that’s compressing the big data into helpful word vectors. 

One example:  the model is “…trained with back-propagation to maximize the data log-likelihood under the model…training such a purely lexical model to maximize likelihood will induce word representations…”  — What exactly are the data? Utterances?  Is there some sense of trying to predict the next word the way previous models did? Otherwise, if everything’s just treated as a bag of words presumably, how would that help regularize word representations?

(2) Table 2: Since the RNN-1600 does the best, it would be handy to know what the “several systems” were that comprised it. That said, there seems to be an interesting difference in performance between adjectives and nouns on one hand (at best, 23-29% correct) and verbs on the other (at best, 62%), especially for the RNN versions. Why might that be? The only verb relation was the past vs present tense…were there subsets of noun or adjective relations with differing performance, or were all the noun and all the adjective relations equal? (That is, is this effectively a sampling error, and if we tested more verb relations, we’d find more varied performance?) Also, it’d be interesting to dig into the individual results and see if there were particular word types the RNN representations were especially good or bad at. 

(3) Table 3: Since the RNN-1600 was by far the best of the RNNs in Table 2 (and in fact RNN-80 was the worst), why pick the RNN-80 to compare against the other models (CW, HLBL)?

(4) Table 4, semantic relation results: When .275 is the best Spearman’s rho can you can get, it shows this is a pretty hard task…I wonder what human performance would be. I assume close to 1.00 if these are the simple analogy-style questions? (Side note: MaxDiff is apparently this, and is another way of dealing with scoring relational data.)