Monday, April 17, 2017

Thoughts on Lasnik & Lidz 2016

I really enjoyed L&L2016’s take on poverty of the stimulus and how it relates to the argument for Universal Grammar (UG) — so much so that I’m definitely using this chapter as a reference when we talk about poverty of the stimulus in my upper-division language acquisition course. 

One thing that surprised me, though, is that there seems to be some legitimate confusion in the research community about how to define UG (more on this below), which leads to one of two situations: (i) everyone who believes there are induction problems in language by definition believes in Universal Grammar, or (ii) everyone who believes there are induction problems in language that are only solvable by language-specific components believes in Universal Grammar. I feel like the linguistics, cognitive science, and psychology communities need to have a heart-to-heart about what we’re all talking about when we argue for or against Universal Grammar. (To be fair, I’ve felt this way when responding to some folks in the developmental psych community — see Pearl 2014 for an example.)

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

Specific thoughts:

(1) Universal Grammar:  
Chomsky (1971)’s quote in section 10.6 about structure-dependence concludes with “This is a very simple example of an invariant principle of language, what might be called a formal linguistic universal or a principle of universal grammar.” — Here, the term Universal Grammar seems to apply to anything that occurs in all human languages (not unreasonable, give the adjective “universal”). But it doesn’t specify whether that thing is innate vs. derived, or language-specific vs. domain-general. I begin to see where the confusion in the research community may have come from. 

Right now, some people seem to get really upset at the term Universal Grammar, taking it to mean things that are both innate and language-specific (and this is the certainly working definition Jon Sprouse and I use). But Chomsky’s use of Universal Grammar above can clearly be interpreted quite differently. And for that interpretation of Universal Grammar, it’s really just a question of whether the thing is in fact something that occurs in all human languages, period. It doesn’t matter what kind of thing it is.

Related: In the conclusion section 10.9, L&L2016 zero in on the innate part of UG: “…there must be something inside the learner which leads to that particular way of organizing experience…organizing structure is what we typically refer to as Universal Grammar…”. This notably leaves it open about whether the innate components are language-specific or domain-general. But the part that immediately follows zeros in on the language-specific part by saying Universal Grammar is “the innate knowledge of language” that underlies human language structure and makes acquisition possible. On the other hand….maybe “innate knowledge of language” could mean innate knowledge that follows from domain-general components and which happens to apply to language too? If so, that would back us off to innate stuff period, and then, by that definition, everyone believes in Universal Grammar as long as they believe in innate stuff applying to language.

(2) Induction problems: I really appreciate how the intro in 10.1 highlights that the existence of induction problems doesn’t require language-specific innate components (just innate components). The additional step of asserting that the innate components are also language-specific (for other reasons) is just that — an additional step. Sometimes, I think these steps get conflated when induction problems and poverty of the stimulus are discussed, and it’s really great to see it so explicitly laid out here. I think the general line of argument in this opening section also makes it clear why the pure empiricist view just doesn’t fly anymore in cognitive development — everyone’s some kind of nativist. But where people really split is whether they believe at least some innate component is also language-specific (or not). This is highlighted by a Chomsky (1971) quote in section 10.3, which notes that the language-specific part is an “empirical hypothes[i]s”, and the components might in fact be “special cases of more general principles of mind”.

(3) The data issue for induction problems: 
Where I think a lot of interest has been focused is the issue of how much data are actually available for different aspects of language acquisition. Chomsky’s quote in 10.1 about the A-over-A example closes with “…there is little data available to the language learner to show that they apply”. Two points: 

(a) "Little" is different than "none", and how much data is actually available is a very important question. (Obviously, it’s far more impressive if the input contains none or effectively none of the data that a person is able to judge as grammatical or ungrammatical.) This is picked up in the Chomsky (1971) quote in section 10.6, which claims that someone “might go through much or all of his life without ever having been exposed to relevant evidence”. This is something we can actually check out in child-directed speech corpora — once we decide what the “relevant evidence” is (no small feat, and often a core contribution of an acquisition theory). This also comes back in the discussion of English anaphoric one in section 10.8, where the idea that of what counts as informative data is talked about in some detail (unambiguous vs. ambiguous data of different kinds). 

(b) How much data is "enough" to support successful acquisition is also a really excellent question. Basically, an induction problem happens when the data are too scarce to support correct generalization as fast as kids do it. So, it really matters what “too scarce” means. (Legate & Yang (2002) and Hsu & Chater (2010) have two interesting ideas for how to assess this quantitatively.) L&L2016 bring this up explicitly in the closing bit of 10.5 on Principle C acquisition, which is really great.

Legate, J. A., & Yang, C. D. (2002). Empirical re-assessment of stimulus poverty arguments. Linguistic Review, 19(1/2), 151-162.

Hsu, A. S., & Chater, N. (2010). The logical problem of language acquisition: A probabilistic perspective. Cognitive science, 34(6), 972-1016.

(4) Section 10.4 has a nice explanation of Principle C empirical data in children, but we didn’t quite get to the indirect negative evidence part for it (which I was quite interested to see!). My guess: Something about structure-dependent representations, and then tracking what positions certain pronouns allow reference to (a la Orita et al. 2013), though section 10.5 also talks about a more idealized account that’s based on the simple consideration of data likelihood.

Orita, N., McKeown, R., Feldman, N., Lidz, J., & Boyd-Graber, J. L. (2013). Discovering Pronoun Categories using Discourse Information. In CogSci.

(5) A very minor quibble in section 10.5, about the explanation given for likelihood. I think the intuition is more about how the learner views data compatibility with the hypothesis. P(D | H ) = something like “how probable the observed data are under this hypothesis”, which is exactly why the preference falls out for a smaller hypothesis space that generates fewer data points. (How the learner’s input affects the beliefs is the whole calculation of likelihood * prior, which is transformed into the posterior.) 

Related: I love the direct connection of Bayesian reasoning to the Subset Principle. It seems to be exactly what Chomsky was talking about as something that’s a special case of a more general principle of mind.

(6) Structure-dependence, from section 10.6: “Unfortunately, other scholars were frequently misled by this into taking one particular aspect of the aux-fronting paradigm as the principle structure dependence claim, or, worse still, as the principle poverty of the stimulus claim.” — Too darned true, alas! Hopefully, this paper and others like it will help rectify that misunderstanding. I think it also highlights that our job, as people who believe there are all these complex induction problems out there, should be to accessibly demonstrate what these induction problems are. A lot.

(7) Artificial language learning experiments in 10.7: I’ve always thought the artificial language learning work of Takahashi and Lidz was a really beautiful demonstration of statistical learning abilities applied to learning structure-dependent rules that operate over constituents (= what I’ll call a “constituent bias”). But, as with all artificial language learning experiments, I’m less clear about how to relate this to native language acquisition, where the learners don’t already have a set of language biases about using constituents from their prior experience. It could indeed be that such biases are innate, but it could also be that such biases (however learned) are already present in the adult and 18-month-old learners, and these biases are deployed for learning the novel artificial language. So, it’s not clear what this tells us about the origin of the constituent bias. (Note: I think it’s impressive as heck to do this with 18-month-olds. But 18-month-olds do already have quite a lot of experience with their native language.)

(8) Section 10.8 & anaphoric one (minor clarification): This example is of course near and dear to my heart, since I worked on it with Jeff Lidz.  And indeed, based on our corpus analyses, unambiguous data for one’s syntactic category (and referent) in context is pretty darned rare. The thing that’s glossed over somewhat is that the experiment with 18-month-olds involves not just identifying one’s antecedent as an N’, but specifically as the N’ “red bottle” (because “bottle” is also an N’ on its own, as example 20 shows). This is an important distinction, because it means the acquisition task is actually a bit more complicated. The syntactic category of N’ is linked to 18-month-olds preferring the antecedent “red bottle” — if they behaved as if they thought it was “bottle”, we wouldn’t know if they thought it was N’ “bottle” or plain old N0 “bottle”.

Tuesday, March 7, 2017

Thoughts on Ranganath et al. 2013

I really appreciate seeing the principled reasoning for using certain types of classifiers, and doing feature analysis both before and after classification. On this basis alone, this paper seems like a good guide to classifier best practices for the social sciences. Moreover, the discussion section takes care to relate the specific findings to larger theoretical ideas in affective states, like collaborative conversation style, and the relationship between specific features and affective state (e.g.,  negation use during flirtation may be related to teasing or self-deprecation; the potential distinction between extraversion and assertiveness; the connection between hedging and psychological distancing; what laughter signals at different points in the conversational turn). Thanks, R&al2013!

Other thoughts:

(1) Data cleanliness: R&al2013 want a really clean data set to learn from, which is why they start with the highest 10% and lowest 10% of judged stance ratings.  We can certainly see the impact of having messier data, based on the quartile experiments. In short, if you use less obvious examples to train, you end up with worse performance. I wonder what would happen if you use the cleaner data to train (say, the top and bottom 10%), but tested on classifying the messier data (top and bottom 25%). Do you think you would still do as poorly, or would you have learned some good general features from the clean dataset that can be applied to the messy dataset? (I’m thinking about this in terms of child-directed speech (CDS) for language acquisition, where CDS is “cleaner” in various respects than messy adult-directed data.)

(2) This relates to the point in the main section about how R&al2013 really care about integrating insights from the psychology of the things they’re trying to classify. In the lit review, I appreciated the discussion of the psychological literature related to interpersonal stance (e.g., specifying the different categories of affective states). This demonstrates the authors are aware of the cognitive states underpinning the linguistic expression.

(3) Lexical categories, using 10 LIWC-like categories: I appreciated seeing the reasoning in footnote 1 about how they came up with these, and more importantly, why they modified them the way they did. While I might not agree with leaving the “love” and “hate” categories so basic (why not use WordNet synsets to expand this?), it’s at least a reasonable start. Same comment for the hedge category (which I love seeing in the first place).

(4) Dialog and discourse features: Some of these seem much more complex to extract (ex: sympathetic negative assessments). The authors went for a simple heuristic regular expression to extract these, but this is presumably only a (reasonable) first-pass attempt. On the other hand, given that they had less than 1000 speed-dates, they probably could have done some human annotation of these just to give the feature the best shot of being useful. Then, if it’s useful, they can worry about how to automatically extract it later.

(5)  It’s so interesting to see the accommodation of function words signifying flirtation. Function words were the original authorship stylistic marker, under the assumption that your use of function words isn’t under your conscious control. I guess the idea would be that function word accommodation also isn’t really under your conscious control, and imitation is the sincerest form of flattery (=~ flirtation)…

Tuesday, February 21, 2017

Thoughts on Rubin et al. 2015

As with much of the deception detection literature, it’s always such a surprise to me how relatively modest the performance gains are. (Here, the predictive model doesn’t actually get above chance performance, for example — of course, neither do humans.) This underscores how difficult a problem deception detection from linguistic cues generally is (or at least, is currently). 

For this paper, I appreciated seeing the incorporation of more sophisticated linguistic cues, especially those with more intuitive links to the psychological processes underlying deception (e.g., rhetorical elements representing both what the deceiver chooses to focus on and the chain of argument from one point to the next). I wonder if there’s a way to incorporate theory of mind considerations more concretely, perhaps via pragmatic inference linked to discourse properties (I have visions of a Rational-Speech-Act-style framework being useful somehow).

Other thoughts:

(1) I wonder if it’s useful to compare and contrast the deception process that underlies fake product reviews with the process underlying fake news. In some sense, they’re both “imaginative writing”, and they’re both  about a specific topic that could involve verifiable facts. (This comes to mind especially because of the detection rate of around 90% for the fake product reviews in the data set of Ott et al. 2011, 2013, using just n-grams + some LIWC features).

Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 309-319). Association for Computational Linguistics.

Ott, M., Cardie, C., & Hancock, J. T. 2013. Negative Deceptive Opinion Spam. In HLT-NAACL (pp. 497-501).

(2) I really appreciated the discussion of the issues surrounding “citizen journalism”. I wonder if an easier (or alternative route) for news verification is considering a set of reports about the same topic in aggregate — i.e., a wisdom of the crowds approach over the content of the reports. The result is aggregated content (note: perhaps cleverly aggregated to be weighted by various linguistic/rhetorical/topic features) that reflects the ground truth better than any individual report, and thus would potentially mitigate the impact of any single fake news report. You might even be able to use the “Bluff the Listener” NPS news data R&al2015 used, though there you only have three stories at a time on the same topic (and two are in fact fake, so your “crowd” of stories is deception-biased).

(3) Something I’m interested in, given the sophistication of the RST discourse features — what are some news examples that a simplistic n-grams approach would miss (either false positives or false negatives)? Once we have those key examples, we can look at the discourse feature profiles of those examples to see if anything pops out. This then tells us what value would be added to a standard baseline n-gram model that also incorporated these discourse features, especially since they have to be manually annotated currently. 

Tuesday, January 31, 2017

Thoughts on Iyyer et al. 2014

I really appreciate that I&al2014’s goal is to go beyond bag-of-words approaches and leverage the syntactic information available (something that warms my linguistic heart).  To this end, we see a nice example in Figure 1 of the impact of lexical choice and structure on the overall bias of a sentence, with “big lie” + its complement (a proposition) = opposite bias of the proposition. Seeing this seemingly sophisticated compositional process, I was surprised to see later on that negation causes such trouble. Maybe this has to do with the sentiment associated with “lie” (which is implicitly negative), while “not” has no obvious valence on its own?

Some other thoughts:

(1) Going over some of the math specifics: In the supervised objective loss function in (5), I’m on board with l(pred_i), but what’s gamma? (A bias parameter of some kind? And is over two just so the derivative works on in equation 6?)) Theta is apparently the set of vectors corresponding to the components (W_L, W_R), the weights on the components (W_cat), the biases (b_1, b_2), and some other vector W_e (which later on is described as a word embedding matrix from word2vec)…and that gets squared in the objective function because…?

(2) I like seeing the impact of initialization settings (random vs prior knowledge= 300 dimensional word2vec). The upshot is that word2vec prior knowledge about words is helpful — though only by 1% in performance, much to my surprise. I expected this semantic knowledge to be more helpful (again, my linguistic knowledge bias is showing).

(3) Dataset stuff:

(a) I found it a bit odd that the authors first note that partisanship (i.e., whether someone is Republican or Democrat) doesn’t always correlate with their ideological stance on a particular issue (i.e., conservative or democrat), and then say how they’re going to avoid conflating these things by creating a new annotated data set. But then, when creating their sentence labels, they propagate the party label (Republican/Demoncrat) down from the speaker to individual sentences, making exactly these mappings (Republican—>conservative, Democrat—>liberal) they just said they didn’t want to conflate. Did I miss something? (Also, why not use crowdflower to verify the propagated annotations?)

(b) Relatedly, when winnowing down the sentences that are likely to be biased for the annotated dataset, I&al2014 rely on exactly the hand-crafted methods that they shied away from before (e.g., a dictionary of “sticky bigrams” strongly associated with one party or the other). So maybe there’s a place for these methods at some point in the classifier development pipeline (in terms of identifying useful data to train on).

(c) The final dataset size is 7816 sentences — wow! That’s tiny in NLP dataset size terms. Even when you add the 11,555 hand-tagged ones from the IBC, that’s still less than 20K sentences to learn from. Maybe this is an instance of quality over quantity when it comes to learning (and hopefully not overfitting)?

(4) It’s really nice to see specific examples where I&al2014’s approach did better than the different  baselines. This helps with the explanation of what might be going on (basically, structurally-cued shifts in ideology get captured). Also, here’s where negation strikes! It’s always surprising to me that more explicit things to handle negation structurally aren’t implemented, given how much power negation has when it comes to interpretation. I&al2014 say this can be solved by more training data (probably true)…so maybe the vectorized representation of “not” would get encoded to be something like its linguistic structural equivalent? 

Tuesday, January 10, 2017

Some thoughts on Mikolov et al. 2013

I definitely find it as interesting as M&al2013 do that some morphosyntactic relationships (e.g., past tense vs. present tense) are captured by these distributed vector representations of words, in addition to the semantic relationships. That said, this paper left me desperately wanting to know why these vector representations worked that way. Was there anything interpretable in the encodings themselves? (This is one reason why current research into explaining neural network results is so attractive — it’s nice to see cool results, but we want to know what the explanation is for those results.) Put simply, I can see that forcing a neural network to learn from big data in an unsupervised way yields these implicit relationships in the word encodings. (Yay! Very cool.) But tell me more about why the encodings look the way they do so we better understand this representation of meaning.

Other thoughts:

(1) Everything clearly rides on how the word vectors are created (“…where similar words are likely to have similar vectors”). And that’s accomplished via an RNN language model very briefly sketched in Figure 1. I think it would be useful to better understand what we can of this, since this is the force that’s compressing the big data into helpful word vectors. 

One example:  the model is “…trained with back-propagation to maximize the data log-likelihood under the model…training such a purely lexical model to maximize likelihood will induce word representations…”  — What exactly are the data? Utterances?  Is there some sense of trying to predict the next word the way previous models did? Otherwise, if everything’s just treated as a bag of words presumably, how would that help regularize word representations?

(2) Table 2: Since the RNN-1600 does the best, it would be handy to know what the “several systems” were that comprised it. That said, there seems to be an interesting difference in performance between adjectives and nouns on one hand (at best, 23-29% correct) and verbs on the other (at best, 62%), especially for the RNN versions. Why might that be? The only verb relation was the past vs present tense…were there subsets of noun or adjective relations with differing performance, or were all the noun and all the adjective relations equal? (That is, is this effectively a sampling error, and if we tested more verb relations, we’d find more varied performance?) Also, it’d be interesting to dig into the individual results and see if there were particular word types the RNN representations were especially good or bad at. 

(3) Table 3: Since the RNN-1600 was by far the best of the RNNs in Table 2 (and in fact RNN-80 was the worst), why pick the RNN-80 to compare against the other models (CW, HLBL)?

(4) Table 4, semantic relation results: When .275 is the best Spearman’s rho can you can get, it shows this is a pretty hard task…I wonder what human performance would be. I assume close to 1.00 if these are the simple analogy-style questions? (Side note: MaxDiff is apparently this, and is another way of dealing with scoring relational data.)