Tuesday, January 31, 2017

Thoughts on Iyyer et al. 2014

I really appreciate that I&al2014’s goal is to go beyond bag-of-words approaches and leverage the syntactic information available (something that warms my linguistic heart).  To this end, we see a nice example in Figure 1 of the impact of lexical choice and structure on the overall bias of a sentence, with “big lie” + its complement (a proposition) = opposite bias of the proposition. Seeing this seemingly sophisticated compositional process, I was surprised to see later on that negation causes such trouble. Maybe this has to do with the sentiment associated with “lie” (which is implicitly negative), while “not” has no obvious valence on its own?

Some other thoughts:

(1) Going over some of the math specifics: In the supervised objective loss function in (5), I’m on board with l(pred_i), but what’s gamma? (A bias parameter of some kind? And is over two just so the derivative works on in equation 6?)) Theta is apparently the set of vectors corresponding to the components (W_L, W_R), the weights on the components (W_cat), the biases (b_1, b_2), and some other vector W_e (which later on is described as a word embedding matrix from word2vec)…and that gets squared in the objective function because…?

(2) I like seeing the impact of initialization settings (random vs prior knowledge= 300 dimensional word2vec). The upshot is that word2vec prior knowledge about words is helpful — though only by 1% in performance, much to my surprise. I expected this semantic knowledge to be more helpful (again, my linguistic knowledge bias is showing).

(3) Dataset stuff:

(a) I found it a bit odd that the authors first note that partisanship (i.e., whether someone is Republican or Democrat) doesn’t always correlate with their ideological stance on a particular issue (i.e., conservative or democrat), and then say how they’re going to avoid conflating these things by creating a new annotated data set. But then, when creating their sentence labels, they propagate the party label (Republican/Demoncrat) down from the speaker to individual sentences, making exactly these mappings (Republican—>conservative, Democrat—>liberal) they just said they didn’t want to conflate. Did I miss something? (Also, why not use crowdflower to verify the propagated annotations?)

(b) Relatedly, when winnowing down the sentences that are likely to be biased for the annotated dataset, I&al2014 rely on exactly the hand-crafted methods that they shied away from before (e.g., a dictionary of “sticky bigrams” strongly associated with one party or the other). So maybe there’s a place for these methods at some point in the classifier development pipeline (in terms of identifying useful data to train on).

(c) The final dataset size is 7816 sentences — wow! That’s tiny in NLP dataset size terms. Even when you add the 11,555 hand-tagged ones from the IBC, that’s still less than 20K sentences to learn from. Maybe this is an instance of quality over quantity when it comes to learning (and hopefully not overfitting)?

(4) It’s really nice to see specific examples where I&al2014’s approach did better than the different  baselines. This helps with the explanation of what might be going on (basically, structurally-cued shifts in ideology get captured). Also, here’s where negation strikes! It’s always surprising to me that more explicit things to handle negation structurally aren’t implemented, given how much power negation has when it comes to interpretation. I&al2014 say this can be solved by more training data (probably true)…so maybe the vectorized representation of “not” would get encoded to be something like its linguistic structural equivalent? 


  1. Very interesting paper, it did take me a moment though to realize I was looking at a RECURSIVE neural network instead of the more popular recurrent neural network! From a theoretical perspective, it's great work. What it really would need is a comparison to the recurrent framework in order to demonstrate some kind of superior performance. My guess is that a flat recurrent NN which ignores syntax would do just as well on a large dataset. That's really the problem with the recursive net, you have to parse, and parsing takes time. From an applications perspective, if I have N billion documents, do I really gain anything from parsing?

    Some clarifications to your questions Lisa:

    1) For equation 5, yes gamma is a bias term, but to understand it's purpose you have to understand what it's weighting. The double verticle bars indicates a norm. In this case because it's squared you're looking at the L2 norm. Square every parameter and sum them up. This penalizes a model which sets its parameters very far away from 0 and is a common method of implementing sparsity into a model.

    2) The initialization results look to be very standard. They are working with very small data here, so you would generally expect pre-trained w2v would be of some benefit. But the important thing to keep in mind is that the w2v embeds were trained on completely different data. To the degree that other data differs from their political data, pretrained embeddings actually HURT the model because it has to learn to ignore the irrelevant bits. Embeddings can be very finicky, pretrained w2v is a very general-purpose embedding, but any particular task is going to succeed best with embeddings that are fine-tuned towards their end goal. For this reason you typically see pre-trained embeddings on tasks with small amounts of data since with large data you can adequately learn the embeddings on your own and pretraining offers no benefits (typically).

    3) Yeah, this is real small data stuff. Very hard to say how well it might generalize to other data. I don't imagine they have done much to extend this work. Although there are huge piles of political speech laying around, the annotation work looks terrible.

    4) The negation issue is actually really interesting to me here. This is a very old paper in terms of deep learning, and so I had to re-read their description of the recurrent NN to understand their method is actually quite naive. Now, a recurrent NN would solve the negation problem because for every word it has some type of hidden memory unit which passes information along. The model learns when it should remember something and when it should forget. That allows these models to do things like recognize that "the movie looked great on paper" is really a negative sentiment. It goes word-by-word and things are looking very positive by the time it reaches great, it passes that info on and by the time it reaches paper, it realizes woah what I'm seeing now combines with my memory of what I've seen before in a way that tells me this is super negative. You just need enough training data to let the model learn those types of relationships.

    Now for their recursive approach, they don't have any type of memory mechanism built in. It reads phrase by phrase and passes information along through the summation in Eq (1). This is bad news bears territory. If your two relevant pieces of information are right next to one another then you're fine "not" + "great" = BAD. But if the two relevant pieces are far apart, you get vanishing gradients. "not" + word1, then combines with word2, then word3, etc... And by the time you get to the actual important bit, all that information about not has been lost because nobody told you to keep it around. This was a big problem for the first varieties of recurrent NNs and is a big reason why more recent work tends to focus on LSTM and GRU layers which hold information over long dependencies better.

  2. I actually like the recursive models better than the recurrent models because they seem more tractable for analysis (although it might be less psychologicall plausible).

    Lawrence raises the issue of vanishing gradients when the negation is separated from the thing that is negated. I think the problem is actually worse.

    The additive vector logic might not work at all for "not" even when the words are close. For example, I can see how you can produce a vector for "not" such that "not" + "great" = "BAD". However, I do not immediately see how that same vector for "not" also produces "not" + "bad" = "GREAT"

    It seems that you want the constituent vectors to have multiplicative influences as in a mixture of expert model. For example, the influence of "not" could be to flip the vector following it. Perhaps there is a modification of the recursive model that would be more expressive in terms of how left and right constituent vectors are combined

  3. Recurrent NNs are really just recursive where you feed everything in with a linear order. Obviously this makes the recursive version more powerful, but imposes additional computational cost which I believe is the primary reason they have been investigated much less thoroughly. I think Recursive nets also haven't been well supported by libraries such as Keras which makes them more difficult to implement. Perhaps with the addition of more flexible computational machinery such as Edward (http://edwardlib.org/) we might see an increase in the number of papers related to less standard neural network architectures.

    For the vector logic, I would have to think a bit more about their implementation. Certainly from my own experience I can say that the negation issue is not so bad, but that is probably because of some added non-linearity in the recurrent layer. I'd agree that the way they integrate left and right constituents seems less than ideal since you have to share the weights everywhere. If you were already parsing you could have W matrices learned for particular syntactic constituents, but learning those would probably require more data than they actually have to play around with.