I really appreciate that I&al2014’s goal is to go beyond bag-of-words approaches and leverage the syntactic information available (something that warms my linguistic heart). To this end, we see a nice example in Figure 1 of the impact of lexical choice and structure on the overall bias of a sentence, with “big lie” + its complement (a proposition) = opposite bias of the proposition. Seeing this seemingly sophisticated compositional process, I was surprised to see later on that negation causes such trouble. Maybe this has to do with the sentiment associated with “lie” (which is implicitly negative), while “not” has no obvious valence on its own?
Some other thoughts:
(1) Going over some of the math specifics: In the supervised objective loss function in (5), I’m on board with l(pred_i), but what’s gamma? (A bias parameter of some kind? And is over two just so the derivative works on in equation 6?)) Theta is apparently the set of vectors corresponding to the components (W_L, W_R), the weights on the components (W_cat), the biases (b_1, b_2), and some other vector W_e (which later on is described as a word embedding matrix from word2vec)…and that gets squared in the objective function because…?
(2) I like seeing the impact of initialization settings (random vs prior knowledge= 300 dimensional word2vec). The upshot is that word2vec prior knowledge about words is helpful — though only by 1% in performance, much to my surprise. I expected this semantic knowledge to be more helpful (again, my linguistic knowledge bias is showing).
(3) Dataset stuff:
(a) I found it a bit odd that the authors first note that partisanship (i.e., whether someone is Republican or Democrat) doesn’t always correlate with their ideological stance on a particular issue (i.e., conservative or democrat), and then say how they’re going to avoid conflating these things by creating a new annotated data set. But then, when creating their sentence labels, they propagate the party label (Republican/Demoncrat) down from the speaker to individual sentences, making exactly these mappings (Republican—>conservative, Democrat—>liberal) they just said they didn’t want to conflate. Did I miss something? (Also, why not use crowdflower to verify the propagated annotations?)
(b) Relatedly, when winnowing down the sentences that are likely to be biased for the annotated dataset, I&al2014 rely on exactly the hand-crafted methods that they shied away from before (e.g., a dictionary of “sticky bigrams” strongly associated with one party or the other). So maybe there’s a place for these methods at some point in the classifier development pipeline (in terms of identifying useful data to train on).
(c) The final dataset size is 7816 sentences — wow! That’s tiny in NLP dataset size terms. Even when you add the 11,555 hand-tagged ones from the IBC, that’s still less than 20K sentences to learn from. Maybe this is an instance of quality over quantity when it comes to learning (and hopefully not overfitting)?
(4) It’s really nice to see specific examples where I&al2014’s approach did better than the different baselines. This helps with the explanation of what might be going on (basically, structurally-cued shifts in ideology get captured). Also, here’s where negation strikes! It’s always surprising to me that more explicit things to handle negation structurally aren’t implemented, given how much power negation has when it comes to interpretation. I&al2014 say this can be solved by more training data (probably true)…so maybe the vectorized representation of “not” would get encoded to be something like its linguistic structural equivalent?