Tuesday, January 31, 2017

Thoughts on Iyyer et al. 2014

I really appreciate that I&al2014’s goal is to go beyond bag-of-words approaches and leverage the syntactic information available (something that warms my linguistic heart).  To this end, we see a nice example in Figure 1 of the impact of lexical choice and structure on the overall bias of a sentence, with “big lie” + its complement (a proposition) = opposite bias of the proposition. Seeing this seemingly sophisticated compositional process, I was surprised to see later on that negation causes such trouble. Maybe this has to do with the sentiment associated with “lie” (which is implicitly negative), while “not” has no obvious valence on its own?

Some other thoughts:

(1) Going over some of the math specifics: In the supervised objective loss function in (5), I’m on board with l(pred_i), but what’s gamma? (A bias parameter of some kind? And is over two just so the derivative works on in equation 6?)) Theta is apparently the set of vectors corresponding to the components (W_L, W_R), the weights on the components (W_cat), the biases (b_1, b_2), and some other vector W_e (which later on is described as a word embedding matrix from word2vec)…and that gets squared in the objective function because…?

(2) I like seeing the impact of initialization settings (random vs prior knowledge= 300 dimensional word2vec). The upshot is that word2vec prior knowledge about words is helpful — though only by 1% in performance, much to my surprise. I expected this semantic knowledge to be more helpful (again, my linguistic knowledge bias is showing).

(3) Dataset stuff:

(a) I found it a bit odd that the authors first note that partisanship (i.e., whether someone is Republican or Democrat) doesn’t always correlate with their ideological stance on a particular issue (i.e., conservative or democrat), and then say how they’re going to avoid conflating these things by creating a new annotated data set. But then, when creating their sentence labels, they propagate the party label (Republican/Demoncrat) down from the speaker to individual sentences, making exactly these mappings (Republican—>conservative, Democrat—>liberal) they just said they didn’t want to conflate. Did I miss something? (Also, why not use crowdflower to verify the propagated annotations?)

(b) Relatedly, when winnowing down the sentences that are likely to be biased for the annotated dataset, I&al2014 rely on exactly the hand-crafted methods that they shied away from before (e.g., a dictionary of “sticky bigrams” strongly associated with one party or the other). So maybe there’s a place for these methods at some point in the classifier development pipeline (in terms of identifying useful data to train on).

(c) The final dataset size is 7816 sentences — wow! That’s tiny in NLP dataset size terms. Even when you add the 11,555 hand-tagged ones from the IBC, that’s still less than 20K sentences to learn from. Maybe this is an instance of quality over quantity when it comes to learning (and hopefully not overfitting)?

(4) It’s really nice to see specific examples where I&al2014’s approach did better than the different  baselines. This helps with the explanation of what might be going on (basically, structurally-cued shifts in ideology get captured). Also, here’s where negation strikes! It’s always surprising to me that more explicit things to handle negation structurally aren’t implemented, given how much power negation has when it comes to interpretation. I&al2014 say this can be solved by more training data (probably true)…so maybe the vectorized representation of “not” would get encoded to be something like its linguistic structural equivalent? 

Tuesday, January 10, 2017

Some thoughts on Mikolov et al. 2013

I definitely find it as interesting as M&al2013 do that some morphosyntactic relationships (e.g., past tense vs. present tense) are captured by these distributed vector representations of words, in addition to the semantic relationships. That said, this paper left me desperately wanting to know why these vector representations worked that way. Was there anything interpretable in the encodings themselves? (This is one reason why current research into explaining neural network results is so attractive — it’s nice to see cool results, but we want to know what the explanation is for those results.) Put simply, I can see that forcing a neural network to learn from big data in an unsupervised way yields these implicit relationships in the word encodings. (Yay! Very cool.) But tell me more about why the encodings look the way they do so we better understand this representation of meaning.

Other thoughts:

(1) Everything clearly rides on how the word vectors are created (“…where similar words are likely to have similar vectors”). And that’s accomplished via an RNN language model very briefly sketched in Figure 1. I think it would be useful to better understand what we can of this, since this is the force that’s compressing the big data into helpful word vectors. 

One example:  the model is “…trained with back-propagation to maximize the data log-likelihood under the model…training such a purely lexical model to maximize likelihood will induce word representations…”  — What exactly are the data? Utterances?  Is there some sense of trying to predict the next word the way previous models did? Otherwise, if everything’s just treated as a bag of words presumably, how would that help regularize word representations?

(2) Table 2: Since the RNN-1600 does the best, it would be handy to know what the “several systems” were that comprised it. That said, there seems to be an interesting difference in performance between adjectives and nouns on one hand (at best, 23-29% correct) and verbs on the other (at best, 62%), especially for the RNN versions. Why might that be? The only verb relation was the past vs present tense…were there subsets of noun or adjective relations with differing performance, or were all the noun and all the adjective relations equal? (That is, is this effectively a sampling error, and if we tested more verb relations, we’d find more varied performance?) Also, it’d be interesting to dig into the individual results and see if there were particular word types the RNN representations were especially good or bad at. 

(3) Table 3: Since the RNN-1600 was by far the best of the RNNs in Table 2 (and in fact RNN-80 was the worst), why pick the RNN-80 to compare against the other models (CW, HLBL)?

(4) Table 4, semantic relation results: When .275 is the best Spearman’s rho can you can get, it shows this is a pretty hard task…I wonder what human performance would be. I assume close to 1.00 if these are the simple analogy-style questions? (Side note: MaxDiff is apparently this, and is another way of dealing with scoring relational data.)