Wednesday, December 2, 2020

Some thoughts on Caplan et al. 2020

I appreciate seeing existence proofs like the one C&al2020 provide here -- more specifically, the previous article by PTG seemed to invite an existence proof that certain properties of a lexicon (ambiguous words being short, frequent, and easy to articulate) could arise from something besides communicative efficiency. C&al2020 then obliged them by providing an existence proof grounded in empirical data. I admit that I had some confusion about the specifics of the communicative efficiency debate (more on this below) as well as PTG’s original findings (more on this below too), but this may be due to actual vagueness in how “communicative efficiency” is talked about in general. 


Specific thoughts:

(1) Communicative efficiency: It hit me right in the introduction that I was confused about what communicative efficiency was meant to be. In the introduction, it sounds like the way “communicative efficiency” is defined is with respect to ambiguity. That is, ambiguity is viewed as not communicatively efficient. But isn’t it efficient for the speaker? It’s just not so helpful for the listener. So, this means communicative efficiency is about comprehension, rather than production.  


Okay. Then, efficiency is about something like information transfer (or entropy reduction, etc.). This then makes sense with the Labov quote at the beginning that talks about the “maximization of information’’ as a signal of communicative efficiency. That is, if you’re communicatively efficient, you maximize information transfer to the listener.


Then, we have the callout to Darwin, with the idea that “better, shorter, and easier forms are constantly gaining the upper hand”.  Here, “better” and “easier” need to be defined. (Shorter, at least, we can objectively measure.) That is, better for who? Easier for who? If we continue with the idea from before, that we’re maximizing information transfer, it’s better and easier for the listener. But of course, we could also define "better" and "easier" for the speaker. In general, it seems like there’d be competing pressure between forms that are better and easier for the speaker vs. forms that are better and easier for the listener. This also reminds me of some of the components of the Rational Speech Act framework, where there’s a speaker cost function to capture how good (or not) a form is for the speaker vs. the surprisal function that captures how good (or not) a form is inferred to be for the listener. Certainly, surprisal comes back in the measures used by PTG  as well as by C&al2020.


Later on, both the description of Zipf’s Principle of Least Effort and the PTG 2012 overview make it sound like communicative efficiency is about how effortful it is for the speaker, rather than focusing on the information transfer to the listener. Which is it? Or are both meant to be considered for communicative efficiency? It seems like both ought to be, which gets us back to the idea of competing pressures…I guess one upshot of C&al2020’s findings is that we don’t have to care about this thorny issue because we can generate lexicons that looks like human language lexicons without relying on communicative efficiency considerations.

(2) 2.2, language models: I was surprised by the amount of attention given to phonotactic surprisal, because I think the main issue is that a statistical model of language is needed and that requires us to make commitments about what we think the language model looks like. This should be the very same issue we see for word-based surprisal. That is, surprisal is the negative log probability of $thing (word or phonological unit), given some language model that predicts how that $thing arises based on the previous context. But it seemed like C&al2020 were less worried about this for word-based surprisal than for phonotactic surprisal, and I’m not sure why.


(3) The summary of PTG’s findings: I would have appreciated a slightly more leisurely walkthrough of PTG’s main findings -- I wasn’t quite sure I got the interpretations right as it was. Here’s what I think I understood: 


(a) homophony: negatively correlated with word length and frequency (so more homophony = shorter words and ...lower frequency words???). It’s also negatively correlated with phonotactic surprisal in 2 of 3 languages (so more homophony = lower surprisal = more frequent phonotactic sequences).


(b) polysemy: negatively correlated with word length, frequency, and phonotactic surprisal (so more polysemous words = shorter, less frequent??, and less surprising = more frequent phonotactic sequences).


(c) syllable informativity: negatively correlated with length in phones, frequency, and phonotactic surprisal (so, the more informative (=the less frequent), the shorter in phones, the lower in frequency (yes, by definition), and the lower the surprisal (so the higher the syllable frequency?)


I think C&al2020’s takeaway message from all 3 of these results was this: ”Words that are shorter, more frequent, and easier to produce are more ambiguous than words that are longer, less frequent, and harder to produce”. The only thing is that I struggled a bit to get this from the specific correlations noted. But okay, if we take this at face value, then ambiguity goes hand-in-hand with being shorter, more frequent, and less phonologically surprising = all about easing things for the speaker at face value. (So, it doesn’t seem like ambiguity and communicative efficiency are at odds with each other, if communicative efficiency is defined from the speaker’s perspective.)


(4) Implementing the semantic constraint on the phonotactic monkey model: The current implementation of meaning similarity uses an idealized version (100 x 100 two-dimensional space of real numbers), where points close to each other have more similar meanings. It seems like a natural extension of this would be to try it with actual distributed semantic representations like GLoVe or RoBERTa.  I guess maybe it’s unclear what additional value this adds to the general argument here -- that is, the current paper is written as “you asked for an existence proof of how lexicons like this could arise without communicative considerations; we made you one”. Yet, at the end, it does sound like C&al2020 would like to have the PSM model be truly considered as a cognitively plausible model of lexicon generation (especially when tied to social networks). If so, then an updated semantic implementation might help convince people that this specific non-communicative-efficiency approach is viable, rather than there simply is a non-communicative-efficiency approach out there that will work.


(5) In 5.3, C&al2020 highlight what the communicative efficiency hypothesis would predict for lexicon change. In particular:


(a) Reused forms should be more efficient than stale forms (i.e., shorter, more frequent, less surprising syllables)


(b) New forms should use more efficient phonotactics (i.e., more frequent, less surprising)


But aren’t these what C&al2020 just showed as something that could result from the PM and PSM models, and so a non-communicative-efficiency approach could also have them? Or is this the point again? I thought at this point that C&al2020 aimed to already show that these predictions aren’t unique to the communicative efficiency hypothesis. (Indeed, this is what they show in the next figure, as they note that PSM English better exploits inefficiencies in the English lexicon by leveraging phonotactically possible, but unused, short words). I guess this is just a rhetorical strategy that I got temporarily confused by.