Friday, May 11, 2018

Some thoughts on Johnson 2017 + Perfors 2017

I love seeing connections to Marr’s levels of description, because this framework is one that I’ve found so helpful for thinking about a variety of problems I work on in language development. Related to this, it was interesting to see Johnson suggest that grammars are computational-level while comprehension and production are algorithmic-level, because comprehension and production are processes operating over these grammar structures. But couldn’t we also apply levels of description just to the grammar knowledge itself? So, for instance, computational-level descriptions provide a grammar structure (or a way to generate that structure using things like Merge), say for some utterance. Then, the algorithmic-level description describes how humans generate that utterance structure in real time with their cognitive limitations (irrespective of whether they’re comprehending, producing, or learning). Then, the implementational-level description is the neural matter that implements the language structure in real time with cognitive and wetware limitations (again, irrespective of whether someone is comprehending, producing, or learning).

Other thoughts:
(1) One major point Johnson makes: a small change at the computational level can have a big impact at the implementation as level. This is basically saying that a small change in building blocks can have a big impact on what you can build, which is the idea behind parameters, especially linguistic parameters. It’s also the idea behind how the brain constructs the mind, with small neurological changes having big cognitive effects (for example, brain lesions).

But, importantly for Johnson and Perfors, implementational level complexity may matter more for evolutionary plausibility. In particular, the systems needed to support the implementation may be quite different, and that connects to evolutionary plausibility. Because of this, arguing for or against something on the basis of its computational level simplicity may not be useful because we don’t really know how the computational level description gets implemented (in the neural matter, let alone the genome that constructs that neural matter). If it turns out the genes encode some kind of computational level description, then we have a link we can exploit for discussing evolutionary probability. Otherwise, it’s not obvious how much evolutionary-plausibility-mileage we get out of something being simple at the computational level of description. So, the level at which simplicity is relevant for evolutionary arguments is the genetic level, since that’s the part that connects most directly to evolutionary arguments. (Though perhaps there’s also a place for “simple” to be about how easy it is to derive from cultural evolution?)

(2) From Johnson 2017: “...perhaps computational descriptions are best understood as scientific theories about cognitive systems?” While I understand where Johnson is coming from (given his focus on evolutionary explanations), I don’t think I agree with this idea of connecting “computational description” with “scientific theories”. A computational description is a description at the level of “the goals of this computation”. We can have scientific theories about that, but we can also have scientific theories about “how this computation is implemented in the wetware” (i.e., the implementational level of description).  So, to me, “level of description” is a separate thing from “scientific theory” (and usefully so).

Friday, April 27, 2018

Some thoughts on Kirby 2017 + Adger 2017 + Bowling 2017

I love seeing an articulated approach to studying language evolution from a computational perspective, and appreciate that Kirby addressed some of my concerns with this approach head on (whether or not I found the answers satisfactory). Interestingly, I’m not so bothered by the issues that bother Adger. I also quite appreciate Bowling’s point about gene-culture interactions when it comes to explaining the origins of complex “phenotypes” like language.

Other thoughts:

(1) Kirby 2017

(a) Given his focus on cultural heredity, does Kirby fall on the “language evolved to enable communication, not complex thought” side of the spectrum? He seems to want to distinguish his viewpoint from the language-for-communication side, though. “...cultural transmission once certain biological prerequisites are in place”. I guess it depends on what he thinks the biological prerequisites are? His final claim is that it’s linked to self-domestication (which yields tendencies towards signal copying and sharing), so that again seems more on the language-for-communication side.

(b) According to Kirby, language design features all seem to lead to systematicity, which is the ability to be more compactly represented, a la language grammars. This is a pretty key component in language acquisition, where children seem biased to look for systematicity (i.e., generalizations) in the data they encounter, such as language data. This seems like it comes into play when Kirby talks about systematicity arising from pressures of language learning and language use.

Kirby also indicates that only human languages have systematicity, which makes the studies about meaningful combinations in other animal systems (e.g., some primate calls involving social hierarchies, birdsong involving the rearrangement of components) interesting as a comparison. Presumably, Kirby would say the non-human systematicity is very poor compared to human language systematicity?

(c) Iterated learning: Kirby notes that compositionality emerges over time in simulated agents when all individuals are initialized with random form-referent pairs (e.g. Brighton 2002). But what else is going on in these simulations? What external/internal pressures are there to cause any change at all? That seems important.

(d) I thought it was interesting to see the connection to poverty of the stimulus (i.e., the data being underspecified with respect to the hypothesis speakers used to generate those data). In particular, because the data are compatible with multiple hypotheses, learners can land on a hypothesis that’s different than what the speaker used to generate those data and which probably aligns better with the learner’s already-existing internal biases. Then, this learner grows up and becomes a speaker generating data for the next generation, now using the hypothesis that’s better aligned with learner biases to actually generate the data new learners observe. So, ambiguity in the input signal allows the emergence of language structure that’s easy to pick up when the data are ambiguous, precisely because it aligns with learner biases. So, that’s why little kids ended up being so darned good at learning the same language structure from ambiguous data.

(e) I laughed just a little at the reference to Gold (1967) as having anything to say about humans learning language acquisition. If there’s anything I learned from Johnson (2004) -- a phenomenal paper-- it’s that whenever you cite the computational learnability results of Gold (1967) as having anything to do with human language acquisition, you’re almost invariably wrong.

Johnson, K. (2004). What does Gold's Theorem show about language acquisition?. In Proceedings from the Annual Meeting of the Chicago Linguistic Society, Vol. 40, No. 2, pp. 261-277).

(f) While I appreciate the walk-through of Bayesian reasoning with respect to the likelihood and prior, I cringed just a little at equating the prior with the learner’s biology. It all depends in how you set the model up (and what kind of learning you’re modeling). All “prior” means is prior -- it was in place before the current learning started. That may be because it was there innately (courtesy of biology) or because the learner derived it from prior experience. That said, hats off to Kirby for promoting the Bayesian modeling approach as an example of modeling that you can easily interpret theoretically. I couldn’t agree more with that part.

(g) In terms of interpreting Figure 3, I definitely understand the main point that the size of the prior bias towards regularity (aaaaa languages) doesn’t seem to affect the results at all. But it looks like between 15 and 22% of all languages at the end of learning are this type of language, with near 0% distributed across the other 4 options (aaab, aabb, etc.) Where did the other 78-85% go? Maybe the x axis language instances are samples from the entire population of languages ((a-e)^5), and so the remaining 78-85% is distributed in teeny tiny amounts across all these possibilities? So, therefore, the aaaaa language is the one with a relative majority?

(h) I really like the point that human languages may take on certain properties not because they’re hard-coded into humans, but because humans have a tendency (no matter how slight/weak) towards them. Any little push can make a difference when you have a bunch of ambiguous data to learn from. (This hooks back into the Poverty of the Stimulus ideas from before, and how that contributes to the emergence of languages that can be learned by children from the data they encounter.) That said, I got a little lost with this statement: “the language faculty may contain domain-specific constraints only if they are weak, and strong constraints only if they are domain general. I understand about constraints being weak, whether they’re about language (domain-specific) or cognition generally. But where does the distinction between domain-specific vs. domain-general come from, and where do strong constraints come from at all, based on these results? Maybe this has to do with the domain-general simplicity bias Kirby comes back to at the end, which he makes a case for as a strong innate bias that does a lot of work for us?

(i) In terms of iterated learning in the laboratory, I’m always a little skeptical about what to conclude from these studies. Certainly, we can see the amplification of slight biases in the case of transmission/learning bottlenecks. But in terms of where those biases come from, if they’re representative of pre-verbal primate biases...I’m not sure how we tell. Especially when we consider artificial language learning, we have to consider what language-specific and domain-general biases adult humans have developed by already knowing a human language. For example, does compositionality emerge in the lab because it has to with those conditions or because the humans involved already had compositionality in their hypothesis space because of their experience with their native languages? To be fair, Kirby explicitly acknowledges this issue, and then sets up a simulation with a simplicity bias built in that’s capable of generating the behavioral results from humans. But of course, the simplicity bias is expressed with respect to certain structural preferences (concise context-free transducers). How different from compositionality is this structural preference? This comes up again for me when Kirby notes all the different ways a “simplicity” bias can be cashed out linguistically. So, simple with respect to what seems to matter -- that is, how the learner knows to define simplicity a particular way.

What I find more convincing is the comparison with recently-created signed languages like NSL, in terms of the systemacity that emerges. It seems that whatever cognitive bias is at play for the cohorts of NSLers might also be at play in the experimental participants learning the mini-signed languages. Therefore, you can at least get the same outcome from both people who don’t already know a language (NSLers) and people who do (experimental participants).

(2) Adger 2017

(a) I do think it’s very fair for Adger to note the other sociocultural factors that affect transmission of language through a population, such as invasion, intermarriage, etc. This comes back to how all models idealize, and the real question is if they’ve idealized something important away.

(b) Also, Adger makes a fair point about how “cultural” in Kirby’s terms is really more about transmission pressures in a population that can be formally modeled, rather than what a sociologist might naturally refer to as culture and/or cultural transmission.

(c) I’m not sure I agree that the NSLers showing the emergence of regularity so quickly is an argument against iterated learning. It seems to me that this is indeed a valid case of the transmission & learning bottleneck at the heart of Kirby’s iterated learning approach. The fact that it occurs over years in the NSLers instead of over generations doesn’t really matter, I don’t think.

(d) Adger notes that rapid emergence of certain structures involves “specific cognitive structures” that must also be present. I don’t see this as being incompatible with Kirby’s suggestion of a domain-general simplicity bias. That’s a specific cognitive structure, after all, assuming you count biases as structures.

(e) Adger also brings up certain examples of language not being very simple in an obvious way in order to argue against Kirby’s simplicity bias. But to me, the question is always simple with respect to what? Maybe the seemingly convoluted language structure we see is in fact a simple solution given the other constraints at work (available building blocks like a preference for hierarchical structure, frequent data in the input, etc.). It’s also not obvious to me that seeing hierarchical structure appear over and over again is incompatible with Kirby’s proposal for weak biases leading to highly prevalent patterns. That is, why couldn’t a slight bias for hierarchical structure make the hierarchical structure version the simplest answer to a variety of language structure problems?

(3) Bowling 2017

(a) I really like Bowling’s point that there are bidirectional effects between DNA and the environment (e.g., culture). For me, this makes a nice link with the environmental/cultural factors of transmission and learning by individuals that Kirby’s approach highlights and the biological underpinnings of those abilities. For example, could the evolution of language have reinforced a biologically-based bias for simplicity? That is, could the iterated learning process have made individuals with a predisposition for simplicity more prevalent in the human population? That doesn’t seem far-fetched to me.

(b) “...even though Kirby’s Bayesian models falsely separate genes from learning” - This doesn’t seem like a fair characterization to me. All that Bayesian models do is separate out what was there before you started learning from what you’re currently learning. They don’t specify where the previous stuff came from (i.e., genes vs. environment vs. environment+genes, etc.).

Tuesday, March 13, 2018

Some thoughts on Freudenthal et al. 2016

I think it’s always nice to see someone translate a computational-level approach to an algorithmic-level approach. The other main attempt I’ve seen for syntactic categorization is Wang & Mintz (2008) for frequent frames.

Wang, H., & Mintz, T. (2008). A dynamic learning model for categorizing words using frames. BUCLD 32 Proceedings, 525-536.

Here, F&al2016 are embedding a categorization strategy in an online item-based approach to learning word order patterns, and evaluating it against qualitative patterns of observed child knowledge (early noun-ish category knowledge and later verb-ish category knowledge).

An important takeaway seems to be making a qualitative distinction between preceding vs. following context. Interestingly, this is the essence of a frame as well.

Specific comments:

(1) Types vs tokens: It’s interesting to see F&al2016 get mileage by ignoring token frequency. This is a tendency that seems to show up in a variety of learning strategies (e.g., Tolerance Principle decisions about whether to generalize are based on consideration of types rather than tokens, which itself is tied to considerations of memory storage and retrieval: Yang 2005).

Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

In the intro, F&al2016 note that their motivation is one of computational cost — they say it’s less work to collect just the word, rather than keep track of both the word and its frequency. I wonder how much of an additional burden that is though. It doesn’t seem like all that much work, and don’t we already track frequencies of so many things anyway?

Also, in the simulation section, F&al2016 say “MOSAIC does not represent duplicate utterances” -- so does this mean MOSAIC already has a type bias built into it? (In this case, at the utterance level.)

(2) The MOSAIC model: I love all the considerations of developmental plausibility this model encodes, which is why it’s so striking that they use orthographically transcribed speech as input. Usually this is verboten for models of early language acquisition (e.g., speech segmentation), because orthographic and phonetic words aren’t the same thing. But here, this comes back to an underlying assumption about the initial knowledge state of the learner they model. In particular, this learner has already learned how to segment speech in an adult-like way. This isn’t a crazy assumption for 12-month-olds, but it’s also a little idealized, given what we know about the persistence of segmentation errors. Still, this assumption is no different from what previous syntactic categorization studies have assumed. What makes it stand out here is the (laudable) focus on developmental plausibility. Future work might be how robust this learning strategy is to segmentation errors in the input.

(3) Distributed representations: The Redington et al categorization approach that uses context vectors reminds me strongly of current distributed representations to word meaning (i.e., word embedding: word2vec, GloVe). Of course, the word embedding approaches aren’t a transparent translation of words into their counts, but the underlying intuition feels similar.

(4) Developmental linking: The model basis F&al2016 use for how nouns emerge early as a category is due to the structure of English utterances, coupled with the utterance-final bias of MOSAIC. Does this mean languages with verbs in the final position should have children developing knowledge of the verb category earlier (e.g., Japanese)? If so, I wonder if we see any evidence of this from behavioral or computational work.

(5) Evaluation metrics: I want to make sure I understand the categorization evaluation metric. The model’s classification of a cluster was compared against “the (most common) grammatical class assigned to each word”, but there was also a pairwise metric used that doesn’t actually need to take into account what the cluster’s class is for precision. That is, if you’re using pairwise precision (accuracy) and recall (completeness), you just get all the pairs of words from your cluster and figure out how many are actually truly in the same category -- whatever that category is -- and that’s the number in the numerator. The number in the denominator depends on whether you’re comparing against all the pairs in that cluster (precision) or all the pairs in the true adult category (recall).  So, there’s only a need to decide what an individual cluster’s category is (noun or verb or something else entirely) when you're doing the recall part.

(6) Model interpretation: In order to understand F&al2016’s concern with the number of links over time (in particular, the problem with there being more links earlier on than later on), it probably would have helped to know more about what those links refer to. I think they’re related to how utterances are generated, with progressively longer versions of an utterance linked word by word. But then, how does that relate to syntactic categorization? A little later, F&al2016 mention these links as something that link nouns together vs links verbs together, which would then make sense from a syntactic categorization perspective. But this is then different from the original MOSAIC links. Maybe links are what happens when the Redington et al. analysis is done over the progressively longer utterances provided by MOSAIC? So it’s just another way of saying “these words are clustered together based on the clustering threshold defined”.

(7) Free parameters: It’s interesting that they had to change the thresholds for Table 1 vs Table 2. The footnote explains this by saying this allows “a meaningful overall comparison in terms of accuracy and completeness”. But why wouldn’t the original thresholds suffice for that? Maybe this has something to do with the qualitative properties you’re looking for from a threshold? (For instance, the original “frequency” threshold for frequent frames was motivated partly by frames that were salient “enough” to the child. I’m not sure what you’d be looking for in a threshold for this Redington et al. analysis, though. Some sort of similarity saliency?)

Relatedly, where did the Jaccard distance threshold of 0.2 used in Table 3 come from? (Or perhaps, why is a Jaccard threshold of 0.2 equivalent to a rank order threshold of 0.45?)

(8) Noun richness analysis: This kind of incremental approach to what words are in a noun category vs a verb category seems like an interesting hypothesis for what the non-adult noun and verb categories ought to look like. I’d love to test them against child production data from these same corpora using a Yang-style productivity analysis (ex: Yang 2011).

Yang, C. (2011). A statistical test for grammar. In Proceedings of the 2nd workshop on Cognitive Modeling and Computational Linguistics (pp. 30-38). Association for Computational Linguistics.

Friday, March 2, 2018

Some thoughts on Hochstein et al. 2017

As a cognitive modeler, I love having these kind of theoretically-motivated empirical data to think about. Here, I wonder if we can unpack different possible causes of the ASD children’s behavior using something like the RSA model. We have distinct patterns of behavior to account for with details on the exact experimental context, and a really interesting separation of two steps involved in appropriately using scalar implicatures (where it seems like the ASD kids fail to cancel the implicature when they should).

Other thoughts:

(1) After reading the introduction and the difference between the ignorance implicature and the epistemic step, I now have a renewed appreciation for symbolic representation. In particular, the text descriptions of each of these made my head spin for awhile, while the symbolic representation was immediately comprehensible (and then I later worked out my own text description). My take: ignorance implicature not(believe(p)) = I don’t know if p is true”; epistemic step believe(not(p))= “I know p specifically is not true (as opposed to other things I might believe about p or whatever else)”.

(2) The basic issue with prior experimental work that H&al2017 highlight is that the Truth-Value Judgment Task (TVJT) is not the normal language comprehension process. This is because normal language comprehension involves you inferring the world from the utterance expressed. In the TVJT, in contrast, you’re given the world and asked if you would say a particular utterance - which is why RSA models capturing the TVJT cast it as an utterance endorsement process instead. But this highlights how important naturalistic conversational usage may be for getting at knowledge in populations where accessing that knowledge may be more fragile (like kids). The Partial Knowledge Task of H&al2017 is an example of this, where we see something like a naturalistic task in which participants have to use their implicit calculation (or not) of the implicature to make a judgment about the state of the world.

(3) Interestingly, something like the partial knowledge task setup has already been implemented in the RSA framework by Goodman & Stuhlmueller 2013, and addresses neurotypical adult behavior about when implicatures are and (importantly) aren’t computed, depending on speaker knowledge. Notably, this is where we see an ASD difference in the H&Al2017 studies — ASD kids don’t seem to use their ignorance implicature computation abilities here, and instead go ahead with the scalar implicature calculation.

I wonder how the H&al2017 behavior patterns play out in an RSA model. Would it have  something to do with the recursive reasoning component if ASD kids don’t care about speaker knowledge? Or is there a way to keep the recursive social reasoning, but somehow skew probabilities to get this response behavior? (Especially since ASD Theory of Mind ability didn’t correlate with this response behavior.)

Friday, February 9, 2018

Some thoughts on Tanaka-Ishii 2017

It’s really interesting to see someone coming at language development from a very different perspective (here: statistical physics). Different terminology means different ways of talking about the same ideas — and this highlighted for me how comfortable I’ve become with my own terminology, and how foreign it can seem when someone uses different terminology (see comments on long-range dependency below).

Specific thoughts:
(1) Implications for language development
(a) I don’t find it all that surprising that early child productions have these long-range correlation properties. This may be because of my naive understanding of power-law relationships, but basically, power-law relationships aren’t a language-specific thing, so why shouldn’t they appear in early child productions too? It made me smile, though, to see this author then use the existence of long-range correlation as an argument for an “innate mechanism of the human language faculty”. I didn’t really see that thought cashed out later though, and maybe that’s for the best.

(b) In the discussion section, the author says “This would require more exhaustive knowledge of long-range memory in natural language, and the model would have to integrate more complex schemes that possibly introduce n-grams or grammar models.” — This made me smile, too. You mean we might need syntactic structure to explain language development? Couldn’t be.

(2) Equation 1, which is correlation at a distance s: I think it’s worth thinking about the intuition of this. It captures the similarity of two subsequences s distance apart, with respect to their deviation from the mean value. Interpretation for word frequency: same frequency (which differs from mean by some amount) s words away. So this means long-range correlation is a power-law relationship w.r.t correlation. That is, it’s a power-law in time for word usage by frequency, not just in overall frequency irrespective of time.

(3) Working with kid data
(a) The author talks about the analysis of one child’s utterances and how things are still under development, but the analysis is effectively over word use in sequences, so it’s not clear how complex the syntactic and semantic knowledge needs to be for this to occur. That is, it’s not surprising that a swath of data between two and five shows this relationship. More interesting would have been this analysis at two vs. three vs. four. Later in the paper, the author says “In early childhood speech, utterances are still lacking in full vocabulary, ungrammatical, and full of mistakes. Therefore, the long-range correlation of such speech must be based on a simple mechanism other than linguistic features such as grammar that we generally consider.” - This comes back to assumptions about what knowledge develops at what age. “Ungrammatical” isn’t very accurate, especially when we’re talking four- and five-year-olds.

(b) I love seeing the author leverage cross-linguistic data, but how old were these kids? Age matters a bunch. And how many words were in these datasets?

(4) Understanding the different generative models
(a) The Simon model is described as “the rich get richer”, which seems like the intuition for the Chinese Restaurant Process (CRP). I definitely understand that this is uniform sampling from previous elements (in time, this means sampling from the past), plus a little for a new element. Except then the Pitman-Yor can reduce to a CRP when a is 0, and Pitman-Yor is meant to be different from Simon. Based on Figure 10, there’s clearly a major difference (the autocorrelation isn’t there for Pitman-Yor), but the intuition of what’s different is hard to grasp.

(b) I’m not sure I understand the issue described here for the Simon model: “the vocabulary growth (proven to have exponent 1.0) is too fast”. Isn’t the Simon model meant to be about sequences in time? Or is the author referring to people who have tried to match child vocabulary development to a Simon model? Or maybe this refers to the left panel of the figures where sometimes we see divergence from a strict Zipf’s law?

Friday, January 26, 2018

Some thoughts on Dye et al. 2017

I really like the clean layout of this approach and its mathematical predictions, even if I sometimes had to re-read some of the pieces to make sense of them. (I suspect this may be due to the length limitations.) In particular, this strikes me as an example of good corpus work motivated by interesting theoretical questions, and which makes sure to connect the results to the bigger picture of language change, language acquisition (both first and second), and language use.

More specific thoughts:

(1) The abstract mentions smoothing information over discourse to make nouns “more equally predictable in context”— I had some trouble figuring out what this meant. It shows up again in the section discussing grammatical gender, in the context of uncertainty over an utterance. My best guess is that this means at any point along the utterance, we can predict what noun is coming with probabilities that are more uniform?

Possibly related: “While the average uncertainty following the determiners was similar across languages, German determiners supported much greater entropy reduction than their English equivalent.” — So does this mean the range of entropy reduction was greater for German, even if on average it all washed out compared to English? And if it does wash out on average, then is the German gender system on determiners helping communicative efficiency in general, compared to English? This seems like it’s related to this comment that occurs a bit later: “However, whereas German provided a substantial entropy offset, English provided none at all.” What does an entropy offset refer to?

Also related: Trying to understand what’s going on in Figure 1. Are the two blue lines English vs. German? If so, where are the 10.17 and 10.55 coming from? They seem like they refer to different noun frequencies (based on the y axis). Is the idea that the y axis shows how many more nouns could be used with a certain entropy? If so, then the way to read this is that the three dotted lines come from another calculation, but we see the nouniness of their effect on the y axis. And then, the way to interpret that is that more entropy yields more nouns….so decreasing entropy means more predictable which means fewer nouns…..which means lower lexical diversity? Or does having fewer nouns possible mean you get to be more precise, and so when you sum over all contexts, you get more lexical diversity? I think that’s what this comment indicates: “German speakers appear to use the entropy reduction provided by noun class to choose nouns that are more specific, resulting in greater nominal diversity.”

(2) I’m a little surprised by the claim that it’s mostly adult speakers who innovate — all the first language acquisition work I’ve seen would suggest the bottleneck of L1 acquisition is a non-trivial cause of change (which I’m equating to how “innovation” is used here.) This may be a bias on my part because of my own work on Old English to Middle English word order change, with the idea it was caused in no small part by selective filters on first language learning: Pearl & Weinberg 2007.

Pearl, L., & Weinberg, A. (2007). Input filtering in syntactic acquisition: Answers from language change modeling. Language learning and development, 3(1), 43-72.

(3) Following up on the idea that adjectives in English reduce noun entropy, can we then get adjective ordering out of this (and link it to perceived subjectivity, a la Scontras et al. 2017?) In particular, is the more subjective adjective, which is further away, “more discriminative” or “less definite”? (Less definite seems to be in the same vein.) Is perceived subjectivity somehow tied to frequency?

Scontras, G., Degen, J., & Goodman, N. D. (2017). Subjectivity predicts adjective ordering preferences. Open Mind, 1, 53-65.

Monday, December 4, 2017

Some thoughts on Perkins et al. 2017

I really enjoy seeing Bayesian models like this because it’s so clear exactly what’s built in and how. In this particular model, a couple of things struck me: 

(1) This learner needs to have prior (innate? definitely linguistic) knowledge that there are three classes of verbs with different properties. That actually goes a bit beyond just saying a verb has some probability of taking a direct object, which I think is pretty uncontroversial.

(2) The learner only has to know that its parsing is fallible, which causes errors — but notably the learner doesn’t need to know the error rate(s) beforehand. So, as P&al2017 note in their discussion, this means less specific knowledge about the filter has to be built in a priori.

Other thoughts:
(1) Thinking some about the initial stage of learning P&al2017 describe in section 2: So, this learner isn’t supposed to yet know that a wh-word can connect to the object of the verb. It’s true that knowing that specific knowledge is hard without already knowing which verbs are transitive (as P&al2017 point out). But does the learner know anything about wh-words looking for connections to things later in the utterance? For example, I’m thinking that maybe the learner encounters other wh-words that are clearly connected to the subject or object of a preposition: “Who ate a sandwich?” “Who did Amy throw a frisbee to?”. In those cases, it’s not a question of verb subcategorization - the wh-word is connecting to/standing in for something later on in the utterance. 

If the learner does know wh-words are searching for something to connect to later in the utterance, due to experience with non-object wh-words, then maybe a wh-word that connects to the object of a verb isn’t so mysterious (e.g., “What did John eat?”). That is, because the child knows wh-words connect to something else and there’s already a subject present, that leaves the object. Then, non-basic wh-questions actually can be parsed correctly and don’t have to be filtered out. They in fact are signals of a verb’s transitivity.

Maybe P&al2017’s idea is that this wh-awareness is a later stage of development. But I do wonder how early this more basic wh-words-indicate-a-connection knowledge is available.

(2) Thinking about the second part of the filter, involving delta (which is the chance of getting a spurious direct object due to a parsing error): I would have thought that this depended on which verb it was. Maybe it would help to think of a specific parsing error that would yield a spurious direct object. From section 5.1, we get this concrete example: “wait a minute”, with “a minute” parsed as a direct object. It does seem like it should depend on whether the verb is likely to have a direct object there to begin with, rather than a general direct object hallucination parsing error. I could imagine that spurious direct objects are more likely to occur for intransitive verbs, for instance.

I get that parsing error propensity (epsilon) doesn’t depend on verb, though.

(3) Thinking about the model’s target state: P&al2017 base this on adult classes from Levin (1993), but I wonder if it might be fairer to adjust that based on the actual child-directed speech usage (e.g., what’s in Table 2). For example, if “jump” was only ever used intransitively in this input sample, is it a fair target state to say it should be alternating? 

I guess this comes down to the general problem of defining the target state for models of early language learning. Here, what you’d ideally like is an output set of verb classes that corresponds to those of a very young child (say, a year old). That, of course, is hard to get. Alternatively, maybe what you want to have is some sort of downstream evaluation where you see if a model using that inferred knowledge representation can perform the way young children are attested to in some other task.

For example, one of the behaviors of this model, as noted in section 5.1, is that it assigns lots of alternating verbs to be either transitive or intransitive. It would be great to test this behaviorally with kids of the appropriate age to see if they also have these same mis-assignments.

(4) Related to the above about the overregularization tendencies: I love the idea that P&al2017 suggest in the discussion about this style of assumption (i.e.,“the parser will make errors but I don’t know how often”). They note that it could be useful for modeling cases of child overregularization. We certainly have a ton of data where children seem more deterministic than adults in the presence of noisy data. It’d be great to try to capture some of those known behavioral differences with a model like this.

Monday, November 20, 2017

Some thoughts on Stevens et al. 2017

It’s really nice to see an RSA model engaging with pretty technical aspects of linguistic theory, as S&al2017 do here. In these kinds of problems, there tend to be a lot of links to follow in the chain of reasoning, and it’s definitely not easy to adequately communicate them in such a limited space. (Side note: I forget how disorienting it can be to not know specific linguistics terms until I try to read them all at once in an abstract without a concrete example. This is a good reminder to those of us who work in more technical areas: Make sure to have concrete examples handy. The same thing is true for walking through the empirical details with the prosodic realizations as S&al2017 have here —  I found the concrete examples super-helpful.)

Specific thoughts:

(1) For S&al2017, “information structure” = inferring the QUD probabilistically from prosodic cues?

 (2) I think the technical linguistic material is worth going over, as it connects to the RSA model. For instance, I’m struggling a bit to understand the QUD implications for having incomplete answers vs. having complete answers, especially as it relates to a QUD’s compatibility with a given melody. 

For example, when we hear “Masha didn’t run QUICKLY”, the QUD is something like “How did Masha run?”. That’s an example of an incomplete answer. What’s a complete answer version of this scenario, and how does this impact the QUD? Once I get this, then I think it makes complete sense to use the utility function defined in equation (10). 

(3) I was struck by S&al2017’s notational trick, where they get out of the recursive social reasoning loop of literal listener to speaker to pragmatic listener. Here, it’s utility function to speaker to hearer because they’re presumably trying to deemphasize the social reasoning aspect? Or they just thought it made more sense described this way?

(4) About those results:
Figure 2: It’s nice to see modelers investigating the effect of the rationality (softmax) parameter in the speaker function. From the look of Figure 2, speakers need to be pretty darned rational indeed (really exaggerate endpoint behavior) in order to get any separation in commitment certainty predictions. 

Thinking about this intuitively, we should expect the LH Name condition (MASHA didn’t run quickly) to continue to be ambivalent about commitment to Masha running at all. That definitely shows up. I think. (Actually, I wonder if if might have been more helpful to ask participants to rate things on a scale from 1 (No, certainly not) to 7 (Yes, certainly so). That seems like it would make a 4 score easier to interpret (4 = maybe yes, maybe no). Here, I’m a little unsure how participants were interpreting the middle of the scale. I would have thought “No, not certain” would be the “maybe yes, maybe no” option, and so we would expect scores of 1. This is something of an issue when we come to the quantitative fit of the model results to the experimental results. Is the behavioral difference shallow just because of the way humans were asked to give their answers?  The way the model probability is calculated in (16) suggests that the model is operating more under the 1 = “no, certainly not” version (if I’m interpreting it correctly - -you have the “certainly yes” option contrasted with the “certainly not” option).

Clearly, however, we see a shift up in human responses in Figure 3 for the LH Adverb condition (Masha didn’t run QUICKLY), which does accord with my intuitions. And we get them from the model in Figure 2, as long as that rationality parameter is turned way up. (Side note: I’m a little unclear about how to interpret the rationality parameter, though. We always hedge about it in our simulation results. It seems to be treated as a noise parameter, i.e., humans are noisy, so let’s use this to capture some messy bits of their behavior. In that case, maybe it doesn’t mean much of anything that it has to be turned up so high here.)

Monday, November 6, 2017

Thoughts on Orita et al. 2015

I really appreciated how O&al2015 used the RSA modeling framework to make a theory (in this case, about discourse salience) concrete enough to implement and then evaluate against observable behavior. As always, this is the kind of thing I think modeling is particularly good at, so the more that we as modelers emphasize that, the better.

Some more targeted thoughts:

(1) The Uniform Information Density (UID) Hypothesis assumes receiving information in chunks of approximately the same size is better for communication. I was trying to get the intuition of that down -- is it that new information is easier to integrate if the amount of hypothesis adjustment needed based on that new information is always the same? (And if so, why should that be exactly? Some kind of processing thing?)

Related: If I’m understanding correctly, the discourse salience version of the UID hypothesis means more predictable forms become pronouns. This gets cashed out initially as the surprisal component of the speaker function in (3) (I(words; intended referent, available referent)), which is just about vocabulary specificity (that is, inversely proportional w.r.t how ambiguous the literal meaning of the word is). Then 3.2 talks about how to incorporate discourse salience. In particular, (4) incorporates the literal listener interpretation given the word, and (5) is just straight Bayesian inference where the priors over referents are what discourse salience affects. Question: Would we need these discourse-salience-based priors to reappear in the pragmatic listener level if we were using that level? (It seems like they belong there too, right?)

Speaking of levels, since O&al2015 are modeling speaker productions, is the S1 level the right level? Or should they be using an S2 level, where the speaker assumes a pragmatic listener is the conversational partner? Maybe not because we usually save the S2 level for metalinguistic judgments like endorsements in a truth-value judgment task?

(2) Table 1: Just looking at the log likelihood scores, it seems like frequency-based discourse salience is the way to go (and this effect is much more pronounced in child-directed speech). However, the text in the discussion by the authors notes how the recency-based discourse salience version has better accuracy scores, though most of that is due to the proper name accuracy since every model is pretty terrible at pronoun accuracy. I’m not entirely sure I follow the authors’ point about why the accuracy and log likelihood scores don’t agree on the winner. If the recency-based models return higher probabilities for a proper name, shouldn’t that make the recency-based log likelihood score better than the frequency-based log likelihood score? Is the idea that some proper names get all the probability (for whatever reason) for the recency-based version, and this so drastically lowers the probabilities of the other proper names that a worse log likelihood results?

But still, no matter what, discourse saliency looks like it’s having the most impact (though there’s some impact of expression cost). In the adult-directed dataset, you can actually get pretty close to the best log likelihood with the -cost frequency-based version (-1017) vs. the complete  frequency-based version (-958). But if you remove discourse salience, things get much, much worse (-6904). Similarly, in the child-directed dataset, the -cost versions aren’t too much worse than the complete versions, but the -discourse version is horrible.

All that said, what on earth happened with pronoun accuracy? There’s clearly a dichotomy between the proper name results and the pronoun results, no matter what model version you look at (except maybe the adult-directed -unseen frequency-based version).

(3) In terms of next steps, incorporating visual salience seems like a natural step when calculating discourse saliency. Probably the best way to do this is as a joint distribution in the listener function for the prior? (I also liked the proposed extension that involves speaker identity as part of the relevant context.) Similarly, incorporating grammatical and semantic constraints seems like a natural extension that could be implemented the same way. Probably a hard part is getting plausible estimates for these priors?