Tuesday, March 13, 2018

Some thoughts on Freudenthal et al. 2016

I think it’s always nice to see someone translate a computational-level approach to an algorithmic-level approach. The other main attempt I’ve seen for syntactic categorization is Wang & Mintz (2008) for frequent frames.

Wang, H., & Mintz, T. (2008). A dynamic learning model for categorizing words using frames. BUCLD 32 Proceedings, 525-536.

Here, F&al2016 are embedding a categorization strategy in an online item-based approach to learning word order patterns, and evaluating it against qualitative patterns of observed child knowledge (early noun-ish category knowledge and later verb-ish category knowledge).

An important takeaway seems to be making a qualitative distinction between preceding vs. following context. Interestingly, this is the essence of a frame as well.

Specific comments:

(1) Types vs tokens: It’s interesting to see F&al2016 get mileage by ignoring token frequency. This is a tendency that seems to show up in a variety of learning strategies (e.g., Tolerance Principle decisions about whether to generalize are based on consideration of types rather than tokens, which itself is tied to considerations of memory storage and retrieval: Yang 2005).

Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

In the intro, F&al2016 note that their motivation is one of computational cost — they say it’s less work to collect just the word, rather than keep track of both the word and its frequency. I wonder how much of an additional burden that is though. It doesn’t seem like all that much work, and don’t we already track frequencies of so many things anyway?

Also, in the simulation section, F&al2016 say “MOSAIC does not represent duplicate utterances” -- so does this mean MOSAIC already has a type bias built into it? (In this case, at the utterance level.)

(2) The MOSAIC model: I love all the considerations of developmental plausibility this model encodes, which is why it’s so striking that they use orthographically transcribed speech as input. Usually this is verboten for models of early language acquisition (e.g., speech segmentation), because orthographic and phonetic words aren’t the same thing. But here, this comes back to an underlying assumption about the initial knowledge state of the learner they model. In particular, this learner has already learned how to segment speech in an adult-like way. This isn’t a crazy assumption for 12-month-olds, but it’s also a little idealized, given what we know about the persistence of segmentation errors. Still, this assumption is no different from what previous syntactic categorization studies have assumed. What makes it stand out here is the (laudable) focus on developmental plausibility. Future work might be how robust this learning strategy is to segmentation errors in the input.

(3) Distributed representations: The Redington et al categorization approach that uses context vectors reminds me strongly of current distributed representations to word meaning (i.e., word embedding: word2vec, GloVe). Of course, the word embedding approaches aren’t a transparent translation of words into their counts, but the underlying intuition feels similar.

(4) Developmental linking: The model basis F&al2016 use for how nouns emerge early as a category is due to the structure of English utterances, coupled with the utterance-final bias of MOSAIC. Does this mean languages with verbs in the final position should have children developing knowledge of the verb category earlier (e.g., Japanese)? If so, I wonder if we see any evidence of this from behavioral or computational work.

(5) Evaluation metrics: I want to make sure I understand the categorization evaluation metric. The model’s classification of a cluster was compared against “the (most common) grammatical class assigned to each word”, but there was also a pairwise metric used that doesn’t actually need to take into account what the cluster’s class is for precision. That is, if you’re using pairwise precision (accuracy) and recall (completeness), you just get all the pairs of words from your cluster and figure out how many are actually truly in the same category -- whatever that category is -- and that’s the number in the numerator. The number in the denominator depends on whether you’re comparing against all the pairs in that cluster (precision) or all the pairs in the true adult category (recall).  So, there’s only a need to decide what an individual cluster’s category is (noun or verb or something else entirely) when you're doing the recall part.

(6) Model interpretation: In order to understand F&al2016’s concern with the number of links over time (in particular, the problem with there being more links earlier on than later on), it probably would have helped to know more about what those links refer to. I think they’re related to how utterances are generated, with progressively longer versions of an utterance linked word by word. But then, how does that relate to syntactic categorization? A little later, F&al2016 mention these links as something that link nouns together vs links verbs together, which would then make sense from a syntactic categorization perspective. But this is then different from the original MOSAIC links. Maybe links are what happens when the Redington et al. analysis is done over the progressively longer utterances provided by MOSAIC? So it’s just another way of saying “these words are clustered together based on the clustering threshold defined”.

(7) Free parameters: It’s interesting that they had to change the thresholds for Table 1 vs Table 2. The footnote explains this by saying this allows “a meaningful overall comparison in terms of accuracy and completeness”. But why wouldn’t the original thresholds suffice for that? Maybe this has something to do with the qualitative properties you’re looking for from a threshold? (For instance, the original “frequency” threshold for frequent frames was motivated partly by frames that were salient “enough” to the child. I’m not sure what you’d be looking for in a threshold for this Redington et al. analysis, though. Some sort of similarity saliency?)

Relatedly, where did the Jaccard distance threshold of 0.2 used in Table 3 come from? (Or perhaps, why is a Jaccard threshold of 0.2 equivalent to a rank order threshold of 0.45?)

(8) Noun richness analysis: This kind of incremental approach to what words are in a noun category vs a verb category seems like an interesting hypothesis for what the non-adult noun and verb categories ought to look like. I’d love to test them against child production data from these same corpora using a Yang-style productivity analysis (ex: Yang 2011).

Yang, C. (2011). A statistical test for grammar. In Proceedings of the 2nd workshop on Cognitive Modeling and Computational Linguistics (pp. 30-38). Association for Computational Linguistics.

Friday, March 2, 2018

Some thoughts on Hochstein et al. 2017

As a cognitive modeler, I love having these kind of theoretically-motivated empirical data to think about. Here, I wonder if we can unpack different possible causes of the ASD children’s behavior using something like the RSA model. We have distinct patterns of behavior to account for with details on the exact experimental context, and a really interesting separation of two steps involved in appropriately using scalar implicatures (where it seems like the ASD kids fail to cancel the implicature when they should).

Other thoughts:

(1) After reading the introduction and the difference between the ignorance implicature and the epistemic step, I now have a renewed appreciation for symbolic representation. In particular, the text descriptions of each of these made my head spin for awhile, while the symbolic representation was immediately comprehensible (and then I later worked out my own text description). My take: ignorance implicature not(believe(p)) = I don’t know if p is true”; epistemic step believe(not(p))= “I know p specifically is not true (as opposed to other things I might believe about p or whatever else)”.

(2) The basic issue with prior experimental work that H&al2017 highlight is that the Truth-Value Judgment Task (TVJT) is not the normal language comprehension process. This is because normal language comprehension involves you inferring the world from the utterance expressed. In the TVJT, in contrast, you’re given the world and asked if you would say a particular utterance - which is why RSA models capturing the TVJT cast it as an utterance endorsement process instead. But this highlights how important naturalistic conversational usage may be for getting at knowledge in populations where accessing that knowledge may be more fragile (like kids). The Partial Knowledge Task of H&al2017 is an example of this, where we see something like a naturalistic task in which participants have to use their implicit calculation (or not) of the implicature to make a judgment about the state of the world.

(3) Interestingly, something like the partial knowledge task setup has already been implemented in the RSA framework by Goodman & Stuhlmueller 2013, and addresses neurotypical adult behavior about when implicatures are and (importantly) aren’t computed, depending on speaker knowledge. Notably, this is where we see an ASD difference in the H&Al2017 studies — ASD kids don’t seem to use their ignorance implicature computation abilities here, and instead go ahead with the scalar implicature calculation.

I wonder how the H&al2017 behavior patterns play out in an RSA model. Would it have  something to do with the recursive reasoning component if ASD kids don’t care about speaker knowledge? Or is there a way to keep the recursive social reasoning, but somehow skew probabilities to get this response behavior? (Especially since ASD Theory of Mind ability didn’t correlate with this response behavior.)