Tuesday, March 13, 2018

Some thoughts on Freudenthal et al. 2016

I think it’s always nice to see someone translate a computational-level approach to an algorithmic-level approach. The other main attempt I’ve seen for syntactic categorization is Wang & Mintz (2008) for frequent frames.

Wang, H., & Mintz, T. (2008). A dynamic learning model for categorizing words using frames. BUCLD 32 Proceedings, 525-536.

Here, F&al2016 are embedding a categorization strategy in an online item-based approach to learning word order patterns, and evaluating it against qualitative patterns of observed child knowledge (early noun-ish category knowledge and later verb-ish category knowledge).

An important takeaway seems to be making a qualitative distinction between preceding vs. following context. Interestingly, this is the essence of a frame as well.

Specific comments:

(1) Types vs tokens: It’s interesting to see F&al2016 get mileage by ignoring token frequency. This is a tendency that seems to show up in a variety of learning strategies (e.g., Tolerance Principle decisions about whether to generalize are based on consideration of types rather than tokens, which itself is tied to considerations of memory storage and retrieval: Yang 2005).

Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

In the intro, F&al2016 note that their motivation is one of computational cost — they say it’s less work to collect just the word, rather than keep track of both the word and its frequency. I wonder how much of an additional burden that is though. It doesn’t seem like all that much work, and don’t we already track frequencies of so many things anyway?

Also, in the simulation section, F&al2016 say “MOSAIC does not represent duplicate utterances” -- so does this mean MOSAIC already has a type bias built into it? (In this case, at the utterance level.)

(2) The MOSAIC model: I love all the considerations of developmental plausibility this model encodes, which is why it’s so striking that they use orthographically transcribed speech as input. Usually this is verboten for models of early language acquisition (e.g., speech segmentation), because orthographic and phonetic words aren’t the same thing. But here, this comes back to an underlying assumption about the initial knowledge state of the learner they model. In particular, this learner has already learned how to segment speech in an adult-like way. This isn’t a crazy assumption for 12-month-olds, but it’s also a little idealized, given what we know about the persistence of segmentation errors. Still, this assumption is no different from what previous syntactic categorization studies have assumed. What makes it stand out here is the (laudable) focus on developmental plausibility. Future work might be how robust this learning strategy is to segmentation errors in the input.

(3) Distributed representations: The Redington et al categorization approach that uses context vectors reminds me strongly of current distributed representations to word meaning (i.e., word embedding: word2vec, GloVe). Of course, the word embedding approaches aren’t a transparent translation of words into their counts, but the underlying intuition feels similar.

(4) Developmental linking: The model basis F&al2016 use for how nouns emerge early as a category is due to the structure of English utterances, coupled with the utterance-final bias of MOSAIC. Does this mean languages with verbs in the final position should have children developing knowledge of the verb category earlier (e.g., Japanese)? If so, I wonder if we see any evidence of this from behavioral or computational work.

(5) Evaluation metrics: I want to make sure I understand the categorization evaluation metric. The model’s classification of a cluster was compared against “the (most common) grammatical class assigned to each word”, but there was also a pairwise metric used that doesn’t actually need to take into account what the cluster’s class is for precision. That is, if you’re using pairwise precision (accuracy) and recall (completeness), you just get all the pairs of words from your cluster and figure out how many are actually truly in the same category -- whatever that category is -- and that’s the number in the numerator. The number in the denominator depends on whether you’re comparing against all the pairs in that cluster (precision) or all the pairs in the true adult category (recall).  So, there’s only a need to decide what an individual cluster’s category is (noun or verb or something else entirely) when you're doing the recall part.

(6) Model interpretation: In order to understand F&al2016’s concern with the number of links over time (in particular, the problem with there being more links earlier on than later on), it probably would have helped to know more about what those links refer to. I think they’re related to how utterances are generated, with progressively longer versions of an utterance linked word by word. But then, how does that relate to syntactic categorization? A little later, F&al2016 mention these links as something that link nouns together vs links verbs together, which would then make sense from a syntactic categorization perspective. But this is then different from the original MOSAIC links. Maybe links are what happens when the Redington et al. analysis is done over the progressively longer utterances provided by MOSAIC? So it’s just another way of saying “these words are clustered together based on the clustering threshold defined”.

(7) Free parameters: It’s interesting that they had to change the thresholds for Table 1 vs Table 2. The footnote explains this by saying this allows “a meaningful overall comparison in terms of accuracy and completeness”. But why wouldn’t the original thresholds suffice for that? Maybe this has something to do with the qualitative properties you’re looking for from a threshold? (For instance, the original “frequency” threshold for frequent frames was motivated partly by frames that were salient “enough” to the child. I’m not sure what you’d be looking for in a threshold for this Redington et al. analysis, though. Some sort of similarity saliency?)

Relatedly, where did the Jaccard distance threshold of 0.2 used in Table 3 come from? (Or perhaps, why is a Jaccard threshold of 0.2 equivalent to a rank order threshold of 0.45?)

(8) Noun richness analysis: This kind of incremental approach to what words are in a noun category vs a verb category seems like an interesting hypothesis for what the non-adult noun and verb categories ought to look like. I’d love to test them against child production data from these same corpora using a Yang-style productivity analysis (ex: Yang 2011).

Yang, C. (2011). A statistical test for grammar. In Proceedings of the 2nd workshop on Cognitive Modeling and Computational Linguistics (pp. 30-38). Association for Computational Linguistics.

No comments:

Post a Comment