Tuesday, February 23, 2021

Some thoughts on Tenenbaum et al. 2020

I think it’s a really interesting and intuitive idea to add semantic constraints to the task of morphology identification. That said, I do wonder how much of the morphology prefixes and suffixes might already come for free from the initial speech segmentation process. (I’m reminded of work in Bayesian segmentation strategies, where we definitely get some morphology like -ing sliced off for free with some implementations.) If those morphology pieces are already available, perhaps it becomes easier to implement semantically-constrained generalization over morphology transforms. Here, it seems like a lot of struggle is in the plausibility of the particular algorithm chosen for identifying suffix morphology. Perhaps that could all be sidestepped.

Relatedly, a major issue for me was understanding how the algorithm underlying the developmental model works (more on this below). I’m unclear on what seem to be important implementational details if we want to make claims about cognitive plausibility. But I love the goal of increasing developmental plausibility!


Other specific thoughts:


(1) The goal of identifying transforms: In some sense, this is the foundation of morphology learning systems (e.g., Yang 2002, 2005, 2016) that assume the child already recognizes a derived form as an instance of a root form (e.g., kissed-kiss, drank-drink, sung-sing, went-go). For those approaches, the child knows “kissed” is the past tense of “kiss” and “drank” is the past tense of “drink” (typically because the child has an awareness of the meaning similarity). Then, the child tries to figure out if the -ed transformation or the -in- → -an- transformation is productive morphology. Here, it’s about recognizing valid morphology transforms to begin with (is -in- → -an- really a thing that relates drink-drank and sing-sang?), so it’s a precursor step.


(2) On computational modeling as a goal: For me, it’s funny to state outright that a goal is to build a computational model of some process. Left implicit is why someone would want to do this. (Of course, it’s because a computational model allows us to make concrete the cognitive process we think is going on -- here, a learning theory for morphology -- and then evaluate the predictions that implemented theory makes. But experience has taught me that it’s always a good idea to say this kind of thing explicitly.)


(3) Training GloVe representations on child-directed speech: I love this. It could well be that the nature of children’s input structures the meaning space in a different way than adult linguistic input does, and this could matter for capturing non-adult-like behavior in children.


(4) Morphology algorithm stuff: In general, some of the model implementation details are unclear for me, and it seems important to understand what they are if we want to make claims that a certain algorithm is capturing the cognitive computations that humans are doing.


(a) Parameter P determines which sets (unmodeled, base, derived) the proposed base and derived elements can come from. So this means they don’t just come from the unmodeled set? I think I don’t understand what P is. Does this mean both the “base” and “derived” elements of a pair could come from, say, the “base” set? Later on, they discuss the actual P settings they consider, with respect to “static” vs “non-static”. I don’t quite know what’s going on there, though -- why do the additional three settings for the “Nonstatic” value intuitively connect to a “Nonstatic” rather than “Static” approach? It’s clearly something to do with allowing things to move in and out of the derived bin, in addition to in and out of the base bin...


(b) One step is to discard transforms that don’t meet a “threshold of overlap ratio”. What is this? Is this different from T? It seems like it, but what does it refer to?


(c) Another step is to rank remaining transforms according to the number of wordpairs they explain, with ties broken by token counts. So, token frequency does come back into play, even though the basic algorithm operates over types? I guess the frequencies come from the CHILDES data aggregates.


(d) If the top candidate transform explains >= W wordpairs, it’s kept. So, does this mean the algorithm is only evaluating the top transform each time? That is, it’s discarding the information from all the other potential transforms? That doesn’t seem very efficient...but maybe this has to do with explicit hypothesis testing, with the idea that the child can only entertain one hypothesis at a time…


(e) Each base/derived word pair explained by the new transform is moved to the Based/Derived bin. The exception is if the base form was in the derived bin before; in this case, it doesn’t move. So, if an approved transform seems to actually explain a derived1/derived2 pair, the derived1 element doesn’t go into the base bin? Is the transform still kept? I guess so?



(5) Performance is assessed via hits vs. false alarms, so I think this is an ROC curve. I like the signal detection theory approach, but then shouldn’t we be able to capture performance holistically for each combination by looking at the area under the curve?


Relatedly, transforms are counted as valid if they’re connected to at least three correct base/derived wordpairs, even if they’re also connected to any number of other spurious ones. So, a transform is “correct” if recall >=3, regardless of precision. Okay...this seems a bit arbitrary, though. Why focus on recall, rather than precision for correctness? This seems particularly salient given the discussion a bit further on in the paper that “reliability” (i.e., precision) would better model children’s learning. 


Note: I agree that high precision for early learning (<1 year) is more important than high recall. But I wonder what age this algorithm is meant to be applying to, and if that age would still be better modeled by high precision at the expense of high recall. 


Note 2 from the results later on: I do like seeing qualitative comparison to developmental data, discussing how a particular low-resource setting can capture 8 of the most common valid transforms children have.


(6) T&al2020 talk about a high-resource vs. a low-resource learner. But why not call the high-resource learner an idealized/computational-level learner? Unless Lignos & colleagues meant this to be a process/algorithmic-level learner? (It doesn’t seem like it, but then maybe they were less concerned about some of the cognitive plausibility aspects.)


(7) Fig 3 & 4, and comparisons: 


(a) Fig 3 & 4: I’d love to see the Lignos et al. version with no semantic information for all the parameter values manipulated here. That seems like an easy thing to do (just remove the semantic filtering, but still allow variation for the top number of suffixes N, wordpair threshold W, and permitted wordpairs P for the high-resource learners; for the low-resource learners, just vary W and P). Then, you could also easily compare the area under the curve for this baseline (no semantics) model vs. the semantics models for all the learners (not just the high-resource ones). And that then would make the conclusion that the learners who use semantics do better more robust. (Side note: I totally believe that semantics would help. But it would be great to see that explicitly in the analysis, and to understand exactly how much it helps the different types of learners, both high-resource and low-resource).


(b) Fig 4: I do appreciate the individual parameter exploration, but I’d also like to see a full low-resource learner combination [VC=Full, EC=CHILDES, N=3], too -- at least, if we want to claim that the more developmentally-plausible learners can still benefit from semantic info like this. This is talked about in the discussion some (i.e., VC=Full, EC=CHILDES, N=15 does as well as the original Lignos settings), but it’d be nice to see this plotted in a Figure-4-style plot for easy comparison.


(8) Which morphological transforms we’re after: In the discussion, T&al2020 note that they only focus on suffixes, and certainly the algorithm is only tuned to suffixes. It definitely seems like a more developmentally-plausible algorithm would be able to use meaning to connect more disparate derived forms to their base forms (e.g., drink-drank, think-thought). I’d love to see an algorithm that uses semantic similarity (and syntactic context) as the primary considerations, and then how close the base is to the derived form as a secondary consideration. This would allow the irregulars (like drink-drank, think-thought) to emerge as connected wordpairs. (T&al2020 do sketch some ideas in this direction in the next section, when they talk about model generalizability of morphology, and morphology clustering.)


(9) In the model extension part, T&al2020 say they want to get a “token level understanding of segmentation”. I’m not sure what this means -- is this the clustering together of different morphological transforms that apply to specific words? (I’d call this types, rather than tokens if so.)


(10) T&al2020’s proposed semantic constraint is that valid morphological transforms should connect pairs of base and derived forms that are offset in a consistent direction in semantic space. Hmmm...I guess the idea is that the semantic information encoded by a transform (e.g., past tense, plural, ongoing action) is consistent, so that should be detectable. That doesn’t seem crazy, certainly as a starting hypothesis. My concern in the practical implementation T&al2020 try is the GloVe semantic space, which may or may not actually have this property. The semantic space of embedding models is strange, and not usually very interpretable (currently) in the ways we might hope it to be. But I guess the brief practical demonstration T&al2020 do for their H3 morpheme transforms shows a proof of concept, even if it’s a mystery how a child would agglomeratively cluster things just so. That proof of concept does show it’s in fact possible to cluster just so over the GloVe-defined difference vectors.


Thursday, February 4, 2021

Some thoughts on Fox & Katzir 2020

I think one of the main things that struck me is the type of iterated rationality models (IRMs) that F&K2020 discuss -- those IRMs don’t seem like any of the ones I’ve seen in the cognitively-oriented literature that connects with human behavioral or developmental data. That is, in footnote 3, F&K2020 note that there’s an IRM approach that assumes grammatical derivation of alternatives, and then uses probabilistic reasoning to disambiguate those alternatives in context. They don’t have a problem with this IRM approach, and think it’s compatible with the grammatical approach they favor. So, if we’re using this IRM approach, then the worries F&K2020 that highlight don’t apply? In my own collaborative work for instance, I’m pretty sure we always talk about our IRM (i.e., RSA) models as ambiguity resolution among grammatical options that were already derived, though we can assign priors to them and so include how expensive it is to access those options. 


Other thoughts:

(1)  My take on footnote 4 and related text: there’s a conceptual separation between the creation of alternatives (syntactic/semantic computation) and how we choose between those alternatives (which typically involves probabilities). I know there’s a big debate about whether this conceptual separation is cognitively real, and I think that’s what’s being alluded to here.


(2) The comparison “grammatical approach": I’m curious about the evaluation metrics being used for theory comparison here -- in terms of acquisition, the grammatical approach requires language-specific knowledge (presumably innate?) in the form of the Exh operator, the “innocent inclusion”, and “innocent exclusion” operations. From this perspective, it’s putting a lot of explanatory work onto the development of this language-specific knowledge, compared with domain-general probabilistic reasoning mechanisms. I guess F&K2020 are focused more on issues of empirical coverage, with the multiplier conjunctive reading example not being handled by Franke’s approach.


(3) In section 6 on probabilities and modularity, F&K2020 discuss how probabilities could be part of the initial computations of SIs. I think I’m now starting to blur between this and the version of IRMs that F&R2020 were okay with, which is when IRMs have possibilities that are calculated from the grammar (e.g., with the semantics) and then the IRM uses recursive social reasoning to choose among those possibilities in context. It seems like the “SI calculation” part is about navigating the possibilities (here: the options on the scale that come from the semantics). So, RSA models that purport to capture SIs (even if relying on scale options that come from the grammar) would be exactly the IRMs that F&R2020 would be unhappy with.


(4) In 6.3, F&K2020 mention that priors could be “formal constructs defined internally to the system.” This is clearly an option the F&K2020 think is viable (even if they don’t favor it), so it seems important to understand what this means. But I’m unclear myself on how to interpret that phrase. Would this mean that there are probabilities available beforehand (therefore making them priors), but they’re not tied to anything external (like beliefs about the world, or how often a particular interpretation has occurred)? They’re just...probabilities that get generated somehow for possible interpretations?