Friday, February 11, 2022

Some thoughts on Wilcox et al. 2021

This paper made me really happy because it involved careful thought about what was being investigated, an accessible intuition about how each model works, what the selected models can and can’t tell us, how the models should be evaluated, sensible ways to interpret model results, and why we should care. Of course, I did have (a lot of) various things occur to me as I was reading (more on this below), but this is probably one of the few papers I’ve read recently using neural net models that I care about, as a developmental linguist who does cognitive modeling. Thanks, authors!


Specific thoughts:

(1) Poverty of the stimulus vs. the Argument from poverty of the stimulus (i.e, viable solutions to poverty of the stimulus): I think it’s useful to really separate these two ideas. Poverty of the stimulus is about whether the data are actually compatible with multiple generalizations. I think this seems to be true about learning constraints on filler-gap dependencies (though this assertion depends on the data considered relevant in the input signal, which is why it’s important to be clear about what the input is). But the argument from poverty of the stimulus is about viable solutions, i.e., the biases that are built in to navigate the possibilities and converge on the right generalization.


The abstract wording focuses on poverty of the stimulus itself for syntactic islands, while the general discussion in 6.2. is clearly focusing on the (potential) viable solutions uncovered via the models explored in the paper. That is, the focus isn’t about whether there’s poverty of the stimulus for learning about islands, but rather what built-in stuff it would take to solve it. And that’s where the linguistic nativist vs. non-linguistic nativist/empiricist discussion comes in. I think this distinction between poverty of the stimulus itself and the argument from poverty of the stimulus gets mushed together a bit sometimes, so it can be helpful to note it explicitly. Still, the authors are very careful in 6.2. to talk about what they’re interested in as the argument from poverty of the stimulus, and not poverty of the stimulus itself.


(2) Introduction, Mapping out a “lower bound for learnability”: I’m not quite sure I follow what this means: a lower bound in the sense of what’s learnable from this kind of setup, I guess? Which is why anything unlearnable might still require a language-specific constraint? 


Also, I’m not sure I quite follow the distinction between top-down vs bottom-up being made about constraints. Is it that top-down is explicitly defined and implemented, as opposed to bottom-up being an emerging thing from whatever was explicitly defined and implemented? But if so, isn’t that more of an implementational-level distinction, rather than a core aspect of the definition (=computational-level) of the constraint? That is, the bottom-up thing could be explicitly defined, if only we understood better how the explicitly defined things caused it to emerge?


(3) The “psycholinguistics paradigm” for model assessment: I really like this approach, precisely because it doesn’t commit you to an internal theory-specific representation. In general, this is a huge plus for evaluating models against observable behavior. Even if you use an internal representation (and someone doesn’t happen to like it), you can still say that whatever’s going on can yield human behavior so it must have something human-like about it. The same is true for distributed/connectionist language models where it’s hard to tell what the internal representations are, aside from being vectors of numbers.


(4) The expected superadditive pattern when both the filler and gap are present: Why should this be superadditive, instead of just additive? What extra thing is happening to make the presence of both yield a superadditive pattern? I have the same question once we get to island stimuli, too, where the factors are filler presence, gap presence, and island structure presence. 


(5) The domain-general property of the neural models: The neural models aren’t building any bias for language-specific representations in, but language-specific representations are in the hypothesis space. So, is it possible the best-fitting internal representations are language-specific? This would be similar to Bayesian approaches (e.g., Perfors et al 2011) that allow the hypothesis space to include domain-general options, but inference leads the learner to select language-specific options.


(6) The input: Just a quick note that the neural models here were trained on non-childlike input both in terms of content (e.g., newsire text, wikipedia) and quantity (though I do appreciate the legwork of estimating input quantity). This isn’t a really big deal for the proof-of-concept goal here, but starts to matter more for more targeted arguments about how children could learn various filler-gap knowledge so reliably from their experience. Of course, the authors are aware of this and explicitly discuss this right after they introduce the different models (thanks, authors!). 


One thing that could be done: cross-check the input quantity with known ages of acquisition (e.g., Complex NP islands in English by age four, De Villiers et al. 2008). Since the authors say input quantity doesn’t really affect their reported results anyway, then this should be both easy to do and not change any major findings.


The second thing is to train these models on child-directed speech samples and see if the results hold. The CHILDES database should have enough input samples from high-resource languages, and whatever limitations there might be in terms of sampling from multiple children at multiple ages from multiple backgrounds (and other variables), it seems like a step in the right direction that isn’t too hard to do (though I guess that does depend on how hard it is to train these models).


(7) Proof-of-concept argument with these neural models: The fact that these models do struggle with issues of length and word frequency in non-human-like ways does suggest that they might do other things (like learn about filler-gap dependencies) in non-human-like ways too. So we have to be careful about what kind of argument this proof-of-concept is — that is, it’s a computational-level “is it possible at all” argument, rather than a computational-level “is it possible for humans who have these known biases/limitations, etc” argument.


(8) N-grams always fail: Is this just because the 5-token window isn’t big enough, so there’s no hope of capturing dependencies that are longer? I expect so, but don’t remember the authors saying something explicitly like that.


(9) Figure 5: I want to better understand why inversion is an ok behavior (I’m looking at you, GRNN).  Does that mean that now a gap in matrix position with a licensing filler in the subject is more surprising than no gap in matrix position with no licensing filler in the subject? I guess that’s not too weird. Basically, GRNN doesn’t want gaps in places they shouldn’t be (which seems reminiscent of island restrictions, as islands are places where gaps shouldn’t be).


(10) One takeaway from the neural modeling results: Non-transformer models do better at generalizing.  Do we think this is just due to data overfitting (training input size, parameter number), or something else?


(11) Coordination islands: I know the text says all four neural models showed significant reduction in wh-effects, so I guess the reductions must be significant between the control conditions and the 1st conjunct gaps. But, there seems to be a qualitative difference in attenuation we see for a gap in the first conjunct vs. the second conjunct (and it’s true for all four neural models). I wonder why that should be. 


(12) Figure 10, checking my understanding: So, seeing no gap inside a control structure is less surprising sometimes than seeing no gap inside a left-branching structure…I think this may have to do with the weirdness of the control structures, if I’m following 14 correctly? In particular, the -gap control is “I know that you bought an expensive a car last week” and the -gap island is “I know how expensive you bought a car last week”. This may come back to being more precise about surprisal expectations for control vs. island structures. Usually, control structures are fine (grammatical), but here they’re not, and so that could interfere with the potential surprisal pattern we’re looking for.


(13) Subject islands: It was helpful to get a quick explanation about why the GRNN didn’t do as well as the other neural models here (basically, not having a robust wh-effect for the control structures). A quick explanation of this type would be helpful for other cases where we see some neural models (seem to) fail, like the first conjunct for Coordination islands, and then Left Branch and Sentential Subject islands.


(14) Table 14: (just a shout out) Thank you so much, authors, for providing this. Unbelievably helpful summary.


(15) One takeaway the authors point out: If learning is about maximizing input data probability, then these neural approaches are similar to previous approaches that do this. In particular, maximizing input data probability corresponds to the likelihood component of any Bayesian learning approach, which seems sensible. Then, the difference is just about the prior part, which corresponds to the inductive biases built in.


(16) General discussion: I’m not quite sure I follow why linguistic nativist biases would contrast with empiricist biases by a priori downweighting certain possibilities — maybe this is another way of saying that one type of language-specific bias skews/limits the hypothesis space a certain way only if it’s a language-based hypothesis space? In contrast, a domain-general bias skews/limits the hypothesis space no matter what kind of hypothesis space it is. The particular domain-general bias of maximizing input probability of course doesn’t occur a priori— the learner needs to see the input data. But other kinds of domain-general biases seem like they could skew the hypothesis space a priori (e.g., the simplicity preference from Perfors et al. 2006).


(17) Another takeaway from the general discussion is that the learner doesn’t obviously need built-in language-specific biases to learn these island constraints. But I would love to know what abstract representations get built up in the best-performing neural models from this set, like JRNN. These are likely linguistic, as they’re word forms passed through a convolutional neural network (and therefore compressed somehow), and it would be great to know if they look like syntactic categories we recognize or something else. 


So, I’m totally on board with being able to navigate to the right knowledge in this case without needing language-specific (in contrast with domain-general) help. I just would love to know more about the intermediate representations, and what it takes to plausibly construct them (especially for small humans).


Tuesday, January 25, 2022

Some thoughts on van der Slik et al. 2021

I really appreciate the thoughtfulness that went into the reanalysis of the original Harthorne et al. 2018 data on second language acquisition and a potential critical/sensitive period. What struck me (more on this below) was the subtlety of the distinction that van der Slik et al. 2021 were really looking at: I think it’s not really a “critical period” vs. not, but rather a sensitive period where some language ability is equal before a certain point vs. not. In particular, both the discontinuous (=sensitive period) and continuous (=no sensitive period) approaches assume a dropoff at some point, and that dropoff is steeper at some points than others (hence, the S-shaped curve). So the fact that there is in fact a dropoff isn’t really in dispute. Instead, the question is whether before that dropoff point, are abilities equal (and in fact, equal to native = sensitive period) or not? To me, this is certainly interesting, but the big picture remains that there’s a steeper dropoff after some point that’s predictable, and it’s useful to know when that point is.



Specific thoughts:

(1) A bit more on the discontinuous vs. continuous models, and sensitive periods vs. not: I totally sympathize with the idea that a continuous sigmoidal function is the more parsimonious explanation for the available data, especially given the plausibility of external factors (i.e., non-biological factors like schooling) for the non-immersion learners. So, turning back to the idea of a critical/sensitive period, we still get a big dropoff in rate of learning, and if the slope is steep enough at the initial onset of the S-curve, it probably looks pretty stark. Is the big difference between that and a canonical sensitive period simply that the time before the dropoff isn’t all the same? That is, for a canonical sensitive period, all ages before the cutoff are the same. In contrast, for the continuous sigmoidal curve, all ages before the point of accelerated dropoff are mostly the same, but there may in fact be small differences the older you are. If that’s the takeaway, then great — we just have to be more nuanced in how we define what happens before the “cutoff” point. But the fact that a younger brain is better (broadly speaking) is true in either case.


(2) L1 vs. L2 sensitive periods:  It’s a good point that these may in fact be different (missing the L1 cutoff seems more catastrophic). This difference seems to call into question how much we can infer about a critical/sensitive period for L1 acquisition on the basis of L2 acquisition. Later results from this paper suggest qualitative similarities in early immersion (<10 years old), bilinguals, and monolinguals (L1) vs. later immersion, in terms of whether a continuous model with sigmoidal dropoff (early immersion) vs. a discontinuous model with constant rate followed by sigmoidal dropoff (later immersion) is the best fit. So maybe we can extrapolate from L2 to L1, provided we look at the right set of L2 learners (i.e., early immersion learners). And certainly we can learn useful things about L2 critical/sensitive periods.


(3) AIC score interpretation: I think I need more of a primer on this, as I was pretty confused on how to interpret these scores. I had thought that a negative score closer to 0 is better because the measure is based on log likelihood, and closer to 0 means a “smaller” negative, which is a higher probability.  Various googling suggests absolute lowest score is better,  but I don’t understand how you get a negative number in the first place if you’re subtracting the ln of the log likelihood. That is, you’re subtracting a negative number (because likelihoods are small probabilities often much less than 1), which is equivalent to adding a positive number. So, I would have expected these scores to be positive numbers.


Thursday, January 13, 2022

Some thoughts on Hu et al. 2021

It’s a nice change of pace for me to take a look at pragmatic modeling work more from the engineering/NLP side of the world (rather than the purely cognitive side), as I think this paper does. That said, I wonder if some of the specific techniques used here, such as the training of the initial context-free lexicon, might be useful for thinking about how humans represent of meaning (especially meaning that feeds into pragmatic reasoning). 


I admit, I also would have benefited from the authors having more space to explain their approach in different places (more on this below). For instance, the intuition of self-supervised vs. regular supervised learning is something I get, but the specific implementation of the self-supervised approach (in particular, why it counts as self-supervised) was a little hard for me to follow.


Specific thoughts:

(1) H&al2021 describe a two-step learning process, where the first step is learning a lexicon without “contextual supervision”. It sounds like this is “context-free” lexicon, like the L0 level level of RSA, which typically involves the semantic representation only. Though I do wonder how “context-free” the basic semantic representations actually are (e.g., they may incorporate the linguistic contexts words appear in), to be honest. But I suppose the main distinction is that no intentions or social information are involved.


The second step is to learn “pragmatic policies” by optimizing an appropriate objective function without “human supervision”. I initially took this to mean unsupervised learning, but then H&al2021 clarified (e.g., in section 3) that instead they meant that certain types of information provided by humans aren’t included during training, and this is useful from an engineering perspective because that kind of data can be costly to get. And so the learning gets the label “self-supervising”, from the standpoint of that withheld information.


 (2) Section 4.3, on the self-supervised learning (SSL) pragmatic agents.


For the AM model that the RSA implementations use, H&al2021 say that they train the base level agents with the full contextual supervision and then “enrich” it with subsequent AM steps. I think I need this unpacked more. I think I follow what it means to train agents with the full contextual supervision: in particular, include the contexts provided by the color triples. But I don’t understand what enriching the agents with AM steps afterwards means. How is that separate/different from the initial training process? Is the initial training not done via AM optimization? For the GD model, we see a similar process, with pragmatic enrichment done via GD steps, rather than AM steps. It seems like this is important to understand, as this distinction gets this approach classified as self-supervised rather than fully supervised. 


(3) For the GD approach, the listener model can train an utterance encoder and color context encoder. But why wouldn’t a listener be using decoders, since listeners can be intuitively thought of as decoding? I guess decoding is just the inverse of encoding, so maybe it’s translatable?


(4) I think I’m unclear on what “ground truth” is in Figure 2a, and why we’re interested in that if humans don’t match it either sometimes. I would have thought the ground truth would be what humans do for this kind of pragmatic language use.

Tuesday, November 23, 2021

Some thoughts on Bohn et al. 2021

I think it’s really nice to see a developmental RSA model, along with explicit model comparisons. To me, this approach highlights how you can capture specific theories/hypotheses about what exactly is developing via these computational cognitive modeling “snapshots” that capture observable behavior at different ages. Also, we get to see the model-evaluation pipeline often used in RSA adult modeling now used with kids (i.e., the model makes testable predictions that are in fact tested on kids). I also appreciate how careful B&al2021 are with respect to how model parameters link to psychological processes in the discussion (they emphasize in the general discussion that their model necessarily made idealizations to be able to get anywhere).


Some other thoughts:

(1) It’s interesting to me that B&al2021 talk about children integrating all available information, in contrast to alternative models that ignore some information (and don’t do as well). I’m assuming “all” is relative, because a major part of language development is learning which part of the input signal is relevant. For instance, speaker voice pitch is presumably available information, but I don’t think B&al2021 would consider it relevant for the inference process they’re interested in. But I do get that they’re contrasting the winning model with one that ignores some available relevant information.


(2) I feel like the way that B&al2021 talk about informativity seems to differ at points. In one sense, they talk about an informative and cooperative speaker, which seems to link with the general RSA framework of speaker utility as maximizing correct listener inference. In another sense, they connect informativity to alpha specifically, which seems like a narrower sense of “informativity”, maybe tied to how much above 1 alpha is (and therefore how deterministic the probabilities are that the speaker uses).


(3) Methodology, no-word-knowledge variant: I was still a little fuzzy even after reading the methods section about how general vocabulary size is estimated and used in place of specific word familiarity, except that of course it’s the same value for all objects (rather than in fact differing by word familiarity).


Tuesday, November 9, 2021

Some thoughts on Perfors et al. 2010

I’m reminded how much I enjoy this style of modeling work. There’s a lot going on, but the intuitions and motivations for it made sense to me throughout, and I really appreciated how careful P&al2010 were in both interpreting their modeling results and connecting them to the existing developmental literature.


Some thoughts:

(1)  I generally am really a fan of building less in, but building it in more abstractly. This approach makes the problem of explaining where that built-in stuff comes from easier --  if you have to explain where fewer things came from, you have less explaining to do.


(2) I really appreciate how careful P&al2010 are with their conclusions about the value of having verb classes. It does seem like the model with classes (K-L3) captures the age-related effect of less overgeneralization much more strongly while the one with a single verb class (L3) doesn’t. But, P&al2010 still note that both technically capture the key effects. Qualitative developmental pattern as the official evaluation measure, check! (Something we see a lot in modeling work, because then you don’t have to explain every nuance of the observed behavior;  instead you can say the model can predict something that matters a lot for producing that observed behavior, even if it’s not the only thing that matters.)


(3) Study 3: It might seem strange to try to add more to the model in Study 2 that already seems to capture the known empirical developmental data with just syntactic distribution information. But, the thing we always have to remember is that learning any particular thing doesn’t occur in a vacuum -- if information is in the input that’s useful, and children don’t filter it out for some reason, then they probably do in fact use it and it’s helpful to see what impact this has on an explanatory model like this. Basically, does the additional information intensify the model-generated patterns or muck them up, especially if it’s noisy? This can tell us about whether kids could be using this additional information (or when they’re using it) or maybe should ignore it, for instance. This comes back at the end of the results presentation, when P&al2010 mention that having 13 features with only 6 being helpful ruins the model -- the model can’t ignore the other 7, tries to incorporate them, and gets mucked up.  Also, as P&al2010 demonstrate here, this approach could differentiate between different model types (i.e., representational theories here: with verb classes vs. without).


(4) Small implementation thing: In Study 3, when noise is added to the semantic feature correlations, so that the appropriate semantic feature only appears 60% of the time: Presumably this would be implemented across verb instances, rather than only 60% of the verbs in that class having the feature? Otherwise, if some verbs always had the feature and some didn’t, I would think the model would probably end up inferring different classes for each syntactic type instead of just one per syntactic type, e.g., a PD-only class with the P feature and a PD-only class with no feature.


Wednesday, October 27, 2021

Some thoughts on Tal et al. 2021

This seemed to me like a straightforward application of a measure of redundancy (measuring whatever level of representation you like) to quantify redundancy in child-directed speech over developmental time. As T&al2021 note, the idea of repetition and redundancy in child-directed speech isn’t new, but this way of measuring it is, and the results certainly accord with current wisdom that (i) repetition in speech is helpful for young children, and (ii) repetition gets less as children get older (and the speech directed at them gets more adult-like). The contributions therefore also seem pretty straightforward: a new, more holistic measure of repetition/redundancy at the lexical level, and the finding that multi-word utterances seem to be the thing that gets repeated less as children get older.


Some other thoughts:

(1) Corpus analysis: For the Providence corpus, with such large samples, I wonder why T&al2021 chose to make only two age bins (12-24 months, and 24-36 months). It seems like there would be enough data there to go finer-grained (like maybe every two months: 12-14, 14-16, etc), and especially zoom in on the gaps in the NewmanRatner corpus between 12 and 24 months.


(2) I had some confusion over the discussion of the NewmanRatner results, regarding the entropy decrease they found with the shuffled word order of Study 2. In particular, I think the explanation for the entropy decrease was that lexical diversity didn’t increase in this sample as children got older. But, I didn’t quite follow why this explained the entropy decrease. More specifically, if lexical diversity stays the same, the shuffled word order keeps the same frequencies of individual words over time, so no change in entropy at the lexical level. With shuffled word order, the multi-word sequences are destroyed, so that should increase entropy. How does no change + entropy increase lead to an overall entropy decrease? 


Relatedly, T&al2021 say  about Study 2 that “the opposite tendencies of lexical- and multi-word repetitiveness in this corpus seem to cancel each other out at 11 months”. This related to my confusion above. Basically, we have constant lexical diversity, so there’s no change to entropy over time coming from the lexical level. Decreasing multi-word repetitions leads to higher entropy over time. What are the opposite tendencies here? It seems like there’s only one tendency (increasing entropy from the loss of the multi-word repetitions).


Thursday, October 14, 2021

Some thoughts on Harmon et al. 2021

 I think it’s a testament to the model description that the simulations seemed almost unnecessary to me -- they turned out exactly as (I) expected, given what the model is described as trying to do, based on the frequency of novel types. I also really love seeing modeling work of this kind used to investigate developmental language disorders -- I feel like there’s just not as much of this kind of work out there, and the atypical development community really benefits from it. That said, I do think the paper suffers a bit from length limitations. I definitely had points of confusion about what conceptually was going on (more on this below).


(1) Production probability: The inference problem is described as trying to identify the “production probability”, but it took me awhile to figure out what this might be referring to. For instance, does “production probability” refer to the probability that this item will take some kind of morphology (i.e., be “productive”) vs. not in some moment? If an item has a production probability of say, .5, does that mean that the item is actually “fully” productive, but that productivity is only accessed 50% of the time (so it would be a deployment issue that we see 50% in the output)? Or does it mean that only 50% of the inflections that should be used with that item are actually used (e.g. -ed but not -ing)? (That seems more like a representation issue.) Or does “production probability” mean something else? 


I guess here, if H&al2021 are focusing on just one morpheme, it would be the deployment option, since that morpheme is either used or not. Later on, H&al2021 talk about this probability as “the probability for the inflection”, which does make me think it’s how often one inflection applies, which also aligns with the deployment option. Even later, when talking about the Pitman-Yor process, it seems like H&al2021 are talking about the probability assigned to the fragment that incorporates the inflection directly. So, this corresponds to how often that fragment gets deployed, I think.


(2) Competition, H&al2021 start a train of thought with “if competition is too difficult to resolve on the fly”: I don’t think I understand what “competition” means in this case. That is, what does it mean not to resolve the competition? I thought what was going on was that if the production probability is too low, the competition is lost (resolved) in favor of the non-inflected form. But this description makes it sound like the competition is a separate process (maybe among all the possible inflected forms?), and if that “doesn’t resolve”, then the inflected form loses to another option (which is compensation).


(3) In the description of the Procedural Deficit Hypothesis, DLD kids are said to “produce an unproductive rule”: I don’t think I follow what this means -- is it that these kids produce a form that should be unproductive, like “thank” for think-past tense? This doesn’t seem to align with “memorization using the declarative memory system”, unless these kids are hearing “thank” as think-past tense in their input (which seems unlikely). Maybe this was a typo for “produce an uninflected form”?


(4) The proposed account of H&al2021 is that children are trying to access appropriate semantics, and not just the appropriate form (i.e., they prioritize meaning); so, this is why bare forms win out.  This makes intuitive sense to me from a bottleneck standpoint. If you want to get your message across, you prioritize content over form. This is what little typically-developing kids do, too, during telegraphic speech.


(5) Potentially related work on productivity: I’m honestly surprised there’s no mention of Yang’s work on productivity here -- he has a whole book of work on it (Yang 2016), and his approach focuses on specifying how many types are necessary for a rule to be productive, which seems relevant here.

 

Yang, C. (2016). The price of linguistic productivity: How children learn to break the rules of language. MIT Press.


(6) During inference, the modeled learner is given parsed input and has to infer fragments: So the assumption is that the DLD child perceived the form and the inflection correctly in the input, but the issue is retrieving that form and inflection during production. I guess this is because DLD kids comprehend morphology just fine, but struggle with production?


(7) Results: “the results of t tests showed that in all models, the probability of producing wug was higher than wugged...due to the high frequency of the base form”: Was this true even for the TD (typically developing child) model? If so, isn’t that not what we want to see, because TD children pass the wug test? 


Also, were these the only two alternatives available, or were other inflectional options on the table too? 


Also, is it that the modeled child just picked the one with the highest probability? 


Are the only options available the chunked inflections (including the null of the bare form), or are fragments that just have STEM + INFLECTION (without specifying the inflection) also possible? If so, how can we tell that option from the STEM + null of the bare form in practice? Both would result in the bare form, I would think.


(8) In the discussion, processing difficulties are said to skew the intake to have fewer novel types, which is crucial for inferring productivity. So, this means that kids don’t infer a high enough probability for the productive fragment, as it were; I guess this doesn’t affect their comprehension, because they can still use the less efficient fragments to parse the input (but maybe not parse it as fast). So maybe this is a more specific hypothesis about the “processing difficulties” that cause them not to parse novel types in the input that well?


(9) Discussion, “past tense rule in the DLD models was not entirely unproductive”: Is this because the fragment probability wasn’t 0? Or, how low does it have to be to be considered unproductive? This brings me back to Yang’s work, where there’s a specific threshold. Below that threshold, it’s unproductive. And that threshold can actually be pretty high  (like, definitely above 50%).


(10) Discussion, the qualitative pattern match with TD kids is higher than with DLD kids: I get that qualitative pattern matching is important and useful when talking about child behavior, but 90-95% production vs. 30-60% production looks pretty different from Figure 3. I guess Figure 3’s in log space, and who knows what other linking components are involved. But still, I feel like it would have been rhetorically more effective to talk about higher vs lower usage than give the actual percentages here.


(11) Discussion, “possible that experience with fewer verb types in the past tense, especially with higher frequency, biases children with DLD to store a large number of inflected verbs as a single unit (stem plus inflection) compared to TD children, further undermining productivity": This description makes it sound like storing STEM + inflection directly isn’t productive. But, I thought that was the productive fragment we wanted. Or was this meant as a particular stem + inflection, like hug + ed?

Tuesday, February 23, 2021

Some thoughts on Tenenbaum et al. 2020

I think it’s a really interesting and intuitive idea to add semantic constraints to the task of morphology identification. That said, I do wonder how much of the morphology prefixes and suffixes might already come for free from the initial speech segmentation process. (I’m reminded of work in Bayesian segmentation strategies, where we definitely get some morphology like -ing sliced off for free with some implementations.) If those morphology pieces are already available, perhaps it becomes easier to implement semantically-constrained generalization over morphology transforms. Here, it seems like a lot of struggle is in the plausibility of the particular algorithm chosen for identifying suffix morphology. Perhaps that could all be sidestepped.

Relatedly, a major issue for me was understanding how the algorithm underlying the developmental model works (more on this below). I’m unclear on what seem to be important implementational details if we want to make claims about cognitive plausibility. But I love the goal of increasing developmental plausibility!


Other specific thoughts:


(1) The goal of identifying transforms: In some sense, this is the foundation of morphology learning systems (e.g., Yang 2002, 2005, 2016) that assume the child already recognizes a derived form as an instance of a root form (e.g., kissed-kiss, drank-drink, sung-sing, went-go). For those approaches, the child knows “kissed” is the past tense of “kiss” and “drank” is the past tense of “drink” (typically because the child has an awareness of the meaning similarity). Then, the child tries to figure out if the -ed transformation or the -in- → -an- transformation is productive morphology. Here, it’s about recognizing valid morphology transforms to begin with (is -in- → -an- really a thing that relates drink-drank and sing-sang?), so it’s a precursor step.


(2) On computational modeling as a goal: For me, it’s funny to state outright that a goal is to build a computational model of some process. Left implicit is why someone would want to do this. (Of course, it’s because a computational model allows us to make concrete the cognitive process we think is going on -- here, a learning theory for morphology -- and then evaluate the predictions that implemented theory makes. But experience has taught me that it’s always a good idea to say this kind of thing explicitly.)


(3) Training GloVe representations on child-directed speech: I love this. It could well be that the nature of children’s input structures the meaning space in a different way than adult linguistic input does, and this could matter for capturing non-adult-like behavior in children.


(4) Morphology algorithm stuff: In general, some of the model implementation details are unclear for me, and it seems important to understand what they are if we want to make claims that a certain algorithm is capturing the cognitive computations that humans are doing.


(a) Parameter P determines which sets (unmodeled, base, derived) the proposed base and derived elements can come from. So this means they don’t just come from the unmodeled set? I think I don’t understand what P is. Does this mean both the “base” and “derived” elements of a pair could come from, say, the “base” set? Later on, they discuss the actual P settings they consider, with respect to “static” vs “non-static”. I don’t quite know what’s going on there, though -- why do the additional three settings for the “Nonstatic” value intuitively connect to a “Nonstatic” rather than “Static” approach? It’s clearly something to do with allowing things to move in and out of the derived bin, in addition to in and out of the base bin...


(b) One step is to discard transforms that don’t meet a “threshold of overlap ratio”. What is this? Is this different from T? It seems like it, but what does it refer to?


(c) Another step is to rank remaining transforms according to the number of wordpairs they explain, with ties broken by token counts. So, token frequency does come back into play, even though the basic algorithm operates over types? I guess the frequencies come from the CHILDES data aggregates.


(d) If the top candidate transform explains >= W wordpairs, it’s kept. So, does this mean the algorithm is only evaluating the top transform each time? That is, it’s discarding the information from all the other potential transforms? That doesn’t seem very efficient...but maybe this has to do with explicit hypothesis testing, with the idea that the child can only entertain one hypothesis at a time…


(e) Each base/derived word pair explained by the new transform is moved to the Based/Derived bin. The exception is if the base form was in the derived bin before; in this case, it doesn’t move. So, if an approved transform seems to actually explain a derived1/derived2 pair, the derived1 element doesn’t go into the base bin? Is the transform still kept? I guess so?



(5) Performance is assessed via hits vs. false alarms, so I think this is an ROC curve. I like the signal detection theory approach, but then shouldn’t we be able to capture performance holistically for each combination by looking at the area under the curve?


Relatedly, transforms are counted as valid if they’re connected to at least three correct base/derived wordpairs, even if they’re also connected to any number of other spurious ones. So, a transform is “correct” if recall >=3, regardless of precision. Okay...this seems a bit arbitrary, though. Why focus on recall, rather than precision for correctness? This seems particularly salient given the discussion a bit further on in the paper that “reliability” (i.e., precision) would better model children’s learning. 


Note: I agree that high precision for early learning (<1 year) is more important than high recall. But I wonder what age this algorithm is meant to be applying to, and if that age would still be better modeled by high precision at the expense of high recall. 


Note 2 from the results later on: I do like seeing qualitative comparison to developmental data, discussing how a particular low-resource setting can capture 8 of the most common valid transforms children have.


(6) T&al2020 talk about a high-resource vs. a low-resource learner. But why not call the high-resource learner an idealized/computational-level learner? Unless Lignos & colleagues meant this to be a process/algorithmic-level learner? (It doesn’t seem like it, but then maybe they were less concerned about some of the cognitive plausibility aspects.)


(7) Fig 3 & 4, and comparisons: 


(a) Fig 3 & 4: I’d love to see the Lignos et al. version with no semantic information for all the parameter values manipulated here. That seems like an easy thing to do (just remove the semantic filtering, but still allow variation for the top number of suffixes N, wordpair threshold W, and permitted wordpairs P for the high-resource learners; for the low-resource learners, just vary W and P). Then, you could also easily compare the area under the curve for this baseline (no semantics) model vs. the semantics models for all the learners (not just the high-resource ones). And that then would make the conclusion that the learners who use semantics do better more robust. (Side note: I totally believe that semantics would help. But it would be great to see that explicitly in the analysis, and to understand exactly how much it helps the different types of learners, both high-resource and low-resource).


(b) Fig 4: I do appreciate the individual parameter exploration, but I’d also like to see a full low-resource learner combination [VC=Full, EC=CHILDES, N=3], too -- at least, if we want to claim that the more developmentally-plausible learners can still benefit from semantic info like this. This is talked about in the discussion some (i.e., VC=Full, EC=CHILDES, N=15 does as well as the original Lignos settings), but it’d be nice to see this plotted in a Figure-4-style plot for easy comparison.


(8) Which morphological transforms we’re after: In the discussion, T&al2020 note that they only focus on suffixes, and certainly the algorithm is only tuned to suffixes. It definitely seems like a more developmentally-plausible algorithm would be able to use meaning to connect more disparate derived forms to their base forms (e.g., drink-drank, think-thought). I’d love to see an algorithm that uses semantic similarity (and syntactic context) as the primary considerations, and then how close the base is to the derived form as a secondary consideration. This would allow the irregulars (like drink-drank, think-thought) to emerge as connected wordpairs. (T&al2020 do sketch some ideas in this direction in the next section, when they talk about model generalizability of morphology, and morphology clustering.)


(9) In the model extension part, T&al2020 say they want to get a “token level understanding of segmentation”. I’m not sure what this means -- is this the clustering together of different morphological transforms that apply to specific words? (I’d call this types, rather than tokens if so.)


(10) T&al2020’s proposed semantic constraint is that valid morphological transforms should connect pairs of base and derived forms that are offset in a consistent direction in semantic space. Hmmm...I guess the idea is that the semantic information encoded by a transform (e.g., past tense, plural, ongoing action) is consistent, so that should be detectable. That doesn’t seem crazy, certainly as a starting hypothesis. My concern in the practical implementation T&al2020 try is the GloVe semantic space, which may or may not actually have this property. The semantic space of embedding models is strange, and not usually very interpretable (currently) in the ways we might hope it to be. But I guess the brief practical demonstration T&al2020 do for their H3 morpheme transforms shows a proof of concept, even if it’s a mystery how a child would agglomeratively cluster things just so. That proof of concept does show it’s in fact possible to cluster just so over the GloVe-defined difference vectors.


Thursday, February 4, 2021

Some thoughts on Fox & Katzir 2020

I think one of the main things that struck me is the type of iterated rationality models (IRMs) that F&K2020 discuss -- those IRMs don’t seem like any of the ones I’ve seen in the cognitively-oriented literature that connects with human behavioral or developmental data. That is, in footnote 3, F&K2020 note that there’s an IRM approach that assumes grammatical derivation of alternatives, and then uses probabilistic reasoning to disambiguate those alternatives in context. They don’t have a problem with this IRM approach, and think it’s compatible with the grammatical approach they favor. So, if we’re using this IRM approach, then the worries F&K2020 that highlight don’t apply? In my own collaborative work for instance, I’m pretty sure we always talk about our IRM (i.e., RSA) models as ambiguity resolution among grammatical options that were already derived, though we can assign priors to them and so include how expensive it is to access those options. 


Other thoughts:

(1)  My take on footnote 4 and related text: there’s a conceptual separation between the creation of alternatives (syntactic/semantic computation) and how we choose between those alternatives (which typically involves probabilities). I know there’s a big debate about whether this conceptual separation is cognitively real, and I think that’s what’s being alluded to here.


(2) The comparison “grammatical approach": I’m curious about the evaluation metrics being used for theory comparison here -- in terms of acquisition, the grammatical approach requires language-specific knowledge (presumably innate?) in the form of the Exh operator, the “innocent inclusion”, and “innocent exclusion” operations. From this perspective, it’s putting a lot of explanatory work onto the development of this language-specific knowledge, compared with domain-general probabilistic reasoning mechanisms. I guess F&K2020 are focused more on issues of empirical coverage, with the multiplier conjunctive reading example not being handled by Franke’s approach.


(3) In section 6 on probabilities and modularity, F&K2020 discuss how probabilities could be part of the initial computations of SIs. I think I’m now starting to blur between this and the version of IRMs that F&R2020 were okay with, which is when IRMs have possibilities that are calculated from the grammar (e.g., with the semantics) and then the IRM uses recursive social reasoning to choose among those possibilities in context. It seems like the “SI calculation” part is about navigating the possibilities (here: the options on the scale that come from the semantics). So, RSA models that purport to capture SIs (even if relying on scale options that come from the grammar) would be exactly the IRMs that F&R2020 would be unhappy with.


(4) In 6.3, F&K2020 mention that priors could be “formal constructs defined internally to the system.” This is clearly an option the F&K2020 think is viable (even if they don’t favor it), so it seems important to understand what this means. But I’m unclear myself on how to interpret that phrase. Would this mean that there are probabilities available beforehand (therefore making them priors), but they’re not tied to anything external (like beliefs about the world, or how often a particular interpretation has occurred)? They’re just...probabilities that get generated somehow for possible interpretations?


Wednesday, December 2, 2020

Some thoughts on Caplan et al. 2020

I appreciate seeing existence proofs like the one C&al2020 provide here -- more specifically, the previous article by PTG seemed to invite an existence proof that certain properties of a lexicon (ambiguous words being short, frequent, and easy to articulate) could arise from something besides communicative efficiency. C&al2020 then obliged them by providing an existence proof grounded in empirical data. I admit that I had some confusion about the specifics of the communicative efficiency debate (more on this below) as well as PTG’s original findings (more on this below too), but this may be due to actual vagueness in how “communicative efficiency” is talked about in general. 


Specific thoughts:

(1) Communicative efficiency: It hit me right in the introduction that I was confused about what communicative efficiency was meant to be. In the introduction, it sounds like the way “communicative efficiency” is defined is with respect to ambiguity. That is, ambiguity is viewed as not communicatively efficient. But isn’t it efficient for the speaker? It’s just not so helpful for the listener. So, this means communicative efficiency is about comprehension, rather than production.  


Okay. Then, efficiency is about something like information transfer (or entropy reduction, etc.). This then makes sense with the Labov quote at the beginning that talks about the “maximization of information’’ as a signal of communicative efficiency. That is, if you’re communicatively efficient, you maximize information transfer to the listener.


Then, we have the callout to Darwin, with the idea that “better, shorter, and easier forms are constantly gaining the upper hand”.  Here, “better” and “easier” need to be defined. (Shorter, at least, we can objectively measure.) That is, better for who? Easier for who? If we continue with the idea from before, that we’re maximizing information transfer, it’s better and easier for the listener. But of course, we could also define "better" and "easier" for the speaker. In general, it seems like there’d be competing pressure between forms that are better and easier for the speaker vs. forms that are better and easier for the listener. This also reminds me of some of the components of the Rational Speech Act framework, where there’s a speaker cost function to capture how good (or not) a form is for the speaker vs. the surprisal function that captures how good (or not) a form is inferred to be for the listener. Certainly, surprisal comes back in the measures used by PTG  as well as by C&al2020.


Later on, both the description of Zipf’s Principle of Least Effort and the PTG 2012 overview make it sound like communicative efficiency is about how effortful it is for the speaker, rather than focusing on the information transfer to the listener. Which is it? Or are both meant to be considered for communicative efficiency? It seems like both ought to be, which gets us back to the idea of competing pressures…I guess one upshot of C&al2020’s findings is that we don’t have to care about this thorny issue because we can generate lexicons that looks like human language lexicons without relying on communicative efficiency considerations.

(2) 2.2, language models: I was surprised by the amount of attention given to phonotactic surprisal, because I think the main issue is that a statistical model of language is needed and that requires us to make commitments about what we think the language model looks like. This should be the very same issue we see for word-based surprisal. That is, surprisal is the negative log probability of $thing (word or phonological unit), given some language model that predicts how that $thing arises based on the previous context. But it seemed like C&al2020 were less worried about this for word-based surprisal than for phonotactic surprisal, and I’m not sure why.


(3) The summary of PTG’s findings: I would have appreciated a slightly more leisurely walkthrough of PTG’s main findings -- I wasn’t quite sure I got the interpretations right as it was. Here’s what I think I understood: 


(a) homophony: negatively correlated with word length and frequency (so more homophony = shorter words and ...lower frequency words???). It’s also negatively correlated with phonotactic surprisal in 2 of 3 languages (so more homophony = lower surprisal = more frequent phonotactic sequences).


(b) polysemy: negatively correlated with word length, frequency, and phonotactic surprisal (so more polysemous words = shorter, less frequent??, and less surprising = more frequent phonotactic sequences).


(c) syllable informativity: negatively correlated with length in phones, frequency, and phonotactic surprisal (so, the more informative (=the less frequent), the shorter in phones, the lower in frequency (yes, by definition), and the lower the surprisal (so the higher the syllable frequency?)


I think C&al2020’s takeaway message from all 3 of these results was this: ”Words that are shorter, more frequent, and easier to produce are more ambiguous than words that are longer, less frequent, and harder to produce”. The only thing is that I struggled a bit to get this from the specific correlations noted. But okay, if we take this at face value, then ambiguity goes hand-in-hand with being shorter, more frequent, and less phonologically surprising = all about easing things for the speaker at face value. (So, it doesn’t seem like ambiguity and communicative efficiency are at odds with each other, if communicative efficiency is defined from the speaker’s perspective.)


(4) Implementing the semantic constraint on the phonotactic monkey model: The current implementation of meaning similarity uses an idealized version (100 x 100 two-dimensional space of real numbers), where points close to each other have more similar meanings. It seems like a natural extension of this would be to try it with actual distributed semantic representations like GLoVe or RoBERTa.  I guess maybe it’s unclear what additional value this adds to the general argument here -- that is, the current paper is written as “you asked for an existence proof of how lexicons like this could arise without communicative considerations; we made you one”. Yet, at the end, it does sound like C&al2020 would like to have the PSM model be truly considered as a cognitively plausible model of lexicon generation (especially when tied to social networks). If so, then an updated semantic implementation might help convince people that this specific non-communicative-efficiency approach is viable, rather than there simply is a non-communicative-efficiency approach out there that will work.


(5) In 5.3, C&al2020 highlight what the communicative efficiency hypothesis would predict for lexicon change. In particular:


(a) Reused forms should be more efficient than stale forms (i.e., shorter, more frequent, less surprising syllables)


(b) New forms should use more efficient phonotactics (i.e., more frequent, less surprising)


But aren’t these what C&al2020 just showed as something that could result from the PM and PSM models, and so a non-communicative-efficiency approach could also have them? Or is this the point again? I thought at this point that C&al2020 aimed to already show that these predictions aren’t unique to the communicative efficiency hypothesis. (Indeed, this is what they show in the next figure, as they note that PSM English better exploits inefficiencies in the English lexicon by leveraging phonotactically possible, but unused, short words). I guess this is just a rhetorical strategy that I got temporarily confused by.


Tuesday, November 17, 2020

Some thoughts on Matusevych et al. 2020

I really like seeing this kind of model comparison work, as computational models like this encode specific theories of a developmental process (here, how language-specific sound contrasts get learned). I think we see a lot of good practices demonstrated in this paper when it comes to this approach, especially when borrowing models from the NLP world: using naturalistic data, explicitly highlighting the model distinctions and what they mean in terms of representation and learning mechanism, comparing model output to observable behavioral data (more on this below), and generating testable behavioral predictions that will distinguish currently-winning models. 


Specific thoughts:

(1) Comparing model output to observable behavior: I love that M&al2020 do this with their models, especially since most previous models tried to learn unobservable theoretically-motivated representations. This is so useful. If you want the model’s target to be an unobserved knowledge state (like phonetic categories), you’re going to have a fight with the people who care about that knowledge representation level -- namely, is your target knowledge the right form? If instead you make the model’s target some observable behavior, then no one can argue with you. The behavior is an empirical fact, and your model either can generate it or not. It saves much angst on the modeling, and makes for far more convincing results. Bonus: You can then peek inside the model to see what representation it used to generate the observed behavior, and potentially inform the debates about what representation is the right one.


(2) Simulating the ABX task results: So, this seemed a little subtle to me, which is why I want to spell out what I understood (which may well be not quite right). Model performance is calculated by how many individual stimuli the model gets right -- for instance, none = 0% discrimination, 50% = chance performance; 100% = perfect discrimination. I guess maybe this deals with the discrimination threshold issue (i.e., how you know if a given stimulus pair is actually different enough to be discriminated) by just treating each stimulus as a probabilistic draw from a distribution? That is, highly overlapping distributions means A-X is often the same as B-X, and so this works out to no discrimination...I think I need to think this through with the collective a little. It feels like the model’s representation is the draw from a distribution over possible representations, and then that’s what gets translated into the error rate. So, if you get enough stimuli, you get enough draws, and that means the aggregate error rate captures the true degree of separation for these representations. I think?


(3) On the weak word-level supervision: This turns out to be recognizing that tokens of a word are in fact the same word form. That’s not crazy from an acquisition perspective -- meaning could help determine that the same lexical item was used in context (e.g., “kitty” one time and “kitty” another time when pointing at the family pet).


(4) Cognitive plausibility of the models: So what strikes me about the RNN models is that they’re clearly coming from the engineering side of the world -- I don’t know if we have evidence that humans do this forced encoding-decoding process. It doesn’t seem impossible (after all, we have memory and attention bottlenecks galore, especially as children), but I just don’t know if anyone’s mapped these autoencoder-style implementations to the cognitive computations we think kids are doing. So, even though the word-level supervision part of the correspondence RNNs seems reasonable, I have no idea about the other parts of the RNNs. Contrast this with the Dirichlet process Gaussian mixture model -- this kind of generative model is easy to map to a cognitive process of categorization, and the computation carried out by the MCMC sampling can be approximated by humans (or so it seems).


(5) Model input representations: MFCCs from 25ms long frames are used. M&al2020 say this is grounded in human auditory processing. This is news to me! I had thought MFCCs were something that NLP had found worked, but we didn’t really know about links to human auditory perception. Wikipedia says the mel (M) part is what’s connected to human auditory processing, in that the spacing of the bands by “mel” is what approximates the human auditory response. But the rest of the process of getting MFCCs from the acoustic input, who knows? This contrasts with using something like phonetic features, which certainly seems to be more like our conscious perception of what’s in the acoustic signal. 


Still, M&al2020 then use speech alignments that map chunks of speech to corresponding phones. So, I think that the alignment process on the MFCCs yields something more like what linguistic theory bases things on, namely phones that would be aggregated together into phonetic categories.


Related thought, from the conclusion: “models learning representations directly from unsegmented natural speech can correctly predict some of the infant phone discrimination data”. Notably, there’s the transformation into MFCCs and speech alignment into phones, so the unit of representation is something more like phones, right? (Or whole words of MFCCs for the (C)AE-RNN models?) So should we take away something about what the infant unit of speech perception is from there, or not? I guess I can’t tell if the MFCC transformation and phone alignment is meant as an algorithmic-level description of how infants would get their phone-like/word-like representations, or if instead it’s a computational-level implementation where we think infants get phone-like/word-like representations out, but infants need to approximate the computation performed here.


(6) Data sparseness: Blaming data sparseness for no model getting the Catalan contrast doesn’t seem crazy to me. Around 8 minutes of Catalan training data (if I’m reading Table 3 correctly) isn’t a lot. If I’m reading Table 3 incorrectly, and it’s actually under 8 hours of Catalan training data, that still isn’t a lot. I mean, we’re talking less than a day’s worth of input for a child, even if this is in hours.


(7) Predictions for novel sound contrasts: I really appreciate seeing these predictions, and brief discussion of what the differences are (i.e., the CAE-RNN is better for differences in length, while the DPGMM is better for ones that observably differ in short time slices). What I don’t know is what to make of that -- and presumably M&al2020 didn’t either. They did their best to hook these findings into what’s known about human speech perception (i.e., certain contrasts like /θ/ are harder for human listeners and are harder for the CAE-RNN too), but the general distinction of length vs. observable short time chunks is unexplained. The only infant data to hook back into is whether certain contrasts are realized earlier than others, but the Catalan one was the earlier one at 8 months, and no model got that.