Computational Models of Language (at UC Irvine): 2021

Tuesday, November 23, 2021

Some thoughts on Bohn et al. 2021

I think it’s really nice to see a developmental RSA model, along with explicit model comparisons. To me, this approach highlights how you can capture specific theories/hypotheses about what exactly is developing via these computational cognitive modeling “snapshots” that capture observable behavior at different ages. Also, we get to see the model-evaluation pipeline often used in RSA adult modeling now used with kids (i.e., the model makes testable predictions that are in fact tested on kids). I also appreciate how careful B&al2021 are with respect to how model parameters link to psychological processes in the discussion (they emphasize in the general discussion that their model necessarily made idealizations to be able to get anywhere).

Some other thoughts:

(1) It’s interesting to me that B&al2021 talk about children integrating all available information, in contrast to alternative models that ignore some information (and don’t do as well). I’m assuming “all” is relative, because a major part of language development is learning which part of the input signal is relevant. For instance, speaker voice pitch is presumably available information, but I don’t think B&al2021 would consider it relevant for the inference process they’re interested in. But I do get that they’re contrasting the winning model with one that ignores some available relevant information.

(2) I feel like the way that B&al2021 talk about informativity seems to differ at points. In one sense, they talk about an informative and cooperative speaker, which seems to link with the general RSA framework of speaker utility as maximizing correct listener inference. In another sense, they connect informativity to alpha specifically, which seems like a narrower sense of “informativity”, maybe tied to how much above 1 alpha is (and therefore how deterministic the probabilities are that the speaker uses).

(3) Methodology, no-word-knowledge variant: I was still a little fuzzy even after reading the methods section about how general vocabulary size is estimated and used in place of specific word familiarity, except that of course it’s the same value for all objects (rather than in fact differing by word familiarity).

Tuesday, November 9, 2021

Some thoughts on Perfors et al. 2010

I’m reminded how much I enjoy this style of modeling work. There’s a lot going on, but the intuitions and motivations for it made sense to me throughout, and I really appreciated how careful P&al2010 were in both interpreting their modeling results and connecting them to the existing developmental literature.

Some thoughts:

(1) I generally am really a fan of building less in, but building it in more abstractly. This approach makes the problem of explaining where that built-in stuff comes from easier -- if you have to explain where fewer things came from, you have less explaining to do.

(2) I really appreciate how careful P&al2010 are with their conclusions about the value of having verb classes. It does seem like the model with classes (K-L3) captures the age-related effect of less overgeneralization much more strongly while the one with a single verb class (L3) doesn’t. But, P&al2010 still note that both technically capture the key effects. Qualitative developmental pattern as the official evaluation measure, check! (Something we see a lot in modeling work, because then you don’t have to explain every nuance of the observed behavior; instead you can say the model can predict something that matters a lot for producing that observed behavior, even if it’s not the only thing that matters.)

(3) Study 3: It might seem strange to try to add more to the model in Study 2 that already seems to capture the known empirical developmental data with just syntactic distribution information. But, the thing we always have to remember is that learning any particular thing doesn’t occur in a vacuum -- if information is in the input that’s useful, and children don’t filter it out for some reason, then they probably do in fact use it and it’s helpful to see what impact this has on an explanatory model like this. Basically, does the additional information intensify the model-generated patterns or muck them up, especially if it’s noisy? This can tell us about whether kids could be using this additional information (or when they’re using it) or maybe should ignore it, for instance. This comes back at the end of the results presentation, when P&al2010 mention that having 13 features with only 6 being helpful ruins the model -- the model can’t ignore the other 7, tries to incorporate them, and gets mucked up. Also, as P&al2010 demonstrate here, this approach could differentiate between different model types (i.e., representational theories here: with verb classes vs. without).

(4) Small implementation thing: In Study 3, when noise is added to the semantic feature correlations, so that the appropriate semantic feature only appears 60% of the time: Presumably this would be implemented across verb instances, rather than only 60% of the verbs in that class having the feature? Otherwise, if some verbs always had the feature and some didn’t, I would think the model would probably end up inferring different classes for each syntactic type instead of just one per syntactic type, e.g., a PD-only class with the P feature and a PD-only class with no feature.

Wednesday, October 27, 2021

Some thoughts on Tal et al. 2021

This seemed to me like a straightforward application of a measure of redundancy (measuring whatever level of representation you like) to quantify redundancy in child-directed speech over developmental time. As T&al2021 note, the idea of repetition and redundancy in child-directed speech isn’t new, but this way of measuring it is, and the results certainly accord with current wisdom that (i) repetition in speech is helpful for young children, and (ii) repetition gets less as children get older (and the speech directed at them gets more adult-like). The contributions therefore also seem pretty straightforward: a new, more holistic measure of repetition/redundancy at the lexical level, and the finding that multi-word utterances seem to be the thing that gets repeated less as children get older.

Some other thoughts:

(1) Corpus analysis: For the Providence corpus, with such large samples, I wonder why T&al2021 chose to make only two age bins (12-24 months, and 24-36 months). It seems like there would be enough data there to go finer-grained (like maybe every two months: 12-14, 14-16, etc), and especially zoom in on the gaps in the NewmanRatner corpus between 12 and 24 months.

(2) I had some confusion over the discussion of the NewmanRatner results, regarding the entropy decrease they found with the shuffled word order of Study 2. In particular, I think the explanation for the entropy decrease was that lexical diversity didn’t increase in this sample as children got older. But, I didn’t quite follow why this explained the entropy decrease. More specifically, if lexical diversity stays the same, the shuffled word order keeps the same frequencies of individual words over time, so no change in entropy at the lexical level. With shuffled word order, the multi-word sequences are destroyed, so that should increase entropy. How does no change + entropy increase lead to an overall entropy decrease?

Relatedly, T&al2021 say about Study 2 that “the opposite tendencies of lexical- and multi-word repetitiveness in this corpus seem to cancel each other out at 11 months”. This related to my confusion above. Basically, we have constant lexical diversity, so there’s no change to entropy over time coming from the lexical level. Decreasing multi-word repetitions leads to higher entropy over time. What are the opposite tendencies here? It seems like there’s only one tendency (increasing entropy from the loss of the multi-word repetitions).

Thursday, October 14, 2021

Some thoughts on Harmon et al. 2021

I think it’s a testament to the model description that the simulations seemed almost unnecessary to me -- they turned out exactly as (I) expected, given what the model is described as trying to do, based on the frequency of novel types. I also really love seeing modeling work of this kind used to investigate developmental language disorders -- I feel like there’s just not as much of this kind of work out there, and the atypical development community really benefits from it. That said, I do think the paper suffers a bit from length limitations. I definitely had points of confusion about what conceptually was going on (more on this below).

(1) Production probability: The inference problem is described as trying to identify the “production probability”, but it took me awhile to figure out what this might be referring to. For instance, does “production probability” refer to the probability that this item will take some kind of morphology (i.e., be “productive”) vs. not in some moment? If an item has a production probability of say, .5, does that mean that the item is actually “fully” productive, but that productivity is only accessed 50% of the time (so it would be a deployment issue that we see 50% in the output)? Or does it mean that only 50% of the inflections that should be used with that item are actually used (e.g. -ed but not -ing)? (That seems more like a representation issue.) Or does “production probability” mean something else?

I guess here, if H&al2021 are focusing on just one morpheme, it would be the deployment option, since that morpheme is either used or not. Later on, H&al2021 talk about this probability as “the probability for the inflection”, which does make me think it’s how often one inflection applies, which also aligns with the deployment option. Even later, when talking about the Pitman-Yor process, it seems like H&al2021 are talking about the probability assigned to the fragment that incorporates the inflection directly. So, this corresponds to how often that fragment gets deployed, I think.

(2) Competition, H&al2021 start a train of thought with “if competition is too difficult to resolve on the fly”: I don’t think I understand what “competition” means in this case. That is, what does it mean not to resolve the competition? I thought what was going on was that if the production probability is too low, the competition is lost (resolved) in favor of the non-inflected form. But this description makes it sound like the competition is a separate process (maybe among all the possible inflected forms?), and if that “doesn’t resolve”, then the inflected form loses to another option (which is compensation).

(3) In the description of the Procedural Deficit Hypothesis, DLD kids are said to “produce an unproductive rule”: I don’t think I follow what this means -- is it that these kids produce a form that should be unproductive, like “thank” for think-past tense? This doesn’t seem to align with “memorization using the declarative memory system”, unless these kids are hearing “thank” as think-past tense in their input (which seems unlikely). Maybe this was a typo for “produce an uninflected form”?

(4) The proposed account of H&al2021 is that children are trying to access appropriate semantics, and not just the appropriate form (i.e., they prioritize meaning); so, this is why bare forms win out. This makes intuitive sense to me from a bottleneck standpoint. If you want to get your message across, you prioritize content over form. This is what little typically-developing kids do, too, during telegraphic speech.

(5) Potentially related work on productivity: I’m honestly surprised there’s no mention of Yang’s work on productivity here -- he has a whole book of work on it (Yang 2016), and his approach focuses on specifying how many types are necessary for a rule to be productive, which seems relevant here.

Yang, C. (2016). The price of linguistic productivity: How children learn to break the rules of language. MIT Press.

(6) During inference, the modeled learner is given parsed input and has to infer fragments: So the assumption is that the DLD child perceived the form and the inflection correctly in the input, but the issue is retrieving that form and inflection during production. I guess this is because DLD kids comprehend morphology just fine, but struggle with production?

(7) Results: “the results of t tests showed that in all models, the probability of producing wug was higher than wugged...due to the high frequency of the base form”: Was this true even for the TD (typically developing child) model? If so, isn’t that not what we want to see, because TD children pass the wug test?

Also, were these the only two alternatives available, or were other inflectional options on the table too?

Also, is it that the modeled child just picked the one with the highest probability?

Are the only options available the chunked inflections (including the null of the bare form), or are fragments that just have STEM + INFLECTION (without specifying the inflection) also possible? If so, how can we tell that option from the STEM + null of the bare form in practice? Both would result in the bare form, I would think.

(8) In the discussion, processing difficulties are said to skew the intake to have fewer novel types, which is crucial for inferring productivity. So, this means that kids don’t infer a high enough probability for the productive fragment, as it were; I guess this doesn’t affect their comprehension, because they can still use the less efficient fragments to parse the input (but maybe not parse it as fast). So maybe this is a more specific hypothesis about the “processing difficulties” that cause them not to parse novel types in the input that well?

(9) Discussion, “past tense rule in the DLD models was not entirely unproductive”: Is this because the fragment probability wasn’t 0? Or, how low does it have to be to be considered unproductive? This brings me back to Yang’s work, where there’s a specific threshold. Below that threshold, it’s unproductive. And that threshold can actually be pretty high (like, definitely above 50%).

(10) Discussion, the qualitative pattern match with TD kids is higher than with DLD kids: I get that qualitative pattern matching is important and useful when talking about child behavior, but 90-95% production vs. 30-60% production looks pretty different from Figure 3. I guess Figure 3’s in log space, and who knows what other linking components are involved. But still, I feel like it would have been rhetorically more effective to talk about higher vs lower usage than give the actual percentages here.

(11) Discussion, “possible that experience with fewer verb types in the past tense, especially with higher frequency, biases children with DLD to store a large number of inflected verbs as a single unit (stem plus inflection) compared to TD children, further undermining productivity": This description makes it sound like storing STEM + inflection directly isn’t productive. But, I thought that was the productive fragment we wanted. Or was this meant as a particular stem + inflection, like hug + ed?

Tuesday, February 23, 2021

Some thoughts on Tenenbaum et al. 2020

I think it’s a really interesting and intuitive idea to add semantic constraints to the task of morphology identification. That said, I do wonder how much of the morphology prefixes and suffixes might already come for free from the initial speech segmentation process. (I’m reminded of work in Bayesian segmentation strategies, where we definitely get some morphology like -ing sliced off for free with some implementations.) If those morphology pieces are already available, perhaps it becomes easier to implement semantically-constrained generalization over morphology transforms. Here, it seems like a lot of struggle is in the plausibility of the particular algorithm chosen for identifying suffix morphology. Perhaps that could all be sidestepped.

Relatedly, a major issue for me was understanding how the algorithm underlying the developmental model works (more on this below). I’m unclear on what seem to be important implementational details if we want to make claims about cognitive plausibility. But I love the goal of increasing developmental plausibility!

Other specific thoughts:

(1) The goal of identifying transforms: In some sense, this is the foundation of morphology learning systems (e.g., Yang 2002, 2005, 2016) that assume the child already recognizes a derived form as an instance of a root form (e.g., kissed-kiss, drank-drink, sung-sing, went-go). For those approaches, the child knows “kissed” is the past tense of “kiss” and “drank” is the past tense of “drink” (typically because the child has an awareness of the meaning similarity). Then, the child tries to figure out if the -ed transformation or the -in- → -an- transformation is productive morphology. Here, it’s about recognizing valid morphology transforms to begin with (is -in- → -an- really a thing that relates drink-drank and sing-sang?), so it’s a precursor step.

(2) On computational modeling as a goal: For me, it’s funny to state outright that a goal is to build a computational model of some process. Left implicit is why someone would want to do this. (Of course, it’s because a computational model allows us to make concrete the cognitive process we think is going on -- here, a learning theory for morphology -- and then evaluate the predictions that implemented theory makes. But experience has taught me that it’s always a good idea to say this kind of thing explicitly.)

(3) Training GloVe representations on child-directed speech: I love this. It could well be that the nature of children’s input structures the meaning space in a different way than adult linguistic input does, and this could matter for capturing non-adult-like behavior in children.

(4) Morphology algorithm stuff: In general, some of the model implementation details are unclear for me, and it seems important to understand what they are if we want to make claims that a certain algorithm is capturing the cognitive computations that humans are doing.

(a) Parameter P determines which sets (unmodeled, base, derived) the proposed base and derived elements can come from. So this means they don’t just come from the unmodeled set? I think I don’t understand what P is. Does this mean both the “base” and “derived” elements of a pair could come from, say, the “base” set? Later on, they discuss the actual P settings they consider, with respect to “static” vs “non-static”. I don’t quite know what’s going on there, though -- why do the additional three settings for the “Nonstatic” value intuitively connect to a “Nonstatic” rather than “Static” approach? It’s clearly something to do with allowing things to move in and out of the derived bin, in addition to in and out of the base bin...

(b) One step is to discard transforms that don’t meet a “threshold of overlap ratio”. What is this? Is this different from T? It seems like it, but what does it refer to?

(c) Another step is to rank remaining transforms according to the number of wordpairs they explain, with ties broken by token counts. So, token frequency does come back into play, even though the basic algorithm operates over types? I guess the frequencies come from the CHILDES data aggregates.

(d) If the top candidate transform explains >= W wordpairs, it’s kept. So, does this mean the algorithm is only evaluating the top transform each time? That is, it’s discarding the information from all the other potential transforms? That doesn’t seem very efficient...but maybe this has to do with explicit hypothesis testing, with the idea that the child can only entertain one hypothesis at a time…

(e) Each base/derived word pair explained by the new transform is moved to the Based/Derived bin. The exception is if the base form was in the derived bin before; in this case, it doesn’t move. So, if an approved transform seems to actually explain a derived1/derived2 pair, the derived1 element doesn’t go into the base bin? Is the transform still kept? I guess so?

(5) Performance is assessed via hits vs. false alarms, so I think this is an ROC curve. I like the signal detection theory approach, but then shouldn’t we be able to capture performance holistically for each combination by looking at the area under the curve?

Relatedly, transforms are counted as valid if they’re connected to at least three correct base/derived wordpairs, even if they’re also connected to any number of other spurious ones. So, a transform is “correct” if recall >=3, regardless of precision. Okay...this seems a bit arbitrary, though. Why focus on recall, rather than precision for correctness? This seems particularly salient given the discussion a bit further on in the paper that “reliability” (i.e., precision) would better model children’s learning.

Note: I agree that high precision for early learning (<1 year) is more important than high recall. But I wonder what age this algorithm is meant to be applying to, and if that age would still be better modeled by high precision at the expense of high recall.

Note 2 from the results later on: I do like seeing qualitative comparison to developmental data, discussing how a particular low-resource setting can capture 8 of the most common valid transforms children have.

(6) T&al2020 talk about a high-resource vs. a low-resource learner. But why not call the high-resource learner an idealized/computational-level learner? Unless Lignos & colleagues meant this to be a process/algorithmic-level learner? (It doesn’t seem like it, but then maybe they were less concerned about some of the cognitive plausibility aspects.)

(7) Fig 3 & 4, and comparisons:

(a) Fig 3 & 4: I’d love to see the Lignos et al. version with no semantic information for all the parameter values manipulated here. That seems like an easy thing to do (just remove the semantic filtering, but still allow variation for the top number of suffixes N, wordpair threshold W, and permitted wordpairs P for the high-resource learners; for the low-resource learners, just vary W and P). Then, you could also easily compare the area under the curve for this baseline (no semantics) model vs. the semantics models for all the learners (not just the high-resource ones). And that then would make the conclusion that the learners who use semantics do better more robust. (Side note: I totally believe that semantics would help. But it would be great to see that explicitly in the analysis, and to understand exactly how much it helps the different types of learners, both high-resource and low-resource).

(b) Fig 4: I do appreciate the individual parameter exploration, but I’d also like to see a full low-resource learner combination [VC=Full, EC=CHILDES, N=3], too -- at least, if we want to claim that the more developmentally-plausible learners can still benefit from semantic info like this. This is talked about in the discussion some (i.e., VC=Full, EC=CHILDES, N=15 does as well as the original Lignos settings), but it’d be nice to see this plotted in a Figure-4-style plot for easy comparison.

(8) Which morphological transforms we’re after: In the discussion, T&al2020 note that they only focus on suffixes, and certainly the algorithm is only tuned to suffixes. It definitely seems like a more developmentally-plausible algorithm would be able to use meaning to connect more disparate derived forms to their base forms (e.g., drink-drank, think-thought). I’d love to see an algorithm that uses semantic similarity (and syntactic context) as the primary considerations, and then how close the base is to the derived form as a secondary consideration. This would allow the irregulars (like drink-drank, think-thought) to emerge as connected wordpairs. (T&al2020 do sketch some ideas in this direction in the next section, when they talk about model generalizability of morphology, and morphology clustering.)

(9) In the model extension part, T&al2020 say they want to get a “token level understanding of segmentation”. I’m not sure what this means -- is this the clustering together of different morphological transforms that apply to specific words? (I’d call this types, rather than tokens if so.)

(10) T&al2020’s proposed semantic constraint is that valid morphological transforms should connect pairs of base and derived forms that are offset in a consistent direction in semantic space. Hmmm...I guess the idea is that the semantic information encoded by a transform (e.g., past tense, plural, ongoing action) is consistent, so that should be detectable. That doesn’t seem crazy, certainly as a starting hypothesis. My concern in the practical implementation T&al2020 try is the GloVe semantic space, which may or may not actually have this property. The semantic space of embedding models is strange, and not usually very interpretable (currently) in the ways we might hope it to be. But I guess the brief practical demonstration T&al2020 do for their H3 morpheme transforms shows a proof of concept, even if it’s a mystery how a child would agglomeratively cluster things just so. That proof of concept does show it’s in fact possible to cluster just so over the GloVe-defined difference vectors.

Thursday, February 4, 2021

Some thoughts on Fox & Katzir 2020

I think one of the main things that struck me is the type of iterated rationality models (IRMs) that F&K2020 discuss -- those IRMs don’t seem like any of the ones I’ve seen in the cognitively-oriented literature that connects with human behavioral or developmental data. That is, in footnote 3, F&K2020 note that there’s an IRM approach that assumes grammatical derivation of alternatives, and then uses probabilistic reasoning to disambiguate those alternatives in context. They don’t have a problem with this IRM approach, and think it’s compatible with the grammatical approach they favor. So, if we’re using this IRM approach, then the worries F&K2020 that highlight don’t apply? In my own collaborative work for instance, I’m pretty sure we always talk about our IRM (i.e., RSA) models as ambiguity resolution among grammatical options that were already derived, though we can assign priors to them and so include how expensive it is to access those options.

Other thoughts:

(1) My take on footnote 4 and related text: there’s a conceptual separation between the creation of alternatives (syntactic/semantic computation) and how we choose between those alternatives (which typically involves probabilities). I know there’s a big debate about whether this conceptual separation is cognitively real, and I think that’s what’s being alluded to here.

(2) The comparison “grammatical approach": I’m curious about the evaluation metrics being used for theory comparison here -- in terms of acquisition, the grammatical approach requires language-specific knowledge (presumably innate?) in the form of the Exh operator, the “innocent inclusion”, and “innocent exclusion” operations. From this perspective, it’s putting a lot of explanatory work onto the development of this language-specific knowledge, compared with domain-general probabilistic reasoning mechanisms. I guess F&K2020 are focused more on issues of empirical coverage, with the multiplier conjunctive reading example not being handled by Franke’s approach.

(3) In section 6 on probabilities and modularity, F&K2020 discuss how probabilities could be part of the initial computations of SIs. I think I’m now starting to blur between this and the version of IRMs that F&R2020 were okay with, which is when IRMs have possibilities that are calculated from the grammar (e.g., with the semantics) and then the IRM uses recursive social reasoning to choose among those possibilities in context. It seems like the “SI calculation” part is about navigating the possibilities (here: the options on the scale that come from the semantics). So, RSA models that purport to capture SIs (even if relying on scale options that come from the grammar) would be exactly the IRMs that F&R2020 would be unhappy with.

(4) In 6.3, F&K2020 mention that priors could be “formal constructs defined internally to the system.” This is clearly an option the F&K2020 think is viable (even if they don’t favor it), so it seems important to understand what this means. But I’m unclear myself on how to interpret that phrase. Would this mean that there are probabilities available beforehand (therefore making them priors), but they’re not tied to anything external (like beliefs about the world, or how often a particular interpretation has occurred)? They’re just...probabilities that get generated somehow for possible interpretations?

Computational Models of Language (at UC Irvine)