Tuesday, February 23, 2021

Some thoughts on Tenenbaum et al. 2020

I think it’s a really interesting and intuitive idea to add semantic constraints to the task of morphology identification. That said, I do wonder how much of the morphology prefixes and suffixes might already come for free from the initial speech segmentation process. (I’m reminded of work in Bayesian segmentation strategies, where we definitely get some morphology like -ing sliced off for free with some implementations.) If those morphology pieces are already available, perhaps it becomes easier to implement semantically-constrained generalization over morphology transforms. Here, it seems like a lot of struggle is in the plausibility of the particular algorithm chosen for identifying suffix morphology. Perhaps that could all be sidestepped.

Relatedly, a major issue for me was understanding how the algorithm underlying the developmental model works (more on this below). I’m unclear on what seem to be important implementational details if we want to make claims about cognitive plausibility. But I love the goal of increasing developmental plausibility!


Other specific thoughts:


(1) The goal of identifying transforms: In some sense, this is the foundation of morphology learning systems (e.g., Yang 2002, 2005, 2016) that assume the child already recognizes a derived form as an instance of a root form (e.g., kissed-kiss, drank-drink, sung-sing, went-go). For those approaches, the child knows “kissed” is the past tense of “kiss” and “drank” is the past tense of “drink” (typically because the child has an awareness of the meaning similarity). Then, the child tries to figure out if the -ed transformation or the -in- → -an- transformation is productive morphology. Here, it’s about recognizing valid morphology transforms to begin with (is -in- → -an- really a thing that relates drink-drank and sing-sang?), so it’s a precursor step.


(2) On computational modeling as a goal: For me, it’s funny to state outright that a goal is to build a computational model of some process. Left implicit is why someone would want to do this. (Of course, it’s because a computational model allows us to make concrete the cognitive process we think is going on -- here, a learning theory for morphology -- and then evaluate the predictions that implemented theory makes. But experience has taught me that it’s always a good idea to say this kind of thing explicitly.)


(3) Training GloVe representations on child-directed speech: I love this. It could well be that the nature of children’s input structures the meaning space in a different way than adult linguistic input does, and this could matter for capturing non-adult-like behavior in children.


(4) Morphology algorithm stuff: In general, some of the model implementation details are unclear for me, and it seems important to understand what they are if we want to make claims that a certain algorithm is capturing the cognitive computations that humans are doing.


(a) Parameter P determines which sets (unmodeled, base, derived) the proposed base and derived elements can come from. So this means they don’t just come from the unmodeled set? I think I don’t understand what P is. Does this mean both the “base” and “derived” elements of a pair could come from, say, the “base” set? Later on, they discuss the actual P settings they consider, with respect to “static” vs “non-static”. I don’t quite know what’s going on there, though -- why do the additional three settings for the “Nonstatic” value intuitively connect to a “Nonstatic” rather than “Static” approach? It’s clearly something to do with allowing things to move in and out of the derived bin, in addition to in and out of the base bin...


(b) One step is to discard transforms that don’t meet a “threshold of overlap ratio”. What is this? Is this different from T? It seems like it, but what does it refer to?


(c) Another step is to rank remaining transforms according to the number of wordpairs they explain, with ties broken by token counts. So, token frequency does come back into play, even though the basic algorithm operates over types? I guess the frequencies come from the CHILDES data aggregates.


(d) If the top candidate transform explains >= W wordpairs, it’s kept. So, does this mean the algorithm is only evaluating the top transform each time? That is, it’s discarding the information from all the other potential transforms? That doesn’t seem very efficient...but maybe this has to do with explicit hypothesis testing, with the idea that the child can only entertain one hypothesis at a time…


(e) Each base/derived word pair explained by the new transform is moved to the Based/Derived bin. The exception is if the base form was in the derived bin before; in this case, it doesn’t move. So, if an approved transform seems to actually explain a derived1/derived2 pair, the derived1 element doesn’t go into the base bin? Is the transform still kept? I guess so?



(5) Performance is assessed via hits vs. false alarms, so I think this is an ROC curve. I like the signal detection theory approach, but then shouldn’t we be able to capture performance holistically for each combination by looking at the area under the curve?


Relatedly, transforms are counted as valid if they’re connected to at least three correct base/derived wordpairs, even if they’re also connected to any number of other spurious ones. So, a transform is “correct” if recall >=3, regardless of precision. Okay...this seems a bit arbitrary, though. Why focus on recall, rather than precision for correctness? This seems particularly salient given the discussion a bit further on in the paper that “reliability” (i.e., precision) would better model children’s learning. 


Note: I agree that high precision for early learning (<1 year) is more important than high recall. But I wonder what age this algorithm is meant to be applying to, and if that age would still be better modeled by high precision at the expense of high recall. 


Note 2 from the results later on: I do like seeing qualitative comparison to developmental data, discussing how a particular low-resource setting can capture 8 of the most common valid transforms children have.


(6) T&al2020 talk about a high-resource vs. a low-resource learner. But why not call the high-resource learner an idealized/computational-level learner? Unless Lignos & colleagues meant this to be a process/algorithmic-level learner? (It doesn’t seem like it, but then maybe they were less concerned about some of the cognitive plausibility aspects.)


(7) Fig 3 & 4, and comparisons: 


(a) Fig 3 & 4: I’d love to see the Lignos et al. version with no semantic information for all the parameter values manipulated here. That seems like an easy thing to do (just remove the semantic filtering, but still allow variation for the top number of suffixes N, wordpair threshold W, and permitted wordpairs P for the high-resource learners; for the low-resource learners, just vary W and P). Then, you could also easily compare the area under the curve for this baseline (no semantics) model vs. the semantics models for all the learners (not just the high-resource ones). And that then would make the conclusion that the learners who use semantics do better more robust. (Side note: I totally believe that semantics would help. But it would be great to see that explicitly in the analysis, and to understand exactly how much it helps the different types of learners, both high-resource and low-resource).


(b) Fig 4: I do appreciate the individual parameter exploration, but I’d also like to see a full low-resource learner combination [VC=Full, EC=CHILDES, N=3], too -- at least, if we want to claim that the more developmentally-plausible learners can still benefit from semantic info like this. This is talked about in the discussion some (i.e., VC=Full, EC=CHILDES, N=15 does as well as the original Lignos settings), but it’d be nice to see this plotted in a Figure-4-style plot for easy comparison.


(8) Which morphological transforms we’re after: In the discussion, T&al2020 note that they only focus on suffixes, and certainly the algorithm is only tuned to suffixes. It definitely seems like a more developmentally-plausible algorithm would be able to use meaning to connect more disparate derived forms to their base forms (e.g., drink-drank, think-thought). I’d love to see an algorithm that uses semantic similarity (and syntactic context) as the primary considerations, and then how close the base is to the derived form as a secondary consideration. This would allow the irregulars (like drink-drank, think-thought) to emerge as connected wordpairs. (T&al2020 do sketch some ideas in this direction in the next section, when they talk about model generalizability of morphology, and morphology clustering.)


(9) In the model extension part, T&al2020 say they want to get a “token level understanding of segmentation”. I’m not sure what this means -- is this the clustering together of different morphological transforms that apply to specific words? (I’d call this types, rather than tokens if so.)


(10) T&al2020’s proposed semantic constraint is that valid morphological transforms should connect pairs of base and derived forms that are offset in a consistent direction in semantic space. Hmmm...I guess the idea is that the semantic information encoded by a transform (e.g., past tense, plural, ongoing action) is consistent, so that should be detectable. That doesn’t seem crazy, certainly as a starting hypothesis. My concern in the practical implementation T&al2020 try is the GloVe semantic space, which may or may not actually have this property. The semantic space of embedding models is strange, and not usually very interpretable (currently) in the ways we might hope it to be. But I guess the brief practical demonstration T&al2020 do for their H3 morpheme transforms shows a proof of concept, even if it’s a mystery how a child would agglomeratively cluster things just so. That proof of concept does show it’s in fact possible to cluster just so over the GloVe-defined difference vectors.


Thursday, February 4, 2021

Some thoughts on Fox & Katzir 2020

I think one of the main things that struck me is the type of iterated rationality models (IRMs) that F&K2020 discuss -- those IRMs don’t seem like any of the ones I’ve seen in the cognitively-oriented literature that connects with human behavioral or developmental data. That is, in footnote 3, F&K2020 note that there’s an IRM approach that assumes grammatical derivation of alternatives, and then uses probabilistic reasoning to disambiguate those alternatives in context. They don’t have a problem with this IRM approach, and think it’s compatible with the grammatical approach they favor. So, if we’re using this IRM approach, then the worries F&K2020 that highlight don’t apply? In my own collaborative work for instance, I’m pretty sure we always talk about our IRM (i.e., RSA) models as ambiguity resolution among grammatical options that were already derived, though we can assign priors to them and so include how expensive it is to access those options. 


Other thoughts:

(1)  My take on footnote 4 and related text: there’s a conceptual separation between the creation of alternatives (syntactic/semantic computation) and how we choose between those alternatives (which typically involves probabilities). I know there’s a big debate about whether this conceptual separation is cognitively real, and I think that’s what’s being alluded to here.


(2) The comparison “grammatical approach": I’m curious about the evaluation metrics being used for theory comparison here -- in terms of acquisition, the grammatical approach requires language-specific knowledge (presumably innate?) in the form of the Exh operator, the “innocent inclusion”, and “innocent exclusion” operations. From this perspective, it’s putting a lot of explanatory work onto the development of this language-specific knowledge, compared with domain-general probabilistic reasoning mechanisms. I guess F&K2020 are focused more on issues of empirical coverage, with the multiplier conjunctive reading example not being handled by Franke’s approach.


(3) In section 6 on probabilities and modularity, F&K2020 discuss how probabilities could be part of the initial computations of SIs. I think I’m now starting to blur between this and the version of IRMs that F&R2020 were okay with, which is when IRMs have possibilities that are calculated from the grammar (e.g., with the semantics) and then the IRM uses recursive social reasoning to choose among those possibilities in context. It seems like the “SI calculation” part is about navigating the possibilities (here: the options on the scale that come from the semantics). So, RSA models that purport to capture SIs (even if relying on scale options that come from the grammar) would be exactly the IRMs that F&R2020 would be unhappy with.


(4) In 6.3, F&K2020 mention that priors could be “formal constructs defined internally to the system.” This is clearly an option the F&K2020 think is viable (even if they don’t favor it), so it seems important to understand what this means. But I’m unclear myself on how to interpret that phrase. Would this mean that there are probabilities available beforehand (therefore making them priors), but they’re not tied to anything external (like beliefs about the world, or how often a particular interpretation has occurred)? They’re just...probabilities that get generated somehow for possible interpretations?


Wednesday, December 2, 2020

Some thoughts on Caplan et al. 2020

I appreciate seeing existence proofs like the one C&al2020 provide here -- more specifically, the previous article by PTG seemed to invite an existence proof that certain properties of a lexicon (ambiguous words being short, frequent, and easy to articulate) could arise from something besides communicative efficiency. C&al2020 then obliged them by providing an existence proof grounded in empirical data. I admit that I had some confusion about the specifics of the communicative efficiency debate (more on this below) as well as PTG’s original findings (more on this below too), but this may be due to actual vagueness in how “communicative efficiency” is talked about in general. 


Specific thoughts:

(1) Communicative efficiency: It hit me right in the introduction that I was confused about what communicative efficiency was meant to be. In the introduction, it sounds like the way “communicative efficiency” is defined is with respect to ambiguity. That is, ambiguity is viewed as not communicatively efficient. But isn’t it efficient for the speaker? It’s just not so helpful for the listener. So, this means communicative efficiency is about comprehension, rather than production.  


Okay. Then, efficiency is about something like information transfer (or entropy reduction, etc.). This then makes sense with the Labov quote at the beginning that talks about the “maximization of information’’ as a signal of communicative efficiency. That is, if you’re communicatively efficient, you maximize information transfer to the listener.


Then, we have the callout to Darwin, with the idea that “better, shorter, and easier forms are constantly gaining the upper hand”.  Here, “better” and “easier” need to be defined. (Shorter, at least, we can objectively measure.) That is, better for who? Easier for who? If we continue with the idea from before, that we’re maximizing information transfer, it’s better and easier for the listener. But of course, we could also define "better" and "easier" for the speaker. In general, it seems like there’d be competing pressure between forms that are better and easier for the speaker vs. forms that are better and easier for the listener. This also reminds me of some of the components of the Rational Speech Act framework, where there’s a speaker cost function to capture how good (or not) a form is for the speaker vs. the surprisal function that captures how good (or not) a form is inferred to be for the listener. Certainly, surprisal comes back in the measures used by PTG  as well as by C&al2020.


Later on, both the description of Zipf’s Principle of Least Effort and the PTG 2012 overview make it sound like communicative efficiency is about how effortful it is for the speaker, rather than focusing on the information transfer to the listener. Which is it? Or are both meant to be considered for communicative efficiency? It seems like both ought to be, which gets us back to the idea of competing pressures…I guess one upshot of C&al2020’s findings is that we don’t have to care about this thorny issue because we can generate lexicons that looks like human language lexicons without relying on communicative efficiency considerations.

(2) 2.2, language models: I was surprised by the amount of attention given to phonotactic surprisal, because I think the main issue is that a statistical model of language is needed and that requires us to make commitments about what we think the language model looks like. This should be the very same issue we see for word-based surprisal. That is, surprisal is the negative log probability of $thing (word or phonological unit), given some language model that predicts how that $thing arises based on the previous context. But it seemed like C&al2020 were less worried about this for word-based surprisal than for phonotactic surprisal, and I’m not sure why.


(3) The summary of PTG’s findings: I would have appreciated a slightly more leisurely walkthrough of PTG’s main findings -- I wasn’t quite sure I got the interpretations right as it was. Here’s what I think I understood: 


(a) homophony: negatively correlated with word length and frequency (so more homophony = shorter words and ...lower frequency words???). It’s also negatively correlated with phonotactic surprisal in 2 of 3 languages (so more homophony = lower surprisal = more frequent phonotactic sequences).


(b) polysemy: negatively correlated with word length, frequency, and phonotactic surprisal (so more polysemous words = shorter, less frequent??, and less surprising = more frequent phonotactic sequences).


(c) syllable informativity: negatively correlated with length in phones, frequency, and phonotactic surprisal (so, the more informative (=the less frequent), the shorter in phones, the lower in frequency (yes, by definition), and the lower the surprisal (so the higher the syllable frequency?)


I think C&al2020’s takeaway message from all 3 of these results was this: ”Words that are shorter, more frequent, and easier to produce are more ambiguous than words that are longer, less frequent, and harder to produce”. The only thing is that I struggled a bit to get this from the specific correlations noted. But okay, if we take this at face value, then ambiguity goes hand-in-hand with being shorter, more frequent, and less phonologically surprising = all about easing things for the speaker at face value. (So, it doesn’t seem like ambiguity and communicative efficiency are at odds with each other, if communicative efficiency is defined from the speaker’s perspective.)


(4) Implementing the semantic constraint on the phonotactic monkey model: The current implementation of meaning similarity uses an idealized version (100 x 100 two-dimensional space of real numbers), where points close to each other have more similar meanings. It seems like a natural extension of this would be to try it with actual distributed semantic representations like GLoVe or RoBERTa.  I guess maybe it’s unclear what additional value this adds to the general argument here -- that is, the current paper is written as “you asked for an existence proof of how lexicons like this could arise without communicative considerations; we made you one”. Yet, at the end, it does sound like C&al2020 would like to have the PSM model be truly considered as a cognitively plausible model of lexicon generation (especially when tied to social networks). If so, then an updated semantic implementation might help convince people that this specific non-communicative-efficiency approach is viable, rather than there simply is a non-communicative-efficiency approach out there that will work.


(5) In 5.3, C&al2020 highlight what the communicative efficiency hypothesis would predict for lexicon change. In particular:


(a) Reused forms should be more efficient than stale forms (i.e., shorter, more frequent, less surprising syllables)


(b) New forms should use more efficient phonotactics (i.e., more frequent, less surprising)


But aren’t these what C&al2020 just showed as something that could result from the PM and PSM models, and so a non-communicative-efficiency approach could also have them? Or is this the point again? I thought at this point that C&al2020 aimed to already show that these predictions aren’t unique to the communicative efficiency hypothesis. (Indeed, this is what they show in the next figure, as they note that PSM English better exploits inefficiencies in the English lexicon by leveraging phonotactically possible, but unused, short words). I guess this is just a rhetorical strategy that I got temporarily confused by.


Tuesday, November 17, 2020

Some thoughts on Matusevych et al. 2020

I really like seeing this kind of model comparison work, as computational models like this encode specific theories of a developmental process (here, how language-specific sound contrasts get learned). I think we see a lot of good practices demonstrated in this paper when it comes to this approach, especially when borrowing models from the NLP world: using naturalistic data, explicitly highlighting the model distinctions and what they mean in terms of representation and learning mechanism, comparing model output to observable behavioral data (more on this below), and generating testable behavioral predictions that will distinguish currently-winning models. 


Specific thoughts:

(1) Comparing model output to observable behavior: I love that M&al2020 do this with their models, especially since most previous models tried to learn unobservable theoretically-motivated representations. This is so useful. If you want the model’s target to be an unobserved knowledge state (like phonetic categories), you’re going to have a fight with the people who care about that knowledge representation level -- namely, is your target knowledge the right form? If instead you make the model’s target some observable behavior, then no one can argue with you. The behavior is an empirical fact, and your model either can generate it or not. It saves much angst on the modeling, and makes for far more convincing results. Bonus: You can then peek inside the model to see what representation it used to generate the observed behavior, and potentially inform the debates about what representation is the right one.


(2) Simulating the ABX task results: So, this seemed a little subtle to me, which is why I want to spell out what I understood (which may well be not quite right). Model performance is calculated by how many individual stimuli the model gets right -- for instance, none = 0% discrimination, 50% = chance performance; 100% = perfect discrimination. I guess maybe this deals with the discrimination threshold issue (i.e., how you know if a given stimulus pair is actually different enough to be discriminated) by just treating each stimulus as a probabilistic draw from a distribution? That is, highly overlapping distributions means A-X is often the same as B-X, and so this works out to no discrimination...I think I need to think this through with the collective a little. It feels like the model’s representation is the draw from a distribution over possible representations, and then that’s what gets translated into the error rate. So, if you get enough stimuli, you get enough draws, and that means the aggregate error rate captures the true degree of separation for these representations. I think?


(3) On the weak word-level supervision: This turns out to be recognizing that tokens of a word are in fact the same word form. That’s not crazy from an acquisition perspective -- meaning could help determine that the same lexical item was used in context (e.g., “kitty” one time and “kitty” another time when pointing at the family pet).


(4) Cognitive plausibility of the models: So what strikes me about the RNN models is that they’re clearly coming from the engineering side of the world -- I don’t know if we have evidence that humans do this forced encoding-decoding process. It doesn’t seem impossible (after all, we have memory and attention bottlenecks galore, especially as children), but I just don’t know if anyone’s mapped these autoencoder-style implementations to the cognitive computations we think kids are doing. So, even though the word-level supervision part of the correspondence RNNs seems reasonable, I have no idea about the other parts of the RNNs. Contrast this with the Dirichlet process Gaussian mixture model -- this kind of generative model is easy to map to a cognitive process of categorization, and the computation carried out by the MCMC sampling can be approximated by humans (or so it seems).


(5) Model input representations: MFCCs from 25ms long frames are used. M&al2020 say this is grounded in human auditory processing. This is news to me! I had thought MFCCs were something that NLP had found worked, but we didn’t really know about links to human auditory perception. Wikipedia says the mel (M) part is what’s connected to human auditory processing, in that the spacing of the bands by “mel” is what approximates the human auditory response. But the rest of the process of getting MFCCs from the acoustic input, who knows? This contrasts with using something like phonetic features, which certainly seems to be more like our conscious perception of what’s in the acoustic signal. 


Still, M&al2020 then use speech alignments that map chunks of speech to corresponding phones. So, I think that the alignment process on the MFCCs yields something more like what linguistic theory bases things on, namely phones that would be aggregated together into phonetic categories.


Related thought, from the conclusion: “models learning representations directly from unsegmented natural speech can correctly predict some of the infant phone discrimination data”. Notably, there’s the transformation into MFCCs and speech alignment into phones, so the unit of representation is something more like phones, right? (Or whole words of MFCCs for the (C)AE-RNN models?) So should we take away something about what the infant unit of speech perception is from there, or not? I guess I can’t tell if the MFCC transformation and phone alignment is meant as an algorithmic-level description of how infants would get their phone-like/word-like representations, or if instead it’s a computational-level implementation where we think infants get phone-like/word-like representations out, but infants need to approximate the computation performed here.


(6) Data sparseness: Blaming data sparseness for no model getting the Catalan contrast doesn’t seem crazy to me. Around 8 minutes of Catalan training data (if I’m reading Table 3 correctly) isn’t a lot. If I’m reading Table 3 incorrectly, and it’s actually under 8 hours of Catalan training data, that still isn’t a lot. I mean, we’re talking less than a day’s worth of input for a child, even if this is in hours.


(7) Predictions for novel sound contrasts: I really appreciate seeing these predictions, and brief discussion of what the differences are (i.e., the CAE-RNN is better for differences in length, while the DPGMM is better for ones that observably differ in short time slices). What I don’t know is what to make of that -- and presumably M&al2020 didn’t either. They did their best to hook these findings into what’s known about human speech perception (i.e., certain contrasts like /θ/ are harder for human listeners and are harder for the CAE-RNN too), but the general distinction of length vs. observable short time chunks is unexplained. The only infant data to hook back into is whether certain contrasts are realized earlier than others, but the Catalan one was the earlier one at 8 months, and no model got that.


Tuesday, November 3, 2020

Some thoughts on Fourtassi et al. 2020

It’s really nice to see a computational cognitive model both (i) capture previously-observed human behavior (here, very young children in a specific word-learning experimental task), and (ii) make new testable, predictions that the authors then test in order to validate the developmental theory implemented in the model. What’s particularly nice (in my opinion) about the specific new prediction made here is that it seems so intuitive in hindsight -- of *course* noisiness in the representation of the referent (here: how distinct the objects are from each other) could impact the downstream behavior being measured, since it matters for generating that behavior. But it sure wasn’t obvious to me before seeing the model, and I was fairly familiar with this particular debate and set of studies. That’s the thing about good insights, though -- they’re often obvious in hindsight, but you don’t notice them until someone explicitly points them out. So, this computational cognitive model, by concretely implementing the different factors that lead to the behavior being measured, highlighted that there’s a new factor that should be considered to explain children’s non-adult-like behavior. (Yay, modeling!)


Other thoughts:

(1) Qualitative vs. quantitative developmental change: It certainly seems difficult (currently) to capture qualitative change in computational cognitive models. One of the biggest issues is how to capture qualitative “conceptual” change in, say, a Bayesian model of development. At the moment, the best I’m aware of is implementing models that themselves individually have qualitative differences and then doing model comparison to see which best captures child behavior. But that’s about snapshots of the child’s state, not about how qualitative change happens. Ideally, what we’d like is a way to define building blocks that allow us to construct “novel” hypotheses from their combination...but then qualitative change is about adding a completely new building block. And where does that come from?


Relatedly, having continuous change (“quantitative development”) is certainly in line with the Continuity Hypothesis in developmental linguistics. Under that hypothesis, kids are just navigating through pre-defined options (that adult languages happen to use), rather than positing completely new options (which would be a discontinuous, qualitative change). 



(2) Model implementation:  F&al2020 assume an unambiguous 1-1 mapping between concepts and labels, meaning that the child has learned these mappings completely correctly in the experimental setup. Given the age of the original children (14 months, and actually 8 months too), this seems a simplification. But it’s not an unreasonable one -- importantly, if the behavioral effects can be captured without making this model more complicated, then that’s good to know. That means the main things that matter don’t include this assumption about how well children learn the labels and mappings in the experimental setup.


(3) Model validation with kids and adults: Of course we can quibble with the developmental difference between a 4-year-old and a 14-month-old when it comes to their perceptions of the sounds that make up words and referent distinciveness. But as a starting proof of concept to show that visual salience matters, I think this is a reasonable first step. A great followup is to actually run the experiment with 14-month-olds, and vary the visual salience just the same way, as alluded to in the general discussion.


(4) Figure 6: Model 2 (sound fuzziness = visual referent fuzziness) is pretty good at matching kids and adults, but Model 3 (sound fuzziness isn’t the same amount as visual referent fuzziness) is a little better. I wonder, though, is Model 3 enough better to account for additional model complexity? Model 2 accounting for 0.96 of the variance seems pretty darned good. 


So, suppose we say that Model 2 is actually the best, once we take model complexity into account. The implication is interesting -- perceptual fuzziness, broadly construed, is what’s going on, whether that fuzziness is over auditory stimuli or visual stimuli (or over categorizations based on those auditory and visual stimuli, like phonetic categories and object categories). This contrasts with domain-specific fuzziness, where auditory stimuli have their fuzziness and visual stimuli have a different fuzziness (i.e., Model 3). So, if this is what’s happening, would this be more in line with some common underlying factor that feeds into perception, like memory or attention?


F&al2020 are very careful to note that their model doesn’t say why the fuzziness goes away, just that it goes away as kids get older. But I wonder...


(5) On minimal pairs for learning: I think another takeaway of this paper is that minimal pairs in visual stimuli -- just like minimal pairs in auditory stimuli -- are unlikely to be helpful for young learners. This is because young kids may miss that there are two things (i.e., word forms or visual referents) that need to be discriminated (i.e., by having different meanings for the word forms, or different labels for the visual referents). Potential practical advice with babies: Don’t try to point out tiny contrasts (auditory or visual) to make your point that two things are different. That’ll work better for adults (and older children).


(6) A subtle point that I really appreciated being walked through: F&al2020 note that just because their model predicts that kids have higher sound uncertainty than adults doesn’t mean their model goes against previous accounts showing that children are good at encoding fine phonetic detail. Instead, the issue may be about what kids think is a categorical distinction (i.e., how kids choose to view that fine phonetic detail) -- so, the sound uncertainty could be due to downstream processing of phonetic detail that’s been encoded just fine.


Monday, October 19, 2020

Some thoughts on Ovans et al. 2020

I really enjoy seeing this kind of precise quantitative investigation of children’s input, and how it can explain their non-adult-like behavior. This particular case involves language processing, and recovering from an incorrect parse, and the upshot is that kids may be doing perfectly sensible things on these test stimuli, given children’s input.  This underscores for me how ridiculously hard it is to consider everything when you’re designing behavioral experiments with kids, and the value of quantitative work for teasing apart the viability of possible explanations (here: immature executive function vs. mature inference over the differently-skewed dataset of child-directed speech). 


Other specific thoughts:

(1) The importance of the model assumptions: Here, surprisal is the main metric, and its precise value of course depends on the language model you’re using. Here, O&al2020 thought the specific verbs (like “put”) were important to separate out in the language model, because of the known lexical restrictions on verb arguments (and therefore possible parses). If they hadn’t done this, they might have gotten very different surprisal values, as the probabilities for “put” parses would have been aggregated with the probabilities for other verbs like “eat” and “hug”.


It’s because of this importance that I had something of a mental hiccup at the beginning of section 3, before I realized that more detail about the exact language model would come later in section 3.2. ;) 


I also want to note that I don’t think it’s crazy to have grammar rules separated out by the verb lexical item, precisely because of how the argument distributions can depend on the verb. But, this does mean that you get a lot of duplication in PCFG rules (e.g., VP_eat, VP_drink look pretty similar, but are treated completely separate). And when there’s duplication, we may miss generalizations.


(2)  Related thought, from section 4.2: “...our calculation of surprisal included a measure of lexical frequency, and for children, each noun token was relatively unexpected” -- I thought only the verbs were lexicalized (and that seems to be what Figures 2 and 3 would suggest on the x axis labels: put the.1 noun.1 prep.1 the.2 noun.2…). So, where does noun lexical frequency come into this? Why wouldn’t all nouns simply be “noun”? I think I may have misunderstood something in the language model.


(3) O&al2020 find low surprisal at the disambiguating P (e.g., “into”), and interpret that to mean children don’t detect that a reparse is needed. Just to check my understanding: The issue that children have is detecting that they misparsed, given the probability of the word coming next. The explanation O&al2020 give is that children are getting surprised by other things in the sentence (like open-class words like nouns), so the relative strength of the error signal from the disambiguating P slips under their detection radar. That is, lots of things are surprising to kids because they don’t have as much experience with language, so the “you parsed it wrong” surprise is relatively less than it is for adults. That seems reasonable. 


Of course, then O&al2020 themselves note that this is slightly weird, because surprisal then isn’t about parsing disambiguation, even though it’s actually implemented here by summing over possible parses. Except, is this that weird? For the nouns, the parse is simply whether that lexical item can go in that position (caveat: assuming we have the lexical items for nouns and not just the Noun category). That’s a general integration cost, though it’s being classified as a “parse”. If we just think about surprisal as integration, is this explanation really so strange? Integrating open-class words like nouns is harder than integrating close-class words like determiners and prepositions. So, any integration difficulty that a preposition signals can be overshadowed by the difficulty a noun causes.



Tuesday, May 26, 2020

Some thoughts on Liu et al 2019

I really appreciate this paper’s goal of concretely testing different accounts of island constraints, and the authors' intuition that the frequency of the lexical items involved may well have something to do with the (un)acceptability of the island structures they look at. This is something near and dear to my heart, since Jon Sprouse and I worked on a different set of island constraints a few years back (Pearl & Sprouse 2013) and found that the lexical items used as complementizers really mattered. 

Pearl, L., & Sprouse, J. (2013). Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition, 20(1), 23-68.

I do think the L&al2019 paper was a little crunched for space, though -- there were several points where I felt like the reasoning flew by too fast for me to follow (more on this below).


Specific thoughts:
(1) Frequency accounts believe that acceptability is based on exposure. This makes total sense to me for lexical-item-based islands. I wonder if I’d saturate on whether and adjunct islands for this reason.

(grammatical that complementizer) “What did J say that M. bought __?”
 vs. 
(ungrammatical *whether) “What did J wonder whether M. bought __?”
and
(ungrammatical *adjunct (if))“What did J worry if M. bought __?”.

I feel like saturation studies like this have been done at least for some islands, and they didn’t find saturation. Maybe those were islands that weren’t based on lexical items, like subject islands or complex NP islands?

Relatedly, in the verb-frame frequency account, acceptability depends on verb lexical frequency. I definitely get the idea of this prediction (which is nicely intuitive), but Figure 1c seems a specific version of this -- namely, where manner-of-speaking verbs are always less frequent than factive and bridge verbs. I guess this is anticipating the frequency results that will be found?

(2) Explaining why “know” is an outlier (it’s less acceptable than frequency would predict): L&al2019 argue this is due to a pragmatic factor where using “know” implies the speaker already has knowledge, so it’s weird to ask. I’m not sure if I followed the reasoning for the pragmatic explanation given for “know”. 

Just to spell it out, the empirical fact is that “What did J know that M didn’t like __?” is less acceptable than the (relatively high) frequency of “know CP” predicts it should be. So, the pragmatic explanation is that it’s weird for the speaker of the question to ask this because the speaker already knows the answer (I think). But what does that have to do with J knowing something? 

And this issue of the speaker knowing something is supposed to be mitigated in cleft constructions like “It was the cake that J knew that M didn’t like.” I don’t follow why this is, I’m afraid. This point gets reiterated in the discussion of the Experiment 3 cleft results and I still don’t quite follow it: “a question is a request for knowledge but a question with ‘know’ implies that the speaker already has the knowledge”. Again, I have the same problem: “What did J know that M didn’t like __?” has nothing to do with the speaker knowing something.

(3) Methodology: This is probably me not understanding how to do experiments, but why is it that a likert scale doesn’t seem right? Is it just that the participants weren’t using the full scale in Experiment 1? And is that so bad if the test items were never really horribly ungrammatical? Or were there “word salad” controls in Experiment 1, where the participants should have given a 1 or 2 rating, but still didn’t? 

Aside from this, why does a binary choice fix the problem?

(4) Thinking about island (non-)effects: Here, the lack of an interaction between sentence type and frequency was meant to indicate no island effect. I’m more used to thinking about island effects as the interaction of dependency-length (matrix vs embedded) and presence vs absence of an island structure, so an island shows up as a superadditive interaction of dependency length & island structure (i.e., an island-crossing dependency is an embedded dependency that crosses an island structure, and it’s extra bad). 

Here, the two factors are wh-questions (so, a dependency period) + which verb lexical item is used. Therefore, an island “structure” should be some extra badness that occurs when a wh-dependency is embedded in a CP for an “island” lexical item (because that lexical item should have an island structure associated with it). Okay. 

But we don’t see that, so there’s no additional structure there. Instead, it’s just that it’s hard to process wh-dependencies with these verbs because they don’t occur that often. Though when I put it like that, this reminds me of the Pearl & Sprouse 2013 island learning story -- islands are bad because there are pieces of structure that are hard to process (because they never occur in the input = lowest frequency possible). 

So, thinking about it like this, these accounts (that is, the L&al2019 account and the Pearl & Sprouse 2013 [P&S2013] account) don’t seem too different after all. It’s just frequency of what -- here, it’s the verb lexical item in these embedded verb frames; for P&S2013, it was small chunks of the phrasal structure that made up the dependency, some of which were subcategorized by the lexical items in them (like the complementizer).

(5) Expt 2 discussion: I think the point L&al2019 were trying to make about the spurious island effects with Figures 4a vs 4b flew by a little fast for me. Why is log odds [p(acceptable)/p(unacceptable] better than just p(acceptable) on the y-axis? Because doing p(acceptable) on the y axis is apparently what yields the interaction that’s meant to signal an island effect.

(6)  I’m sympathetic to the space limitations of conference papers like this, but the learning story at the end was a little scanty for my taste. More specifically, I’m sympathetic to indirect negative evidence for learning, but it only makes sense when you have a hypothesis space set up, and can compare expectations for different hypotheses. What does that hypothesis space look like here? I think there was a little space to spell it out with a concrete example. 

And eeep, just be very careful about saying absence of evidence is evidence of ungrammaticality, unless you’re very careful about what you’re counting.

Tuesday, May 12, 2020

Some thoughts on Futrell et al 2020

I really liked seeing the technique imports from the NLP world (using embeddings, using classifiers), in the service of psychologically-motivated theories of adjective ordering. Yes! Good tools are wonderful. 

I also love seeing this kind of direct, head-to-head competition between well-defined theories, grounding in a well-defined empirical dataset (complete with separate evaluation set), careful qualitative analysis, and discussion of why certain theories might work out better than others. Hurrah for good science!

Other thoughts:
(1) Integration cost vs information gain (subtle differences): Information gain seems really similar to the integration cost idea, where the size of the set of nouns an adjective could modify is the main thing (as the text notes). Both approaches care about making that entropy gain smaller the further the adjective is away from the noun (since that’s less cognitively-taxing to deal with). The difference (if I’m reading this correctly) is that information gain cares about the set size of the nouns the adjective can’t modify too, and uses that in its entropy calculation.

(2) I really appreciate the two-pronged explanation of (a) the more generally semantic factors (because of improved performance when using the semantic clusters for subjectivity and information gain), and (b) the collocation factor over specific lexical items (because of the improved performance on individual wordforms for PMI). But it’s not clear to me how much information gain is adding above and beyond subjectivity on the semantic factor side. I appreciate the item-based zoom in Table 3, which shows the items that information gain does better on...but it seems like these are wordform-based, not based on general semantic properties. So, the argument that information gain is an important semantic factor is a little tricky for me to follow.

Monday, April 27, 2020

Some thoughts on Schneider et al. 2020

It’s nice to see this type of computational cognitive model: a proof of concept for an intuitive (though potentially vague) idea about how children regularize their input to yield more deterministic/categorical grammar knowledge than the input would seem to suggest on the surface. In particular, it’s intuitive to talk about children perceiving some of the input as signal and some as noise, but much more persuasive to see it work in a concrete implementation.

Specific thoughts:
(1) Intake vs. input filtering: Not sure I followed the distinction about filtering the child’s intake vs. filtering the child’s input. The basic pipeline is that external input signal gets encoded using the child’s current knowledge and processing abilities (perceptual intake) and then a subset of that is actually relevant for learning (acquisition intake). So, for filtering the (acquisition?) intake, this would mean children look at the subset of the input perceived as relevant and assume some of that is noise. For filtering the input, is the idea that children would assume some of the input itself is noise and so some of it is thrown out before it becomes perceptual intake? Or is it that the child assumes some of the perceptual intake is noise, and tosses that before it gets to the acquisition intake? And how would that differ for the end result of the acquisition intake? 

Being a bit more concrete helps me think about this:
Filtering the input --
Let’s let the input be a set of 10 signal pieces and 2 noise pieces (10S, 2N).
Let’s say filtering occurs on this set, so the perceptual intake is now 10S.
Then maybe the acquisitional intake is a subset of those, so it’s 8S.

Filtering the intake --
Our input is again 10S, 2N.
(Accurate) perceptual intake takes in 10S, 2N.
Then acquisitional intake could be the subset 7S, 1N.

So okay, I think I get it -- filtering the input gets you a cleaner signal while filtering the intake gets you some subset (cleaner or not, but certainly more focused).

(2) Using English L1 and L2 data in place of ASL: Clever standin! I was wondering what they would do for an ASL corpus. But this highlights how to focus on the relevant aspects for modeling. Here, it’s more important to get the same kind of unpredictable variation in use than it is to get ASL data. 

(3) Model explanations: I really appreciate the effort here to give the intuitions behind the model pieces. I wonder if it might have been more effective to have a plate diagram, and walk through the high-level explanation for each piece, and then the specifics with the model variables. As it was, I think I was able to follow what was going on in this high-level description because I’m familiar with this type of model already, but I don’t know if that would be true for people who aren’t as familiar. (For example, the bit about considering every partition is a high-level way of talking about Gibbs sampling, as they describe in section 4.2.)

(4) Model priors: If the prior over determiner class is 1/7, then it sounds like the model already knows there are 7 classes of determiner. Similar to a comment raised about the reading last time, why not infer the number of determiner classes, rather than knowing there are 7 already? 

(5) Corpus preprocessing: Interesting step of “downsampling” the counts from the corpora by taking the log. This effectively squishes probability differences down, I think. I wonder why they did this, instead of just using the normalized frequencies? They say this was to compensate for the skewed distribution of frequent determiners like the...but I don’t think I understand why that’s a problem. What does it matter if you have a lot of the, as long as you have enough of the other determiners too? They have the minimum cutoff of 500 instances after all.

(6) Figure 1: It looks like the results from the non-native corpus with the noise filter recover the rates of sg, pl, and mass noun combination pretty well (compared against the gold standard). But the noise filter over the native corpus skews a bit towards allowing more noun types with more classes than the gold standard (e.g., more determiners allowing 3 noun types). Side note: I like this evaluation metric a little better than inferring fixed determiner classes, because individual determiner behavior (how many noun classes it allows) can be counted more directly. We don’t need to worry about whether we have the right determiner classes or not.

(7) Evaluation metrics: Related to the previous thought, maybe a more direct evaluation metric is to just compare allowed vs. disallowed noun vectors for each individual determiner? Then the class assignment becomes a means to that end, rather than being the evaluation metric itself. This may help deal with the issue of capturing the variability in the native input that shows up in simulation 2.

(8) L1 vs. L2 input results:  The model learns there’s less noise in the native input case, and filters less; this leads to capturing more variability in the determiners. S&al2020 don’t seem happy about this, but is this so bad? If there’s true variability in native speaker grammars, then there’s variability. 

In the discussion, S&al2020 say that the behavior they wanted was the same for both native and non-native input, since Simon learned the same as native ASL speakers. So that’s why they’re not okay with the native input results. But I’m trying to imagine how the noisy channel input model they designed could possibly give the same results when the input has different amounts of variability -- by nature, it would filter out less input when there seems to be more regularity in the input to begin with (i.e., the native input). I guess it was possible that just the right amount of the input would be filtered out in each case to lead to exactly the same classification results? And then that didn’t happen.