Wednesday, December 2, 2020

Some thoughts on Caplan et al. 2020

I appreciate seeing existence proofs like the one C&al2020 provide here -- more specifically, the previous article by PTG seemed to invite an existence proof that certain properties of a lexicon (ambiguous words being short, frequent, and easy to articulate) could arise from something besides communicative efficiency. C&al2020 then obliged them by providing an existence proof grounded in empirical data. I admit that I had some confusion about the specifics of the communicative efficiency debate (more on this below) as well as PTG’s original findings (more on this below too), but this may be due to actual vagueness in how “communicative efficiency” is talked about in general. 

Specific thoughts:

(1) Communicative efficiency: It hit me right in the introduction that I was confused about what communicative efficiency was meant to be. In the introduction, it sounds like the way “communicative efficiency” is defined is with respect to ambiguity. That is, ambiguity is viewed as not communicatively efficient. But isn’t it efficient for the speaker? It’s just not so helpful for the listener. So, this means communicative efficiency is about comprehension, rather than production.  

Okay. Then, efficiency is about something like information transfer (or entropy reduction, etc.). This then makes sense with the Labov quote at the beginning that talks about the “maximization of information’’ as a signal of communicative efficiency. That is, if you’re communicatively efficient, you maximize information transfer to the listener.

Then, we have the callout to Darwin, with the idea that “better, shorter, and easier forms are constantly gaining the upper hand”.  Here, “better” and “easier” need to be defined. (Shorter, at least, we can objectively measure.) That is, better for who? Easier for who? If we continue with the idea from before, that we’re maximizing information transfer, it’s better and easier for the listener. But of course, we could also define "better" and "easier" for the speaker. In general, it seems like there’d be competing pressure between forms that are better and easier for the speaker vs. forms that are better and easier for the listener. This also reminds me of some of the components of the Rational Speech Act framework, where there’s a speaker cost function to capture how good (or not) a form is for the speaker vs. the surprisal function that captures how good (or not) a form is inferred to be for the listener. Certainly, surprisal comes back in the measures used by PTG  as well as by C&al2020.

Later on, both the description of Zipf’s Principle of Least Effort and the PTG 2012 overview make it sound like communicative efficiency is about how effortful it is for the speaker, rather than focusing on the information transfer to the listener. Which is it? Or are both meant to be considered for communicative efficiency? It seems like both ought to be, which gets us back to the idea of competing pressures…I guess one upshot of C&al2020’s findings is that we don’t have to care about this thorny issue because we can generate lexicons that looks like human language lexicons without relying on communicative efficiency considerations.

(2) 2.2, language models: I was surprised by the amount of attention given to phonotactic surprisal, because I think the main issue is that a statistical model of language is needed and that requires us to make commitments about what we think the language model looks like. This should be the very same issue we see for word-based surprisal. That is, surprisal is the negative log probability of $thing (word or phonological unit), given some language model that predicts how that $thing arises based on the previous context. But it seemed like C&al2020 were less worried about this for word-based surprisal than for phonotactic surprisal, and I’m not sure why.

(3) The summary of PTG’s findings: I would have appreciated a slightly more leisurely walkthrough of PTG’s main findings -- I wasn’t quite sure I got the interpretations right as it was. Here’s what I think I understood: 

(a) homophony: negatively correlated with word length and frequency (so more homophony = shorter words and ...lower frequency words???). It’s also negatively correlated with phonotactic surprisal in 2 of 3 languages (so more homophony = lower surprisal = more frequent phonotactic sequences).

(b) polysemy: negatively correlated with word length, frequency, and phonotactic surprisal (so more polysemous words = shorter, less frequent??, and less surprising = more frequent phonotactic sequences).

(c) syllable informativity: negatively correlated with length in phones, frequency, and phonotactic surprisal (so, the more informative (=the less frequent), the shorter in phones, the lower in frequency (yes, by definition), and the lower the surprisal (so the higher the syllable frequency?)

I think C&al2020’s takeaway message from all 3 of these results was this: ”Words that are shorter, more frequent, and easier to produce are more ambiguous than words that are longer, less frequent, and harder to produce”. The only thing is that I struggled a bit to get this from the specific correlations noted. But okay, if we take this at face value, then ambiguity goes hand-in-hand with being shorter, more frequent, and less phonologically surprising = all about easing things for the speaker at face value. (So, it doesn’t seem like ambiguity and communicative efficiency are at odds with each other, if communicative efficiency is defined from the speaker’s perspective.)

(4) Implementing the semantic constraint on the phonotactic monkey model: The current implementation of meaning similarity uses an idealized version (100 x 100 two-dimensional space of real numbers), where points close to each other have more similar meanings. It seems like a natural extension of this would be to try it with actual distributed semantic representations like GLoVe or RoBERTa.  I guess maybe it’s unclear what additional value this adds to the general argument here -- that is, the current paper is written as “you asked for an existence proof of how lexicons like this could arise without communicative considerations; we made you one”. Yet, at the end, it does sound like C&al2020 would like to have the PSM model be truly considered as a cognitively plausible model of lexicon generation (especially when tied to social networks). If so, then an updated semantic implementation might help convince people that this specific non-communicative-efficiency approach is viable, rather than there simply is a non-communicative-efficiency approach out there that will work.

(5) In 5.3, C&al2020 highlight what the communicative efficiency hypothesis would predict for lexicon change. In particular:

(a) Reused forms should be more efficient than stale forms (i.e., shorter, more frequent, less surprising syllables)

(b) New forms should use more efficient phonotactics (i.e., more frequent, less surprising)

But aren’t these what C&al2020 just showed as something that could result from the PM and PSM models, and so a non-communicative-efficiency approach could also have them? Or is this the point again? I thought at this point that C&al2020 aimed to already show that these predictions aren’t unique to the communicative efficiency hypothesis. (Indeed, this is what they show in the next figure, as they note that PSM English better exploits inefficiencies in the English lexicon by leveraging phonotactically possible, but unused, short words). I guess this is just a rhetorical strategy that I got temporarily confused by.

Tuesday, November 17, 2020

Some thoughts on Matusevych et al. 2020

I really like seeing this kind of model comparison work, as computational models like this encode specific theories of a developmental process (here, how language-specific sound contrasts get learned). I think we see a lot of good practices demonstrated in this paper when it comes to this approach, especially when borrowing models from the NLP world: using naturalistic data, explicitly highlighting the model distinctions and what they mean in terms of representation and learning mechanism, comparing model output to observable behavioral data (more on this below), and generating testable behavioral predictions that will distinguish currently-winning models. 

Specific thoughts:

(1) Comparing model output to observable behavior: I love that M&al2020 do this with their models, especially since most previous models tried to learn unobservable theoretically-motivated representations. This is so useful. If you want the model’s target to be an unobserved knowledge state (like phonetic categories), you’re going to have a fight with the people who care about that knowledge representation level -- namely, is your target knowledge the right form? If instead you make the model’s target some observable behavior, then no one can argue with you. The behavior is an empirical fact, and your model either can generate it or not. It saves much angst on the modeling, and makes for far more convincing results. Bonus: You can then peek inside the model to see what representation it used to generate the observed behavior, and potentially inform the debates about what representation is the right one.

(2) Simulating the ABX task results: So, this seemed a little subtle to me, which is why I want to spell out what I understood (which may well be not quite right). Model performance is calculated by how many individual stimuli the model gets right -- for instance, none = 0% discrimination, 50% = chance performance; 100% = perfect discrimination. I guess maybe this deals with the discrimination threshold issue (i.e., how you know if a given stimulus pair is actually different enough to be discriminated) by just treating each stimulus as a probabilistic draw from a distribution? That is, highly overlapping distributions means A-X is often the same as B-X, and so this works out to no discrimination...I think I need to think this through with the collective a little. It feels like the model’s representation is the draw from a distribution over possible representations, and then that’s what gets translated into the error rate. So, if you get enough stimuli, you get enough draws, and that means the aggregate error rate captures the true degree of separation for these representations. I think?

(3) On the weak word-level supervision: This turns out to be recognizing that tokens of a word are in fact the same word form. That’s not crazy from an acquisition perspective -- meaning could help determine that the same lexical item was used in context (e.g., “kitty” one time and “kitty” another time when pointing at the family pet).

(4) Cognitive plausibility of the models: So what strikes me about the RNN models is that they’re clearly coming from the engineering side of the world -- I don’t know if we have evidence that humans do this forced encoding-decoding process. It doesn’t seem impossible (after all, we have memory and attention bottlenecks galore, especially as children), but I just don’t know if anyone’s mapped these autoencoder-style implementations to the cognitive computations we think kids are doing. So, even though the word-level supervision part of the correspondence RNNs seems reasonable, I have no idea about the other parts of the RNNs. Contrast this with the Dirichlet process Gaussian mixture model -- this kind of generative model is easy to map to a cognitive process of categorization, and the computation carried out by the MCMC sampling can be approximated by humans (or so it seems).

(5) Model input representations: MFCCs from 25ms long frames are used. M&al2020 say this is grounded in human auditory processing. This is news to me! I had thought MFCCs were something that NLP had found worked, but we didn’t really know about links to human auditory perception. Wikipedia says the mel (M) part is what’s connected to human auditory processing, in that the spacing of the bands by “mel” is what approximates the human auditory response. But the rest of the process of getting MFCCs from the acoustic input, who knows? This contrasts with using something like phonetic features, which certainly seems to be more like our conscious perception of what’s in the acoustic signal. 

Still, M&al2020 then use speech alignments that map chunks of speech to corresponding phones. So, I think that the alignment process on the MFCCs yields something more like what linguistic theory bases things on, namely phones that would be aggregated together into phonetic categories.

Related thought, from the conclusion: “models learning representations directly from unsegmented natural speech can correctly predict some of the infant phone discrimination data”. Notably, there’s the transformation into MFCCs and speech alignment into phones, so the unit of representation is something more like phones, right? (Or whole words of MFCCs for the (C)AE-RNN models?) So should we take away something about what the infant unit of speech perception is from there, or not? I guess I can’t tell if the MFCC transformation and phone alignment is meant as an algorithmic-level description of how infants would get their phone-like/word-like representations, or if instead it’s a computational-level implementation where we think infants get phone-like/word-like representations out, but infants need to approximate the computation performed here.

(6) Data sparseness: Blaming data sparseness for no model getting the Catalan contrast doesn’t seem crazy to me. Around 8 minutes of Catalan training data (if I’m reading Table 3 correctly) isn’t a lot. If I’m reading Table 3 incorrectly, and it’s actually under 8 hours of Catalan training data, that still isn’t a lot. I mean, we’re talking less than a day’s worth of input for a child, even if this is in hours.

(7) Predictions for novel sound contrasts: I really appreciate seeing these predictions, and brief discussion of what the differences are (i.e., the CAE-RNN is better for differences in length, while the DPGMM is better for ones that observably differ in short time slices). What I don’t know is what to make of that -- and presumably M&al2020 didn’t either. They did their best to hook these findings into what’s known about human speech perception (i.e., certain contrasts like /θ/ are harder for human listeners and are harder for the CAE-RNN too), but the general distinction of length vs. observable short time chunks is unexplained. The only infant data to hook back into is whether certain contrasts are realized earlier than others, but the Catalan one was the earlier one at 8 months, and no model got that.

Tuesday, November 3, 2020

Some thoughts on Fourtassi et al. 2020

It’s really nice to see a computational cognitive model both (i) capture previously-observed human behavior (here, very young children in a specific word-learning experimental task), and (ii) make new testable, predictions that the authors then test in order to validate the developmental theory implemented in the model. What’s particularly nice (in my opinion) about the specific new prediction made here is that it seems so intuitive in hindsight -- of *course* noisiness in the representation of the referent (here: how distinct the objects are from each other) could impact the downstream behavior being measured, since it matters for generating that behavior. But it sure wasn’t obvious to me before seeing the model, and I was fairly familiar with this particular debate and set of studies. That’s the thing about good insights, though -- they’re often obvious in hindsight, but you don’t notice them until someone explicitly points them out. So, this computational cognitive model, by concretely implementing the different factors that lead to the behavior being measured, highlighted that there’s a new factor that should be considered to explain children’s non-adult-like behavior. (Yay, modeling!)

Other thoughts:

(1) Qualitative vs. quantitative developmental change: It certainly seems difficult (currently) to capture qualitative change in computational cognitive models. One of the biggest issues is how to capture qualitative “conceptual” change in, say, a Bayesian model of development. At the moment, the best I’m aware of is implementing models that themselves individually have qualitative differences and then doing model comparison to see which best captures child behavior. But that’s about snapshots of the child’s state, not about how qualitative change happens. Ideally, what we’d like is a way to define building blocks that allow us to construct “novel” hypotheses from their combination...but then qualitative change is about adding a completely new building block. And where does that come from?

Relatedly, having continuous change (“quantitative development”) is certainly in line with the Continuity Hypothesis in developmental linguistics. Under that hypothesis, kids are just navigating through pre-defined options (that adult languages happen to use), rather than positing completely new options (which would be a discontinuous, qualitative change). 

(2) Model implementation:  F&al2020 assume an unambiguous 1-1 mapping between concepts and labels, meaning that the child has learned these mappings completely correctly in the experimental setup. Given the age of the original children (14 months, and actually 8 months too), this seems a simplification. But it’s not an unreasonable one -- importantly, if the behavioral effects can be captured without making this model more complicated, then that’s good to know. That means the main things that matter don’t include this assumption about how well children learn the labels and mappings in the experimental setup.

(3) Model validation with kids and adults: Of course we can quibble with the developmental difference between a 4-year-old and a 14-month-old when it comes to their perceptions of the sounds that make up words and referent distinciveness. But as a starting proof of concept to show that visual salience matters, I think this is a reasonable first step. A great followup is to actually run the experiment with 14-month-olds, and vary the visual salience just the same way, as alluded to in the general discussion.

(4) Figure 6: Model 2 (sound fuzziness = visual referent fuzziness) is pretty good at matching kids and adults, but Model 3 (sound fuzziness isn’t the same amount as visual referent fuzziness) is a little better. I wonder, though, is Model 3 enough better to account for additional model complexity? Model 2 accounting for 0.96 of the variance seems pretty darned good. 

So, suppose we say that Model 2 is actually the best, once we take model complexity into account. The implication is interesting -- perceptual fuzziness, broadly construed, is what’s going on, whether that fuzziness is over auditory stimuli or visual stimuli (or over categorizations based on those auditory and visual stimuli, like phonetic categories and object categories). This contrasts with domain-specific fuzziness, where auditory stimuli have their fuzziness and visual stimuli have a different fuzziness (i.e., Model 3). So, if this is what’s happening, would this be more in line with some common underlying factor that feeds into perception, like memory or attention?

F&al2020 are very careful to note that their model doesn’t say why the fuzziness goes away, just that it goes away as kids get older. But I wonder...

(5) On minimal pairs for learning: I think another takeaway of this paper is that minimal pairs in visual stimuli -- just like minimal pairs in auditory stimuli -- are unlikely to be helpful for young learners. This is because young kids may miss that there are two things (i.e., word forms or visual referents) that need to be discriminated (i.e., by having different meanings for the word forms, or different labels for the visual referents). Potential practical advice with babies: Don’t try to point out tiny contrasts (auditory or visual) to make your point that two things are different. That’ll work better for adults (and older children).

(6) A subtle point that I really appreciated being walked through: F&al2020 note that just because their model predicts that kids have higher sound uncertainty than adults doesn’t mean their model goes against previous accounts showing that children are good at encoding fine phonetic detail. Instead, the issue may be about what kids think is a categorical distinction (i.e., how kids choose to view that fine phonetic detail) -- so, the sound uncertainty could be due to downstream processing of phonetic detail that’s been encoded just fine.

Monday, October 19, 2020

Some thoughts on Ovans et al. 2020

I really enjoy seeing this kind of precise quantitative investigation of children’s input, and how it can explain their non-adult-like behavior. This particular case involves language processing, and recovering from an incorrect parse, and the upshot is that kids may be doing perfectly sensible things on these test stimuli, given children’s input.  This underscores for me how ridiculously hard it is to consider everything when you’re designing behavioral experiments with kids, and the value of quantitative work for teasing apart the viability of possible explanations (here: immature executive function vs. mature inference over the differently-skewed dataset of child-directed speech). 

Other specific thoughts:

(1) The importance of the model assumptions: Here, surprisal is the main metric, and its precise value of course depends on the language model you’re using. Here, O&al2020 thought the specific verbs (like “put”) were important to separate out in the language model, because of the known lexical restrictions on verb arguments (and therefore possible parses). If they hadn’t done this, they might have gotten very different surprisal values, as the probabilities for “put” parses would have been aggregated with the probabilities for other verbs like “eat” and “hug”.

It’s because of this importance that I had something of a mental hiccup at the beginning of section 3, before I realized that more detail about the exact language model would come later in section 3.2. ;) 

I also want to note that I don’t think it’s crazy to have grammar rules separated out by the verb lexical item, precisely because of how the argument distributions can depend on the verb. But, this does mean that you get a lot of duplication in PCFG rules (e.g., VP_eat, VP_drink look pretty similar, but are treated completely separate). And when there’s duplication, we may miss generalizations.

(2)  Related thought, from section 4.2: “...our calculation of surprisal included a measure of lexical frequency, and for children, each noun token was relatively unexpected” -- I thought only the verbs were lexicalized (and that seems to be what Figures 2 and 3 would suggest on the x axis labels: put the.1 noun.1 prep.1 the.2 noun.2…). So, where does noun lexical frequency come into this? Why wouldn’t all nouns simply be “noun”? I think I may have misunderstood something in the language model.

(3) O&al2020 find low surprisal at the disambiguating P (e.g., “into”), and interpret that to mean children don’t detect that a reparse is needed. Just to check my understanding: The issue that children have is detecting that they misparsed, given the probability of the word coming next. The explanation O&al2020 give is that children are getting surprised by other things in the sentence (like open-class words like nouns), so the relative strength of the error signal from the disambiguating P slips under their detection radar. That is, lots of things are surprising to kids because they don’t have as much experience with language, so the “you parsed it wrong” surprise is relatively less than it is for adults. That seems reasonable. 

Of course, then O&al2020 themselves note that this is slightly weird, because surprisal then isn’t about parsing disambiguation, even though it’s actually implemented here by summing over possible parses. Except, is this that weird? For the nouns, the parse is simply whether that lexical item can go in that position (caveat: assuming we have the lexical items for nouns and not just the Noun category). That’s a general integration cost, though it’s being classified as a “parse”. If we just think about surprisal as integration, is this explanation really so strange? Integrating open-class words like nouns is harder than integrating close-class words like determiners and prepositions. So, any integration difficulty that a preposition signals can be overshadowed by the difficulty a noun causes.

Tuesday, May 26, 2020

Some thoughts on Liu et al 2019

I really appreciate this paper’s goal of concretely testing different accounts of island constraints, and the authors' intuition that the frequency of the lexical items involved may well have something to do with the (un)acceptability of the island structures they look at. This is something near and dear to my heart, since Jon Sprouse and I worked on a different set of island constraints a few years back (Pearl & Sprouse 2013) and found that the lexical items used as complementizers really mattered. 

Pearl, L., & Sprouse, J. (2013). Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition, 20(1), 23-68.

I do think the L&al2019 paper was a little crunched for space, though -- there were several points where I felt like the reasoning flew by too fast for me to follow (more on this below).

Specific thoughts:
(1) Frequency accounts believe that acceptability is based on exposure. This makes total sense to me for lexical-item-based islands. I wonder if I’d saturate on whether and adjunct islands for this reason.

(grammatical that complementizer) “What did J say that M. bought __?”
(ungrammatical *whether) “What did J wonder whether M. bought __?”
(ungrammatical *adjunct (if))“What did J worry if M. bought __?”.

I feel like saturation studies like this have been done at least for some islands, and they didn’t find saturation. Maybe those were islands that weren’t based on lexical items, like subject islands or complex NP islands?

Relatedly, in the verb-frame frequency account, acceptability depends on verb lexical frequency. I definitely get the idea of this prediction (which is nicely intuitive), but Figure 1c seems a specific version of this -- namely, where manner-of-speaking verbs are always less frequent than factive and bridge verbs. I guess this is anticipating the frequency results that will be found?

(2) Explaining why “know” is an outlier (it’s less acceptable than frequency would predict): L&al2019 argue this is due to a pragmatic factor where using “know” implies the speaker already has knowledge, so it’s weird to ask. I’m not sure if I followed the reasoning for the pragmatic explanation given for “know”. 

Just to spell it out, the empirical fact is that “What did J know that M didn’t like __?” is less acceptable than the (relatively high) frequency of “know CP” predicts it should be. So, the pragmatic explanation is that it’s weird for the speaker of the question to ask this because the speaker already knows the answer (I think). But what does that have to do with J knowing something? 

And this issue of the speaker knowing something is supposed to be mitigated in cleft constructions like “It was the cake that J knew that M didn’t like.” I don’t follow why this is, I’m afraid. This point gets reiterated in the discussion of the Experiment 3 cleft results and I still don’t quite follow it: “a question is a request for knowledge but a question with ‘know’ implies that the speaker already has the knowledge”. Again, I have the same problem: “What did J know that M didn’t like __?” has nothing to do with the speaker knowing something.

(3) Methodology: This is probably me not understanding how to do experiments, but why is it that a likert scale doesn’t seem right? Is it just that the participants weren’t using the full scale in Experiment 1? And is that so bad if the test items were never really horribly ungrammatical? Or were there “word salad” controls in Experiment 1, where the participants should have given a 1 or 2 rating, but still didn’t? 

Aside from this, why does a binary choice fix the problem?

(4) Thinking about island (non-)effects: Here, the lack of an interaction between sentence type and frequency was meant to indicate no island effect. I’m more used to thinking about island effects as the interaction of dependency-length (matrix vs embedded) and presence vs absence of an island structure, so an island shows up as a superadditive interaction of dependency length & island structure (i.e., an island-crossing dependency is an embedded dependency that crosses an island structure, and it’s extra bad). 

Here, the two factors are wh-questions (so, a dependency period) + which verb lexical item is used. Therefore, an island “structure” should be some extra badness that occurs when a wh-dependency is embedded in a CP for an “island” lexical item (because that lexical item should have an island structure associated with it). Okay. 

But we don’t see that, so there’s no additional structure there. Instead, it’s just that it’s hard to process wh-dependencies with these verbs because they don’t occur that often. Though when I put it like that, this reminds me of the Pearl & Sprouse 2013 island learning story -- islands are bad because there are pieces of structure that are hard to process (because they never occur in the input = lowest frequency possible). 

So, thinking about it like this, these accounts (that is, the L&al2019 account and the Pearl & Sprouse 2013 [P&S2013] account) don’t seem too different after all. It’s just frequency of what -- here, it’s the verb lexical item in these embedded verb frames; for P&S2013, it was small chunks of the phrasal structure that made up the dependency, some of which were subcategorized by the lexical items in them (like the complementizer).

(5) Expt 2 discussion: I think the point L&al2019 were trying to make about the spurious island effects with Figures 4a vs 4b flew by a little fast for me. Why is log odds [p(acceptable)/p(unacceptable] better than just p(acceptable) on the y-axis? Because doing p(acceptable) on the y axis is apparently what yields the interaction that’s meant to signal an island effect.

(6)  I’m sympathetic to the space limitations of conference papers like this, but the learning story at the end was a little scanty for my taste. More specifically, I’m sympathetic to indirect negative evidence for learning, but it only makes sense when you have a hypothesis space set up, and can compare expectations for different hypotheses. What does that hypothesis space look like here? I think there was a little space to spell it out with a concrete example. 

And eeep, just be very careful about saying absence of evidence is evidence of ungrammaticality, unless you’re very careful about what you’re counting.

Tuesday, May 12, 2020

Some thoughts on Futrell et al 2020

I really liked seeing the technique imports from the NLP world (using embeddings, using classifiers), in the service of psychologically-motivated theories of adjective ordering. Yes! Good tools are wonderful. 

I also love seeing this kind of direct, head-to-head competition between well-defined theories, grounding in a well-defined empirical dataset (complete with separate evaluation set), careful qualitative analysis, and discussion of why certain theories might work out better than others. Hurrah for good science!

Other thoughts:
(1) Integration cost vs information gain (subtle differences): Information gain seems really similar to the integration cost idea, where the size of the set of nouns an adjective could modify is the main thing (as the text notes). Both approaches care about making that entropy gain smaller the further the adjective is away from the noun (since that’s less cognitively-taxing to deal with). The difference (if I’m reading this correctly) is that information gain cares about the set size of the nouns the adjective can’t modify too, and uses that in its entropy calculation.

(2) I really appreciate the two-pronged explanation of (a) the more generally semantic factors (because of improved performance when using the semantic clusters for subjectivity and information gain), and (b) the collocation factor over specific lexical items (because of the improved performance on individual wordforms for PMI). But it’s not clear to me how much information gain is adding above and beyond subjectivity on the semantic factor side. I appreciate the item-based zoom in Table 3, which shows the items that information gain does better on...but it seems like these are wordform-based, not based on general semantic properties. So, the argument that information gain is an important semantic factor is a little tricky for me to follow.

Monday, April 27, 2020

Some thoughts on Schneider et al. 2020

It’s nice to see this type of computational cognitive model: a proof of concept for an intuitive (though potentially vague) idea about how children regularize their input to yield more deterministic/categorical grammar knowledge than the input would seem to suggest on the surface. In particular, it’s intuitive to talk about children perceiving some of the input as signal and some as noise, but much more persuasive to see it work in a concrete implementation.

Specific thoughts:
(1) Intake vs. input filtering: Not sure I followed the distinction about filtering the child’s intake vs. filtering the child’s input. The basic pipeline is that external input signal gets encoded using the child’s current knowledge and processing abilities (perceptual intake) and then a subset of that is actually relevant for learning (acquisition intake). So, for filtering the (acquisition?) intake, this would mean children look at the subset of the input perceived as relevant and assume some of that is noise. For filtering the input, is the idea that children would assume some of the input itself is noise and so some of it is thrown out before it becomes perceptual intake? Or is it that the child assumes some of the perceptual intake is noise, and tosses that before it gets to the acquisition intake? And how would that differ for the end result of the acquisition intake? 

Being a bit more concrete helps me think about this:
Filtering the input --
Let’s let the input be a set of 10 signal pieces and 2 noise pieces (10S, 2N).
Let’s say filtering occurs on this set, so the perceptual intake is now 10S.
Then maybe the acquisitional intake is a subset of those, so it’s 8S.

Filtering the intake --
Our input is again 10S, 2N.
(Accurate) perceptual intake takes in 10S, 2N.
Then acquisitional intake could be the subset 7S, 1N.

So okay, I think I get it -- filtering the input gets you a cleaner signal while filtering the intake gets you some subset (cleaner or not, but certainly more focused).

(2) Using English L1 and L2 data in place of ASL: Clever standin! I was wondering what they would do for an ASL corpus. But this highlights how to focus on the relevant aspects for modeling. Here, it’s more important to get the same kind of unpredictable variation in use than it is to get ASL data. 

(3) Model explanations: I really appreciate the effort here to give the intuitions behind the model pieces. I wonder if it might have been more effective to have a plate diagram, and walk through the high-level explanation for each piece, and then the specifics with the model variables. As it was, I think I was able to follow what was going on in this high-level description because I’m familiar with this type of model already, but I don’t know if that would be true for people who aren’t as familiar. (For example, the bit about considering every partition is a high-level way of talking about Gibbs sampling, as they describe in section 4.2.)

(4) Model priors: If the prior over determiner class is 1/7, then it sounds like the model already knows there are 7 classes of determiner. Similar to a comment raised about the reading last time, why not infer the number of determiner classes, rather than knowing there are 7 already? 

(5) Corpus preprocessing: Interesting step of “downsampling” the counts from the corpora by taking the log. This effectively squishes probability differences down, I think. I wonder why they did this, instead of just using the normalized frequencies? They say this was to compensate for the skewed distribution of frequent determiners like the...but I don’t think I understand why that’s a problem. What does it matter if you have a lot of the, as long as you have enough of the other determiners too? They have the minimum cutoff of 500 instances after all.

(6) Figure 1: It looks like the results from the non-native corpus with the noise filter recover the rates of sg, pl, and mass noun combination pretty well (compared against the gold standard). But the noise filter over the native corpus skews a bit towards allowing more noun types with more classes than the gold standard (e.g., more determiners allowing 3 noun types). Side note: I like this evaluation metric a little better than inferring fixed determiner classes, because individual determiner behavior (how many noun classes it allows) can be counted more directly. We don’t need to worry about whether we have the right determiner classes or not.

(7) Evaluation metrics: Related to the previous thought, maybe a more direct evaluation metric is to just compare allowed vs. disallowed noun vectors for each individual determiner? Then the class assignment becomes a means to that end, rather than being the evaluation metric itself. This may help deal with the issue of capturing the variability in the native input that shows up in simulation 2.

(8) L1 vs. L2 input results:  The model learns there’s less noise in the native input case, and filters less; this leads to capturing more variability in the determiners. S&al2020 don’t seem happy about this, but is this so bad? If there’s true variability in native speaker grammars, then there’s variability. 

In the discussion, S&al2020 say that the behavior they wanted was the same for both native and non-native input, since Simon learned the same as native ASL speakers. So that’s why they’re not okay with the native input results. But I’m trying to imagine how the noisy channel input model they designed could possibly give the same results when the input has different amounts of variability -- by nature, it would filter out less input when there seems to be more regularity in the input to begin with (i.e., the native input). I guess it was possible that just the right amount of the input would be filtered out in each case to lead to exactly the same classification results? And then that didn’t happen.

Tuesday, April 14, 2020

Some thoughts on Perkins et al. 2020

General thoughts: I love this model as an example of incremental learning in action, where developing representations and developing processing abilities are taken seriously -- here, we can see how these developing components can yield pretty good learning of transitivity relations and an input filter, and then eventually canonical word order.  I also deeply appreciate the careful caveats P&al2020 give in the general discussion for how to interpret their modeling results. This is so important, because it’s so easy to misinterpret modeling results (especially if you weren’t the one doing the modeling -- and sometimes, even if you *are* the one doing the modeling!)

Other thoughts (I had a lot!):

(1) A key point seems to be that the input representation matters -- definitely speaking to the choir, here! What’s true of cognitive modeling seems true for (language) learning period: garbage in, garbage out. (Also, high quality stuff in = high quality stuff out.) Relatedly, I love the “quality over quantity” takeaway in the general discussion, when it comes to the data children use for learning. This seems exactly right to me, and is the heart of most “less is more” language learning proposals.

(2) A core aspect of the model is that the learner recognizes the possibility of misparsing some of the input. This doesn’t seem like unreasonable prior knowledge to have -- children are surely aware that they make mistakes in general, just by not being able to do/communicate the things they want. So, the “I-make-mistakes” overhypothesis could potentially transfer to this specific case of “I-make-mistakes-when-understanding-the-language-around-me”.

(3) It’s important to remember that this isn’t a model of simultaneously/jointly learning transitivity and word order (for the first part of the manuscript, I thought it was). Instead, it’s a joint learning model that will yield the rudimentary learning components (initial transitivity classes, some version of wh-dependencies that satisfy canonical word order) that a subsequent joint learning process could use. That is, it’s the precursor learning process that would allow children to derive useful learning components they’ll need in the future.  The things that are in fact jointly learned are rudimentary transitivity and how much of the input to trust (i.e., the basic word order filter).

(4) Finding that learning with a uniform prior works just as well:  This is really interesting to me because a uniform prior might explain how very young children can accomplish this inference. That is, they can get a pretty good result even with a uniform prior -- it’s wrong, but it doesn’t matter. Caveat: The model doesn’t differentiate transitive vs. intransitive if its prior is very biased towards the alternating class. But do we care, unless we think children would be highly biased a priori towards the alternating class?

Another simple (empirically-grounded) option is to seed the priors based on the current verbs the child knows, which is a (small) subset of the language’s transitive, intransitive, and alternating verbs. (P&al2020 mention this possibility as part of an incrementally-updating modeled learner.) As long as most of those in the subset aren’t alternating (and so cause that highly-skewed-towards-alternating prior), it looks like the English child will end up making good inferences about subsequent verbs.

(5) I feel for the authors in having the caveat about how ideal Bayesian inference is a proof of concept only. It’s true! But it’s a necessary first step (and highly recommended before trying more child-realistic inference processes -- which may in fact be “broken” forms of the idealized Bayesian computation that Gibbs sampling accomplishes here). Moreover, pretty much all our cognitive models are proofs of concept (i.e., existence proofs that something is possible). That is, we always have to idealize something to make any progress. So, the authors here do the responsible thing and remind us about where they’re idealizing so that we know how to interpret the results.

(6) The second error parameter (delta) about the rate of object drop -- I had some trouble interpreting it. I guess maybe it’s a version of “Did I miss $thing (which only affects that argument) or did I swap $thing with something else (which affects that argument and another argument)?” But then in the text explaining Figure 1, it seems like delta is the global rate of erroneously generating a direct object when it shouldn’t be there. Is this the same as “drop the direct object” vs. “confuse it for another argument”? It doesn’t quite seem like it. This is “I misparsed but accidentally made a direct object anyway when I shouldn’t have,” not necessarily “I confused the direct object with another argument”. Though maybe it could be “I just dropped the direct object completely”?

(7) As the authors note themselves, the model’s results look like a basic fuzzy thresholding decision (0 direct objects <= intransitive <= 15% <= alternating <= around 80% <= transitive <= 100%). Nothing wrong with this at all, but maybe the key is to have the child’s representation of the input take into account some of the nuances mentioned in the results discussion (like wait used with temporal adjuncts) that would cause these thresholds to be more accurate. Then, the trick to learning isn’t about fancy inference (though I do love me some Bayesian inference), but rather the input to that inference.

(8) My confusion about the “true” error parameter values (epsilon and delta): What do error parameters mean for the true corpus? That a non-canonical word order occurred? But weren’t all non-canonical instances removed in the curated input set?

(9) Figure 5:  If I’m interpreting the transitive graph correctly, it looks like super-high delta and epsilon values yield the best accuracy. In particular, if epsilon (i.e., how often to ignore the input) is near 1, we get high accuracy (near 1). What does that mean? The prior is really good for this class of verbs? This is the opposite of what we see with the alternating verbs, where low epsilon yields the best accuracy (so we shouldn’t ignore the input).

Relatedly though, it’s a good point that the three verb classes have different epsilon balances that yield high accuracy. And I appreciated the explanation that a high epsilon means lowering the threshold for membership into the class (e.g., transitive verbs).

(10) The no-filter baseline (with epsilon = 0): Note that this (dumb) strategy has better performance across all verbs (.70) simply because it gets all the alternating verbs right, and those comprise the bulk of the verbs. But this is definitely an instance of perfect recall (of alternating) at the cost of precision (transitive and intransitive).

(11) It’s a nice point that the model performs like children seem to in the presence of noisy input (where the noisy input doesn’t obviously have a predictable source of noise) --  i.e., children overregularize, and so does the model. And the way the model learns this is by having global parameters, so information from any individual verb informs those global parameters, which in turn affects the model’s decisions about other individual verbs. 

(12) I really like the idea of having different noise parameters depending on the sources of noise the learner thinks there are. This might require us to have a more articulated idea of the grammatical process that generates data, so that noise could come from different pieces of that process. Then, voila -- a noise parameter for each piece.

(13) It’s also a cool point about the importance of variation -- the variation provides anchor points (here: verbs the modeled child thinks are definitely transitive or intransitive). If there were no variation, the modeled child wouldn’t have these anchor points, and so would be hindered in deciding how much noise there might be. At a more general level, this idea about the importance of variation seems like an example where something “harder” about the learning problem (here: variation is present in the verbs) actually makes learning easier.

(14)  Main upshot: The modeled child can infer an appropriate filter (=”I mis-parse things sometimes” + “I add/delete a direct object sometimes”) at the same time as inferring classes of verbs with certain argument structure (transitive, intransitive, and alternating). Once these classes are established, then learners can use the classes to generalize properties of (new) verbs in those classes, such as transitive verbs having subjects and objects, which correspond to agents and patients in English. 

Relatedly, I’d really love to think more about this with respect to how children learn complex linking theories like UTAH and rUTAH, which involve a child knowing collections of links between verb arguments (like subject and object) and event participants (like agent and patient). That is, let’s assume the learning process described in this paper happens and children have some seed classes of transitive, intransitive, and alternating + the knowledge of the argument structure associated with each class (must have direct object [transitive], must not have direct object [intransitive], may have direct object [alternating]). I think children would still have to learn the links between arguments and event participants, right? That is, they’d still need to learn that the subject of a transitive verb is often an agent in the event. But they’d at least be able to recognize that certain verbs have these arguments, and so be able to handle input with movement, like wh-questions for transitive verbs.