Monday, November 28, 2016

Some thoughts on McCauley & Christiansen 2014

I really appreciate this kind of overview, especially for an acquisition modeling literature I’m not as familiar with. It’s heartening to see similar broad concerns (consensus about what models should be doing), even if I might not always agree with the particulars. What caught my initial attention here is the focus on moving beyond “purely distributional features of the input” — though it turns out this might mean something different to me than to the authors.

For me, “purely distributional” means using only distributional information (rather than being additionally biased to skew the distributions in some way, e.g., by upweighting certain data and downweighting others). Importantly, "purely distributional" can still be information about the distribution of fairly abstract things, like thematic role positions. For M&C2014, based on the intro, it seems like they want it to mean distributions of words, since they specifically point out the “relative lack of semantic information” in current distributional usage-based models. They also contrast a purely distributional version of Perfors et al.’s dative alternation learning model with one that includes “a single semantic feature”. So while I’m happy to see the inclusion of more abstract linguistic features, I would still class the use of the distributions of those features as a purely distributional strategy. (This is part of the general idea that it's not that you're counting, but rather what you're counting.)

Some additional thoughts:

(1) I like the suggestion to create models that can produce behavioral output that we can compare against children’s behavioral output.  (This is under the general heading of “Models should aim to capture aspects of language use”.) That way, we don’t have to spend so much time arguing over the theoretical representation we choose for the model’s internal knowledge — the ultimate checkpoint is that it’s a way to generate the observed behavior (i.e., an existence proof). This is exactly the sort of the thing we read about last time in the reading group. Of course, as we also saw last time, this is much easier said than done.

(2) One criticism M&C2014 bring up as they discuss the models of semantic role labeling is that there’s a fixed set of predefined semantic roles. Is this really a problem, though? I think there’s evidence for early conceptual roles in infants (something like proto-agent and proto-patient). 

Also, later on in the discussion of verb argument structure, M&C2014 describe Chang’s Embodied Construction Grammar model as involving a set of “predefined schemas” that correspond to “actions, objects, and agents”. This doesn’t seem to cause M&C2014 as much consternation — why is it any more usage-based to have predefined conceptual schemas instead of predefined conceptual roles?

(3) I admit, I was somewhat surprised in the future extensions discussion to see “subject-auxiliary inversion” as an example of complex grammatical phenomena. In my head, that’s far more basic than many other things I see in the syntactic development literature, such as raising vs. control verb interpretation, quantifier scope ambiguity, syntactic islands constraints, binding relations, negative polarity items, and so on. Related to this, it’s unclear to me how much “social feedback” incorporation that “reflect[s] the semi-supervised nature of the learning task” is going to matter for syntactic knowledge like this. How much feedback do children get (and actually absorb, even if they get it) for these more sophisticated knowledge elements?

Monday, November 14, 2016

Some thoughts on Lloyd-Kelly et al. 2016

I really appreciate this paper as a first attempt to provide a linking story between model representations and infant behavior (in this case, turning probabilities associated with chunked representations into actual infant listening times, using things like time-sensitive trace decay). This highlights how the details of the experimental procedure matter, such as how often syllables are uttered, how long between habituation and test phases, and how long between the individual test stimuli during the test phase. In theory, this would also include all the non-linguistic processes that go into generating observable behavior, like motor control, attention, and memory, though LK&al2016 focused on memory for this first-pass attempt. (I should note that I think including some mechanism for attention would really help them out in future modeling attempts — more on this below.)

Some additional thoughts:

(1) It might be useful to go over some of the details of the CHREST model discussed in the “Participant Modelling” section, and embodied in Figure 1. While the basic division into long-term memory, short-term memory, and a phonological loop makes good sense, I want to make sure I’m clear on the distinction between discriminating, familiarizing, and a node being finished. For instance, why does a “finished” node cause something new to be created?

Relatedly, based on Figure 1, it seems like there’s a built-in primacy effect with respect to inserting a new node. For example, when pa-go is encountered in “pa-go-ti”, but only pa-do exists, the first thing that happens is “go” is added on its own as a primitive. My interpretation: If you get something new, you only manage to grab a piece of it. Primacy biases make you grab the first piece you don’t recognize. (An alternative might be a recency bias, where you grab the last thing, due to phonological loop decay. So, in pa-go-ti, you grab “ti” first.). 

(2) I think it’s very handy how the learner ignores incoming requests during the search, retrieval, and updating process.  The upshot is that the learner can’t learn new things while it’s still updating old things, which intuitively feels right. Also, it’s nice from a model fit perspective to have three distinct timing variables to tweak in order to match human behavior (though this also gets into issues of maybe being able to overfit with that many degrees of freedom).

(3) I really appreciated the empirical grounding based on children’s sensory auditory memory strength for the phonological store (=600ms). However, then I got a bit confused at to why they were testing out other values for this (800ms and 1000ms) in their simulations. Perhaps because 600ms was only a guess?

This then relates to the interpretation of Figure 2. It looks like the least variable performance comes from a short phonological store trace decay (600ms), though the r^2 is also low (but then, so is the RMSE, which is a good thing). If we take this as “this is the best”, then we might interpret this as quick forgetting mattering more than the other memory retrieval aspects encoded by familiarization and discrimination time.

On the other hand, if we focus on the highest r^2 and lowest RMSE values, then we get these combinations as being best: 

800ms phon decay + 10000ms discrimination + 1000-1500ms familiarization
1000ms phon decay + 9000ms discrimination + 2000ms familiarization

Importantly, the 600ms phon decay isn’t even in there. If we take these at face value, then the question is how to interpret it. Perhaps it narrows down the set of possible values for these different memory components in infants. In that case, maybe an 8-month-old phonological store trace decay is closer to a 1 or 2-year-old's, which is 1000-2000ms, rather than 600ms…

…except LK&al2016’s conclusion section seems to take the opposite tack: “…the data obtained in this paper would lend credence to the proposal that the trace decay time of the phonological store is around 600ms for very young infants.” I think I missed how they get there from their results, especially the connection to the digit span findings cited from Gathercole & Adams (1993). It seems super important, given how LK&al2016 think it’s the biggest finding of their paper.

(4) LK&al2016 find a qualitative match to infant looking times (Figure 3), but they note that they’re getting longer times for everything.  As LK&al themselves note: “infants appear to become bored much more quickly than the model”. It seems like this indicates a natural role for attention in future models. Interestingly, this is something LK&al2016 didn’t explicitly mention in describing future adaptations of the model in the conclusion. On the plus side, it doesn’t seems like it would be hard to build attention into the listening time calculation (e.g., just subtract some amount from the total looking time, based on some parameter connected to how much time has passed).