Tuesday, May 26, 2020

Some thoughts on Liu et al 2019

I really appreciate this paper’s goal of concretely testing different accounts of island constraints, and the authors' intuition that the frequency of the lexical items involved may well have something to do with the (un)acceptability of the island structures they look at. This is something near and dear to my heart, since Jon Sprouse and I worked on a different set of island constraints a few years back (Pearl & Sprouse 2013) and found that the lexical items used as complementizers really mattered. 

Pearl, L., & Sprouse, J. (2013). Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition, 20(1), 23-68.

I do think the L&al2019 paper was a little crunched for space, though -- there were several points where I felt like the reasoning flew by too fast for me to follow (more on this below).


Specific thoughts:
(1) Frequency accounts believe that acceptability is based on exposure. This makes total sense to me for lexical-item-based islands. I wonder if I’d saturate on whether and adjunct islands for this reason.

(grammatical that complementizer) “What did J say that M. bought __?”
 vs. 
(ungrammatical *whether) “What did J wonder whether M. bought __?”
and
(ungrammatical *adjunct (if))“What did J worry if M. bought __?”.

I feel like saturation studies like this have been done at least for some islands, and they didn’t find saturation. Maybe those were islands that weren’t based on lexical items, like subject islands or complex NP islands?

Relatedly, in the verb-frame frequency account, acceptability depends on verb lexical frequency. I definitely get the idea of this prediction (which is nicely intuitive), but Figure 1c seems a specific version of this -- namely, where manner-of-speaking verbs are always less frequent than factive and bridge verbs. I guess this is anticipating the frequency results that will be found?

(2) Explaining why “know” is an outlier (it’s less acceptable than frequency would predict): L&al2019 argue this is due to a pragmatic factor where using “know” implies the speaker already has knowledge, so it’s weird to ask. I’m not sure if I followed the reasoning for the pragmatic explanation given for “know”. 

Just to spell it out, the empirical fact is that “What did J know that M didn’t like __?” is less acceptable than the (relatively high) frequency of “know CP” predicts it should be. So, the pragmatic explanation is that it’s weird for the speaker of the question to ask this because the speaker already knows the answer (I think). But what does that have to do with J knowing something? 

And this issue of the speaker knowing something is supposed to be mitigated in cleft constructions like “It was the cake that J knew that M didn’t like.” I don’t follow why this is, I’m afraid. This point gets reiterated in the discussion of the Experiment 3 cleft results and I still don’t quite follow it: “a question is a request for knowledge but a question with ‘know’ implies that the speaker already has the knowledge”. Again, I have the same problem: “What did J know that M didn’t like __?” has nothing to do with the speaker knowing something.

(3) Methodology: This is probably me not understanding how to do experiments, but why is it that a likert scale doesn’t seem right? Is it just that the participants weren’t using the full scale in Experiment 1? And is that so bad if the test items were never really horribly ungrammatical? Or were there “word salad” controls in Experiment 1, where the participants should have given a 1 or 2 rating, but still didn’t? 

Aside from this, why does a binary choice fix the problem?

(4) Thinking about island (non-)effects: Here, the lack of an interaction between sentence type and frequency was meant to indicate no island effect. I’m more used to thinking about island effects as the interaction of dependency-length (matrix vs embedded) and presence vs absence of an island structure, so an island shows up as a superadditive interaction of dependency length & island structure (i.e., an island-crossing dependency is an embedded dependency that crosses an island structure, and it’s extra bad). 

Here, the two factors are wh-questions (so, a dependency period) + which verb lexical item is used. Therefore, an island “structure” should be some extra badness that occurs when a wh-dependency is embedded in a CP for an “island” lexical item (because that lexical item should have an island structure associated with it). Okay. 

But we don’t see that, so there’s no additional structure there. Instead, it’s just that it’s hard to process wh-dependencies with these verbs because they don’t occur that often. Though when I put it like that, this reminds me of the Pearl & Sprouse 2013 island learning story -- islands are bad because there are pieces of structure that are hard to process (because they never occur in the input = lowest frequency possible). 

So, thinking about it like this, these accounts (that is, the L&al2019 account and the Pearl & Sprouse 2013 [P&S2013] account) don’t seem too different after all. It’s just frequency of what -- here, it’s the verb lexical item in these embedded verb frames; for P&S2013, it was small chunks of the phrasal structure that made up the dependency, some of which were subcategorized by the lexical items in them (like the complementizer).

(5) Expt 2 discussion: I think the point L&al2019 were trying to make about the spurious island effects with Figures 4a vs 4b flew by a little fast for me. Why is log odds [p(acceptable)/p(unacceptable] better than just p(acceptable) on the y-axis? Because doing p(acceptable) on the y axis is apparently what yields the interaction that’s meant to signal an island effect.

(6)  I’m sympathetic to the space limitations of conference papers like this, but the learning story at the end was a little scanty for my taste. More specifically, I’m sympathetic to indirect negative evidence for learning, but it only makes sense when you have a hypothesis space set up, and can compare expectations for different hypotheses. What does that hypothesis space look like here? I think there was a little space to spell it out with a concrete example. 

And eeep, just be very careful about saying absence of evidence is evidence of ungrammaticality, unless you’re very careful about what you’re counting.

Tuesday, May 12, 2020

Some thoughts on Futrell et al 2020

I really liked seeing the technique imports from the NLP world (using embeddings, using classifiers), in the service of psychologically-motivated theories of adjective ordering. Yes! Good tools are wonderful. 

I also love seeing this kind of direct, head-to-head competition between well-defined theories, grounding in a well-defined empirical dataset (complete with separate evaluation set), careful qualitative analysis, and discussion of why certain theories might work out better than others. Hurrah for good science!

Other thoughts:
(1) Integration cost vs information gain (subtle differences): Information gain seems really similar to the integration cost idea, where the size of the set of nouns an adjective could modify is the main thing (as the text notes). Both approaches care about making that entropy gain smaller the further the adjective is away from the noun (since that’s less cognitively-taxing to deal with). The difference (if I’m reading this correctly) is that information gain cares about the set size of the nouns the adjective can’t modify too, and uses that in its entropy calculation.

(2) I really appreciate the two-pronged explanation of (a) the more generally semantic factors (because of improved performance when using the semantic clusters for subjectivity and information gain), and (b) the collocation factor over specific lexical items (because of the improved performance on individual wordforms for PMI). But it’s not clear to me how much information gain is adding above and beyond subjectivity on the semantic factor side. I appreciate the item-based zoom in Table 3, which shows the items that information gain does better on...but it seems like these are wordform-based, not based on general semantic properties. So, the argument that information gain is an important semantic factor is a little tricky for me to follow.

Monday, April 27, 2020

Some thoughts on Schneider et al. 2020

It’s nice to see this type of computational cognitive model: a proof of concept for an intuitive (though potentially vague) idea about how children regularize their input to yield more deterministic/categorical grammar knowledge than the input would seem to suggest on the surface. In particular, it’s intuitive to talk about children perceiving some of the input as signal and some as noise, but much more persuasive to see it work in a concrete implementation.

Specific thoughts:
(1) Intake vs. input filtering: Not sure I followed the distinction about filtering the child’s intake vs. filtering the child’s input. The basic pipeline is that external input signal gets encoded using the child’s current knowledge and processing abilities (perceptual intake) and then a subset of that is actually relevant for learning (acquisition intake). So, for filtering the (acquisition?) intake, this would mean children look at the subset of the input perceived as relevant and assume some of that is noise. For filtering the input, is the idea that children would assume some of the input itself is noise and so some of it is thrown out before it becomes perceptual intake? Or is it that the child assumes some of the perceptual intake is noise, and tosses that before it gets to the acquisition intake? And how would that differ for the end result of the acquisition intake? 

Being a bit more concrete helps me think about this:
Filtering the input --
Let’s let the input be a set of 10 signal pieces and 2 noise pieces (10S, 2N).
Let’s say filtering occurs on this set, so the perceptual intake is now 10S.
Then maybe the acquisitional intake is a subset of those, so it’s 8S.

Filtering the intake --
Our input is again 10S, 2N.
(Accurate) perceptual intake takes in 10S, 2N.
Then acquisitional intake could be the subset 7S, 1N.

So okay, I think I get it -- filtering the input gets you a cleaner signal while filtering the intake gets you some subset (cleaner or not, but certainly more focused).

(2) Using English L1 and L2 data in place of ASL: Clever standin! I was wondering what they would do for an ASL corpus. But this highlights how to focus on the relevant aspects for modeling. Here, it’s more important to get the same kind of unpredictable variation in use than it is to get ASL data. 

(3) Model explanations: I really appreciate the effort here to give the intuitions behind the model pieces. I wonder if it might have been more effective to have a plate diagram, and walk through the high-level explanation for each piece, and then the specifics with the model variables. As it was, I think I was able to follow what was going on in this high-level description because I’m familiar with this type of model already, but I don’t know if that would be true for people who aren’t as familiar. (For example, the bit about considering every partition is a high-level way of talking about Gibbs sampling, as they describe in section 4.2.)

(4) Model priors: If the prior over determiner class is 1/7, then it sounds like the model already knows there are 7 classes of determiner. Similar to a comment raised about the reading last time, why not infer the number of determiner classes, rather than knowing there are 7 already? 

(5) Corpus preprocessing: Interesting step of “downsampling” the counts from the corpora by taking the log. This effectively squishes probability differences down, I think. I wonder why they did this, instead of just using the normalized frequencies? They say this was to compensate for the skewed distribution of frequent determiners like the...but I don’t think I understand why that’s a problem. What does it matter if you have a lot of the, as long as you have enough of the other determiners too? They have the minimum cutoff of 500 instances after all.

(6) Figure 1: It looks like the results from the non-native corpus with the noise filter recover the rates of sg, pl, and mass noun combination pretty well (compared against the gold standard). But the noise filter over the native corpus skews a bit towards allowing more noun types with more classes than the gold standard (e.g., more determiners allowing 3 noun types). Side note: I like this evaluation metric a little better than inferring fixed determiner classes, because individual determiner behavior (how many noun classes it allows) can be counted more directly. We don’t need to worry about whether we have the right determiner classes or not.

(7) Evaluation metrics: Related to the previous thought, maybe a more direct evaluation metric is to just compare allowed vs. disallowed noun vectors for each individual determiner? Then the class assignment becomes a means to that end, rather than being the evaluation metric itself. This may help deal with the issue of capturing the variability in the native input that shows up in simulation 2.

(8) L1 vs. L2 input results:  The model learns there’s less noise in the native input case, and filters less; this leads to capturing more variability in the determiners. S&al2020 don’t seem happy about this, but is this so bad? If there’s true variability in native speaker grammars, then there’s variability. 

In the discussion, S&al2020 say that the behavior they wanted was the same for both native and non-native input, since Simon learned the same as native ASL speakers. So that’s why they’re not okay with the native input results. But I’m trying to imagine how the noisy channel input model they designed could possibly give the same results when the input has different amounts of variability -- by nature, it would filter out less input when there seems to be more regularity in the input to begin with (i.e., the native input). I guess it was possible that just the right amount of the input would be filtered out in each case to lead to exactly the same classification results? And then that didn’t happen.

Tuesday, April 14, 2020

Some thoughts on Perkins et al. 2020

General thoughts: I love this model as an example of incremental learning in action, where developing representations and developing processing abilities are taken seriously -- here, we can see how these developing components can yield pretty good learning of transitivity relations and an input filter, and then eventually canonical word order.  I also deeply appreciate the careful caveats P&al2020 give in the general discussion for how to interpret their modeling results. This is so important, because it’s so easy to misinterpret modeling results (especially if you weren’t the one doing the modeling -- and sometimes, even if you *are* the one doing the modeling!)

Other thoughts (I had a lot!):

(1) A key point seems to be that the input representation matters -- definitely speaking to the choir, here! What’s true of cognitive modeling seems true for (language) learning period: garbage in, garbage out. (Also, high quality stuff in = high quality stuff out.) Relatedly, I love the “quality over quantity” takeaway in the general discussion, when it comes to the data children use for learning. This seems exactly right to me, and is the heart of most “less is more” language learning proposals.

(2) A core aspect of the model is that the learner recognizes the possibility of misparsing some of the input. This doesn’t seem like unreasonable prior knowledge to have -- children are surely aware that they make mistakes in general, just by not being able to do/communicate the things they want. So, the “I-make-mistakes” overhypothesis could potentially transfer to this specific case of “I-make-mistakes-when-understanding-the-language-around-me”.

(3) It’s important to remember that this isn’t a model of simultaneously/jointly learning transitivity and word order (for the first part of the manuscript, I thought it was). Instead, it’s a joint learning model that will yield the rudimentary learning components (initial transitivity classes, some version of wh-dependencies that satisfy canonical word order) that a subsequent joint learning process could use. That is, it’s the precursor learning process that would allow children to derive useful learning components they’ll need in the future.  The things that are in fact jointly learned are rudimentary transitivity and how much of the input to trust (i.e., the basic word order filter).

(4) Finding that learning with a uniform prior works just as well:  This is really interesting to me because a uniform prior might explain how very young children can accomplish this inference. That is, they can get a pretty good result even with a uniform prior -- it’s wrong, but it doesn’t matter. Caveat: The model doesn’t differentiate transitive vs. intransitive if its prior is very biased towards the alternating class. But do we care, unless we think children would be highly biased a priori towards the alternating class?

Another simple (empirically-grounded) option is to seed the priors based on the current verbs the child knows, which is a (small) subset of the language’s transitive, intransitive, and alternating verbs. (P&al2020 mention this possibility as part of an incrementally-updating modeled learner.) As long as most of those in the subset aren’t alternating (and so cause that highly-skewed-towards-alternating prior), it looks like the English child will end up making good inferences about subsequent verbs.

(5) I feel for the authors in having the caveat about how ideal Bayesian inference is a proof of concept only. It’s true! But it’s a necessary first step (and highly recommended before trying more child-realistic inference processes -- which may in fact be “broken” forms of the idealized Bayesian computation that Gibbs sampling accomplishes here). Moreover, pretty much all our cognitive models are proofs of concept (i.e., existence proofs that something is possible). That is, we always have to idealize something to make any progress. So, the authors here do the responsible thing and remind us about where they’re idealizing so that we know how to interpret the results.

(6) The second error parameter (delta) about the rate of object drop -- I had some trouble interpreting it. I guess maybe it’s a version of “Did I miss $thing (which only affects that argument) or did I swap $thing with something else (which affects that argument and another argument)?” But then in the text explaining Figure 1, it seems like delta is the global rate of erroneously generating a direct object when it shouldn’t be there. Is this the same as “drop the direct object” vs. “confuse it for another argument”? It doesn’t quite seem like it. This is “I misparsed but accidentally made a direct object anyway when I shouldn’t have,” not necessarily “I confused the direct object with another argument”. Though maybe it could be “I just dropped the direct object completely”?

(7) As the authors note themselves, the model’s results look like a basic fuzzy thresholding decision (0 direct objects <= intransitive <= 15% <= alternating <= around 80% <= transitive <= 100%). Nothing wrong with this at all, but maybe the key is to have the child’s representation of the input take into account some of the nuances mentioned in the results discussion (like wait used with temporal adjuncts) that would cause these thresholds to be more accurate. Then, the trick to learning isn’t about fancy inference (though I do love me some Bayesian inference), but rather the input to that inference.

(8) My confusion about the “true” error parameter values (epsilon and delta): What do error parameters mean for the true corpus? That a non-canonical word order occurred? But weren’t all non-canonical instances removed in the curated input set?

(9) Figure 5:  If I’m interpreting the transitive graph correctly, it looks like super-high delta and epsilon values yield the best accuracy. In particular, if epsilon (i.e., how often to ignore the input) is near 1, we get high accuracy (near 1). What does that mean? The prior is really good for this class of verbs? This is the opposite of what we see with the alternating verbs, where low epsilon yields the best accuracy (so we shouldn’t ignore the input).

Relatedly though, it’s a good point that the three verb classes have different epsilon balances that yield high accuracy. And I appreciated the explanation that a high epsilon means lowering the threshold for membership into the class (e.g., transitive verbs).

(10) The no-filter baseline (with epsilon = 0): Note that this (dumb) strategy has better performance across all verbs (.70) simply because it gets all the alternating verbs right, and those comprise the bulk of the verbs. But this is definitely an instance of perfect recall (of alternating) at the cost of precision (transitive and intransitive).

(11) It’s a nice point that the model performs like children seem to in the presence of noisy input (where the noisy input doesn’t obviously have a predictable source of noise) --  i.e., children overregularize, and so does the model. And the way the model learns this is by having global parameters, so information from any individual verb informs those global parameters, which in turn affects the model’s decisions about other individual verbs. 

(12) I really like the idea of having different noise parameters depending on the sources of noise the learner thinks there are. This might require us to have a more articulated idea of the grammatical process that generates data, so that noise could come from different pieces of that process. Then, voila -- a noise parameter for each piece.

(13) It’s also a cool point about the importance of variation -- the variation provides anchor points (here: verbs the modeled child thinks are definitely transitive or intransitive). If there were no variation, the modeled child wouldn’t have these anchor points, and so would be hindered in deciding how much noise there might be. At a more general level, this idea about the importance of variation seems like an example where something “harder” about the learning problem (here: variation is present in the verbs) actually makes learning easier.

(14)  Main upshot: The modeled child can infer an appropriate filter (=”I mis-parse things sometimes” + “I add/delete a direct object sometimes”) at the same time as inferring classes of verbs with certain argument structure (transitive, intransitive, and alternating). Once these classes are established, then learners can use the classes to generalize properties of (new) verbs in those classes, such as transitive verbs having subjects and objects, which correspond to agents and patients in English. 

Relatedly, I’d really love to think more about this with respect to how children learn complex linking theories like UTAH and rUTAH, which involve a child knowing collections of links between verb arguments (like subject and object) and event participants (like agent and patient). That is, let’s assume the learning process described in this paper happens and children have some seed classes of transitive, intransitive, and alternating + the knowledge of the argument structure associated with each class (must have direct object [transitive], must not have direct object [intransitive], may have direct object [alternating]). I think children would still have to learn the links between arguments and event participants, right? That is, they’d still need to learn that the subject of a transitive verb is often an agent in the event. But they’d at least be able to recognize that certain verbs have these arguments, and so be able to handle input with movement, like wh-questions for transitive verbs.

Sunday, December 1, 2019

Some thoughts on Ud Deen & Timyam 2018

I really appreciate seeing a traditional Universal Grammar (UG) approach to learnability spelled out so clearly, especially coupled with clear behavioral data about the phenomenon in question (condition C). It makes it easier to see where I agree vs. where I’m concerned that we’re not being fair to alternative accounts (or even how we’re characterizing linguistic nativist vs. non-linguistic nativist accounts). What really struck me after all the learnability discussions at various points (more on this below) is that I really want a computational cognitive model that tries to learn condition C child behavior facts in English and Thai.  When we have a concrete model, we can be explicit about what we’re building in that either makes the model behave vs. not behave like children in these languages. And then it makes sense to have a discussion about the nature of the built in stuff.

Specific thoughts:
(1) Intro, the traditional UG learnability approach: The traditional claim is if we test kids as young as we can (here, that’s Thai four-year-olds, though in English we’ve apparently tested 2.5-year-olds), and they show a certain type of knowledge, we assume that knowledge is innate. For me, I think this is a good placeholder — we have to explain how kids have this knowledge by that age. That then means a careful analysis of the input and reasonable investigation of how that knowledge could be derived from more fundamental building blocks (maybe language-specific building blocks, but maybe not). This goal of course depends on how the knowledge of condition C is represented, which is part of what often evolves in theoretical debates.

(2) Section 2.2, learnability again: “the negative properties of a language are downright impossible to acquire from the input because children never get evidence of what is impossible in the language.” -- Of course, it all depends on the representation we think children are learning, and what building blocks go into that representation. This is what Pearl & Sprouse (2013) did for syntactic islands (i.e., subjacency constraints) -- and it turns out you can get away with much more general-purpose knowledge, which may or may not be UG. 


UD&T2018 walk through how the learning process might work for condition C knowledge (and kudos to them for doing an input analysis! Much more convincing when you have actual counts in children’s input data of the phenomena you’re talking about for learnability). They highlight that children basically hear all the viable combinations of pronoun and name with both co-indexed and non-coindexed readings, but only hear the problematic structure with the non-coindexed reading. They then ask why a child wouldn’t just assume the co-indexed reading is fine for this one, too. (And that would lead to a condition C violation.) 

But the flip side is if children get fairly strong evidence for all the other options allowing coindexed readings but this structure doesn’t, why would they assume it can be coindexed? This really comes back to expectations and indirect evidence -- if you keep expecting something to occur, and it keeps not occurring, you start shifting probability towards the option that it can’t in fact occur. To make this indirect evidence account work, children would have to expect that coindexed reading to occur and keep seeing it not occur. This doesn’t seem implausible to me, but it helps to have an existence proof. (For instance, what causes them to have this expectation and what are the restrictions on the structures they expect to allow co-indexing?)

All that said, I’m completely with UD&T2018 that children aren’t unbiased learners. Children clearly have constraints on the hypotheses they entertain. What I’m not sure of is the origin of those constraints as being language-specific and innate vs. derived from prior experience. As I mentioned above, it really depends what building blocks underlie the explicit hypotheses the child is entertaining. It’s perfectly possible for the implicit hypothesis space to be really (infinitely) large, because of the way building blocks can combine (especially if there’s any kind of recursion). But domain-general biases (e.g., prefer explicit hypotheses that require fewer building blocks) can skew the prior over this implicit hypothesis space in a useful way.

I’m also not very convinced about the need for a language-specific Subset Principle -- this same preference for a narrower hypothesis, given ambiguous data, falls out from standard Bayesian inference. (Basically, a narrower hypothesis has a higher likelihood for generating an ambiguous data point than a wider hypothesis does, so boom -- we have a preference for a narrower hypothesis that comes from a domain-general learning mechanism.) But okay, if a bias for the narrower hypothesis helps children navigate the condition C acquisition problem, great.

(3) Section 2.4, the nativist vs. non-nativist discussion: It’s illuminating to see the discussion of what would count as “nativist” vs. “non-nativist” here. (Side note: I have a whole discussion about this particular terminology issue in a recent manuscript about Poverty of the Stimulus: Pearl 2019: Pearl, L. (under review). Poverty of the Stimulus Without Tears. https://ling.auf.net/lingbuzz/004646.)

Using the terms I prefer, UD&T2018 are thinking about a linguistic nativist approach (=their “nativist”) where explicit knowledge of condition C is innately available vs. a non-linguistic nativist approach (=their “non-nativist”) where this knowledge is instead derived from the input. Of course, it’s very possible that explicit knowledge of condition C is derived from the input over time by *still* using some innate, language-specific knowledge (maybe something more general-purpose). If so, then we still have a linguistic nativist position, but now it’s one that’s utilizing the input more than UD&T2018 think linguistic nativist approaches do. What would make an approach non-linguistic nativist is the following: the way explicit knowledge of condition C was derived from the input never involved language-specific innate knowledge, but rather only domain-general innate knowledge (or mechanisms, like Bayesian inference).

Related though, on children initially obeying condition C everywhere (even for bare nominals): This behavior would accord with (relatively) rapid acquisition of explicit condition C knowledge (which initially gets overapplied for Thai). But, it’s not clear to me that we can definitively claim linguistic vs. non-linguistic nativist for the necessary knowledge. Again, it all depends on how condition C knowledge is represented and what building blocks kids use (and could track in the input) to construct that knowledge. 

Also, with respect to age of acquisition, without an explicit theory of how learning occurs, how do we know that derived approaches to learning condition C couldn’t yield child judgment behavior at four? That is, how do we know that acquisition wouldn’t be fast enough this way? (This is one of my main concerns with claims that young children knowing something means they were innately endowed with that knowledge.)

(4) Section 4, discussion of Thai children overapplying condition C: UD&T2018 talk about this as children allowing the most restrictive grammar. But honestly, thinking about this in terms of Yang’s Tolerance Principle, couldn’t it be about learning the rule (here: condition C) in the presence of exceptions? So it’s the same grammar, but we just have exceptions in Thai while languages like English don’t have exceptions. If so, then it’s not clear that considerations about more restrictive vs. less restrictive grammars apply.

(5) Section 4, on minimalist approaches to condition C: I like the idea of considering the building blocks of condition C a lot. But then, isn’t it an open question whether the explicit condition C knowledge constructed from these building blocks has to be innate (and perhaps more interestingly, as constrained as UD&T2018 initially posit it to be)? I’m very willing to believe some building blocks are innate (I’m a nativist, after all), but it seems yet-to-be-determined whether the necessary building blocks are language-specific. And again, even if they are, how constrained do they make the child’s implicit hypothesis space? (This is where it really hit me that I wanted a computational cognitive model of learning condition C facts — and really, a model that replicates adult and child judgment behavior.)

Friday, November 15, 2019

Some thoughts on Villata et al. 2019

I appreciated seeing up front the traditional argument about economy of representation because it’s often at the heart of theoretical debate. What’s interesting to me is the assumption that something is more economical just based on intuition, without having some formal way to evaluate how economical it is. So, good on V&al2019 for thinking about this issue explicitly. More generally, when I hear this kind of debate about categorical grammar + external gradience vs. gradience in the grammar, I often wonder how on earth you could tell the difference. V&al2019 are approaching this from an economy angle, rather than a behavioral fit angle, and showing a proof-of-concept with the SOSP model. That said, it’s interesting to note that the SOSP implementation of a gradient grammar clearly includes both syntactic and semantic features -- and that’s where its ability to handle some of the desiderata comes from.

Other comments:
(1) Model implementation involving the differential equations: If I understand this correctly, the model is using a computational-level way to accomplish the inference about which treelets to choose. (This is because it seems like it requires a full enumeration of the possible structures that can be formed, which is typically a massively parallel process and not something we think humans are doing on a word-by-word processing basis.) Computational-level inference for me is exactly like this: we think this is the inference computation humans are trying to do, but they approximate it somehow. So, we’re not committed to this being the algorithm humans use to accomplish that inference. 

That said, the way V&al2019 describe here isn’t an optimal inference mechanism, the way Gibbs sampling is, since it seems to allow the equivalent of probabilistic sampling (where a sub-optimal option can win out in the long run). So, this differs from the way I often see computational-level inference in Bayesian modeling land, because there the goal is to identify the optimal result of the desired computation.

(2) Generating grammaticality vs. acceptability judgments from model output: I often hear “acceptability” used when grammaticality is one (major) aspect of human judgment, but there are other things in there too (like memory constraints or lexical choice, etc.). I originally thought the point of this model was that we’re trying to generate the judgment straight from the grammar, rather than from other factors (so it would be a grammaticality judgment). But maybe because a word’s feature vector also includes semantic features (whatever those look like), then this is why the judgment is getting termed acceptability rather than grammaticality?

(3) I appreciate the explanation in the Simulations section about the difference between whether islands and subject islands -- basically, for subject islands, there are no easy lexical alternatives that would allow sub-optimal treelets to persist that will eventually allow a link between the wh-phrase and the gap. But something I want to clear up is the issue of parallel vs. greedy parsing. I had thought that the SOSP approach does greedy parsing because it finds what it considers the (noisy) local maximum option for any given word, and proceeds on from there. So, for whether islands, it picks either the wonder option that’s an island or the wonder option that gets coerced to think, because both of those are somewhat close in how harmonic they are. (For subject islands, there’s only one option -- the island one -- so that’s the one that gets picked). Given this, how can we talk about the whether island as having two options at all? Is it that on any given parse, we pick one, and it’s just that sometimes it’s the one that allows a dependency to form? That would be fine. We’d just expect to see individual variation from instance to instance, and the aggregate effect would be that D-linked whether islands are better, basically because sometimes they’re okay-ish and sometimes they’re not.