Computational Models of Language (at UC Irvine)

Tuesday, February 5, 2019

Some thoughts on Fitz & Chang 2017 + Bonus thoughts on McCoy et al. 2018

(Just a quick note that I had a lot of thoughts about these papers, so this is a lengthy post.)

***F&C2017 general thoughts:

This paper tackles one of the cases commonly held up to argue for innate, language-specific knowledge: structure-dependent rules for syntax (and more specifically, complex yes/no questions that require such rules). The key: learn how to produce these question forms without ever seeing (m)any informative examples of them. There have been a variety of solutions to this problem, including recent Bayesian modeling work (Perfors et al. 2011) demonstrating how this knowledge can be inferred as long as the child has the ability to consider structure-dependent rules in her hypothesis space. Here, the approach is to broaden the relevant information from just the form of language (which is traditionally what syntactic learning focused on) and also include the meaning. This reminds me of CCG, which naturally links the form of something to its meaning during learning, and gets great bootstrapping power from that (see Abend et al. 2017 for an example and my forthcoming book chapter for a handy summary).

Perfors, A., Tenenbaum, J. B., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition, 118(3), 306-338.

Abend, O., Kwiatkowski, T., Smith, N. J., Goldwater, S., & Steedman, M. (2017). Bootstrapping language acquisition. Cognition, 164, 116-143.

Pearl, L. (forthcoming). Modeling syntactic acquisition. In J. Sprouse (ed.), Oxford Handbook of Experimental Syntax.

Interestingly enough with respect to what’s built into the child, it’s not clear to me that F&C2017 aren’t still advocating for innate, language-specific knowledge (which is what Universal Grammar is typically thought of). This knowledge just doesn’t happen to be *syntactic*. Instead, the required knowledge is about how concepts are structured. This reminds me of my comments in Pearl (2014) about exactly this point. It seems that non-generativist folks aren’t opposed to the idea of innate, language-specific knowledge -- they just prefer it not be syntactic (and preferably not labeled as Universal Grammar). Here, it seems that innate, language-specific knowledge about structured concepts is one way to accomplish the learning goal. More on this below in the specific thoughts section.

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

***Bonus general thoughts on McCoy et al. 2018:

In contrast to F&C2017, M&al2018 are using only syntactic info to learn from. However, it seems like they’re similar to prior work in using smaller building blocks (i.e., indirect positive evidence) to generate hierarchical structure (i.e., structure-dependent representations) as the favored hypothesis. This is also similar to Perfors et al. (2011) - the main difference is that M&al2018 are using a non-symbolic model, while Perfors et al. (2011) are using a symbolic one. This then leads into the interpretation issue for M&al2018 -- when you find an RNN that works, why does it work? You have to do much more legwork to figure it out, compared to a symbolic model. However, F&C2017 had to do this too for their connectionist model, and I think they demonstrated how you can infer what may be going on quite well (in particular, which factors matter and how).

M&al2018 end up using machine learning classifiers to figure it out, and this seems like a great technique for trying to understand what’s going on in these distributed representations. It’s also something I’m seeing in the neuroscience realm when they try to interpret the distributed contents of, for instance, an fMRI scan.

**Specific thoughts on F&C2017:

(1) The key idea seems to be that nonlinguistic propositions are structured and this provides the crucial scaffolding that allows children to infer structure-dependent rules for the syntactic forms. Doesn’t this still rely on children having the ability to allow structure-dependence into their hypothesis space? Then, this propositional structure can push them towards the structure-dependent rules. But then, that’s no different than the Perfors et al. (2011) approach, where the syntactic forms from the language more broadly pointed towards structured representations that would naturally form the building blocks of structure-dependent rules.

The point that F&C2017 seem to want to make: The necessary information isn’t in the linguistic input at all, but rather in the non-linguistic input. So, this differs from linguistic nativists, who believe it’s not in the input (i.e., the necessary info is internal to the child) and from emergentists/constructionists, who believe it’s in the input (though I think they also allow it to not be the linguistic input specifically). But then, we come back to what prior knowledge/abilities the child needs to harness the information available if it’s in the input (of whatever kind) somewhere. How does the child know to view the input in the crucial way in order to be able to extract the relevant information? Isn’t that based on prior knowledge, which at some point has to be innate? (And where all the disagreement happens is how specific that innate knowledge is.)

Also related: In the discussion, F&C2017 say “Input is the oil that lubes the acquisition machinery, but it is not the machinery itself.” Exactly! And what everyone argues about is what the machinery consists of that uses that input. Here, F&C2017 say “the structure of meaning can constrain the way the language system interacts with experience and restrict the space of learnable grammar.” Great! So, now we just have to figure out where knowledge of that meaning structure originates.

(2) This description of the generativist take on structure dependence seemed odd to me: “consider only rules where auxiliaries do not move out of their S domains”. Well, sure, in this case we’re talking about (S)entences as the relevant structure. But the bias is more general than that (which is why it’s applicable to all kinds of structures and transformations, not just yes/no questions): only consider rules that use structures (like S) as building blocks/primitives. The reliance on linguistic structures, rather than other building blocks, is what makes this bias language-specific. (Though I could imagine an argument where the bias itself is actually a domain-general thing like “use the salient chunks in your system as building blocks for rules”, and that gets implemented in this domain with “salient chunks” = “linguistic structures like S”.)

(3) I quite liked Figure 1, with its visual representation of what a child’s hypothesis space looks like under each approach. I think it’s fair to say the linguistic nativist approach has traditionally ruled out structure-independent grammars from the hypothesis space, while the constructivist approach hasn’t. Of course, there are far more nuanced ways to implement the linguistic nativist idea (e.g., a low, but non-zero, prior on structure-independent grammars), but this certainly serves as the extreme endpoint.

(4) In 1.2, F&C2017 comment on the Perfors et al. 2011 Bayesian model, saying that it doesn’t “explain how grammars are acquired in the first place”. I think this must be referring to the fact that the hypothesis space of the Bayesian learner included possible grammars and the modeled learner was choosing among them. But how else is learning supposed to work? There’s a hypothesis space that’s defined implicitly, and the learner draws/constructs some explicit hypothesis from that implicit hypothesis space to evaluate (Perfors 2012 talks about this very helpfully). Maybe F&C2017 want a learner that constructs the building blocks of the implicit hypothesis space too? (In which case, sure, I’d love to have a model of conceptual change like that. But no one has that yet, as far as I’m aware.)

Perfors, A. (2012). Bayesian models of cognition: what's built in after all?. Philosophy Compass, 7(2), 127-138.

F&C2017 also note in that same part that it’s problematic that children don’t seem to be as optimal as the computational-level Bayesian model. Again, sure, in the same way that any computational-level model needs to be translated to an algorithmic-level version that approximates the inference with child limitations. But this doesn’t seem such a big problem -- or rather, if it is, it’s *everyone’s* problem who works at the computational level of modeling.

(5) I really like the point F&C2017 make about the need to integrate meaning with these kind of learning problems. As they rightly note, what things mean is a very salient source of information. Traditionally, syntactic learning approaches in the generavist world have assumed the child only considers syntactic information when learning about syntactic knowledge. But precisely because syntax is a conduit for meaning to be expressed through and meaning transfer is the heart of communication, it seems exactly right that the child could care about information coming from meaning even when learning something syntactic. This again is where the Abend et al. (2017) model gets some of it bootstrapping power from. (Also, Pearl & Mis 2016 for anaphoric one -- another traditional example of poverty of the stimulus -- integrates meaning information when learning something ostensibly syntactic.)

Pearl, L. & Mis, B. (2016). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric one. Language, 92(1), 1-30.

(6) The Dual-path connectionist model, which uses thematic role & tense info: Importantly, the need for this information is motivated by production in F&C’s model -- you’re trying to express some particular meaning with the form you choose, and that’s part of what’s motivating the form. In theory, this should also be relevant for comprehension, of course. But what’s nice about this approach is that it gets at one of the key criticisms generativists (e.g., Berwick et al. 2011) had of prior modeling approaches -- namely, the disconnect between the form and the meaning.

Berwick, R. C., Pietroski, P., Yankama, B., & Chomsky, N. (2011). Poverty of the stimulus revisited. Cognitive Science, 35(7), 1207-1242.

(7) The dual path architecture: It’s interesting to see the use of a compression layer here, which forces the model to abstract away from details -- i.e., to form internal categories like we believe humans do. (Here, this means abstracting away from individual words and forming syntactic categories of some kind). I think this forced abstraction is one of the key motivations for current autoencoder approaches in machine learning.

(8) Encoding complex utterances: If I’m understanding this correctly, here’s where we see the structure explicitly -- we have one complete proposition connected to the agent concept of another proposition. So, the structured representation is available to the learner a priori via the conceptual structure. So, we might reasonably call this domain-specific knowledge, just not domain-specific syntactic knowledge. Then, experience with the language input tells the child how to translate that structured concept into a sequence of words, in this case, via the use of relative clauses. In particular, the child needs to see relative clauses used for embedded conceptual structures like this.

(9) Input distribution: I really appreciate F&C2017’s attention to realistic input distributions for training their model. This makes their model connect more to the actual problem children face, and so it makes their modeling results more informative.

(10) I think it’s really informative to see these results where the model can recreate specific observed differences in the developmental trajectory, and explain it by means of how the input is viewed. That is, the power of the learning approach is basically in viewing the input the right way, with the right scaffolding knowledge (here, about links between structured concepts and syntactic forms). Once that input lens is on, the input much more transparently reflects the observed behavior patterns in children. And this is what good computational modeling can do: make a learning theory specific enough to evaluate (here, about how to use that input), and then evaluate it by giving it realistic input and seeing if it can generate realistic output.

(11) It seems like F&C2017’s characterization of the hypothesis space aligns with other prior approaches like Perfors et al. 2011: the prior knowledge is a soft constraint on possible grammars, rather than absolutely ruling out structure-independent grammars. (In fact, Perfors et al. 2011 went further and used a simplicity prior, which is biased against the more complex structure-dependent grammars.) But the basic point is that there’s no need to categorically restrict the hypothesis space a priori. Instead, children can use their input and prior knowledge to restrict their hypotheses appropriately over time to structure-dependent rules.

**Bonus thoughts on M&al2018:

(B1) So, as cognitive scientists, should we spend more research time on the architecture that worked (i.e., the GRU with attention)? It does a very non-human thing, while also doing human things. And we don’t know why it’s doing either of those things, compared with other similar-seeming architectures that don’t. I should note that this is my existential issue with non-symbolic models, not a criticism specifically for M&al2018. I think they did a great job for a first pass at this question. Also, I really appreciate how careful they were about giving caveats when it comes to interpreting their results.

Tuesday, January 22, 2019

Some thoughts on Hahn et al. 2018

It’s really cool to see how adding processing considerations to an idealized (i.e., rational) model yields observable behavior. It reminds me of the importance of the different Marr explanation levels, where the algorithmic level is where processing considerations often get added (since these affect the algorithm humans use). A lot of work we’ve read about so far has been at the computational level (where, for example, the Rational Speech Act model typically lives). But in the back in my mind, I’m always thinking about what key differences might emerge once we have the bottleneck of human cognitive constraints.

Some other thoughts:

(1) Introduction, “As they occur in languages with widely different grammatical structures, we can expect that such an explanation will make reference to general principles of human communication and cognition” - I’m completely sympathetic to this approach, though it strikes me as funny that this is the same empirical fact that generativists use to appeal to innate, language-specific mechanisms (i.e., Universal Grammar). That is, the appearance of a pattern like this across the world’s languages is a signal to generativists that a universal language-specific principle is at work. Of course, as Hahn et al. (2018) note, it could well be that the universal principle has an effect on language (here, as adjective ordering constraints) but in fact the principle itself could be domain-general (e.g., something having to do with memory limitations, etc.).

(2) The Function of Subjective Adjectives: I love seeing how to operationalize intuitions formally -- this is a great example. We have a somewhat squishy notion of subjectivity that gets formalized as judgments whose truth is relative to individuals, which subsequently gets implemented as the listener inferring the speaker’s judgment.

(3) A Model of Adjective Use, where the listener infers a full word state that includes multiple people: This seems equivalent to inferring that the adjective is in fact subjective. Developmentally, that’s definitely a step kids have to figure out, i.e., is adjective A likely to be something everyone agrees on or not?

(4) Communication: Rational Listeners and Speakers, an RSA model with just L0 and S. So this is just a model of why a speaker chooses to say something, rather than how a pragmatic listener (L1) chooses to interpret it? I wonder why we stop at this level rather than going another layer to a pragmatic speaker (S1) who chooses to say something, based on how a pragmatic listener will interpret it. That’s what we have to do when modeling Truth Value Judgment Tasks (TVJTs), for example. Maybe that’s because TVJTs aren’t normal speech events, but instead involve participants judging what they themselves would say?

(5) Communication: Rational Listeners and Speakers, where a speaker’s utility function is adjusted (from just basic negative surprisal) because she realizes other people’s judgments may be different than others: This part where the expected utility isn’t just negative surprisal may be a developmental step kids would have to complete. That is, if kids realize adjectives can be subjective and other people may disagree, then they’d behave like what’s modeled here. On the other hand, if kids don’t realize adjectives can be subjective, they may simply go with negative surprisal.

(6) Communication: Rational Listeners and Speakers, where the cost is the surprise of utterance u across the community’s language use: This is interesting too -- usually we see cost having to do with individual production costs such as longer utterances being more costly than shorter ones. But of course here, all the utterances are the same length. Instead, what could differ is the frequency of that combination. This seems like a useful aspect to incorporate into speaker models more generally, since frequency in the input can certainly affect ease of production.

(7) Adding Noise: I love seeing how this explanation works, with words further back being more likely to be deleted before the whole phrase can be interpreted. It’s nicely intuitive that more subjective words would be preferred further back, since they lead to more disagreement across listeners. But I wonder how this story would work for languages where the adjectives comes after. In that case, the more subjective adjective is still further away, but this time it’s the one the listener would have heard most recently. So, I think that means the one further in the past would be deleted more often -- in this case, the less subjective one -- and the more subjective one would be likely to survive. And then it becomes weird, because now we get the reverse situation, where the surviving adjective is the one that listeners don’t agree on as much. It seems like this account would predict languages with adjectives occurring after the noun to have the more subjective ones closer to the noun, since they’d be more likely to be forgotten. But that’s not what we see.

There’s a specific note about this in the discussion: “Our account seems to make the correct prediction. In such languages, the noun is more likely to be lost when the second (subjective, in this case) adjective is reached.” -- So is the idea that the listener is just left with the two adjectives and no noun? Why does that lead to the correct order of noun-less_subjective_adj-more_subjective_adj, from a communicative standpoint?

Tuesday, December 4, 2018

Some thoughts on Bentz et al. 2017

I really appreciate seeing a clear explanation at the outset about how to cognitively interpret word entropy. The first thing I wonder when I see a cognitive model is what the variables are meant to correspond to in human cognition, and we get that right up front when it comes to discussing entropy (and why we should care about it). Basically, it’s a reflection of a processing cost (where minimizing entropy means minimizing that cost), so we potentially get some explanatory power about why language use looks the way it does, through the lens of entropy.

The main contributions of B&al2017 seem to be about establishing the ground truth about cross-linguistic entropy variation and methodology for assessing entropy -- so before we start worrying about what causes variation in word entropy, let’s first figure out how to assess it really well and then see if there are actually any differences than need explanation. The main finding is that, hey, word entropy doesn’t really differ. Therefore, whatever entropy indexes cognitively also doesn’t differ from language to language....which I think would make sense if this is about general human language processing abilities.

The other main finding is summed up this way: Unigram entropies and entropy rates are related -- in fact, you can predict entropy rate from unigram entropy. Here’s where I start to quibble because the interpretation given here doesn’t help me much: “uncertainty-reduction by co-textual information is approximately linear across the languages of the world.” What does this mean exactly? I don’t know how to contextualize that with respect to language processing. To be fair, I think B&al2017 are clear (in section 5.3) that they don’t know how either: “The exact meaning and implications of these constants are topics for future research.”

Other thoughts:

(1) B&al2017 note that they’ll discuss how the word entropy facts (i.e., the consistency across human languages) result from a trade-off between word learnability and word expressivity. In 6.1, they give us a bit of a sketch, which is nice -- basically this:

unlimited entropy = unlimited expressivity = unpredictable = hard to learn

minimum entropy = no expressivity = hard to communicate

This is the basic language evolution bottleneck, and then languages find a balance, with Kirby and colleagues providing simulations to prove it...or at least how compositionality results from these two pressures. But I’d like to think more about how that relates to word entropy. Compositionality = build larger things out a finite number of combinable pieces. Word entropy = ...what happens when you have that kind of system? But the interesting thing is how little variation there is, so it’s about a very narrow range of entropy resulting from this kind of system. So does any compositional system end up producing this range? (My sense is no, but I don’t know for sure.) If not, then we may have some interesting constraints on what kind of compositional system human languages end up producing.

(2) It’s interesting that orthographic words have been “vindicated” as reasonable units of analysis for describing regularities in language. Certainly there’s a big to-do in the developmental literature about words as a target of early speech segmentation (where the general consensus is “not really”).

(3) B&al2017 note that morphological complexity impacts unigram entropy, which makes sense: more complex words = more word types. Does this mean that for morphologically complex languages (e.g., agglutinative and polysynthetic), it would make more sense to do morpheme entropy? Or maybe morpheme entropy would be a better baseline period for cross-linguistic comparison? (This reminds me of the frequent frames literature in development, where there’s a question about whether the frame units ought to be words or morphemes, and how the child would figure out which to use for her language.)

Tuesday, November 13, 2018

Some thoughts on White et al. 2018

I love seeing syntactic bootstrapping not just as an initial word-learning strategy, but in fact as a continuing source of information (and thus useful for very subtle meaning acquisition). Intuitively, this makes sense since we can learn new words by reading them in context, and as an adult, I think that’s the main way we learn new words. But you don’t see as much work on the acquisition side exploring this idea. Hopefully these behavioral experiments can inform both future cognitive models and future NLP applications.

Other thoughts:

(1) The fact that some verbs have both representational and preferential properties underscores that there’s likely to be a continuum, rather than categorical distinctions. This reminds me of the raising vs control distinction (subject raising: He seemed to laugh; subject control: He wanted to laugh), where there are verbs that seem to allow both syntactic options (e.g., begin: It began to fall (raising) vs. He began to laugh (control)). So, casting the acquisition task as “is this a raising or a control verb?” may actually be an unhelpful idealization — instead of a binary classification, it may be that children are identifying where on the raising-control continuum a verb falls, based on its syntactic usage.

(2) I think what comes out most from the review of semantic and syntactic properties is how everything is about correlations, rather than absolutes. So, we have these semantic and syntactic features, and we have verb classes that involve collections of features with particular values; moreover, there seem to be prototypical examples and less-prototypical examples (where a verb has a bunch of properties, but is exceptional by not having another that usually clumps together with the first bunch). This means we can very reasonably have a way to make generalizations, on the basis of property clusters that verb classes have, but we also allow exceptions (related verb classes of much smaller size, or connections between shared properties of verb classes— like an overhypothesis when it comes to property distributions). I wonder if a Tolerance Principle style analysis would predict which property clusters people (adults or children) would view as productive, on the basis of their input frequency and specific proposals about the underlying verb class structure.

(3) Figure 2 is a great visualization for what these verb classes might look like, on the basis of their syntactic frame use. Now, if we could just interpret those first few principle components, we’d have an idea what the high-level properties (=syntactic feature clusters) were…it looks like this is the idea behind the analysis in 3.4.3, where W&al2018 harness the connection between syntactic frames and PCA components.

Side note: Very interesting that bother, amaze, and tell clump together. I wouldn’t have put these three together specifically, but that first component clearly predicts them to be doing the same (negative) thing with respect to that component. Of course, Fig 6 gives a more nuanced view of this.

Also, I love that W&al2018 are able to use their statistical wizardry to interpret their quantitative results and pull out new theoretical proposals for natural semantic classes and the syntactic reflections of these classes. Quantitative theorizing, check!

(4) Hurrah for learning model targets! If we look for features a verb might have as Table 1 does (rather than set classes, where something must be e.g., representational or preferential but not both, which is a problem for hope), then this becomes a nicely-specified acquisition task to model. That is, given children’s input, can verb classes be formed that have each verb connected with its appropriate property cluster? Moreover, with the similarity judgment data, we can even get a sense of what the adult verb classes look like by clustering the verbs on the basis of their similarity (like in Fig 6).

Another learning model check would be put verbs into classes such that the odd man out behavioral results are matched or the similarity judgments are matched. Another would be to put verbs into classes that predict which verb frames they prefer/disprefer.

(5) In the general discussion, we see a concrete proposal for the syntactic and semantic features a learner could track, along with necessary links between the two feature types. I wonder if it’s possible to infer the links (e,g., representational-main clause), rather than build them in. This is a version of my standard wonder: “If you think the learner needs explicit knowledge X, is it possible to derive X from more foundational or general-purpose building blocks?”

(6) Typo sadness: That copyediting typo with “Neither 1 nor 1…” in the introduction was tough. It took me a bit to work through the intended meaning, given examples 3-5, but I figured the point was that think doesn’t entail its complement while know does, whether they’re positive uses or negated uses. Unfortunately, this typo issue seems to be an issue throughout the first chunk of the paper and in the methods section, where the in-text example numbering got 1-ed out. :(

Tuesday, October 23, 2018

Some thoughts on Gauthier et al. 2018

I love seeing examples of joint learning because not only do joint learning models tend to do better than sequential models, but joint learning also seems to be the best fit to how real children learn (language) things. [I remember a more senior colleague who works on a variety of acquisition processes that happen during infancy and toddler-hood saying something like the following: “I used to think babies first learned how to segment words, then learned their language-specific sound categories, and then figured out words. I don’t think those babies exist anymore.”] As G&al2018 find, this is because it can be more efficient to learn jointly than sequentially. Why? Because you harness information from “the other thing” when you’re learning jointly, while you just ignore that information if you’re learning sequentially. I think a real hurdle in the past has been how to mathematically define joint learning models so the math is solvable with current techniques. Happily (at least when it comes to making modeling progress), that seems like a hurdle that’s being surmounted.

It’s also great to see models being evaluated against observable child behavior, rather than a target linguistic knowledge state that we can’t observe. It’s much easier to defend why your model is matching behavior (answer: because it’s what we see children doing -- even if it’s only a qualitative match, like what we see here) than it is to defend why your model is aiming for a specific target theoretically-motivated knowledge state instead of some other equally plausible theoretically-motivated target knowledge state.

What’s exciting about the results is how much you don’t need to build in to get the performance jump. You have to build in the possibility of connections between certain pieces of information in the overhypothesis (e.g., syntactic type to attribute), but not the explicit content of those connections (what the probabilities are). So, stepping back, this supports prior knowledge that focuses your attention on certain building blocks (i.e., “look at these connections”), but doesn’t explicitly have to define the exact form built from those blocks. That’s what you as a child learn to do, based on your input. To me, this is the way forward for generative theorizing about what’s in Universal Grammar.

Other specific thoughts:

(1) It’s nice to see the mention of Abend et al. 2017 -- that’s a paper I recently ran across that did an impressive job of jointly learning word meaning and syntactic structure. It looks like G&al2018 use the CCG formalism too, which is very interesting as CCG has a couple of core building blocks that are used to generate a lot of possible language structure. This is similar in spirit to Minimalism (few building blocks, lots of generative capacity), but CCG now has these acquisition models associated with it that explain how learning could work while Minimalism doesn’t yet.

(2) Given the ages in the Smith et al. 1992 study (2;11-3;9), it’s interesting that G&al2018 are focusing on the meaning of the prenominal adjective position. While this seems perfectly reasonable to start with, I could also imagine that children of this age have something like syntactic categories, and so it’s not just the prenominal adjective position that has some meaning connection, but adjectives in general that have some meaning connection. It’d be handy to know the distribution of meanings for adjectives in general, and use that in addition to the more specific positional information of prenominal adjective meaning. (It seems like this might be closer to what 3-year-olds are using.) Maybe the idea is that this is a model of how those categories form in the first place, and then we see the results of it in the three-year-olds?

(3) In the reference game, I wonder if the 3D nature of the environment matters. Given the properties of interest (shape, color), it seems like the same investigation could be accomplished with a simple list of potential referents and their properties (color, shape, material, size). Maybe this is for ease of extension later on, where perceptual properties of the objects (e.g., distance, contrast) might impact learner inferences about an intended referent?

(4) Marr’s levels check: This seems to be a computational-level (=~rational) model when it comes to inference (using optimal algorithms of various kinds for lexicon induction), yet it also incorporates incremental learning -- which makes it feel more like an algorithmic-level (=~process) model. Typically, I think about rational vs. process models as answering different kinds of acquisition questions. Rational = “Is it possible for a learner to accomplish this acquisition task, given this input, these abilities, and this desired output?”; process = “Is it possible for a child to accomplish this acquisition task, given this input, known child abilities, known child limitations (both cognitive and learning-time wise), and this desired output?” This model starts to incorporate at least one known limitation of child learning -- they see and learn from data incrementally, rather than being able to hold all the data at once in mind for analysis.

(5) If I’m interpreting Figure 4 correctly, I think s|t (syntactic type given abstract type, e.g., adjective given color) would correspond to a sort of inverse syntactic bootstrapping (traditional syntactic bootstrapping: the linguistic context provides the cue to word meaning). Here, the attribute of color, for example, gives you the syntactic type of adjective. Then, w|v (word form, given attribute value, e.g., “blue” given the blue color) corresponds to a more standard idea of a basic lexicon that consists just of word-form-to-word referent mappings?

(6) As proof of concept, I definitely understand starting with a synthetic group of referring expressions. But maybe a next step is to use naturalistic distributions of color+shape combinations? The corpus data used in the initial corpus analysis seem like a good reference distribution.

(7) Figure 5a (and also shown in 5b): It seems like the biggest difference is that the overhypothesis model jumps up to higher performance more quickly (though the base model catches up after not too many more examples). It’s striking how much can be learned after only 50 examples or so -- this super-fast learning highlights why this is more a rational (computational-level) model than a process (algorithmic-level) one. It’s unlikely children can do the same thing after 50 examples.

Tuesday, October 9, 2018

Some thoughts on Linzen & Oseki 2018

I really appreciate L&O2018’s focus on the replicability of linguistic judgments in non-English languages (and especially their calm tone about it). I think the situation of potentially unreliable judgments emerging during review highlights the utility of something like registered reports, even for theoretical researchers. If someone finds out during the planning stage that the contrasts they thought were so robust actually aren’t, this may help avoid wasted time building theories to account for the data in question (or perhaps bring in considerations of language variation). [Side note: I have especial feeling for this issue, having struggled to have an author’s judgments about allowed vs. unallowed interpretations in many a semantics seminar paper in graduate school.]

In theory, aspects of the peer review process are supposed to help cover this, but as L&O2018 note in section 4.1, this is harder for non-English languages. To help with this, L&O2018 suggest the open review system in section 4.2, with the crowdsourced database of published acceptability judgments, which sounds incredible. Someone should totally fund the construction of that. As L&O2018 note, this will be especially helpful for less-studied languages that have fewer native speakers.

I’m also completely with L&O2018 on focusing on judgments that aren’t self-evident - but then, who makes the call about what’s self-evident and what’s not? Is it about the subjective confidence of the individual (what’s “obvious to any native speaker”, as noted in section 4)? And if so, what if an individual finds something self-evident, but it’s actually a legitimate point of variation that this individual isn’t aware of, and so another individual wouldn’t view it as self-evident? I guess this is part of what L&O2018 set out to prove, i.e., that a trained linguist has good subjective confidence about self-evidentiality? Section 2.2 covers this, with the three-way classification. But even still, I wonder about the facts that are theoretically presupposed because they’re self-evident vs. theoretically meaningful because they’re not. It’d be great if there was some objective, measurable signal that distinguished them, aside from the acceptability judgments replications of course (since the whole point of having such a signal would be to focus replications on the ones that weren’t self-evident). Mahowald et al. (2016)’s approach of unanimous judgments from 7 people on 7 variants of the data point in question seems like one way to do this -- basically, it’s a mini-acceptability judgment replication. And it does seem more doable, especially with the crowd-sourced judgment platform L&O2018 advocate.

One more thought: L&O2018 make a striking point about the importance of relative acceptability and how acceptability isn’t the same as grammaticality, since raw acceptability value can differ so widely for “grammatical” and “ungrammatical” items. For example, if an ungrammatical item has a high acceptability score (e.g., H8’s starred version had a mean score of 6.06 out of 7), and no obvious dialectal variation, how do we interpret that? L&O2018 reasonable hypothesize that this means it’s not actually ungrammatical. But then, is ungrammatical just about a threshold of acceptability at some point? That is, is low acceptability necessary for (or highly correlated with) ungrammaticality?

Friday, May 11, 2018

Some thoughts on Johnson 2017 + Perfors 2017

I love seeing connections to Marr’s levels of description, because this framework is one that I’ve found so helpful for thinking about a variety of problems I work on in language development. Related to this, it was interesting to see Johnson suggest that grammars are computational-level while comprehension and production are algorithmic-level, because comprehension and production are processes operating over these grammar structures. But couldn’t we also apply levels of description just to the grammar knowledge itself? So, for instance, computational-level descriptions provide a grammar structure (or a way to generate that structure using things like Merge), say for some utterance. Then, the algorithmic-level description describes how humans generate that utterance structure in real time with their cognitive limitations (irrespective of whether they’re comprehending, producing, or learning). Then, the implementational-level description is the neural matter that implements the language structure in real time with cognitive and wetware limitations (again, irrespective of whether someone is comprehending, producing, or learning).

Other thoughts:

(1) One major point Johnson makes: a small change at the computational level can have a big impact at the implementation as level. This is basically saying that a small change in building blocks can have a big impact on what you can build, which is the idea behind parameters, especially linguistic parameters. It’s also the idea behind how the brain constructs the mind, with small neurological changes having big cognitive effects (for example, brain lesions).

But, importantly for Johnson and Perfors, implementational level complexity may matter more for evolutionary plausibility. In particular, the systems needed to support the implementation may be quite different, and that connects to evolutionary plausibility. Because of this, arguing for or against something on the basis of its computational level simplicity may not be useful because we don’t really know how the computational level description gets implemented (in the neural matter, let alone the genome that constructs that neural matter). If it turns out the genes encode some kind of computational level description, then we have a link we can exploit for discussing evolutionary probability. Otherwise, it’s not obvious how much evolutionary-plausibility-mileage we get out of something being simple at the computational level of description. So, the level at which simplicity is relevant for evolutionary arguments is the genetic level, since that’s the part that connects most directly to evolutionary arguments. (Though perhaps there’s also a place for “simple” to be about how easy it is to derive from cultural evolution?)

(2) From Johnson 2017: “...perhaps computational descriptions are best understood as scientific theories about cognitive systems?” While I understand where Johnson is coming from (given his focus on evolutionary explanations), I don’t think I agree with this idea of connecting “computational description” with “scientific theories”. A computational description is a description at the level of “the goals of this computation”. We can have scientific theories about that, but we can also have scientific theories about “how this computation is implemented in the wetware” (i.e., the implementational level of description). So, to me, “level of description” is a separate thing from “scientific theory” (and usefully so).

Computational Models of Language (at UC Irvine)

Tuesday, February 5, 2019

Some thoughts on Fitz & Chang 2017 + Bonus thoughts on McCoy et al. 2018

Tuesday, January 22, 2019

Some thoughts on Hahn et al. 2018

Tuesday, December 4, 2018

Some thoughts on Bentz et al. 2017

Tuesday, November 13, 2018

Some thoughts on White et al. 2018

Tuesday, October 23, 2018

Some thoughts on Gauthier et al. 2018

Tuesday, October 9, 2018

Some thoughts on Linzen & Oseki 2018

Friday, May 11, 2018

Some thoughts on Johnson 2017 + Perfors 2017

People who think this blog is awesome

Members