Tuesday, January 22, 2019

Some thoughts on Hahn et al. 2018

It’s really cool to see how adding processing considerations to an idealized (i.e., rational) model yields observable behavior. It reminds me of the importance of the different Marr explanation levels, where the algorithmic level is where processing considerations often get added (since these affect the algorithm humans use). A lot of work we’ve read about so far has been at the computational level (where, for example, the Rational Speech Act model typically lives). But in the back in my mind, I’m always thinking about what key differences might emerge once we have the bottleneck of human cognitive constraints.

Some other thoughts:
(1) Introduction, “As they occur in languages with widely different grammatical structures, we can expect that such an explanation will make reference to general principles of human communication and cognition” - I’m completely sympathetic to this approach, though it strikes me as funny that this is the same empirical fact that generativists use to appeal to innate, language-specific mechanisms (i.e., Universal Grammar). That is, the appearance of a pattern like this across the world’s languages is a signal to generativists that a universal language-specific principle is at work. Of course, as Hahn et al. (2018) note, it could well be that the universal principle has an effect on language (here, as adjective ordering constraints) but in fact the principle itself could be domain-general (e.g., something having to do with memory limitations, etc.).

(2) The Function of Subjective Adjectives:  I love seeing how to operationalize intuitions formally -- this is a great example. We have a somewhat squishy notion of subjectivity that gets formalized as judgments whose truth is relative to individuals, which subsequently gets implemented as the listener inferring the speaker’s judgment.

(3) A Model of Adjective Use, where the listener infers a full word state that includes multiple people: This seems equivalent to inferring that the adjective is in fact subjective. Developmentally, that’s definitely a step kids have to figure out, i.e., is adjective A likely to be something everyone agrees on or not?

(4) Communication: Rational Listeners and Speakers, an RSA model with just L0 and S. So this is just a model of why a speaker chooses to say something, rather than how a pragmatic listener (L1) chooses to interpret it? I wonder why we stop at this level rather than going another layer to a pragmatic speaker (S1) who chooses to say something, based on how a pragmatic listener will interpret it. That’s what we have to do when modeling Truth Value Judgment Tasks (TVJTs), for example. Maybe that’s because TVJTs aren’t normal speech events, but instead involve participants judging what they themselves would say?

(5) Communication: Rational Listeners and Speakers, where a speaker’s utility function is adjusted (from just basic negative surprisal) because she realizes other people’s judgments may be different than others: This part where the expected utility isn’t just negative surprisal may be a developmental step kids would have to complete. That is, if kids realize adjectives can be subjective and other people may disagree, then they’d behave like what’s modeled here. On the other hand, if kids don’t realize adjectives can be subjective, they may simply go with negative surprisal.

(6) Communication: Rational Listeners and Speakers, where the cost is the surprise of utterance u across the community’s language use: This is interesting too -- usually we see cost having to do with individual production costs such as longer utterances being more costly than shorter ones. But of course here, all the utterances are the same length. Instead, what could differ is the frequency of that combination. This seems like a useful aspect to incorporate into speaker models more generally, since frequency in the input can certainly affect ease of production.

(7) Adding Noise: I love seeing how this explanation works, with words further back being more likely to be deleted before the whole phrase can be interpreted. It’s nicely intuitive that more subjective words would be preferred further back, since they lead to more disagreement across listeners. But I wonder how this story would work for languages where the adjectives comes after. In that case, the more subjective adjective is still further away, but this time it’s the one the listener would have heard most recently. So, I think that means the one further in the past would be deleted more often -- in this case, the less subjective one -- and the more subjective one would be likely to survive. And then it becomes weird, because now we get the reverse situation, where the surviving adjective is the one that listeners don’t agree on as much. It seems like this account would predict languages with adjectives occurring after the noun to have the more subjective ones closer to the noun, since they’d be more likely to be forgotten. But that’s not what we see.

There’s a specific note about this in the discussion: “Our account seems to make the correct prediction. In such languages, the noun is more likely to be lost when the second (subjective, in this case) adjective is reached.” -- So is the idea that the listener is just left with the two adjectives and no noun? Why does that lead to the correct order of noun-less_subjective_adj-more_subjective_adj, from a communicative standpoint?

Tuesday, December 4, 2018

Some thoughts on Bentz et al. 2017

I really appreciate seeing a clear explanation at the outset about how to cognitively interpret word entropy. The first thing I wonder when I see a cognitive model is what the variables are meant to correspond to in human cognition, and we get that right up front when it comes to discussing entropy (and why we should care about it). Basically, it’s a reflection of a processing cost (where minimizing entropy means minimizing that cost), so we potentially get some explanatory power about why language use looks the way it does, through the lens of entropy.

The main contributions of B&al2017 seem to be about establishing the ground truth about cross-linguistic entropy variation and methodology for assessing entropy -- so before we start worrying about what causes variation in word entropy, let’s first figure out how to assess it really well and then see if there are actually any differences than need explanation. The main finding is that, hey, word entropy doesn’t really differ. Therefore, whatever entropy indexes cognitively also doesn’t differ from language to language....which I think would make sense if this is about general human language processing abilities.

The other main finding is summed up this way: Unigram entropies and entropy rates are related -- in fact, you can predict entropy rate from unigram entropy. Here’s where I start to quibble because the interpretation given here doesn’t help me much: “uncertainty-reduction by co-textual information is approximately linear across the languages of the world.” What does this mean exactly? I don’t know how to contextualize that with respect to language processing. To be fair, I think B&al2017 are clear (in section 5.3) that they don’t know how either: “The exact meaning and implications of these constants are topics for future research.”

Other thoughts:

(1) B&al2017 note that they’ll discuss how the word entropy facts (i.e., the consistency across human languages) result from a trade-off between word learnability and word expressivity. In 6.1, they give us a bit of a sketch, which is nice -- basically this:

unlimited entropy = unlimited expressivity = unpredictable = hard to learn
minimum entropy = no expressivity = hard to communicate

This is the basic language evolution bottleneck, and then languages find a balance, with Kirby and colleagues providing simulations to prove it...or at least how compositionality results from these two pressures. But I’d like to think more about how that relates to word entropy. Compositionality = build larger things out a finite number of combinable pieces. Word entropy = ...what happens when you have that kind of system? But the interesting thing is how little variation there is, so it’s about a very narrow range of entropy resulting from this kind of system. So does any compositional system end up producing this range? (My sense is no, but I don’t know for sure.) If not, then we may have some interesting constraints on what kind of compositional system human languages end up producing.

(2) It’s interesting that orthographic words have been “vindicated” as reasonable units of analysis for describing regularities in language. Certainly there’s a big to-do in the developmental literature about words as a target of early speech segmentation (where the general consensus is “not really”).

(3) B&al2017 note that morphological complexity impacts unigram entropy, which makes sense: more complex words = more word types. Does this mean that for morphologically complex languages (e.g., agglutinative and polysynthetic), it would make more sense to do morpheme entropy? Or maybe morpheme entropy would be a better baseline period for cross-linguistic comparison? (This reminds me of the frequent frames literature in development, where there’s a question about whether the frame units ought to be words or morphemes, and how the child would figure out which to use for her language.)

Tuesday, November 13, 2018

Some thoughts on White et al. 2018

I love seeing syntactic bootstrapping not just as an initial word-learning strategy, but in fact as a continuing source of information (and thus useful for very subtle meaning acquisition). Intuitively, this makes sense since we can learn new words by reading them in context, and as an adult, I think that’s the main way we learn new words. But you don’t see as much work on the acquisition side exploring this idea. Hopefully these behavioral experiments can inform both future cognitive models and future NLP applications.

Other thoughts:

(1) The fact that some verbs have both representational and preferential properties underscores that there’s likely to be a continuum, rather than categorical distinctions. This reminds me of the raising vs control distinction (subject raising: He seemed to laugh; subject control: He wanted to laugh), where there are verbs that seem to allow both syntactic options (e.g., begin: It began to fall (raising) vs. He began to laugh (control)). So, casting the acquisition task as “is this a raising or a control verb?” may actually be an unhelpful idealization — instead of a binary classification, it may be that children are identifying where on the raising-control continuum a verb falls, based on its syntactic usage.

(2) I think what comes out most from the review of semantic and syntactic properties is how everything is about correlations, rather than absolutes. So, we have these semantic and syntactic features, and we have verb classes that involve collections of features with particular values; moreover, there seem to be prototypical examples and less-prototypical examples (where a verb has a bunch of properties, but is exceptional by not having another that usually clumps together with the first bunch). This means we can very reasonably have a way to make generalizations, on the basis of property clusters that verb classes have, but we also allow exceptions (related verb classes of much smaller size, or connections between shared properties of verb classes— like an overhypothesis when it comes to property distributions). I wonder if a Tolerance Principle style analysis would predict which property clusters people (adults or children) would view as productive, on the basis of their input frequency and specific proposals about the underlying verb class structure.

(3) Figure 2 is a great visualization for what these verb classes might look like, on the basis of their syntactic frame use. Now, if we could just interpret those first few principle components, we’d have an idea what the high-level properties (=syntactic feature clusters) were…it looks like this is the idea behind the analysis in 3.4.3, where W&al2018 harness the connection between syntactic frames and PCA components.

Side note: Very interesting that bother, amaze, and tell clump together. I wouldn’t have put these three together specifically, but that first component clearly predicts them to be doing the same (negative) thing with respect to that component. Of course, Fig 6 gives a more nuanced view of this.

Also, I love that W&al2018 are able to use their statistical wizardry to interpret their quantitative results and pull out new theoretical proposals for natural semantic classes and the syntactic reflections of these classes. Quantitative theorizing, check!

(4) Hurrah for learning model targets! If we look for features a verb might have as Table 1 does (rather than set classes, where something must be e.g., representational or preferential but not both, which is a problem for hope), then this becomes a nicely-specified acquisition task to model. That is, given children’s input, can verb classes be formed that have each verb connected with its appropriate property cluster? Moreover, with the similarity judgment data, we can even get a sense of what the adult verb classes look like by clustering the verbs on the basis of their similarity (like in Fig 6).

Another learning model check would be put verbs into classes such that the odd man out behavioral results are matched or the similarity judgments are matched. Another would be to put verbs into classes that predict which verb frames they prefer/disprefer.

(5) In the general discussion, we see a concrete proposal for the syntactic and semantic features a learner could track, along with necessary links between the two feature types. I wonder if it’s possible to infer the links (e,g., representational-main clause), rather than build them in. This is a version of my standard wonder: “If you think the learner needs explicit knowledge X, is it possible to derive X from more foundational or general-purpose building blocks?”

(6) Typo sadness: That copyediting typo with “Neither 1 nor 1…” in the introduction was tough. It took me a bit to work through the intended meaning, given examples 3-5, but I figured the point was that think doesn’t entail its complement while know does, whether they’re positive uses or negated uses. Unfortunately, this typo issue seems to be an issue throughout the first chunk of the paper and in the methods section, where the in-text example numbering got 1-ed out. :(

Tuesday, October 23, 2018

Some thoughts on Gauthier et al. 2018

I love seeing examples of joint learning because not only do joint learning models tend to do better than sequential models, but joint learning also seems to be the best fit to how real children learn (language) things. [I remember a more senior colleague who works on a variety of acquisition processes that happen during infancy and toddler-hood saying something like the following: “I used to think babies first learned how to segment words, then learned their language-specific sound categories, and then figured out words. I don’t think those babies exist anymore.”] As G&al2018 find, this is because it can be more efficient to learn jointly than sequentially. Why? Because you harness information from “the other thing” when you’re learning jointly, while you just ignore that information if you’re learning sequentially. I think a real hurdle in the past has been how to mathematically define joint learning models so the math is solvable with current techniques. Happily (at least when it comes to making modeling progress), that seems like a hurdle that’s being surmounted.

It’s also great to see models being evaluated against observable child behavior, rather than a target linguistic knowledge state that we can’t observe. It’s much easier to defend why your model is matching behavior (answer: because it’s what we see children doing -- even if it’s only a qualitative match, like what we see here) than it is to defend why your model is aiming for a specific target theoretically-motivated knowledge state instead of some other equally plausible theoretically-motivated target knowledge state.

What’s exciting about the results is how much you don’t need to build in to get the performance jump. You have to build in the possibility of connections between certain pieces of information in the overhypothesis (e.g., syntactic type to attribute), but not the explicit content of those connections (what the probabilities are). So, stepping back, this supports prior knowledge that focuses your attention on certain building blocks (i.e., “look at these connections”), but doesn’t explicitly have to define the exact form built from those blocks. That’s what you as a child learn to do, based on your input. To me, this is the way forward for generative theorizing about what’s in Universal Grammar.

Other specific thoughts:
(1) It’s nice to see the mention of Abend et al. 2017 -- that’s a paper I recently ran across that did an impressive job of jointly learning word meaning and syntactic structure. It looks like G&al2018 use the CCG formalism too, which is very interesting as CCG has a couple of core building blocks that are used to generate a lot of possible language structure. This is similar in spirit to Minimalism (few building blocks, lots of generative capacity), but CCG now has these acquisition models associated with it that explain how learning could work while Minimalism doesn’t yet.

(2) Given the ages in the Smith et al. 1992 study (2;11-3;9), it’s interesting that G&al2018 are focusing on the meaning of the prenominal adjective position. While this seems perfectly reasonable to start with, I could also imagine that children of this age have something like syntactic categories, and so it’s not just the prenominal adjective position that has some meaning connection, but adjectives in general that have some meaning connection. It’d be handy to know the distribution of meanings for adjectives in general, and use that in addition to the more specific positional information of prenominal adjective meaning. (It seems like this might be closer to what 3-year-olds are using.) Maybe the idea is that this is a model of how those categories form in the first place, and then we see the results of it in the three-year-olds?

(3) In the reference game, I wonder if the 3D nature of the environment matters. Given the properties of interest (shape, color), it seems like the same investigation could be accomplished with a simple list of potential referents and their properties (color, shape, material, size). Maybe this is for ease of extension later on, where perceptual properties of the objects (e.g., distance, contrast) might impact learner inferences about an intended referent?

(4) Marr’s levels check: This seems to be a computational-level (=~rational) model when it comes to inference (using optimal algorithms of various kinds for lexicon induction), yet it also incorporates incremental learning -- which makes it feel more like an algorithmic-level (=~process) model. Typically, I think about rational vs. process models as answering different kinds of acquisition questions. Rational = “Is it possible for a learner to accomplish this acquisition task, given this input, these abilities, and this desired output?”; process = “Is it possible for a child to accomplish this acquisition task, given this input, known child abilities, known child limitations (both cognitive and learning-time wise), and this desired output?” This model starts to incorporate at least one known limitation of child learning -- they see and learn from data incrementally, rather than being able to hold all the data at once in mind for analysis.

(5) If I’m interpreting Figure 4 correctly, I think s|t (syntactic type given abstract type, e.g., adjective given color) would correspond to a sort of inverse syntactic bootstrapping (traditional syntactic bootstrapping: the linguistic context provides the cue to word meaning). Here, the attribute of color, for example, gives you the syntactic type of adjective. Then, w|v (word form, given attribute value, e.g., “blue” given the blue color) corresponds to a more standard idea of a basic lexicon that consists just of word-form-to-word referent mappings?

(6) As proof of concept, I definitely understand starting with a synthetic group of referring expressions. But maybe a next step is to use naturalistic distributions of color+shape combinations? The corpus data used in the initial corpus analysis seem like a good reference distribution.

(7) Figure 5a (and also shown in 5b): It seems like the biggest difference is that the overhypothesis model jumps up to higher performance more quickly (though the base model catches up after not too many more examples). It’s striking how much can be learned after only 50 examples or so -- this super-fast learning highlights why this is more a rational (computational-level) model than a process (algorithmic-level) one. It’s unlikely children can do the same thing after 50 examples.

Tuesday, October 9, 2018

Some thoughts on Linzen & Oseki 2018

I really appreciate L&O2018’s focus on the replicability of linguistic judgments in non-English languages (and especially their calm tone about it). I think the situation of potentially unreliable judgments emerging during review highlights the utility of something like registered reports, even for theoretical researchers. If someone finds out during the planning stage that the contrasts they thought were so robust actually aren’t, this may help avoid wasted time building theories to account for the data in question (or perhaps bring in considerations of language variation). [Side note: I have especial feeling for this issue, having struggled to have an author’s judgments about allowed vs. unallowed interpretations in many a semantics seminar paper in graduate school.]

In theory, aspects of the peer review process are supposed to help cover this, but as L&O2018 note in section 4.1, this is harder for non-English languages. To help with this, L&O2018 suggest the open review system in section 4.2, with the crowdsourced database of published acceptability judgments, which sounds incredible. Someone should totally fund the construction of that. As L&O2018 note, this will be especially helpful for less-studied languages that have fewer native speakers.

I’m also completely with L&O2018 on focusing on judgments that aren’t self-evident - but then, who makes the call about what’s self-evident and what’s not? Is it about the subjective confidence of the individual (what’s “obvious to any native speaker”, as noted in section 4)? And if so, what if an individual finds something self-evident, but it’s actually a legitimate point of variation that this individual isn’t aware of, and so another individual wouldn’t view it as self-evident? I guess this is part of what L&O2018 set out to prove, i.e., that a trained linguist has good subjective confidence about self-evidentiality? Section 2.2 covers this, with the three-way classification. But even still, I wonder about the facts that are theoretically presupposed because they’re self-evident vs. theoretically meaningful because they’re not. It’d be great if there was some objective, measurable signal that distinguished them, aside from the acceptability judgments replications of course (since the whole point of having such a signal would be to focus replications on the ones that weren’t self-evident). Mahowald et al. (2016)’s approach of unanimous judgments from 7 people on 7 variants of the data point in question seems like one way to do this -- basically, it’s a mini-acceptability judgment replication. And it does seem more doable, especially with the crowd-sourced judgment platform L&O2018 advocate.

One more thought: L&O2018 make a striking point about the importance of relative acceptability and how acceptability isn’t the same as grammaticality, since raw acceptability value can differ so widely for “grammatical” and “ungrammatical” items. For example, if an ungrammatical item has a high acceptability score (e.g., H8’s starred version had a mean score of 6.06 out of 7), and no obvious dialectal variation, how do we interpret that? L&O2018 reasonable hypothesize that this means it’s not actually ungrammatical. But then, is ungrammatical just about a threshold of acceptability at some point? That is, is low acceptability necessary for (or highly correlated with) ungrammaticality?

Friday, May 11, 2018

Some thoughts on Johnson 2017 + Perfors 2017

I love seeing connections to Marr’s levels of description, because this framework is one that I’ve found so helpful for thinking about a variety of problems I work on in language development. Related to this, it was interesting to see Johnson suggest that grammars are computational-level while comprehension and production are algorithmic-level, because comprehension and production are processes operating over these grammar structures. But couldn’t we also apply levels of description just to the grammar knowledge itself? So, for instance, computational-level descriptions provide a grammar structure (or a way to generate that structure using things like Merge), say for some utterance. Then, the algorithmic-level description describes how humans generate that utterance structure in real time with their cognitive limitations (irrespective of whether they’re comprehending, producing, or learning). Then, the implementational-level description is the neural matter that implements the language structure in real time with cognitive and wetware limitations (again, irrespective of whether someone is comprehending, producing, or learning).

Other thoughts:
(1) One major point Johnson makes: a small change at the computational level can have a big impact at the implementation as level. This is basically saying that a small change in building blocks can have a big impact on what you can build, which is the idea behind parameters, especially linguistic parameters. It’s also the idea behind how the brain constructs the mind, with small neurological changes having big cognitive effects (for example, brain lesions).

But, importantly for Johnson and Perfors, implementational level complexity may matter more for evolutionary plausibility. In particular, the systems needed to support the implementation may be quite different, and that connects to evolutionary plausibility. Because of this, arguing for or against something on the basis of its computational level simplicity may not be useful because we don’t really know how the computational level description gets implemented (in the neural matter, let alone the genome that constructs that neural matter). If it turns out the genes encode some kind of computational level description, then we have a link we can exploit for discussing evolutionary probability. Otherwise, it’s not obvious how much evolutionary-plausibility-mileage we get out of something being simple at the computational level of description. So, the level at which simplicity is relevant for evolutionary arguments is the genetic level, since that’s the part that connects most directly to evolutionary arguments. (Though perhaps there’s also a place for “simple” to be about how easy it is to derive from cultural evolution?)

(2) From Johnson 2017: “...perhaps computational descriptions are best understood as scientific theories about cognitive systems?” While I understand where Johnson is coming from (given his focus on evolutionary explanations), I don’t think I agree with this idea of connecting “computational description” with “scientific theories”. A computational description is a description at the level of “the goals of this computation”. We can have scientific theories about that, but we can also have scientific theories about “how this computation is implemented in the wetware” (i.e., the implementational level of description).  So, to me, “level of description” is a separate thing from “scientific theory” (and usefully so).

Friday, April 27, 2018

Some thoughts on Kirby 2017 + Adger 2017 + Bowling 2017

I love seeing an articulated approach to studying language evolution from a computational perspective, and appreciate that Kirby addressed some of my concerns with this approach head on (whether or not I found the answers satisfactory). Interestingly, I’m not so bothered by the issues that bother Adger. I also quite appreciate Bowling’s point about gene-culture interactions when it comes to explaining the origins of complex “phenotypes” like language.

Other thoughts:

(1) Kirby 2017

(a) Given his focus on cultural heredity, does Kirby fall on the “language evolved to enable communication, not complex thought” side of the spectrum? He seems to want to distinguish his viewpoint from the language-for-communication side, though. “...cultural transmission once certain biological prerequisites are in place”. I guess it depends on what he thinks the biological prerequisites are? His final claim is that it’s linked to self-domestication (which yields tendencies towards signal copying and sharing), so that again seems more on the language-for-communication side.

(b) According to Kirby, language design features all seem to lead to systematicity, which is the ability to be more compactly represented, a la language grammars. This is a pretty key component in language acquisition, where children seem biased to look for systematicity (i.e., generalizations) in the data they encounter, such as language data. This seems like it comes into play when Kirby talks about systematicity arising from pressures of language learning and language use.

Kirby also indicates that only human languages have systematicity, which makes the studies about meaningful combinations in other animal systems (e.g., some primate calls involving social hierarchies, birdsong involving the rearrangement of components) interesting as a comparison. Presumably, Kirby would say the non-human systematicity is very poor compared to human language systematicity?

(c) Iterated learning: Kirby notes that compositionality emerges over time in simulated agents when all individuals are initialized with random form-referent pairs (e.g. Brighton 2002). But what else is going on in these simulations? What external/internal pressures are there to cause any change at all? That seems important.

(d) I thought it was interesting to see the connection to poverty of the stimulus (i.e., the data being underspecified with respect to the hypothesis speakers used to generate those data). In particular, because the data are compatible with multiple hypotheses, learners can land on a hypothesis that’s different than what the speaker used to generate those data and which probably aligns better with the learner’s already-existing internal biases. Then, this learner grows up and becomes a speaker generating data for the next generation, now using the hypothesis that’s better aligned with learner biases to actually generate the data new learners observe. So, ambiguity in the input signal allows the emergence of language structure that’s easy to pick up when the data are ambiguous, precisely because it aligns with learner biases. So, that’s why little kids ended up being so darned good at learning the same language structure from ambiguous data.

(e) I laughed just a little at the reference to Gold (1967) as having anything to say about humans learning language acquisition. If there’s anything I learned from Johnson (2004) -- a phenomenal paper-- it’s that whenever you cite the computational learnability results of Gold (1967) as having anything to do with human language acquisition, you’re almost invariably wrong.

Johnson, K. (2004). What does Gold's Theorem show about language acquisition?. In Proceedings from the Annual Meeting of the Chicago Linguistic Society, Vol. 40, No. 2, pp. 261-277).

(f) While I appreciate the walk-through of Bayesian reasoning with respect to the likelihood and prior, I cringed just a little at equating the prior with the learner’s biology. It all depends in how you set the model up (and what kind of learning you’re modeling). All “prior” means is prior -- it was in place before the current learning started. That may be because it was there innately (courtesy of biology) or because the learner derived it from prior experience. That said, hats off to Kirby for promoting the Bayesian modeling approach as an example of modeling that you can easily interpret theoretically. I couldn’t agree more with that part.

(g) In terms of interpreting Figure 3, I definitely understand the main point that the size of the prior bias towards regularity (aaaaa languages) doesn’t seem to affect the results at all. But it looks like between 15 and 22% of all languages at the end of learning are this type of language, with near 0% distributed across the other 4 options (aaab, aabb, etc.) Where did the other 78-85% go? Maybe the x axis language instances are samples from the entire population of languages ((a-e)^5), and so the remaining 78-85% is distributed in teeny tiny amounts across all these possibilities? So, therefore, the aaaaa language is the one with a relative majority?

(h) I really like the point that human languages may take on certain properties not because they’re hard-coded into humans, but because humans have a tendency (no matter how slight/weak) towards them. Any little push can make a difference when you have a bunch of ambiguous data to learn from. (This hooks back into the Poverty of the Stimulus ideas from before, and how that contributes to the emergence of languages that can be learned by children from the data they encounter.) That said, I got a little lost with this statement: “the language faculty may contain domain-specific constraints only if they are weak, and strong constraints only if they are domain general. I understand about constraints being weak, whether they’re about language (domain-specific) or cognition generally. But where does the distinction between domain-specific vs. domain-general come from, and where do strong constraints come from at all, based on these results? Maybe this has to do with the domain-general simplicity bias Kirby comes back to at the end, which he makes a case for as a strong innate bias that does a lot of work for us?

(i) In terms of iterated learning in the laboratory, I’m always a little skeptical about what to conclude from these studies. Certainly, we can see the amplification of slight biases in the case of transmission/learning bottlenecks. But in terms of where those biases come from, if they’re representative of pre-verbal primate biases...I’m not sure how we tell. Especially when we consider artificial language learning, we have to consider what language-specific and domain-general biases adult humans have developed by already knowing a human language. For example, does compositionality emerge in the lab because it has to with those conditions or because the humans involved already had compositionality in their hypothesis space because of their experience with their native languages? To be fair, Kirby explicitly acknowledges this issue, and then sets up a simulation with a simplicity bias built in that’s capable of generating the behavioral results from humans. But of course, the simplicity bias is expressed with respect to certain structural preferences (concise context-free transducers). How different from compositionality is this structural preference? This comes up again for me when Kirby notes all the different ways a “simplicity” bias can be cashed out linguistically. So, simple with respect to what seems to matter -- that is, how the learner knows to define simplicity a particular way.

What I find more convincing is the comparison with recently-created signed languages like NSL, in terms of the systemacity that emerges. It seems that whatever cognitive bias is at play for the cohorts of NSLers might also be at play in the experimental participants learning the mini-signed languages. Therefore, you can at least get the same outcome from both people who don’t already know a language (NSLers) and people who do (experimental participants).

(2) Adger 2017

(a) I do think it’s very fair for Adger to note the other sociocultural factors that affect transmission of language through a population, such as invasion, intermarriage, etc. This comes back to how all models idealize, and the real question is if they’ve idealized something important away.

(b) Also, Adger makes a fair point about how “cultural” in Kirby’s terms is really more about transmission pressures in a population that can be formally modeled, rather than what a sociologist might naturally refer to as culture and/or cultural transmission.

(c) I’m not sure I agree that the NSLers showing the emergence of regularity so quickly is an argument against iterated learning. It seems to me that this is indeed a valid case of the transmission & learning bottleneck at the heart of Kirby’s iterated learning approach. The fact that it occurs over years in the NSLers instead of over generations doesn’t really matter, I don’t think.

(d) Adger notes that rapid emergence of certain structures involves “specific cognitive structures” that must also be present. I don’t see this as being incompatible with Kirby’s suggestion of a domain-general simplicity bias. That’s a specific cognitive structure, after all, assuming you count biases as structures.

(e) Adger also brings up certain examples of language not being very simple in an obvious way in order to argue against Kirby’s simplicity bias. But to me, the question is always simple with respect to what? Maybe the seemingly convoluted language structure we see is in fact a simple solution given the other constraints at work (available building blocks like a preference for hierarchical structure, frequent data in the input, etc.). It’s also not obvious to me that seeing hierarchical structure appear over and over again is incompatible with Kirby’s proposal for weak biases leading to highly prevalent patterns. That is, why couldn’t a slight bias for hierarchical structure make the hierarchical structure version the simplest answer to a variety of language structure problems?

(3) Bowling 2017

(a) I really like Bowling’s point that there are bidirectional effects between DNA and the environment (e.g., culture). For me, this makes a nice link with the environmental/cultural factors of transmission and learning by individuals that Kirby’s approach highlights and the biological underpinnings of those abilities. For example, could the evolution of language have reinforced a biologically-based bias for simplicity? That is, could the iterated learning process have made individuals with a predisposition for simplicity more prevalent in the human population? That doesn’t seem far-fetched to me.

(b) “...even though Kirby’s Bayesian models falsely separate genes from learning” - This doesn’t seem like a fair characterization to me. All that Bayesian models do is separate out what was there before you started learning from what you’re currently learning. They don’t specify where the previous stuff came from (i.e., genes vs. environment vs. environment+genes, etc.).

Tuesday, March 13, 2018

Some thoughts on Freudenthal et al. 2016

I think it’s always nice to see someone translate a computational-level approach to an algorithmic-level approach. The other main attempt I’ve seen for syntactic categorization is Wang & Mintz (2008) for frequent frames.

Wang, H., & Mintz, T. (2008). A dynamic learning model for categorizing words using frames. BUCLD 32 Proceedings, 525-536.

Here, F&al2016 are embedding a categorization strategy in an online item-based approach to learning word order patterns, and evaluating it against qualitative patterns of observed child knowledge (early noun-ish category knowledge and later verb-ish category knowledge).

An important takeaway seems to be making a qualitative distinction between preceding vs. following context. Interestingly, this is the essence of a frame as well.

Specific comments:

(1) Types vs tokens: It’s interesting to see F&al2016 get mileage by ignoring token frequency. This is a tendency that seems to show up in a variety of learning strategies (e.g., Tolerance Principle decisions about whether to generalize are based on consideration of types rather than tokens, which itself is tied to considerations of memory storage and retrieval: Yang 2005).

Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

In the intro, F&al2016 note that their motivation is one of computational cost — they say it’s less work to collect just the word, rather than keep track of both the word and its frequency. I wonder how much of an additional burden that is though. It doesn’t seem like all that much work, and don’t we already track frequencies of so many things anyway?

Also, in the simulation section, F&al2016 say “MOSAIC does not represent duplicate utterances” -- so does this mean MOSAIC already has a type bias built into it? (In this case, at the utterance level.)

(2) The MOSAIC model: I love all the considerations of developmental plausibility this model encodes, which is why it’s so striking that they use orthographically transcribed speech as input. Usually this is verboten for models of early language acquisition (e.g., speech segmentation), because orthographic and phonetic words aren’t the same thing. But here, this comes back to an underlying assumption about the initial knowledge state of the learner they model. In particular, this learner has already learned how to segment speech in an adult-like way. This isn’t a crazy assumption for 12-month-olds, but it’s also a little idealized, given what we know about the persistence of segmentation errors. Still, this assumption is no different from what previous syntactic categorization studies have assumed. What makes it stand out here is the (laudable) focus on developmental plausibility. Future work might be how robust this learning strategy is to segmentation errors in the input.

(3) Distributed representations: The Redington et al categorization approach that uses context vectors reminds me strongly of current distributed representations to word meaning (i.e., word embedding: word2vec, GloVe). Of course, the word embedding approaches aren’t a transparent translation of words into their counts, but the underlying intuition feels similar.

(4) Developmental linking: The model basis F&al2016 use for how nouns emerge early as a category is due to the structure of English utterances, coupled with the utterance-final bias of MOSAIC. Does this mean languages with verbs in the final position should have children developing knowledge of the verb category earlier (e.g., Japanese)? If so, I wonder if we see any evidence of this from behavioral or computational work.

(5) Evaluation metrics: I want to make sure I understand the categorization evaluation metric. The model’s classification of a cluster was compared against “the (most common) grammatical class assigned to each word”, but there was also a pairwise metric used that doesn’t actually need to take into account what the cluster’s class is for precision. That is, if you’re using pairwise precision (accuracy) and recall (completeness), you just get all the pairs of words from your cluster and figure out how many are actually truly in the same category -- whatever that category is -- and that’s the number in the numerator. The number in the denominator depends on whether you’re comparing against all the pairs in that cluster (precision) or all the pairs in the true adult category (recall).  So, there’s only a need to decide what an individual cluster’s category is (noun or verb or something else entirely) when you're doing the recall part.

(6) Model interpretation: In order to understand F&al2016’s concern with the number of links over time (in particular, the problem with there being more links earlier on than later on), it probably would have helped to know more about what those links refer to. I think they’re related to how utterances are generated, with progressively longer versions of an utterance linked word by word. But then, how does that relate to syntactic categorization? A little later, F&al2016 mention these links as something that link nouns together vs links verbs together, which would then make sense from a syntactic categorization perspective. But this is then different from the original MOSAIC links. Maybe links are what happens when the Redington et al. analysis is done over the progressively longer utterances provided by MOSAIC? So it’s just another way of saying “these words are clustered together based on the clustering threshold defined”.

(7) Free parameters: It’s interesting that they had to change the thresholds for Table 1 vs Table 2. The footnote explains this by saying this allows “a meaningful overall comparison in terms of accuracy and completeness”. But why wouldn’t the original thresholds suffice for that? Maybe this has something to do with the qualitative properties you’re looking for from a threshold? (For instance, the original “frequency” threshold for frequent frames was motivated partly by frames that were salient “enough” to the child. I’m not sure what you’d be looking for in a threshold for this Redington et al. analysis, though. Some sort of similarity saliency?)

Relatedly, where did the Jaccard distance threshold of 0.2 used in Table 3 come from? (Or perhaps, why is a Jaccard threshold of 0.2 equivalent to a rank order threshold of 0.45?)

(8) Noun richness analysis: This kind of incremental approach to what words are in a noun category vs a verb category seems like an interesting hypothesis for what the non-adult noun and verb categories ought to look like. I’d love to test them against child production data from these same corpora using a Yang-style productivity analysis (ex: Yang 2011).

Yang, C. (2011). A statistical test for grammar. In Proceedings of the 2nd workshop on Cognitive Modeling and Computational Linguistics (pp. 30-38). Association for Computational Linguistics.