Monday, February 29, 2016

Some thoughts on Goldberg & Boyd 2015

I definitely appreciated G&B2015’s clarification of how precisely statistical preemption and categorization are meant to work for learning about a-adjectives (or at least, one concrete implementation of it). In particular, statistical preemption is likened to blocking, which means the learner needs to have an explicit set of alternatives over which to form expectations. For A-adjectives,  the relevant alternatives could be something like “the sleeping boy” vs. “the asleep boy”. If both are possible, then “the asleep boy” should appear sometimes (i.e., with some probability). When it doesn’t, this is because it’s blocked. Clearly, we could easily implement this with Bayesian inference (or as G&B2015 point out themselves, with simple error-driven learning), provided we have the right hypothesis space. 

For example, H1 = only “the sleeping boy” is allowed, while H2 = “the sleeping boy” and “the asleep boy” are both allowed. H1 will win over H2 in a very short amount of time as long as children hear lots of non-a-adjective equivalents (like "sleeping") in this syntactic construction. The real trick is making sure these are the hypotheses under consideration.  For example, there seems to be another reasonable way to think about the hypothesis space, based on the relative clause vs. attributive syntactic usage. H1 = “the boy who is asleep”; H2 = “the asleep boy” and “the boy who is asleep”. Here, we really need to instances of relative-clause usage to drive us towards H1.

It makes me think about the more general issue of determining the hypothesis space that statistical preemption (or Bayesian inference, etc.) is supposed to operate over. G&B2015 explicitly note this themselves in the beginning of section 5, and talk more about hypothesis space construction in 5.2. For the a-adjective learning story G&B2015 promote, I would think some sort of recognition of the semantic similarity of words and the syntactic environments is the basis of the hypothesis space generation.

Some other thoughts:
(1) Section 1: I thought it was an interesting point about “afraid” being sucked into the a-adjective class even though it lacks the morphological property (aspectual “a-“ prefix + free morpheme, the way we see with “asleep”, “ablaze”, “alone”, etc.). This is presumably because of the relevant distributional properties categorizing it with the other a-adjectives? (That is, it’s “close enough”, given the other properties it has.)

(2) Section 2: Just as a note about the description of the experimental tasks, I wonder why they didn’t use novel-a-adjectives that matched the morphological segmentation properties that the real a-adjectives and alternatives have, i..e, asleep and sleepy, so ablim and blimmy (instead of chammy).  

(3) Section 3: G&B2015 note that Yang’s child-directed survey didn’t find a-adjectives being used in relative clauses (i.e., the relevant syntactic distribution cue). So, this is a problem if you think you need to see relative clause usage to learn something about a-adjectives. But, as mentioned above (and also in Yang 2015), I think that’s only one way to learn about them. There are other options, based on semantic equivalents (“sleeping”, “sleepy”, etc. vs. “asleep”) or similarity to other linguistic categories (e.g., the Yang 2015 approach with locative particles).

(4) Section 4: I really appreciate the explicit discussion of how the distributional similarity-based classification would need to work for the locative particles-strategy to pan out (i.e., Table 1). It’s the next logical step once we have Yang’s proposal about using locative particles in the first place.

(5) Section 4: I admit a bit of trepidation about the conclusion that the available distributional evidence for locative particles is insufficient to lump them together with a-adjectives. It’s the sort of thing where we have to remember that children are learning a system of knowledge, and so while the right-type adverb modification may not be a slam dunk for distinguishing a-adjectives from non-a-adjectives, I do wonder if the collection of syntactic distribution properties (e.g., probability of coordination with PPs, etc.) would cause children to lump a-adjectives together with locative particles and prepositional phrases and, importantly, not with non-a-adjectives. Or perhaps, more generally, the distributional information might cause children to just separate out a-adjectives, and note that they have some overlap with locative particles/PPs and also with regular non-a-adjectives. 

Side note: This is the sort of thing ideal learner models are fantastic at telling us: is the information sufficient to draw conclusion x? In this case, the conclusion would be that non-a-adjectives go together, given the various syntactic distribution cues available. G&B2015 touch on this kind of model at the beginning of section 5.2, mentioning the Perfors et al. 2010 work.


(6) Section 5: I was delighted to see the Hao (2015) study, which gets us the developmental trajectory for a-adjective categorization (or at least, how a-adjectives project onto syntactic distribution). Ten years old is really old for most acquisition stuff. So, this accords with the evidence being pretty scanty (or at least, children taking awhile until they can recognize that the evidence is there, and then make use of it).

Monday, February 15, 2016

Some thoughts on Yang 2015

Just from a purely organizational standpoint, I really appreciate how explicitly the goals of this paper are laid out (basically, (i) here’s why the other strategy won’t work, and (ii) why this new one does). Also, because of the clarity of the presentation, I’ll be interested to read Goldberg & Boyd's response for next time. Additionally, I greatly enjoyed reading about the application of what I’ve been calling “indirect positive evidence” (Pearl & Mis in press) — that is, things that are present in the input that can be leveraged indirectly to tell you about something else you’re trying to currently learn about (here: leverage distributional cues for locative particles and PPs to learn about a-adjectives). I really do think this is the way to deal with a variety of acquisition problems (and as I’ve mentioned before, it’s the same intuition that underlies both linguistic parameters and Bayesian overhypotheses: Pearl & Lidz 2013). In my opinion, the more we see explicit examples of how indirect positive evidence can work for various language acquisition problems, the better.


Some more specific thoughts:
(1) I found it quite helpful to have the different cues to a-adjectives listed out, in particular that the phonological cue of beginning with the schwa isn’t 100%, while the morphological cue of being split into aspectual “a” (= something like presently occurring?) + root is nearly 100%. It reminds me of the Gagliardi et al. (2012) work on children’s differing sensitivity to available cues when categorizing nouns in Tsez. In particular, Gagliardi et al. found that the model had to be more sensitive to phonological cues than semantic cues in order to match children’s behavior. This possibly has to do with the ability to reliably observe phonological cues as compared to semantic cues. I suspect the fairly diagnostic morphological cue might also be more observable, since it involves recognition of a free morpheme within the a-adjective (e.g., wake in awake).

(2) Related point: the actual trajectory of children’s development with a-adjectives. This is something that seems really relevant for determining which learning strategies children are using (as Yang himself points out, when he notes that all the experiments from Boyd & Goldberg are with adults). Do children make errors and use infrequent non-a-adjectives only predicatively (i.e., they don’t think they can use them attributively)? And on the flip side, do they use some a-adjectives attributively? Knowing about the errors children make (or lack thereof) can help us decide if they’re really learning on a lexical item by lexical item basis, or instead recognizing certain classes of adjectives and therefore able to make generalizations from one class instance to another (or perhaps more likely, at what age they recognize the classes of adjectives). 

Yang quite admirably does a corpus search of naturalistic child productions, which is consistent with children knowing not to use a-adjectives attributively, but it’s not quite the same as behavioral evidence where children definitively show they disallow (or strongly disprefer) the attributive usage.

(3) Indirect negative evidence: One of Yang’s concerns is that this kind of evidence “requires comparing the extensions of the competing hypotheses”. I get the general gist of this, but I think we run into the same problem with all the language hypothesis spaces we set up, where one language’s parameter is a subset of another’s. That is, classical approaches like the Subset Principle run into the exact same problem. This is something we always have to deal with, and I think it depends very much on the hypothesis spaces children entertain.

Moreover, on the flip side, how much of a problem is it really? For the concrete example we’re given about the language that includes “the asleep cat” vs. the language that doesn’t, the extensional difference is one utterance (or one category of utterances, if group them all together under a-adjectives). How computationally hard is this to calculate? Importantly, we really just need to know that the difference is one construction — the rest of the language’s extension doesn’t matter. So it seems like there should be a way to form a hypothesis space exactly like the one described above (P = “the asleep cat” is allowed vs. not-P = “the asleep cat” is not allowed)?  

Also, related to the point about how Boyd & Goldberg’s strategy works — does it even matter what other constructions do appear with those adjectives (i.e., the cat is asleep)?  Isn’t it enough that “the asleep cat” doesn’t? I guess the point is that you want to have appropriate abstract classes like the ones described in section 3.1, i.e., predicative usage = “the cat is asleep”, “the cat is nice”; attributive = *“the asleep cat”, “the nice cat”. This makes the P hypothesis more like “asleep can be used both predicatively and attributively” and the not-P class is “asleep can be used only predicatively”. But okay, let’s assume children have enough syntactic knowledge to manage this. Then we go back to the point about how hard it is in practice to deal with hypothesis space extensions. Especially once we add this kind of abstraction in, it doesn’t seem too hard at all, unless I’m missing something (which is always possible).

(4) I personally have a great love for the Tolerance Principle, and I enjoyed seeing its usage here. But, as always, it gets me thinking about the relationship between the Tolerance Principle and Bayesian inference, especially when we have nice hypothesis spaces laid out like we do here. So, here’s my thinking at the moment:

For the Tolerance Principle, we have a setup like this:

Hypotheses:
H1 = the generalization applies to all N items, even though e exceptions exist. 
H2 = there is no generalization, and all N items do their own thing.

Data:
O = items the pattern/rule is observed to apply to
e = exceptional items the pattern/rule should apply to but doesn’t
N - O - e = unobserved items (if any). We can simplify this and just assume all items have been observed to either follow the pattern (and be in O) or not (and be in e), so N - O - e = 0. 

Turning over to Bayesian thinking, let’s assume the priors for H1 and H2 are equal. So, all the work is really done in the likelihood, i.e, P(Hx | data) is proportional to P(Hx) [prior] * P(data | Hx) [likelihood].

Okay, so how do we calculate P(data | H1) vs. P(data | H2)? The data here is O pattern-following items and e exceptions, where N = O + e.

To calculate both likelihoods, we need to know the probability of generating those O pattern-following items and the probability of generating those e exceptions under both H1 and H2. I think this kind of question is where we get into the derivation of the Tolerance Principle, as described by Yang (2005). In particular, there’s an idea that if you have a rule (as in H1), it’s cheaper to store and access the right forms when there are enough items that follow the rule. 

More specifically, it’s some kind of constant cost for those O items (rule application), though the constant cost involves some work because you actually have to do the computation of the rule/pattern over the item. For the e exceptions, there’s some cost of accessing the stored form individually, based on the frequency of the stored items. Importantly, if you have H1 with a rule + exceptions, every time you use the rule, you have to look through the exceptions first and then apply the rule. For H2 where everything is individually stored, you just wander down the list by frequency until you get to the individual item you care about. 

The Tolerance Principle seems to be the result of doing this likelihood calculation, and giving a categorical decision. Instead of spelling out P(data | H1) and P(data | H2) explicitly, Yang (2005) worked out the decision point: if e <= N/ ln N, then P(data | H1) is higher (i.e., having the rule is worth it). So, if we wanted to generate the actual likelihood probabilities for H1 and H2, we’d want to plumb the depths of the Tolerance Principle derivation to determine these. And maybe that would be useful for tracking the trajectory of generalization over time, because it’s very possible these probabilities wouldn’t be close to 0 or 1 immediately. (Quick thoughts: P(data | H1) = something like (p_individualaccess)^e * p(p_followsrule)^O; P(data | H2) = something like (p_individualaccess)^N).


~~~
References:
Gagliardi, A., Feldman, N. H., & Lidz, J. 2012. When suboptimal behavior is optimal and why: Modeling the acquisition of noun classes in Tsez. In Proceedings of the 34th annual conference of the Cognitive Science Society (pp. 360-365).

Pearl, L., & Lidz, J. 2013. Parameters in Language Acquisition. The Cambridge Handbook of Biolinguistics, 129-159.

Pearl, L., & Mis, B. (in press - updated 2/2/15). The role of indirect positive evidence in syntactic acquisition: A look at anaphoric oneLanguage.


Yang, C. (2005). On productivity. Linguistic variation yearbook, 5(1), 265-302.

Monday, February 1, 2016

Some thoughts on van Schijndel & Elsner 2014

I really like the idea of seeing how far you can get with understanding filler-gap interpretation, given very naive ideas about language structure (i.e., linear w.r.t. verb position, as vS&E2014 do). Even if it’s not this particular shallow representation (and instead maybe a syntactic skeleton like the kind Gutman et al. 2014 talked about), the idea of what a “good enough” representation can do for scaffolding other acquisition processes is something near and dear to my heart.  

One niggling thing — given that vS&M2014 say that this model represents a learner between 15 and 25-30 months, it’s likely the syntactic knowledge is vastly more sophisticated at the end of the learning (i.e., ~25 months). So the assumptions of simplified syntactic input may not be as necessary (or appropriate) later on in development. More generally, this kind of extended modeling timeline makes me want more integration with the kind of acquisition framework of Lidz & Gagliardi (2015), which incorporates developing knowledge into the model’s input & inference.

One other thing I really appreciated in this paper was how much they strove to connect the modeling assumptions and evaluation with developmental trajectory data. We can argue about the implementation of the information those empirical data provide, sure, but at least vS&E2014 are trying to seriously incorporate the known facts so that we can get an informative model.

Other specific thoughts:

(1) At the end of section 3, vS&E2014 say the model “assumes that semantic roles have a one-to-one correspondence with nouns in a sentence”. So…is it surprising that “A and B gorped” is interpreted as “A gorped B” since it’s built into the model to begin with? That is, this misinterpretation is exactly what a one-to-one mapping would predict - A and B don’t get the same role (subject/agent) because only one of them can get the role. Unless I misunderstood what the one-to-one correspondence is doing.

(2) I wasn’t quite sure about this assumption mentioned in section 3: “To handle recursion, this work assumes children treat the final verb in each sentence as the main verb…”. So in the example in Table 1, “Susan said John gave (the) girl (a) book”, “gave” is the “main” verb because…why? Why not just break the sentence up by verbs anyway? (That is, “said” would get positions relative to it and “gave” would get positions relative to it, and they might overlap, but…okay?) Is this assumption maybe doing some other kind of work, like with respect to where gaps tend to be?

(3) If I’m understanding the evaluation in section 5 correctly, it seems that semantic roles commonly associated with subject and object (i.e., agent, patient, etc. depending on the specific verb) are automatically assigned by the model. I think this works for standard transitive and intransitive verbs really well, but I wonder about accusatives (fall, melt, freeze, etc.) where the subject is actually the “done-to” thing (i.e., Theme or Patient, so the event is actually affecting that thing). This is something that would be available if you had observable conceptual information (i.e., you could observe the event the utterance refers to and determine the role that participant plays in the event). 

Practically speaking, it means the model assigning “theme/patient” to the subject position (preverbal) would be correct for unaccusatives. But I don’t think the current model does this - in fact, if it just uses “subject” and “object” to stand in for thematic/conceptual roles, the “correct” assignment would be the subject NP of unaccusatives as an “object” (Theme/Patient)….which would be counted as incorrect for this model. (Unless the BabySRL corpus that vS&E2014 used labels thematic roles and not just grammatical roles? It was a bit unclear.) I guess the broader issue is the complexity of different predicate types, and the fact that there isn’t a single mapping that works for all of them.  

This came up again for me in section 6 when vS&E2014 compare their results to the competing BabySRL model and they note that when given a NV frame (like with intransitives or unaccusatives), BabySRL labels the lone NP as an “object” 30 or 40% of the time. If the verb is an unaccusative, this would actually be correct (again, assuming “object” maps to “patient” or “theme”).

(4) Section 6: “…these observations suggest that any linear classifier which relies on positioning features will have difficulties modeling filler-gap acquisition” — including the model here? It seemed like the one vS&E2014 used captured the filler-gap interpretations effects they were after, and yet relied on positioning features (relative to the main verb). 


References:
Gutman, Ariel, Isabelle Dautriche, Benoit Crabbe, & Anne Christophe 2015. Bootstrapping the Syntactic Bootstrapper: Probabilistic Labeling of Prosodic Phrases, Language Acquisition, 22(3), 285-309.

Lidz, J., & Gagliardi, A. (2015). How Nature Meets Nurture: Universal Grammar and Statistical Learning. Annu. Rev. Linguist., 1(1), 333-353.


Monday, January 18, 2016

Some thoughts on Gutman et al. 2014

I’m a big fan of G&al2014’s goal of learning the initial knowledge that gets other acquisition processes started. In this case, it’s about learning the basic elements that allow syntactic boostrapping to start, which itself allows children to learn more abstract word meanings. In CoLaLab, we’ve been looking at this same idea of useful initial knowledge with respect to speech segmentation and early syntactic categorization. 

For G&al2014’s work, I find it interesting that they rely on comparison to adult prosodic categories (specifically VN and NP) — I wonder if there’s a way to determine if the inferred prosodic categories are “good enough” in some sense, beyond matching VN and NP. For example, maybe the inferred categories can be used directly for syntactic bootstrapping, or maybe they can used to ease language processing in some measurable way. (As a side note, it also took me a moment to realize “syntactic categorization” for G&al2014 referred to prosodic phrase types rather than the typical syntactic categories like “noun” and “verb”. Just goes to show the importance of defining your terms to avoid confusion.)

I’m also a big fan of models that recognize children use a variety of cues very early on, i.e., here, prosody and semantics of a few familiar words, as well as edge sensitivity. Of course, it’s also important to understand the contribution of individual sources of information. But it’s really nice to see a more integrated model like this because it’s likely to be a more accurate simulation of what children are actually doing.

Other thoughts:

(1) I really like how this model shows which property of function words (the fact that they occur at prosodic phrase edges) allows children to learn that function words are really great cues — even before they have an official “function word” category like “determiner”.

(2) It’s interesting that the syntactic skeleton (formed via function words and prosodic boundaries) matches adult structure (NP = an apple) in some cases and not so much in others (he’s eating = VN, which isn’t a VP or an NP - it’s actually a non-constituent). I wonder how the recovery/update process works if you end up with a bunch of VN units - that is, what causes you to switch to VP = V NP and treat “he’s eating” as not-really-a-syntactic-unit in “he’s eating an apple”? 

(3) Section 2, Experiment 1: If units are constructed by looking at the initial word, it’s important that there not be too much variety in that first word (unless we want toddlers to end up with a zillion phrasal units). From the details in 2.2.1, it looks like they use the k most frequent words to define k classes of units, with k ranging from 5 to 70. Presumably, this would be something implicit to the learner, based on the learner's cognitive capacity limitations or some such. I also like that this is relying on the most frequent words, since that seems quite plausible as a way to figure out which phrasal types to notice. Related thought: Is it possible to design a model where k itself is inferred? I’m thinking generative non-parametric Bayesian models, for example.


(4) I also found it interesting that they used purity as the evaluation measure for a phrasal category, rather than pairwise precision (PWP). I wonder what benefit purity has over PWP, since footnote 7 explicitly notes they’re related. Is purity easier to interpret for some reason? G&al2014 do calculate recall and precision for the best instances of VN and NP, though (and find that the categories are very precise, even with as few as 10 categories).

Wednesday, November 25, 2015

Some thoughts on Morley 2015

I definitely appreciate the detailed thought that went into this paper — Morley uses this deceptively simple case study to highlight how to take complexity in representation and acquisition seriously, and also how to take arguments about Universal Grammar seriously. (Both of these are, of course, near and dear to my heart.) I also loved the appeal to use computational modeling to make linguistic theories explicit. (I am all about that.)

I also liked how she notes the distinction between learning mechanism and hypothesis space constraints in her discussion of how UG might be instantiated — again, something near and dear to my heart. My understanding is that we’ve typically thought about UG as constraints on the hypothesis space (and the particular UG instantiation Morley investigated is this kind of UG constraint). To be fair, I tend to lean this way myself, preferring domain-general mechanisms for navigating the hypothesis space and UG for defining the hypothesis space in some useful way. 

Turning to the particular UG instantiation Morley looks at, I do find it interesting that she contrasts the “UG-delimited H Principle” with the “cycle of language change and language acquisition” (Intro). To me, the latter could definitely have a UG component in either the hypothesis space definition or the learning mechanism. So I guess it goes to show the importance of being particular about the UG claim you’re investigating. If the UG-delimited H Principle isn’t necessary, that just rules out the logical necessity of that type of UG component rather than all UG components. (I feel like this is the same point made to some extent in the Ambridge et al. 2014 and Pearl 2014 discussion about identifying/needing UG.)


Some other thoughts:
(1) Case Study: 

(a)  I love seeing the previous argument for “poverty of the typology implies UG” laid out. Once you see the pieces that lead to the conclusion, it becomes much easier to evaluate each component in its own right.

(b) The hypothetical lexicon items in Table 1 provide a beautiful example of overlapping hypothesis extensions, some of which are in a subset-superset relationship depending on the actual lexical items observed (I’m thinking of the Penultimate grammar vs the other two, given items 1,3, and 4 or item 1, 2, and 5). Bayesian Size Principle to the rescue (potentially)!

(c) For stress grammars, I definitely agree that some sort of threshold for determining whether a rule should be posited is necessary. I’m fond of Legate & Yang (2013)/Yang (2005)’s Tolerance Principle myself (see Pearl, Ho, & Detrano 2014, 2015 for how we implement it for English stress. Basic idea: this principle provides a concrete threshold for which patterns are the productive ones. Then, the learner can use those to pick the productive grammar from the available hypotheses). I was delighted to see the Tolerance Principle proposal explicitly discussed in section 5.


(2) The Learner

(a) It’s interesting that a distribution over lexical item stress patterns is allowed, which would then imply that a distribution over grammars is allowed (this seems right to me intuitively when you have both productive and non-productive patterns that are predictable). Then, the “core” grammar is simply the one with the highest probability. One sticky thing: Would this predict variability within a single lexical item? (That is, sometimes an item gets the stress contour from grammar 1 and sometimes it gets the one from grammar 2.) If so, that’s a bit weird, except in cases of code-switching within dialects (maybe example: American vs. British pronunciation). But is this what Stochastic OT predicts? It sounds like the other frameworks mentioned could be interpreted this way too. I’m most familiar with Yang’s Variational Learning (VL), but I’m not sure the VL framework has been applied to stress patterns on individual lexical items, and perhaps the sticky issue mentioned above is why? 

Following this up with the general learners described, I think that’s sort of what the Variability/Mixture learners would predict, since grammars can just randomly fail to apply to a given lexical item with some probability. This is then a bit funny because these are the only two general learners pursued further. The discarded learners predict different-sized subclasses of lexical items within which a given grammar applies absolutely, and that seems much more plausible to me, given my knowledge of English stress. Except the description of the hypotheses given later on in example (5) make me think this is effectively how the Mixture model is being applied? But then the text beneath (7) clarifies that, no, this hypothesis really does allow the same lexical item to show up with different stress patterns.

(b) It’s really interesting to see the connection between descriptive and explanatory adequacy and Bayesian likelihood and prior. I immediately got the descriptive-likelihood link, but goggled for a moment at the explanatory-prior link. Isn’t explanatory adequacy about generalization? Ah, but a prior can be thought of as an extension of items -- and so the items included in that extension are ones the hypothesis would generalize to. Nice!

(3)  Likely Input and a Reasonable Learner: The take-home point seems to be that lexicons that support Gujarati* are rare, but not impossible. I wonder how well these match up to the distributions we see in child-directed speech (CDS)? Is CDS more like Degree 4, which seems closest to the Zipfian distribution we tend to see in language at different levels?

(4) Interpretation of Results: I think Morley makes a really striking point about how much we actually (don’t) know about typological diversity, given the sample available to us (basically, we have 0.02% of all the languages). It really makes you (me) rethink making claims based on typology.

References

Ambridge, B., Pine, J. M., & Lieven, E. V. (2014). Child language acquisition: Why universal grammar doesn't help. Language, 90(3), e53-e90.

Pearl, L. (2014). Evaluating learning-strategy components: Being fair (Commentary on Ambridge, Pine, and Lieven). Language, 90(3), e107-e114.

Pearl, L., Ho, T., & Detrano, Z. 2014. More learnable than thou? Testing metrical phonology representations with child-directed speech. Proceedings of the Berkeley Linguistics Society, 398-422.

Monday, November 2, 2015

Some thoughts on Pietroski 2015 in press

One of the things that stood out most to me from this article is the importance of the link between structured sequences and intended meanings (e.g., with the eager/easy to please examples). Pietroski is very clear about this point (which makes sense, as it was one of the main criticisms of the Perfors et al. 2011 work that attempted to investigate poverty of the stimulus for the canonical example of complex yes/no questions). Anyway, the idea that comes through is that it’s not enough to just deal with surface strings alone. Presumably it becomes more acceptable if the strings also include latent structure, though, like traces? (Ex: John is easy to please __(John) vs. John is eager (__John) to please.) At that point, some of the meaning is represented in the string directly.

I’m not sure how many syntactic acquisition models deal with the integration of this kind of meaning information, though. For example, my islands model with Jon Sprouse (Pearl & Sprouse 2013) used latent phrasal structure (IP, VP, CP, etc) to augment the learner’s representation of the input, but was still just trying to assign acceptability (=probability) to structures irrespective of the meanings they had. That is, no meaning component was included. Of course, this is why we focused on islands that were supposed to be solely “syntactic”, unlike, for instance, factive islands that are thought to incorporate semantic components. (Quickie factive island example: *Who do you forget likes this book? vs. Who do you believe likes this book?). Is our approach an exceptional case, though? That is, is it never appropriate to worry only about the “formatives” (i.e., the structures in absence of interpretation)? For instance, what if we think of the learning problem as trying to decide what formative is the appropriate way to express a particular interpretation — isn’t identifying the correct formative alone sufficient in this case? Concrete example: Preferring “Was the hiker who was lost killed in the fire?” over “*Was the hiker who lost was killed in the fire?” with the interpretation of "The hiker who was lost was killed in the fire [ask this]".

Some other thoughts:

(1) My interpretation of the opening quote is that acquisition models (as theories of language learning and/or grammar construction) matter for theories of language representation because they facilitate the clear formulation of deeper representational questions. (Presumably by highlighting more concretely what works and doesn’t work from a learning perspective?) As an acquisition chick who cares about representation, this makes me happy.


(2) For me, the discussion about children’s “vocabulary” that allows them to go from “parochial courses of human experience to particular languages” is another way of talking about the filters children have on how they perceive the input and the inductive biases they have on their hypothesis spaces. This makes perfect sense to me, though I wouldn’t have made the link to the term “vocabulary” before this. Relatedly, the gruesome example walkthrough really highlights for me the importance of inductive biases in the hypothesis space. For example, take the assumption of constancy w.r.t. time for what (most) words mean (so we never get green before time t or blue after time t as possible meanings, even though these are logically possible given the bits we build meaning out of). So we get that more exotic example, which gets followed up with more familiar linguistic examples that help drive the point home.

References:
Pearl, L., & Sprouse, J. (2013). Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition20(1), 23-68.

Perfors, A., Tenenbaum, J. B., & Regier, T. (2011). The learnability of abstract syntactic principles. Cognition118(3), 306-338.



Monday, October 19, 2015

Some thoughts on Braginsky et al. 2015

One of the things I quite like about this paper is that it’s a really nice example of what you can do with observational data (like the CDI data), though of course there are still the standard limitations on the accuracy of caretaker reports, the fact that you’re getting at production (in this case) rather than comprehension so we’re seeing a time delay w.r.t. when the knowledge is acquired by the child, etc. 

Also, how nice to see a study with this many subjects! I think this size subject pool is more standard in medical studies, but it’s really rare that we’ve seen this size for language acquisition studies. This means that when we find trends, we can be more sure it’s not just a fluke of the sample.

The question from the modeler’s perspective then becomes “What can we do with this?” Certainly this provides an empirical checkpoint in multiple languages for specific details about the development trajectory. So, I think this makes it good behavioral data for models of syntactic development (e.g., MOSAIC by Freudenthal & colleagues: Freudenthal et al. 2007, 2009; Variational learning: Yang 2004, Legate & Yang 2007) and models of vocabulary development (e.g., the model of McMurray & colleagues: McMurray 2007, Mitchell & McMurray 2009, McMurray et al. 2012) to try and match their outputs against. Especially good is the differences across languages - these are the kind of nuances that may distinguish models from each other. Perhaps even more interesting would be an attempt to build a joint model that combines promising syntactic development and vocabulary development models together so that you can look for the correlational data this large-scale observational study provides.


Some more targeted thoughts:
(1) The methodology advance of wordbank.stanford.edu pleases me no end — I think this kind of aggregation approach is the way forward. Once you can aggregate data sets of this size, you can find things that you can feel more confident about as a scientist. So, the finding that there are age effects on syntax (less so on morphology) and on function words (less so on nouns) is something that people will take notice of.

(2) Analysis 1: I wonder how much of an effect the linguistic properties of these languages has (ex: Spanish, Norwegian, and Dutch are morphologically much richer than English). It would be nice to see some sort of quantitative measure of the morphological richness, and maybe other potentially relevant cross-linguistic factors. A related thought: Are there any useful/explanatory cross-linguistic differences in the actual items in the Complexity (Morphological & Syntactic) items?


(3) Analysis 2,  Figure 4: There’s an interesting difference in early Spanish where predicates lag behind function words until the vocabulary size =~ 0.4. Presumably this is something due to the language itself, and the items in the predicates vs. function words categories? It’s notable that Spanish is also the only language where predicates don’t seem to have an age effect coefficient (see Figure 5) - so predicate development is totally predictable from the child’s vocabulary development. Also, Figure 5 shows Danish with a big age effect for Nouns — does this have to do with the particular nouns, I wonder? Or something about Danish nouns in general?

~~~
References:

Freudenthal, D., Pine, J. M., Aguado‐Orea, J., & Gobet, F. (2007). Modeling the developmental patterning of finiteness marking in English, Dutch, German, and Spanish using MOSAIC. Cognitive Science31(2), 311-341.

Freudenthal, D., Pine, J. M., & Gobet, F. (2009). Simulating the referential properties of Dutch, German, and English root infinitives in MOSAIC. Language Learning and Development5(1), 1-29.

Legate, J. A., & Yang, C. (2007). Morphosyntactic learning and the development of tense. Language Acquisition14(3), 315-344.

McMurray, B. (2007). Defusing the childhood vocabulary explosion. Science317(5838), 631-631.

McMurray, B., Horst, J. S., & Samuelson, L. K. (2012). Word learning emerges from the interaction of online referent selection and slow associative learning. Psychological review119(4), 831.

Mitchell, C., & McMurray, B. (2009). On leveraged learning in lexical acquisition and its relationship to acceleration. Cognitive Science33(8), 1503-1523.

Yang, C. D. (2004). Universal Grammar, statistics or both?. Trends in cognitive sciences8(10), 451-456.


Monday, October 5, 2015

Tenure-track Assistant Professor, Language Science @ UCI

The Program in Language Science (http://linguistics.uci.edu) at the University of California, Irvine (UCI) is seeking applicants for a tenure-track assistant professor faculty position. We seek candidates who combine a strong background in theoretical linguistics and a research focus in one of its sub-areas with computational, psycholinguistic, neurolinguistic, or logical approaches.
The successful candidate will interact with a dynamic and growing community in language, speech, and hearing sciences within the Program, the Center for Language Science, the Department of Cognitive Sciences, the Department of Logic and the Philosophy of Science, the Center for the Advancement of Logic, its Philosophy, History, and Applications, the Center for Cognitive Neuroscience & Engineering, and the Center for Hearing Research. Individuals whose interests mesh with those of the current faculty and who will contribute to the university's active role in interdisciplinary research and teaching initiatives will be given preference.
Interested candidates should apply online at https://recruit.ap.uci.edu/apply/JPF03107 with a cover letter indicating primary research and teaching interests, CV, three recent publications, three letters of recommendation, and a statement on previous and/or past contributions to diversity, equity and inclusion.
Application review will commence on November 20, 2015, and continue until the position is filled.
The University of California, Irvine is an Equal Opportunity/Affirmative Action Employer advancing inclusive excellence. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, protected veteran status, or other protected categories covered by the UC nondiscrimination policy.

Some thoughts on Meylan & Griffiths 2015

I really enjoyed seeing this extension of a reasonable existing word-learning model (which was focused on concrete nouns) to something that tries to capture more of the complexity of word meaning learning. I admit I was surprised to find out that the extension was on the semantics side (compositional meanings) rather than some sort of syntactic bootstrapping (using surrounding word contexts), especially given their opening example. Given the extensive syntactic bootstrapping experimental literature, I think a really cool extension would be to incorporate the idea that words appearing in similar distributional contexts have similar meanings. Maybe this requires a more sophisticated “meaning” hypothesis space, though? 

I also appreciated seeing the empirical predictions resulting from their model (good modeling practices, check!). More specifically, they talk about why their model does better with a staged input representation, and suggest that learning from one, then two, then three words would lead to the same result as learning from three, then two, then one word (which is not so intuitive, and therefore an interesting prediction). To be honest however, I didn’t quite follow the nitty-gritty details of why that should be, so that’s worth hashing out together.


More specific thoughts:
(1) The learners here have the assumption that a word refers to a subset of world-states, and that presumably could be quite large (infinite even) if we’re talking about all possible combinations of objects, properties, and actions, etc. So this means the learner needs to have some restrictions on the possible components of the world-states. I think that’s pretty reasonable — we know from experimental studies that children have conceptual biases, and so probably also have equivalent perceptual biases that filter down the set of possible world-states in the hypothesis space.

(2) The “wag” example walk-through: I’m not sure I understand exactly how the likelihood works here. “Wag” refers to side-to-side motion. If the learner thinks “wag” refers to side-to-side motion + filled/black shading, this is described as being “consistent with the observed data”.  But what about the instances of “wag” occurring with non-filled items (du ri wag, pu ri wag) - these aren’t consistent with that hypothesis. So shouldn’t the likelihood of generating those data, given this hypothesis, be 0? M&G2015 also note for this case that “the likelihood is relatively low in that the hypothesis picks out a larger number of world-states”. But isn’t side-to-side+black/filled compatible with fewer world-states than side-to-side alone?

(3) I like the incorporation of memory noise (which makes this simulation more cognitively plausible). Certainly the unintentional swapping of a word is one way to to implement memory noise that doesn’t require messing with the guts of the Bayesian model (it’s basically an update to the input the model gets). I wonder what would happen if we messed with the internal knowledge representation instead (or in addition to this) and let the learned mappings degrade over time. I could imagine implementing that as some kind of fuzzy sampling of the probabilities associated with the mappings between word and world-state.

(4) Figure 3, with the adult artificial learning results from Kertsen & Earles 2001: Adults are best at object or path mapping, and are much worse at manner mapping. My guess is that has to do with the English bias for manner-of-motion encoded in verbs over direction-of-motion (which happens to be the opposite of the Spanish bias). So, these results could come from a transfer effect from the English L1 — in essence, due to their L1 bias, it doesn’t occur to the English subjects to encode the manner as a separate word from the verb-y/action-y type word. Given what we know about the development of these language-specific verb biases, this may not be present in the same way in children learning their initial language (e.g., there’s some evidence that all kids come predisposed for direction-of-motion encoding — Maguire et al. 2010.)  At any rate, it seems easy enough to build in a salience bias for one type of world-state - just weight the prior accordingly. At the moment, the model doesn’t show same manner deficit and so this could be an empirically-grounded bias to add to the model to account for those behavioral results.

Maguire, M. J., Hirsh-Pasek, K., Golinkoff, R. M., Imai, M., Haryu, E., Vanegas, S., Okada, H., Pulverman, R., & Sanchez-Davis, B. (2010). A developmental shift from similar to language-specific strategies in verb acquisition: A comparison of English, Spanish, and Japanese. Cognition, 114(3), 299-319. 

(5) Also Figure 3:  I’m not sure what to make of the model comparison with human behavior.  I agree that there’s a qualitative match with respect to improvement for staged exposure over full exposure. Other than that? Maybe the percent correct if averaged (sort of) for eta = 0.25. I guess the real question is how well the model is supposed to match the adult behavior. (That is, maybe I’m being too exacting in my expectations for the output behavior of the model, given what it has built into it.)

(6)  Simulation 3 setup: I didn’t quite follow this. Is the idea that the utterance is paired with four world-states, and the learner assumes the utterance refers to one of them? If so, what does this map to in a realistic acquisition scenario?  Having more conceptual mappings possible? In general, I think the page limit forced the authors to cut the description of this simulation short, which makes it tricky to understand.


Tuesday, September 29, 2015

Next time on 10/7/15 @ 3pm in SBSG 2221 = Meylan & Griffiths 2015

It looks like a good collective time to meet will be Wednesdays at 3pm for this quarter, so that's what we'll plan on.  Our first meeting will be on Oct 7 in SBSG 2221, and our complete schedule is available on the webpage at 


On October 7, we'll be discussing an article that extends an existing state-of-the-art model of concrete word learning to be able to leverage input in a more realistic way.

Meylan, S. C., & Griffiths, T. L. 2015. A Bayesian Framework for Learning Words From Multiword Utterances. In Proceedings of the Cognitive Science Society.


See you October 7!

Friday, June 5, 2015

Some thoughts on Kao et al. 2014

This work strikes me as a nice demonstration of the Rational Speech Act model framework, extended to allow multiple dimensions of communicative goals (in this case, true state of the world vs. affective content vs. both). Beyond the formalization of these components in the RSA model, the key seems to be that the listener must know that both communicative goals are possible. This got me thinking about how to apply the RSA model to child language processing — for example, would a model that only had the true state of the world as its sole communicative goal match children’s interpretations better at a certain point in development? It seems possible. And then, we could track development by when this second communicative goal seems to be taken into consideration (i.e., does the RSA model+affect fit the behavioral data better than the basic RSA model), and potentially how much weight it’s given a priori.

A related thought occurred to me as I was reading the implementation details of the RSA model. The basic framework is that you have a listener, and the listener assumes the speaker generated the utterance by keeping in mind how a literal listener would interpret it. This clearly involves some pretty sophisticated theory of mind (ToM). So, similar to the above, could we track children’s development by how well this model fits their behavior vs. a model where the listener assumes a speaker who deviates from the above in some way (e.g., a speaker who has the same knowledge as the listener, rather than a speaker who thinks about how a naive literal listener will interpret the utterance)? To be honest, I really don’t know how to cash this out exactly, but the intuition feels right to me. Kids may have various kinds of ToM abilities early, but the ToM required in this model seems pretty sophisticated. So maybe kids have a limited ToM to begin with, and that plays out in this model in a different way than the model is currently set up. Then, we compare the ToM-limited model vs. the model given here against children’s behavior, and see which fits best at different stages of development.

Some additional comments:
(1) Looking at Figure 2B, it seems like humans (far right panel) still have a bit of the literal interpretation bias (more of a spike at exactly 1000 for “costs $1,000”) and a bit of the imprecise goal bias (more of a spike at 1001) than the full model does (next panel to the left). I wonder if this separates out by individuals — I could imagine some people being more literal than others (maybe due to natural variation, or because of an Asperger Syndrome type condition).


(2) Related to the above, the imprecise goal seems to be another communicative dimension, but it’s not talked about that way. Instead, we have “truth” vs. “affect”, and then imprecise goal gets folded into affect. I wonder why — perhaps because “imprecise goal” is a way to signal “this is not the truth”? If so, that would require fairly sophisticated communicative knowledge. On the other hand, Kao et al. (2014) treat it as completely separate in the Materials and Methods section — precision of goal (precise vs. imprecise) is fully crossed with communicative goal (truth vs. affect vs. both). So, it does start to feel like an additional communicative dimension.

Wednesday, May 13, 2015

Some thoughts on Kolodny et al. 2015

There are two main things that I really enjoyed about this paper: (1) the explicit attempt to incorporate known properties of language acquisition into the proposed model (unsupervised learning, incremental learning, generative capacity of the learner), and (2) the breadth of empirical studies they tried to validate the proposed model on. Having said this, each of these things has a slight downside for me, given the way that they were covered in this paper.  

First, there seems to be a common refrain of “biological realism”, with the idea that the proposed model does this far better than any other model to date. I found myself wondering how true this was — pretty much every acquisition model we examine in the reading group has included the core properties of unsupervised learning and generative capacity, and all the algorithmic-level ones include incremental learning of some kind. What seems to separate the proposed model from these is the potential domain-generality of its components. That is, it’s meant to apply to any sequential hierarchically structured system, as opposed just to language. But even for this, isn’t that exactly what Bayesian inference does too? It’s the units and hypothesis spaces that are language-specific, not the inference mechanism.

Second, because K&al covered empirical data from so many studies, I felt like I didn’t really understand any individual study that well or even the specifics of how the model works on a concrete example. This is probably a length consideration issue (breadth of coverage trumped depth of coverage), but I really do wish more space had been devoted to a concrete walk-through of how these graphs get built up incrementally (and how the different link types are decided and what it means for something to be “shifted in time”, etc.). I want to like this model, but I just don’t understand the nitty gritty of how it works.

So, given this, I wasn’t too surprised that the Pearl & Sprouse island effects didn’t work out. The issue to me is that K&al were running their model over units that weren’t abstract enough — the P&S strategy worked because it was using trigrams of phrase structure (not trigrams of POS categories, as K&al described it). And not just any phrase structure —specifically, the phrase structure that would be “activated” because the gap is contained inside that phrase structure. So basically, the units are even more abstract than just phrase structure. They’re a subset of phrase structure nodes.  And that’s what trigrams get made out of. Trying to capture these same effects by using local context over words (or even categories that include clumps of words or phrases) seems like we're using the wrong units. I think K&al’s idea is that the appropriate “functionally similar” abstract units would be built up over time with the slot capacity of the graph inference algorithm (and maybe that’s why they alluded to a data sparseness issue). And that might be true…but it certainly remains to be concretely demonstrated.

Some other specific thoughts:

(1) 2.1, “…a unit that re-appears within a short time is likely to be significant” — This seems related to the idea of burstiness.

(2) 2.2, “…tokens are either separated by whitespaces…or…a whitespace is inserted between every two adjacent tokens” — Is this a default for a buffer size of two units? And if so, why? Something about adjacency?

(3) 2.3, “…create a new supernode, A + B, if sanctioned by Barlow’s (1990) principle of suspicious coincidence, subject to a prior” — How exactly does this work? Is it like Bayesian inference? What determines the prior?

(4) 2.4, “…when a recurring sequence….is found within the short-term memory by alignment of the sequence to a shifted version of itself” — How exactly is the shifted version created? How big is the buffer? How cognitively intensive is this to do?

(5) 2.6, “…i.e., drawing with a higher probability nodes that contain longer sequences” — Why would this bias be built in explicitly? If anything, I would thing shorter sequences would have a higher probability.

(6) 3.1, Figure 2: It seems like the U-MILA suddenly does just great on and 9 and 10 word sequences, after doing poorly on 6-8 word sequences. Why should this be? 

Wednesday, April 29, 2015

Some thoughts on Heinz 2015 book chapters, parts 6-9

Continuing on from last time, where we read up through the discussion about constraints on strings, Heinz’s 2015 book chapter now gets into the constraints on maps between the underlying form and the observable form of a phonological string. As before, I found the more leisurely walk-through of the different ideas (complete with illustrative figures) quite accessible. The only gap in that respect for me as a non-phonologist was what an opaque map was, since Heinz mentions that opaque maps raise potential issues for the computational approach here. A quick googling pulled up some examples, but a brief concrete example would have been helpful.

On a more a contentful note, I found the compare and contrast with the optimality approach quite interesting. We have this great setup for some logically possible maps that are derivationally simple (e.g. “Sour Grapes”), and yet we find these maps unattested. Optimality has to add stuff in to take care of it, while the computational ontology Heinz presents neatly separates them out. Boom. Simple. 

So then this leads me (as an acquisition person) to wondering what we can do with this learning-wise. Let’s say we have the set of phonological maps that occur in human language captured by a certain type of relationship (input-strictly local [ISL]) — there are some exceptions currently, but let’s say those get sorted out. Then, we also have some computational learnability results about how to learn these types of maps in the limit. Can I, as an acquisition modeler, then do something with those algorithms? Or do I need to develop other algorithms based off of those that do the same thing, only in plausible time limits? 

And let’s make this even more concrete, actually — suppose there are a set of maps capturing English phonology that we think children learn by a certain age. Suppose that we do the kind of analysis Heinz suggests and discover all these maps are ISL. What kind of learning algorithms should I model to see if children could learn the right maps from English child-directed data? Are the existing learnability algorithms the ones? Or do I need to adapt them somehow? Or is it more that they serve to show it’s possible, but they may bear no resemblance to the algorithms kids would actually have to use? Given Heinz’s comment at the end of part 5 about the link between algorithm and representation, I feel like the existing algorithms should be related to the ones kids approximate if that kind of link is there.

A few other thoughts: 

(1) Heinz points out the interesting dichotomy between tone maps and segment maps, where the tone maps allow more complex relationships. He mentions that this has been used to argue for modularity (where tones are in one module and segments are in the other, presumably), and that could very well be. What it also shows is that there isn’t just one restriction on the complexity in general — a more restrictive one occurs for segment maps but a less restrictive one occurs for tone maps. Why? Two thoughts: (1) Maybe the less restrictive one is the general abstract restriction, and something special happens for segments that further restricts it. This fits into the modularity explanation above. But (2) maybe it’s just chance that we haven’t found segment maps that violate the more restrictive restriction. If so, we wouldn’t need the modularity explanation since the difference between segment maps and tonal maps would just be, in effect, a sampling error (more samples if we had them would show segment maps that don’t follow that extra restriction). Caveat: I’m not sure how plausible this second idea is, given how many segment maps we have access to.

(2) I’m still not sure how much faith I have in the artificial language learning experiments that are meant to show that humans can’t learn certain types of generalizations/rules/mappings. I definitely believe that the subjects struggled to learn certain ones in the experiment while finding others easy to learn. But how much of that is effectively an L2 transfer effect? That is, the easy-to-learn ones are the ones in your native language, so (abstractly) you already have a bunch of experience with those and no experience with the other hard-to-learn kind. To be fair, I’m not sure how you could factor out the L2 transfer effect — no matter what you do with adults (or even kids), if it’s a language thing, they’ve already had exposure from their native language.


(3) Something for NLP applications (maybe): Section 6.4, “The simplest maps are Markovian on the input or the output (ISL, LOSL, and ROSL), and very many phonological transformations belong to these classes.” — This makes me think that the simpler representations NLP apps tend to use for speech recognition and production (ex: various forms of Hidden Markov Models, I think) may not be so far off from the truth, if this approach is correct.

Wednesday, April 15, 2015

Some thoughts on Heinz 2015 book chapter, parts 1-5

For me, this was a very accessible introduction to a lot of the formal terminology and distinctions that computational learnability research trades in. (For instance, I think this may be the first time I really understood why we would be excited that generalizations would be strictly local or strictly piecewise.) From an acquisition point of view, I was very into some particular ideas/approaches:

(1) the distinction between an intensional description (i.e., theoretical constructs that compactly capture the data) and an extension (i.e., the actual pattern of data), along with the analogy to the finite means (intensional description) that accounts for the infinite use (extension). If there’s a reasonable way to talk about the extension, we get a true atheoretical description of the empirical data, which seems like an excellent jumping off point for describing the target state of acquisition.

(2) the approach of defining the implicit hypothesis space, i.e., the fundamental pieces that explicit hypothesis spaces (or generalizations) are built from. This feels very similar to the old school Principles & Parameters approach to language acquisition (specifically, the Principles part, if we’re talking about the things that don’t vary). It also jives well with some recent thoughts in the Bayesian inference sphere (e.g., see Perfors 2012 for implicit vs. explicit hypothesis spaces).

**Perfors, A. 2012. Bayesian Models of Cognition: What's Built in After All? Philosophy Compass, 7(2), 127-138.

(3) that tie-in between the nature of phonological generalizations, the algorithms that can learn those generalizations, and why this might support those generalizations as actual human mental representations. In particular, “Constraints on phonological well-formedness are SL and SP because people learn phonology in the way suggested by these algorithms.” (End of section 5.2.1)

When I first read this, it seemed odd to me — we’re saying something like: “Look! Human language makes only these kinds of generalizations, because there are constraints! And hey, these are the algorithms that can learn those constrained generalizations! Therefore, the reason these constraints exist is because these algorithms are the ones people use!” It felt as if a step were missing at first glance: we use the constrained generalizations as a basis for positing certain learning algorithms, and then we turn that on its head immediately and say that those algorithms *are* the ones humans use and that’s the basis for the constrained generalizations we see. 

But when I looked at it again (and again), I realized that this did actually make sense to me. The way we got to story may have been a little roundabout, but the basic story of “these constraints on representations exist because human brains learn things in a specific way” is very sensible (and picked up again in 5.4: “…human learners generalize in particular ways—and the ways they generalize yield exactly these classes”). And what this does is provide a concrete example of exactly what constraints and exactly what specific learning procedures we’re talking about for phonology.

(4) There’s a nice little typology connection at the end of section 5.1, based on these formal complexity ontologies: “…the widely-attested constraints are the formally simple ones, where the measure of complexity is determined according to these hierarchies”. Thinking back to links with acquisition, would this be because the human brain is sensitive to the complexity levels (however that might be instantiated)? If so, the prevalence of less complex constraints is due to how easy they are to learn with a human brain. (Or any brain?)



Friday, April 3, 2015

Next time on 4/17/15 @ 12pm in SBSG 2221 = Heinz 2015 book chapter, parts 1-5

It looks like a good collective time to meet will be Fridays at 12pm for this quarter, so that's what we'll plan on.  Our first meeting will be on April 17, and our complete schedule is available on the webpage at 


On April 17, we'll be discussing the first part of a book chapter (parts 1-5) on computational phonology that focuses on the kinds of generalizations and computations that seem to occur in this linguistic domain. This can be very useful for us to think about as modelers if we want to understand the hypothesis spaces learners have for making phonological generalizations.

Heinz, J. 2015 Manuscript. The computational nature of phonological generalizations. University of Delaware. Please do not cite without permission from Jeff Heinz.


See you on April 17!

Friday, March 27, 2015

Spring quarter scheduling

I hope everyone's had a good spring break - and now it's time to gear up for the spring quarter of the reading group! :) The schedule of readings is now posted on the CoLa Reading group webpage, including readings on phonology, process models, and pragmatics:

http://www.socsci.uci.edu/~lpearl/colareadinggroup/schedule.html

Now all we need to do is converge on a specific day and time - please let me know by next Thursday (4/2/15) what your availability is during the week. We'll continue our tradition of meeting for approximately one hour (and of course, posting on the discussion board here).

Thanks and see you soon!
-Lisa

Friday, March 13, 2015

See you in the spring!

Thanks so much to everyone who was able to join us for our enlightening discussion today about Viau et al. 2010, and to everyone who's joined us throughout the winter quarter! The CoLa Reading Group will resume again in the spring quarter. As always, feel free to send me suggestions of articles you're interested in reading, especially if you happen across something particularly interesting!

-Lisa

Wednesday, March 11, 2015

Some thoughts on Viau et al. 2010

I really enjoyed the clear delineation of structural vs. semantic vs. pragmatic factors described — it makes it easier to imagine a formal model of interpreting these kind of utterances. For example, things that matter: 

(i) how recently a logic structural computation has begin computed (structure)
(ii) what meaning/extension has recently been accessed, irrespective of the structure that generated it (semantics)
(iii) what the likely communicative intention is, given the discourse (pragmatics)

In particular, it’s the dependencies between these three that seem particularly interesting, since this paper provides concrete evidence of how much impact (i) and (ii) can have when (iii) is minimized, and also how (ii) can impact (i).

Also, I kept thinking about how the phenomena described might relate to a Rational Speech Act (RSA) model of language use, which typically gets at the pragmatics (iii) by saying something about the meanings that were intended (ii). So maybe what we’d really want in order to capture what’s going on during interpretation is to use something RSA-like to model the pragmatics, while also having a processing model that deals with how accessible the structures (i) and extensions (ii) are to a child learner (or even an adult, I suppose).

More specific thoughts:

(1) I was surprised to read that non-isomorphic scope readings only seem to be a problem when negation is involved, based on Goro 2007 (intro, a paragraph before example 5). So kids are fine for something like “Everyone saw a movie”, where a >> every  = There’s a specific movie that every person saw. But as soon as we stick in negation (“Everyone didn’t see a movie”), the non-isomorphic reading with “not” at the top becomes hard for kids to get (i.e., not >> every, a = It’s not true that every person saw a movie — some did, some didn’t). This makes “not” special. And I wonder if the RSA-style models have anything to say about that, since it does seem very pragmatics-based. (Though I suppose “not” is also special syntactically since it doesn’t have to be a determiner, and it’s also special semantically since it inverts the meaning.)

In footnote 6, in section 3.1.5.,  V&al2010 mention a little about what’s going on with respect to the pragmatics, as they discuss a hypothesis that negating positive expectations (“They all were going to…but look! Some didn’t.”) is easier than negating negative expectations (“They all weren’t going to — but look! Some did!”). Let’s suppose this is true — is this easy to instantiate in an RSA-like model? Does it maybe fall out from assumptions that are natural in an RSA-like model?

(2) That crazy effect in experiment two, where they give kids expected success (ES) stories and get a super-duper non-isomorphic access effect when they take away that supportive pragmatic environment (i.e., give them expected failure (EF) setups): We can see this pretty starkly in Figure 5. I don’t think V&al2010 quite know what’s going on with that either. I guess it could be structural and semantic priming just taking over, and since the kids don’t get any evidence that this is wrong, we maybe get a training effect. But this would be a pretty cool behavior to capture in a model that didn’t explicitly build it in.


(3) Footnote 16 right at the end about how long the priming lasted — 3 days to a month sounds like a very long time. My (fairly uninformed and probably out-of-date) recollection about syntactic priming suggested that structure priming effects are usually pretty short-lived. So something lasting this long is a major implicit learning kind of thing. Maybe this happened because the logical structure is connected to specific extensions that mattered in the context of the experiments? So the idea would be that this connection between multiple representations (one of which is more conscious and matters for communication, i.e., the semantic extension) is what caused the evidence accrued during the experiment to have more of an impact on these kids.