I really liked how compact this paper was - there was quite a bit of material included without it feeling like a part of the discussion was missing. I appreciated the connections made between the implementation of the model and the cognitive learning biases that implementation represented.
As a researcher with a soft spot for empirically-grounded modeling, I was also pleased to see the connections to English and Navajo phonotactic variation. (I admit, I would have liked a bit less abstraction for some of the modeling demonstrations once the basic principle had been illustrated, but that's probably why it was a 20 page paper instead of a 40 page paper.) One of the things that really struck me was how much the MaxEnt framework discussed seemed similar to hierarchical Bayesian models (HBMs) - I kept wanting to map the different frameworks to each other (prior = prefer simpler grammars, likelihood = maximize probability of input data, etc.). It seemed like the MaxEnt framework included an overhypothesis (dislike geminate consonants in general [structure-blind]), and then some more specific instantiations (dislike them within words, but don't care about them as much across words [structure-sensitive]). This would be the "leaking" that the title refers to - the leaking of specific constraints back up to the overhypothesis. This also ties into the idea on p.763 where Martin mentions that structure-blind constraints may be a hold-over from very early learning (Perfors, Tenenbaum and colleagues often talk about the "blessing of abstraction" for overhypotheses, where the more abstract thing can be learned earlier because it's instantiated in so many things. And so perhaps the overhypothesis is reinforced more than any individual instantiation of it, making it more resistant to change later on.) But instead of having them arranged in this kind of hierarchy (or maybe it's more like two factors interacting - (1) geminate preference + (2) within vs. across words?), the constraints were specified explicitly by the modeler. This is great first step to show that all of these constraints are needed, but it does feel like some more-general representation is missing.
I also thought it was a very interesting hypothesis that marked forms (i.e., geminates across word boundaries in compounds) persist because new compounds are formed that are not drawn from the existing phonotactic distribution of geminates. Martin suggests this is because semantic factors play a role in compound formation, and they have nothing to do with phonotactics. This seems reasonable, but really, the main empirical finding is simply that something besides the existing phonotactic distribution matters. Something I would have liked to have seen was how far away the new-compound-formation distribution has to be from the existing distribution in order for these forms to persist - in the demonstration Martin does, this distribution is simply 0.5 (half the time new compounds contain geminates). But one might easily imagine that new compounds are formed from the existing words in the lexicon, and this might be less than 0.5, depending on the actual words in the lexicon. Do these forms persist if the new-compound-formation distribution is 0.25 geminates, for instance?
Section 4: I was unsure how to map the learning model to Universal Grammar (UG), especially since Martin makes it a point to connect the model to UG in the first paragraph here. I think he's saying that the "entanglement" of the constraints (which reads to me like overhypothesis + more specific constraints) is not part of UG. This is fine, if we think about the structure of overhypotheses as general not being a UG thing. But what does seem to then be a UG thing is what the overhypothesis actually is - in this case, it's knowing that geminates are a thing to pay attention to, and that word structure may matter for them. (In the same way, if we think of UG parameters as overhypotheses, the UG part is what the content of the overhypothesis/parameter is, not the fact that there is actually an overhypothesis.) So would Martin be happy to claim that both the "entanglement" structure and the content of the constraints themselves aren't part of UG? If so, where does the focus on geminates and word structure come from? Does the attention to geminates and word structure logically arise in some way?
Section 4.2, p.760, discussing the tradeoff between modeling the data as accurately as possible and having as general a grammar as possible: This tradeoff is completely fine, of course, as that's exactly the sort of thing Bayesian models do. But Martin also equates a "general" grammar to a uniform distribution grammar - I was trying to think if that's the right connection to draw. In one sense, it may be, if we think about how much data each grammar is compatible with - a grammar with a uniform distribution doesn't really give much importance to any of the constraints (if I'm understanding this correctly), so it would presumably be fine with the entire set of input data. This then makes it more general than grammars that do place priority on some constraints, and so don't allow in some of the data.
Section 4.2, p.760: The learning described, where the constraints are assigned arbitrary weights, and then the constraint weights are updated using the SGA update rule, reminds me a lot of neural net updating. How similar are these? On a more specific note, I was trying to figure out how to interpret C_i(x) and C_i(y) in the rule in (7) - are these simply binary (1 or 0)? (This would make sense, since the constraints themselves are things like "allow geminates".)