One of the things I most enjoyed about this paper was the way Stabler gives the intuitions behind the different approaches - in many cases, these are some of the most lucid descriptions I've seen about these different mathematical techniques. I also really appreciated the discussion about model selection - it certainly seems true to me that model selection is what many theoretical linguists are thinking about when they discuss different knowledge representations. Of course, this isn't to say that parameter setting once you know the model isn't worthy of investigation (I worry a lot about it myself!). But I also think it's easier to use existing mathematical techniques to investigate parameter setting (and model selection, when the models are known), as compared to model generation.
Some more targeted thoughts below:
I really liked the initial discussion of "abstraction from irrelevant factors", which is getting at the idealizations that we (as language science researchers) make. I don't think anyone would argue that it's necessary to do that to get anything done, but the fights break out when we start talking about the specifics of what's irrelevant. A simple example would be frequency - I think some linguists would assume that frequency's not part of the linguistic knowledge that's relevant for talking about linguistic competence, while others would say that frequency is inherently part of that knowledge since linguistic knowledge includes how often various units are used.
I thought Stabler made very good points about the contributions from both the nativist and the empiricist perspectives (basically, constrained hypothesis spaces for the model types but also impressive rational learning abilities) - and he did it in multiple places, highlighting that both sides have very reasonable claims.
The example in the HMM section with the discovery of implicit syllable structure reminded me very much of UG parameter setting. In particular, while it's true that the learner in this example has to discover the particulars of the unobserved syllable structure, there's still knowledge already (by the nature of the hidden units in the HMM) that there is hidden structure to be discovered (and perhaps even more specific, hidden syllabic structure). I guess the real question is how much has to be specified in the hidden structure for the learner to succeed at discovering the correct syllable structure - is it enough to know that there's a level above consonants & vowel? Or do the hidden units need to specify that this hidden structure is about syllables, and then it's just a question of figuring out exactly
what about syllables is true for this language?
I was struck by Stabler's comment about whether it's methodologically appropriate for linguists to seek grammar formalisms that guarantee that human learners can, from any point on the hypothesis space, always reach the global optimum by using some sort of gradient descent. This reminds me very much of the tension between the complexity of language and the sophistication of language learning. First, if language isn't that complex, then the hypothesis space de facto probably can be traversed by some good domain-general learning algorithms. If, however, language is complex, the hypothesis space may not be so cleanly structured. But, if children have innate learning biases that guide them through this "bumpy" hypothesis space, effectively restructuring the hypothesis space to become smooth, then this works out. So it wouldn't be so much that the hypothesis space must be smoothly structured on its own, but rather that it can be perceived as being smoothly structured, given the right learning biases. (This is the basic linguistic nativist tenet about UG, I think - UG are the biases that allow swift traversal of the "bumpy" hypothesis space.)
I also got to thinking about the idea mentioned in the section on perceptrons about how there are many facts about language that don't seem to naturally be Boolean (and so wouldn't lend themselves well to being learned by a perceptron). In a way, anything can be made into a Boolean - this is the basis of binary decomposition in categorization problems. (If you have 10 categories, you first ask if it's category 1 or not, then category 2 or not, etc.) What you do need is a lot of knowledge about the space of possibilities so you know what yes or no questions to ask - and this reminds me of (binary) parameter setting, as it's usually discussed by linguists. The child has a lot of knowledge about the hypothesis space of language, and is making decisions about each parameter (effectively solving a categorizing problem for each parameter - is it value a or value b?, etc.). So I guess the upshot of my thought stream was that perceptrons could be used to learn language, but at the level of implementing the actual parameter setting.
It was very useful to be reminded that the representation of the problem and the initial values for neural networks are crucial for learning success. This of course implies that the correct structure and values for whatever language learning problem must be known a priori (which is effectively a nativist claim, and if these values are specific to language learning, then a linguistic nativist claim). So, the fight between those who use neural networks to explain language learning behavior and those who hold the classic ideas about what's in UG isn't about whether there are some innate biases, or even if those biases are language-specific - it may just be about whether the biases are about the learning mechanism (values in neural networks, for example) or about the knowledge representation (traditional UG biases, but also potentially about network structure for neural nets).
Alas, the one part where I failed to get the intuition that Stabler offered was in the section on support vector machines. This is probably due to my own inadequate knowledge of SVMs, but given how marvelous the other sections were with their intuitions, I really found myself struggling with this one.
Stabler notes in the section on model selection that model fit cannot be the only criterion for modeling success, since larger models tend to fit the data (and perhaps overfit the data) better than simpler models. MDL seems like one good attempt to deal with this, since it has a simple encoding length metric which it uses to compare models - encoding not just the data, based on the model, but also the model itself. So, while a larger model may have a more compact data encoding, its larger size counts against it. In this way, you get some of that nice balance between model complexity and data fit.