I found this paper a very enjoyable read, and I like very much that it's looking at the building blocks of distributional learning. This seems like the next step forward - we want to know not just that statistical/distributional learning works, but also what the underlying cognitive pieces are that make it work. It's a nice demonstration of a particular story of how cognitive pieces could fit together and make distributional learning work for a few different language acquisition tasks, and it definitely aims to be an algorithmic-level ("mechanistic") account of this process. One of the things that was really good is how clear the authors are that this is only one story - that is, it's an existence proof that this account could work. It doesn't preclude other accounts, but it does shore up support for this account by showing that it does, indeed, work.
The model they propose seems like it can be very prone to initial snowballing, where small initial errors persist and cause larger errors later on. (This may or may not be a bad thing, if we're concerned with actual human learning.) For example, in the first simulation, they mention how sometimes their bimodal input resulted in a unimodal representation, due to exactly this kind of thing. Also, it did seem like there were a fair number of free parameters involved - of course, the nice thing is that some of those parameters have explanatory power, since we can manipulate them to get different qualitative learning effects. Aiming for qualitative patterns rather than exact behavioral matches seems exactly right, though - there are other factors contributing to the observed output behavior, and the authors (quite reasonably) only modeling some of them.
Something else notable about this model is that it's geared only towards tasks that involve abstraction. Now, of course, many acquisition tasks are about some kind of abstraction, but some aren't (like word segmentation) - so it's worth remembering that even if this is how (some kinds of) distributional learning are implemented, we still need some explanation for how other non-abstraction tasks are accomplished. I also like how much they grounded the underlying cognitive pieces of their model in existing models of (long-term) memory - this does my empirical heart good.
Some more targeted thoughts:
I like how they pointed out on p.3 that statistical learning doesn't just have to be about transitional probabilities. Sometimes, these really get equated, and it's a little unfair to the enterprise of statistical learning to talk about it as if it's just dealing with conditional relations. (Of course, much of the experimental work looking at children's inference capabilities involve testing conditional relationships, and many computational models assume conditional relationship tracking abilities.)
The discussion of making inferences from exemplars on p.4 seemed a little simplified to me. For example, while I can imagine that it's often the case that exemplars occurring more frequently will be weighted more than exemplars occurring rarely, it's not obvious to me that this is always the case. Instead, it seems like it would depend on the learner's hypothesis space. For instance, in a subset-superset hypothesis space, one counter-example seems like it could be very heavily weighted, even if it occurs rarely. As another example, the authors talk about how exemplar similarity depends "at least in part upon the variability of the exemplars in the input set". I could imagine that this is true, but I think it also depends on the learner's biases about the hypothesis space - in effect, learner-subjective variability rather than objective variability.
The authors mention on p.7 that they selected the specific linguistic tasks they did because language is a domain where domain-specific mechanisms have often been argued to be at work. I wonder if domain-specific mechanisms have been proposed for the specific linguistic tasks they chose, though - it seems like the type of mechanism proposed depends very much on the task. So, if the authors want to argue that their results show domain-specific mechanisms aren't needed, they do need to address the specific problems where domain-specific mechanisms have been proposed. It wasn't clear to me that this was done, which makes that argument a little weaker to me.
I thought the ability to explain why variable contexts facilitate the learning of phonetic distinctions (basically, due to having a holistic representation of input exemplars) was really excellent. In effect, the "irrelevant" part of the representation helps keep the "relevant" part distinct. This really argues for not just context-sensitive storage of data from the input, but holistic storage. And this also ties into the idea that minimal pairs are probably helpful to linguists, but not to children.
The basic components of the distributional statistical learning process seem quite reasonable: similarity-based activation of prior memories, strength-based learning of features, abstraction of irrelevant features, and memory decay. The second and third components do implicitly assume that the learner has a reasonable set of features to begin with, though. This is a non-trivial assumption, especially when you start thinking about the hypothesis space of possible features. For example, this shows up in simulation 1, where only certain phonetic features are picked out as even in the hypothesis space to begin with.
The effectiveness of the learner really comes from being able to compare across exemplars, which means particular modeling assumptions - such as assuming the learner is memoryless or that the learner is limited to one exemplar at a time - become not so harmless.
I thought it was slightly unfair on p.38 to differentiate the current model from prior models by saying prior models "have been focused on acquiring relatively domain-specific kinds of knowledge...meaning they are not easily applied to other domains". It seemed to me that the current model can only be applied to different domains because the domain-specific knowledge has been built in as part of the feature descriptions. So maybe the point was simply that prior models didn't try to separate out the more-general components from the task-specific components.
I really appreciated the discussion on p.40 about what different parameter values for different learning tasks might actually imply for children. I don't think modelers are always so careful in evaluating what the model parameters & parameter values mean.
A nice aspect of this model discussed in Appendix A is how it can basically recover from spurious examples in the input. Because actual exemplars are kept around (in addition to increasingly more abstract interpretations), one spurious example and its created interpretation can be overrun by lots of non-spurious (i.e., good) exemplars.