So what I liked most about this article was the way in which they chose to explore the space of possibilities in a very computational-level way. I think this is a great example of what I'd like to see more of. As someone also interested in cross-linguistic viability for our models, I have to also commend them for testing on not just one foreign language, but on three.

So there were a number of aspects of the model which I think could have been more clearly specified. For instance, I don't believe they ever explicitly say that the model presumes knowledge of the number of states to be learned. Actual infants don't have the benefit of the doubt in this regard, so it would be nice to know what would happen if you inferred that from the data. It turns out there's a well specified model to do that, but I'll get to that later. Another problem with their description of the model has to do with how their hyperparameters are sampled. They apparently simplify the process by resampling only once per iteration of the Gibbs sampler. I'm happy with this although I'm going to assume that it was a typo that they say they run their model for 2000 iterations (Goldwater seems to prefer 20,000). Gibbs samplers tend to converge more slowly on time-dependent models so it would be nice to have some evidence that the sampler has actually converged. Splitting the data by sentence type seems to increase the size of their confidence intervals by quite a lot, which may be an artifact of having less data per parameter, but could also be due to a lack of convergence.

Typically I have to chastise modelers who attempt to use VI or V-measure, but fortunately they are not doing anything technically wrong here. They are correct in that comparing these scores across corpora is hazardous at best. Both of these measures are biased, VI prefers small numbers of tags and V-measure prefers large numbers of tags (they claim at some point that it is "invariant" to different numbers of tags, this is however not true!). It turns out that a measure, V-beta, is more useful than either of these two in that it is unbiased for the number of categories. So there's my rant about the wonders of V-beta.

What I really would have liked to see would be an infinite HMM for this data, which is a well-specified, very similar model which can infer the number of grammatical categories in the data. It has an efficient sampler (as of 2008) so there's no reason they couldn't run that model over their corpus. It's very useful for us to know what the space of possibilities is, but to what extent would their results change if they gave up the assumption that you knew from the get-go how many categories there were? There's really no reason they couldn't run it and I'd be excited to see how well it performed.

The one problem with the models they show here as well as the IHMM is that neither allows for there to be shared information about transition probabilities or emission probabilities (depending on the model) across sentence types. They're treated as entirely different. They mention this in their conclusion, but I wonder if there's any way to share that information in a useful way without hand coding it somehow.

Overall, I'm really happy someone is doing this. I liked the use of some very salient information to help tackle a hard problem, but I would've liked to have seen it a little more realistic by inferring the number of grammatical categories. I might've also liked to have seen better evidence of convergence (perhaps a beam sampler instead of Gibbs, at the very least I hope they ran it for more than 2000 iterations).

So there were a number of aspects of the model which I think could have been more clearly specified. For instance, I don't believe they ever explicitly say that the model presumes knowledge of the number of states to be learned. Actual infants don't have the benefit of the doubt in this regard, so it would be nice to know what would happen if you inferred that from the data. It turns out there's a well specified model to do that, but I'll get to that later. Another problem with their description of the model has to do with how their hyperparameters are sampled. They apparently simplify the process by resampling only once per iteration of the Gibbs sampler. I'm happy with this although I'm going to assume that it was a typo that they say they run their model for 2000 iterations (Goldwater seems to prefer 20,000). Gibbs samplers tend to converge more slowly on time-dependent models so it would be nice to have some evidence that the sampler has actually converged. Splitting the data by sentence type seems to increase the size of their confidence intervals by quite a lot, which may be an artifact of having less data per parameter, but could also be due to a lack of convergence.

Typically I have to chastise modelers who attempt to use VI or V-measure, but fortunately they are not doing anything technically wrong here. They are correct in that comparing these scores across corpora is hazardous at best. Both of these measures are biased, VI prefers small numbers of tags and V-measure prefers large numbers of tags (they claim at some point that it is "invariant" to different numbers of tags, this is however not true!). It turns out that a measure, V-beta, is more useful than either of these two in that it is unbiased for the number of categories. So there's my rant about the wonders of V-beta.

What I really would have liked to see would be an infinite HMM for this data, which is a well-specified, very similar model which can infer the number of grammatical categories in the data. It has an efficient sampler (as of 2008) so there's no reason they couldn't run that model over their corpus. It's very useful for us to know what the space of possibilities is, but to what extent would their results change if they gave up the assumption that you knew from the get-go how many categories there were? There's really no reason they couldn't run it and I'd be excited to see how well it performed.

The one problem with the models they show here as well as the IHMM is that neither allows for there to be shared information about transition probabilities or emission probabilities (depending on the model) across sentence types. They're treated as entirely different. They mention this in their conclusion, but I wonder if there's any way to share that information in a useful way without hand coding it somehow.

Overall, I'm really happy someone is doing this. I liked the use of some very salient information to help tackle a hard problem, but I would've liked to have seen it a little more realistic by inferring the number of grammatical categories. I might've also liked to have seen better evidence of convergence (perhaps a beam sampler instead of Gibbs, at the very least I hope they ran it for more than 2000 iterations).

## No comments:

## Post a Comment