Wednesday, April 10, 2013

Some thoughts on Mohamed et al. (2011)


This brief article focuses on the principles of how deep belief networks (DBN) achieve good speech recognition performance, while glossing over many of the details. Therefore, it seems to me that this article can be approached with two levels of rigor. For the novice with a more leisurely approach, the article provides some very clear and concise descriptions of what a DBN model has that sets it apart from other types of competing models. For the experimentalist who wants to replicate the actual models used in the paper, good luck. Nevertheless, there are more extensive treatments of the technical details elsewhere in the literature, and even the novice will probably wish to consult some of these sources to appreciate the nuances in the method that receive short shrift here.

The three main things that make DBNs an attractive modeling choice:
1) They are neural networks. Neural networks are an efficient way to estimate the states of hidden Markov models (HMM), compared to mixture of Gaussians.
2) They are deep. More hidden layers allows for more complicated correlations between the input and the model states, so more structure can be extracted from the data.
3) They are generatively pre-trained. This is a neat pre-optimization algorithm that places the model in a good starting point for back-propagation to discover local maxima. Without this pre-optimization, models with many hidden layers are unlikely to converge on a good solution.

The idea of using a "generative" procedure to pre-optimize a system seems like it may have immediate applicability for psychologists and linguists who also study "generative" phenomena. After all, the training algorithm is even called the "wake-sleep" algorithm, where the model generates "fantasies" during its pre-training. While the parallels are certainly interesting, without appreciating the details of the algorithm, it's difficult to know how deep these similarities actually are. In his IPAM lecture, Hinton notes that while some neuroscientists such as Friston do believe the model is directly applicable to the brain, he remains skeptical.

Ignoring psychological applications for the moment, I'm still left wondering about how "good" DBNs actually perform. The best performing model in this paper still only achieves a Phoneme Error Rate of 20%, and the variability attributable to feature types, number of hidden layers, or pre-training appears small, affecting performance by only a few percentage points. Again, the evaluation procedure is not entirely clear to me, so it's difficult to know how these values translate into real-world performance. I would believe that current voice-recognition technology does much better than 80%, and in far more adverse conditions than those tested here. It was also interesting to note that DBNs appear to have a problem with ignoring irrelevant input.

The multidimensional reduction visualization (t-SNE) was pretty cool, plotting data points that are near to each other in high-dimensional space close together in 2-dimensional space. It would be nice to have some way to quantify the revealed structures using this visualization technique. The distinctions between Figs 3-4 and 7-8 are visually obvious, but I think we just have to take the authors' at their word when they describe differences in Figs 5-6. Perhaps another way to visualize the hidden structure in the model, particularly comparing different individual hidden layers as in Figs 7-8, would be to provide dendograms that cluster inputs based on the hidden vectors that are generated.

Overall, DBNs seem like they can do quite a bit of work for speech recognition systems, and the psychological implications of these models seem to be promising avenues for research. It would be really nice to see some more elaborate demonstrations of DBNs in action.

No comments:

Post a Comment