This brief article focuses on the principles of how deep
belief networks (DBN) achieve good speech recognition performance, while
glossing over many of the details. Therefore, it seems to me that this article
can be approached with two levels of rigor. For the novice with a more
leisurely approach, the article provides some very clear and concise
descriptions of what a DBN model has that sets it apart from other types of competing
models. For the experimentalist who wants to replicate the actual models used
in the paper, good luck. Nevertheless, there are more extensive treatments of
the technical details elsewhere in the literature, and even the novice will
probably wish to consult some of these sources to appreciate the nuances in the
method that receive short shrift here.
The three main things that make DBNs an attractive modeling
choice:
1) They are neural networks. Neural networks are an
efficient way to estimate the states of hidden Markov models (HMM), compared to
mixture of Gaussians.
2) They are deep. More hidden layers allows for more
complicated correlations between the input and the model states, so more
structure can be extracted from the data.
3) They are generatively pre-trained. This is a neat
pre-optimization algorithm that places the model in a good starting point for
back-propagation to discover local maxima. Without this pre-optimization,
models with many hidden layers are unlikely to converge on a good solution.
The idea of using a "generative" procedure to
pre-optimize a system seems like it may have immediate applicability for psychologists
and linguists who also study "generative" phenomena. After all, the
training algorithm is even called the "wake-sleep" algorithm, where
the model generates "fantasies" during its pre-training. While the
parallels are certainly interesting, without appreciating the details of the
algorithm, it's difficult to know how deep these similarities actually are. In
his IPAM lecture, Hinton notes that while some neuroscientists such as Friston
do believe the model is directly applicable to the brain, he remains skeptical.
Ignoring psychological applications for the moment, I'm
still left wondering about how "good" DBNs actually perform. The best
performing model in this paper still only achieves a Phoneme Error Rate of 20%,
and the variability attributable to feature types, number of hidden layers, or
pre-training appears small, affecting performance by only a few percentage
points. Again, the evaluation procedure is not entirely clear to me, so it's
difficult to know how these values translate into real-world performance. I would
believe that current voice-recognition technology does much better than 80%,
and in far more adverse conditions than those tested here. It was also
interesting to note that DBNs appear to have a problem with ignoring irrelevant
input.
The multidimensional reduction visualization (t-SNE )
was pretty cool, plotting data points that are near to each other in
high-dimensional space close together in 2-dimensional space. It would be nice
to have some way to quantify the revealed structures using this visualization
technique. The distinctions between Figs 3-4 and 7-8 are visually obvious, but
I think we just have to take the authors' at their word when they describe
differences in Figs 5-6. Perhaps another way to visualize the hidden structure
in the model, particularly comparing different individual hidden layers as in
Figs 7-8, would be to provide dendograms that cluster inputs based on the
hidden vectors that are generated.
Overall, DBNs seem like they can do quite a bit of work for
speech recognition systems, and the psychological implications of these models seem
to be promising avenues for research. It would be really nice to see some more
elaborate demonstrations of DBNs in action.
No comments:
Post a Comment