DNN Research Improves Bing Voice Search
Posted by Rob Knies. Inside Microsoft Research
We live in a society obsessed with speed. Whether it’s download times on a mobile phone or Usain Bolt’s time in the 100 meters, the faster the better. We also live during an era when accuracy has become not just preferable but essential. The technological marvels of the 21st century demand it.
Speed=good. Accuracy=good. Put them together, and you’ve got a leap forward, such as recent advancements in Bing Voice Searchfor Windows Phone that enable customers to get faster, more accurate results than ever before.
Those improvements come, in part, from contributions delivered via Microsoft Research’s work on deep neural networks (DNNs). Such networks are a computational framework for automatic pattern recognition that is inspired by the basic circuits of the human brain. Refinements in mathematical formulas, coupled with greater computational power and large data sets, enable DNNs to learn and edge noticeably closer than traditional speech technologies to humans’ ability to recognize speech and images.
Over the past few years, Frank Seide, senior researcher at Microsoft Research Asia, and Dong Yu, senior researcher in the Conversational Systems Research Center at Microsoft Research Redmond, have been at the forefront of this advance, working with scientists and engineers from the Bing Speech team to provide vast improvements in the speed and the accuracy of Bing Voice Search.
Their success has been dramatic. With the judicious use of DNNs, that service has seen its speed double in recent weeks, and its word-error rate has improved by 15 percent. In addition, Bing Voice Search now performs significantly better amid noisy conditions.
Speed=good. Accuracy=good. Put them together, and you’ve got a leap forward, such as recent advancements in Bing Voice Searchfor Windows Phone that enable customers to get faster, more accurate results than ever before.
Those improvements come, in part, from contributions delivered via Microsoft Research’s work on deep neural networks (DNNs). Such networks are a computational framework for automatic pattern recognition that is inspired by the basic circuits of the human brain. Refinements in mathematical formulas, coupled with greater computational power and large data sets, enable DNNs to learn and edge noticeably closer than traditional speech technologies to humans’ ability to recognize speech and images.
Over the past few years, Frank Seide, senior researcher at Microsoft Research Asia, and Dong Yu, senior researcher in the Conversational Systems Research Center at Microsoft Research Redmond, have been at the forefront of this advance, working with scientists and engineers from the Bing Speech team to provide vast improvements in the speed and the accuracy of Bing Voice Search.
Their success has been dramatic. With the judicious use of DNNs, that service has seen its speed double in recent weeks, and its word-error rate has improved by 15 percent. In addition, Bing Voice Search now performs significantly better amid noisy conditions.
None of this will surprise anybody lucky enough to have attended last year’s Computing in the 21st Century Conference, held in Tianjin, China. As will be familiar to many, during that event, co-hosted by Microsoft Research Asia, Rick Rashid, Microsoft chief research officer and head of Microsoft Research, ended his keynote address by speaking to the attendees in English. As he did, in real time, Bing Translator, supplemented by the use of DNNs and Microsoft Research Asia work in mapping a person’s voice to another language, delivered a jaw-dropping revelation: Rashid speaking expertly translated Mandarin, grammatically correct and accurately intoned—in his own voice. The reception was rapturous.
In fact, that demonstration combined three different technologies: machine translation, text-to-speech conversion, and automatic speech recognition. The latter part represents the research breakthrough on which the work by Yu and Seide on DNNs began to pay off.
The DNN research enabled a new acoustic model and decoder for Bing Voice Search for Windows Phone. The decoding runtime worked like a charm in Tianjin.
The application of DNNs for speech recognition, building on recent advances by Geoffrey Hinton at the University of Toronto, is hardly a simple task. DNN models can contain hundreds of millions of parameters, representing patterns of the human voice, and are trained through a process developed by Microsoft Research scientists. Bing’s back-end infrastructure completes the pipeline that results in an instantaneous user experience.
During Interspeech 2011, the 12th annual Conference of the International Speech Communication Association, Seide, Yu, and Gang Li, a Microsoft Research Asia research software-development engineer, presented a paper called Conversational Speech Transcription Using Context-Dependent Deep Neural Networks, detailing their DNN explorations, the most dramatic change in speech-recognition accuracy in more than two decades.
Yu, who contributed the “context-dependent” work to enable DNNs to be applied to large-vocabulary speech recognition, recalls being absolutely giddy with enthusiasm when he realized what he and his colleagues had achieved.
“I first realized the effect of the DNN when we successfully achieved significant error-rate reduction on the voice-search data set after implementing the context-dependent deep-neural-network hidden Markov model,” Yu smiles. “It was an exciting moment. I was so excited that I did not sleep that night. I realized that we had made a breakthrough and called Qiang Huo [a Microsoft Research Asia research manager who also has worked on speech recognition] late at night—daytime in China—to describe the ideas and results.”
Interestingly, the researchers also have discovered that DNNs can learn across languages. This is of critical importance, because speech recognizers must be trained on huge amounts of example speech data—thousands of hours of it—and the burden of transcribing such voluminous files can be reduced significantly when data from one language can help improve accuracy for another.
That’s just one example of how this research continues to evolve. As the scientists continue to explore the expanding frontiers of the DNN work and to collaborate accordingly, the DNN-fueled speech-recognition improvements can only continue. Broad-scale speech-to-speech translation, once simply a dream, suddenly seems an alluring possibility.
“Our result significantly advanced the state of the art, both in industry and in the academic community,” Yu says. “Now, most industrial automatic-speech-recognition systems are DNN-based. This also helped to popularize deep learning.
“Before our result, deep learning was only tested on small tasks and did not attract wide attention. I believe this is just the first step in advancing the state of the art. Many difficult problems may be attacked under this framework, which might lead to even greater advances.”
The new, improved Bing Voice Search represents yet another enhancement in the user experience customers can enjoy with a combination of Windows Phone 8 and Bing technologies, following November’s update that brought Bing Translator to the Windows Phone 8 platform. Given the popularity of the latter, the enhancements to Bing Voice Search seem certain to captivate even more attention in the mobile-phone realm.
In fact, that demonstration combined three different technologies: machine translation, text-to-speech conversion, and automatic speech recognition. The latter part represents the research breakthrough on which the work by Yu and Seide on DNNs began to pay off.
The DNN research enabled a new acoustic model and decoder for Bing Voice Search for Windows Phone. The decoding runtime worked like a charm in Tianjin.
The application of DNNs for speech recognition, building on recent advances by Geoffrey Hinton at the University of Toronto, is hardly a simple task. DNN models can contain hundreds of millions of parameters, representing patterns of the human voice, and are trained through a process developed by Microsoft Research scientists. Bing’s back-end infrastructure completes the pipeline that results in an instantaneous user experience.
During Interspeech 2011, the 12th annual Conference of the International Speech Communication Association, Seide, Yu, and Gang Li, a Microsoft Research Asia research software-development engineer, presented a paper called Conversational Speech Transcription Using Context-Dependent Deep Neural Networks, detailing their DNN explorations, the most dramatic change in speech-recognition accuracy in more than two decades.
Yu, who contributed the “context-dependent” work to enable DNNs to be applied to large-vocabulary speech recognition, recalls being absolutely giddy with enthusiasm when he realized what he and his colleagues had achieved.
“I first realized the effect of the DNN when we successfully achieved significant error-rate reduction on the voice-search data set after implementing the context-dependent deep-neural-network hidden Markov model,” Yu smiles. “It was an exciting moment. I was so excited that I did not sleep that night. I realized that we had made a breakthrough and called Qiang Huo [a Microsoft Research Asia research manager who also has worked on speech recognition] late at night—daytime in China—to describe the ideas and results.”
Interestingly, the researchers also have discovered that DNNs can learn across languages. This is of critical importance, because speech recognizers must be trained on huge amounts of example speech data—thousands of hours of it—and the burden of transcribing such voluminous files can be reduced significantly when data from one language can help improve accuracy for another.
That’s just one example of how this research continues to evolve. As the scientists continue to explore the expanding frontiers of the DNN work and to collaborate accordingly, the DNN-fueled speech-recognition improvements can only continue. Broad-scale speech-to-speech translation, once simply a dream, suddenly seems an alluring possibility.
“Our result significantly advanced the state of the art, both in industry and in the academic community,” Yu says. “Now, most industrial automatic-speech-recognition systems are DNN-based. This also helped to popularize deep learning.
“Before our result, deep learning was only tested on small tasks and did not attract wide attention. I believe this is just the first step in advancing the state of the art. Many difficult problems may be attacked under this framework, which might lead to even greater advances.”
The new, improved Bing Voice Search represents yet another enhancement in the user experience customers can enjoy with a combination of Windows Phone 8 and Bing technologies, following November’s update that brought Bing Translator to the Windows Phone 8 platform. Given the popularity of the latter, the enhancements to Bing Voice Search seem certain to captivate even more attention in the mobile-phone realm.
Microsoft Research Asia, Microsoft Research Redmond, Rick Rashid, Bing, Windows Phone, Frank Seide, Machine Translation, Bing Translator, Qiang Huo, Interspeech 2011,Gang Li, Dong Yu, Computing in the 21st Century Conference, Deep Neural Networks,Conference of the International Speech Communication Association, automatic speech recognition, Conversational Speech Transcription Using Context-Dependent Deep Neural Networks, speed, text-to-speech, Conversational Systems Research Center, Bing Voice Search, Bing Speech, mobile phone, accuracy, DNN