Top page > Learn(History) > Firsts of Their Kinds > Practical Application of World's First Voice Synthesis System with High-Quality Sound

Practical Application of World's First Voice Synthesis System with High-Quality Sound

From the concept of voice coding, we developed a closed-loop learning method that automatically learned from voice data for the first time in the world and completed the world’s preeminent voice synthesis method.

Practical Application of World's First Voice Synthesis System with High-Quality Sound

Research on voice synthesis started with the aim of realizing a human interface technology for interaction with computers, similar to voice recognition. In 1982, we developed a voice word processor that converted voiced syllables into written characters, and applied it to a voice recognition response system for banks. After that, in addition to improving the method itself, we developed dedicated hardware as well as voice synthesis software running on a workstation and, in 1995, commercialized a voice synthesis software that operated on a PC. However, the sound quality and naturalness of the synthesized voice was not at all satisfactory, and it was described as a “nasal voice” or “robot voice.” The sound quality of the synthesized voice could have been improved by expanding the speech segment dictionary for waveform generation. However, this would have increased the size of the dictionary to the point where it would be difficult to implement the system on small-scale hardware. There was also the issue of the long time required for development, because the creation of a speech segment dictionary relies on a process of trial and error by technical specialists. Various approaches to solve these issues were investigated by a number of research institutions, but no decisive solution appeared.

This situation completely changed in 1994 with the participation of a voice coding researcher in our research efforts. The issues were reinvestigated from zero, free from the common perceptions of voice synthesis. Rather than basing our approach on existing knowledge or knowhow, we decided to focus on automatic learning of voice synthesis parameters from voice data as the fundamental policy. Finally, based on analysis of the causes of the nasal and robotic-sounding voices, we succeeded in formularizing the issue of sound quality in the form of errors from the learning data.

Next, we developed a closed-loop learning method for the speech segment dictionary that minimized errors in the synthesized sound based on this formularization, the first time in the world such a system had been actualized. This memory-efficient system solved the contradiction between sound quality and dictionary size, maximizing the quality of the sound while using a minimum of speech segments and providing high-quality, natural synthesized sound similar to the human voice. Another feature of the system was that, once the learning data had been prepared, a synthesis dictionary could be automatically created in a short period of time to produce synthesized sound close to the human voices used for the learning data. This closed-loop learning method was a revolutionary system that broke through the conventional belief up to that time that the development of a voice synthesis system inevitably required knowledge and knowhow accumulated over many years as well as a process of trial and error relying on the ears of technical specialists. In order to achieve the practical application of these research results, the researchers themselves visited customers and cultivated the market. In 1998, our voice synthesis middleware was adopted by a leading automobile manufacturer. Other manufacturers followed suit, and by 2006 it held a 94% share in the domestic car navigation market. In 2002, we established research and development bases in the U.K. and China and worked on the preparation of multilingual versions. Today, Toshiba’s voice synthesis and voice recognition technologies have been adopted in the European and American markets as well as the Chinese market. We are also pioneering new services such as the application of voice synthesis to the creation of contents, and are promoting the development of speaker adaptation and speaking characteristics adaptation technologies that can synthesize the voices of specific speakers and ways of speaking, as well as an emotional voice synthesis technology that can synthesize emotional voices. In these ways we are making efforts to expand the fields of application of our system.

Related Links

Learn (History) Top Page

Toshiba Science Museum
2F Lazona Kawasaki Toshiba Bldg., 72-34, Horikawa-Cho, Saiwai-Ku, Kawasaki 212-8585, Japan

To Top