Speech Recognition System
Essay Preview: Speech Recognition System
Report this essay
Speech Recognition History
Speech recognition systems date as far as 1940, when the U.S department of defense sponsored the first academic pursuits in speech recognition, but the project was a failure.
In the late 1950s IBM started its research in this domain. The objective of this research was to arrive at a correlation between sounds and the words they represent.
At the 1964 worlds Fair IBM demonstrated recognition of spoken digits.
In the late 1960s the department of defense funded again a new research initiative.
The research included Automatic Prototyping that allows computers to search out specific sounds and store them for comparison and analysis, and Dynamic Programming that recognizes conversational speech despite variation in rates of speech.
During this period, IBM pioneered a statistical approach to speech recognition that allows the system to improve its performance automatically, using powerful statistical logarithms adapted from information theory.
In 1984, IBM demonstrated the worlds first 5000-word vocabulary speech recognition system with 95 percent accuracy, but it required a mainframe computer.
In 1987, due to intensive research efforts, the vocabulary was increased to 20,000 words and the hardware required is reduced to a single auxiliary card.
In 1989 customers began to test the technology, and in 1992 IBM introduced the first dictation system called IBM Speech Server Series.
IBM continued its efforts and in 1993 it launched the IBM Personal dictation system running on an IBM PC. The system takes a dictation at about 80 words per minute with 95 percent accuracy and it supported multiple languages.
The main achievement in 1993 is that the processing power needed to do speech recognition finally arrived in the form of personal computers that had that power.
In 1996 IBM introduced a new release of dictation system called VoiceType 3.0 which required no special adapter card. It now worked with discrete word at a time, speech dictation and recognized continuous commands without the need for training.
VoiceType was the worlds first consumer dictation product.
In 1996 Charles Schwab was some first major consumer company to implement a speech recognition system for its vital customer interface. The system was called a voice broker and its success led to speech recognition adoption by the likes of Sears Roebuck and Co. United parcel Service and the trade securities.
In 1997 IBM introduced an avalanche of new products including VoiceType connection in Chinese, Japanese and Arabic.
Besides IBM several companies were investing in this field, including Dragon Systems, Lernout & Hauspie, and Phillips.
So we can see that early speech recognition products used discrete speech recognition where you had to stop after each word. Now we have the continuous speech recognition that recognizes continuous speech, and the vocabularies include 30000 words expandable to 150000 words.
So speech recognition is becoming an important way of input to the computer as well as other functions.
With Speech Recognition technology advances in the late 1990’s, web developers became increasingly interested in developing a standard allowing them to take advantage of the technology commercially. In 1999, the Voice XML Forum was created by four companies to do just that. In the two years since its inception, the Forum has grown to include over 550 companies.
Most of today’s leading voice portals have developed their capabilities using the Voice XML standard. However, the SALT Forum’s, in October 15, 2001 announcement has left many web developers in a predicament. Without knowing which standard will ultimately win, web developers must choose between using the existing standard (Voice XML) and postponing development to wait for what may become the standard of the future (SALT).
Hardware and Software Requirements
Dragon Systems, L&H, IBM, and Philips each offer basic packages that cost about $50. More sophisticated versions from Dragon, L&H, and IBM have larger dictionaries and more extensive application support, and cost between $200 and $250.
Speech recognitions complexity pushes the limits of PC power. Although most packages will work with a 200-MHz Pentium, a 300-MHz or faster chip improves performance. Chips such as the Pentium III and the Athlon satisfy the applications demand for power even better, and many high-end packages can take advantage of the PIIIs multimedia extensions. And the more RAM, the better: Consider 64MB a practical minimum, with 128MB providing substantial improvements.
Most speech packages come with a basic headset microphone, but a better one from a third party can improve recognition. Andrea, Plantronics, and VXI a variety of headset microphones ranging in price from $30 to $150.
The quality of the PCs sound card is also crucial. Cheap models wont cut it because they produce distorted, low-quality output. While standard 16-bit sound cards work, a high-quality card that costs $100 to $150 will offer better performance.
Or one could try Dragon Systems $80 USB headset, which bypasses the sound card entirely (thanks to its built-in digital signal processor) and works great with notebooks.
Speech Recognition Technology
A computer doesnt speak your language, so it must transform your words into something it can understand. A microphone converts your voice into an analog signal and feeds it to your PCs sound card. An analog-to-digital converter takes the signal and converts it to a stream of digital data (ones and zeros). Then the software goes to work.
While each of the leading speech recognition companies has its own proprietary methods, the two primary components of speech recognition are common across
products. The first piece, called the acoustic model, takes the sounds of your voice and converts them to phonemes, the basic elements of speech. The English language contains approximately 50 phonemes.
Heres how it breaks down your voice: First, the acoustic model removes noise and unneeded information such as changes in volume. Then, using mathematical calculations, it reduces the data to a spectrum of frequencies (the pitches of the sounds), analyzes the data, and converts the words into digital representations