[Asterisk-Users] Poll - Would you pay $30-$50 for high quality speech synthesis?

Tue Jul 15 21:47:53 MST 2003

Jeff Noxon wrote:

>Many of you are familiar with how lousy Festival sounds.
>
>AT&T has a product, NaturalVoices, that sounds much better.  There are
>male & female voice fonts for US/UK/Indian English, French, Spanish,
>and German.
>
>I am considering offering a linux-based text-to-speech engine based on
>the NaturalVoices runtime.  An asterisk module would also be provided,
>making it easy to add natural sounding synthesis to Asterisk applications.
>You could also use it for other purposes, such as home automation.
>
>After discussing royalties with AT&T, I have concluded that I can probably
>offer such a product at the following prices:
>
>Runtime - $30 intro price with one voice font & one processor
>Extra voices/languages - $15 each
>Extra processors - $15 each
>
>Depending on demand, the price may rise to $50 at some point.  The lower
>the demand, the higher the price, due to AT&T's royalty structure.
>
Maybe you are right, but take great care with this.

You can get packaged versions of Natural Voices cheaply for desktop 
applications. However, when you want to use it for telephony systems it 
usually costs more like $600-$700 per port. There are also big 
differences in the way ports are counted by different vendors. For 
example, the per port pricing for RealSpeak (which is not realated to 
Naturally Speaking) and Speechify (Speechworks derivative of Naturally 
Speaking) is not too different, but the final bill may be. With 
Realspeak, if you have 1000 ports, and only use TTS a little you still 
pay for 1000 ports. With Speechify you pay for the maximum current 
channels you will have speaking at any instant. Unless your system is 
very TTS heavy, this makes a huge difference.

I last worked heavily with these TTS engines about two years ago. They 
have improved, but I don't think by that much. Speechify was a lot more 
functional then Naturally Speaking, as its front end language processing 
was more complete. Naturally Speaking read too many things in the wrong 
way (a lot of other TTSs did too). The various Naturally Speaking 
derivatives are not all equal. Naturally Speaking is itself a derivative 
of Festival. Look in the directories, and you still see lots of Festival 
files. Cepstral and Rhetorical Systems both have impressive sounding TTS 
based on Festival. Festival seems to be the root of most things other 
than RealSpeak and Eloquence. Eloquence seems pretty much the only 
mainstream package which does things differently, and actually 
synthesizes voice from basic principals.

Two years ago we deployed systems using RealSpeak, Speechify and 
Eloquence. People hated the robotic quality of Eloquence, but could 
understand it clearly (at least the English one - the Mandarin version 
sounded terrible). People liked the natural sound of RealSpeak, but 
couldn't understand it very well - they could follow paragraphs of text 
OK, but ask them about a specific thing that was said, like a street 
name, and their accuarcy was very poor. Speechify was somewhere in 
between, but tending towards RealSpeak. In the end, adverse user 
reaction made us rip out all the TTS and abandon attempts to use it.

Some pointers from working with this stuff:

- First impressions are a bad indicator of true quality, due to the next 
point. You need to play with these things for a while, and see how they 
behave in real world use, before you can really evaluate their usefulness.

- In current TTS systems (all of them), natural sounding tned to equate 
with hard to understand. Most TTS systems basically use a database of 
recorded snippets, and blend them to form speech. The longer the 
snippets, the smoother and more natural the sound, but the worse is its 
accuracy. Short snippets allow more flexibility in sculpting the result. 
giving better intelligibility, but making the sound more robotic.

- If your application is reading long tracts of text, the natural 
sounding TTSs do fairly well. The words you don't hear clearly are 
naturally filled in by your brain from the context. If your application 
is reading out addresses, the more robotic systems do better - I found 
Eloquence does the best for this.

- Don't underestimate the importance of the front end language 
processor. Most offerings deal with this part poorly. They all have 
demos that show how well the sysem will read things like currency and 
dates. Try feeding those texts to other vendor's TTS engines. The 
results can be quite interestings. The demos only contain examples of 
things the particular engine does well, and they have all focussed on 
getting different things right.

- You put together a system you think is really neat. Users initially 
think it pretty neat too. Then those same users gradually abandon the 
system as they find its limitations.

- Watch our for resource usage. You might expect these things to hog the 
CPU. They don't. However, they take hundreds of megs of disk (OK), and 
some (like Naturally Speak and Speechify) needed it all in RAM at once 
to work well (not so OK). So, you had to allow more than 200MB of RAM 
per voice. This may have been improved in newer versions of Speechify, 
but I don' t think Naturally Speaking has changed much in that time.

Regards,
Steve