[Asterisk-Users] Poll - Would you pay $30-$50 for high quality
speech synthesis?
Steve Underwood
steveu at coppice.org
Tue Jul 15 21:47:53 MST 2003
Jeff Noxon wrote:
>Many of you are familiar with how lousy Festival sounds.
>
>AT&T has a product, NaturalVoices, that sounds much better. There are
>male & female voice fonts for US/UK/Indian English, French, Spanish,
>and German.
>
>I am considering offering a linux-based text-to-speech engine based on
>the NaturalVoices runtime. An asterisk module would also be provided,
>making it easy to add natural sounding synthesis to Asterisk applications.
>You could also use it for other purposes, such as home automation.
>
>After discussing royalties with AT&T, I have concluded that I can probably
>offer such a product at the following prices:
>
>Runtime - $30 intro price with one voice font & one processor
>Extra voices/languages - $15 each
>Extra processors - $15 each
>
>Depending on demand, the price may rise to $50 at some point. The lower
>the demand, the higher the price, due to AT&T's royalty structure.
>
Maybe you are right, but take great care with this.
You can get packaged versions of Natural Voices cheaply for desktop
applications. However, when you want to use it for telephony systems it
usually costs more like $600-$700 per port. There are also big
differences in the way ports are counted by different vendors. For
example, the per port pricing for RealSpeak (which is not realated to
Naturally Speaking) and Speechify (Speechworks derivative of Naturally
Speaking) is not too different, but the final bill may be. With
Realspeak, if you have 1000 ports, and only use TTS a little you still
pay for 1000 ports. With Speechify you pay for the maximum current
channels you will have speaking at any instant. Unless your system is
very TTS heavy, this makes a huge difference.
I last worked heavily with these TTS engines about two years ago. They
have improved, but I don't think by that much. Speechify was a lot more
functional then Naturally Speaking, as its front end language processing
was more complete. Naturally Speaking read too many things in the wrong
way (a lot of other TTSs did too). The various Naturally Speaking
derivatives are not all equal. Naturally Speaking is itself a derivative
of Festival. Look in the directories, and you still see lots of Festival
files. Cepstral and Rhetorical Systems both have impressive sounding TTS
based on Festival. Festival seems to be the root of most things other
than RealSpeak and Eloquence. Eloquence seems pretty much the only
mainstream package which does things differently, and actually
synthesizes voice from basic principals.
Two years ago we deployed systems using RealSpeak, Speechify and
Eloquence. People hated the robotic quality of Eloquence, but could
understand it clearly (at least the English one - the Mandarin version
sounded terrible). People liked the natural sound of RealSpeak, but
couldn't understand it very well - they could follow paragraphs of text
OK, but ask them about a specific thing that was said, like a street
name, and their accuarcy was very poor. Speechify was somewhere in
between, but tending towards RealSpeak. In the end, adverse user
reaction made us rip out all the TTS and abandon attempts to use it.
Some pointers from working with this stuff:
- First impressions are a bad indicator of true quality, due to the next
point. You need to play with these things for a while, and see how they
behave in real world use, before you can really evaluate their usefulness.
- In current TTS systems (all of them), natural sounding tned to equate
with hard to understand. Most TTS systems basically use a database of
recorded snippets, and blend them to form speech. The longer the
snippets, the smoother and more natural the sound, but the worse is its
accuracy. Short snippets allow more flexibility in sculpting the result.
giving better intelligibility, but making the sound more robotic.
- If your application is reading long tracts of text, the natural
sounding TTSs do fairly well. The words you don't hear clearly are
naturally filled in by your brain from the context. If your application
is reading out addresses, the more robotic systems do better - I found
Eloquence does the best for this.
- Don't underestimate the importance of the front end language
processor. Most offerings deal with this part poorly. They all have
demos that show how well the sysem will read things like currency and
dates. Try feeding those texts to other vendor's TTS engines. The
results can be quite interestings. The demos only contain examples of
things the particular engine does well, and they have all focussed on
getting different things right.
- You put together a system you think is really neat. Users initially
think it pretty neat too. Then those same users gradually abandon the
system as they find its limitations.
- Watch our for resource usage. You might expect these things to hog the
CPU. They don't. However, they take hundreds of megs of disk (OK), and
some (like Naturally Speak and Speechify) needed it all in RAM at once
to work well (not so OK). So, you had to allow more than 200MB of RAM
per voice. This may have been improved in newer versions of Speechify,
but I don' t think Naturally Speaking has changed much in that time.
Regards,
Steve
More information about the asterisk-users
mailing list