[Asterisk-Users] Text to Speech - Someone needs to do this
Moshe Yudkowsky
speech at pobox.com
Wed Jul 16 06:32:55 MST 2003
At 15:41 2003-07-15 -1000, Matthew John Darnell wrote:
>Why hasn't someone found 50 people who sound alike, put them in sound
>studios and record the 10,000 most commonly used words. You would all
>differnent forms of the 1,000 most words, i.e. leading, trailing, question
>etc.
>
>You can synthesize the other 0.05% when you run into them. With hard drives
>so big, processors so fast and EXT3 that can handle 30,000+ files in a
>single directory that seems like the way to do it.
>
>You could sell it for BIG bucks.
Text-to-Speech (TTS) is usually either "formative," created by synthesis of
sounds; or concatenative, created by concatenating sounds of actual speech
samples.
However, concatenative TTS usually works by using small fragments of
speech, not entire words. The storage requirements are much smaller, and it
gives the system an opportunity to pick units of speech that match the
units of speech that precede and follow them.
The real trick is to get the correct posidy. Here's three sentences with
the same words but each with different prosidy:
"I said 'yes.'
"I said yes?"
"_I_ said '_yes_'"???!!
Both formative and concatenative systems add prosidy. Adding prosidy to
whole-word concatentative systems is difficult.
If you're in a buying mood, there are some excellent TTS systems available.
For example, Rhetorical (http://www.rhetorical.com) has some excellent
voices. And they have the funniest TTS current available is the "Southern
California female" voice; I use it for non-serious demos ("That's so
totally awesome.")
Commercial TTS is actually very intelligble and perfectly adequate for many
tasks.
--
Moshe Yudkowsky
Disaggregate
2952 W Fargo
Chicago, IL 60645 USA
www.Disaggregate.com
speech at pobox.com
+1 773 764 8727
More information about the asterisk-users
mailing list