[Asterisk-Users] Text to Speech - Someone needs to do this

Moshe Yudkowsky speech at pobox.com
Wed Jul 16 06:32:55 MST 2003


At 15:41 2003-07-15 -1000, Matthew John Darnell wrote:
>Why hasn't someone found 50 people who sound alike, put them in sound
>studios and record the 10,000 most commonly used words.  You would all
>differnent forms of the 1,000 most words, i.e. leading, trailing, question
>etc.
>
>You can synthesize the other 0.05% when you run into them.  With hard drives
>so big, processors so fast and EXT3 that can handle 30,000+ files in a
>single directory that seems like the way to do it.
>
>You could sell it for BIG bucks.

Text-to-Speech (TTS) is usually either "formative," created by synthesis of 
sounds; or concatenative, created by concatenating sounds of actual speech 
samples.

However, concatenative TTS usually works by using small fragments of 
speech, not entire words. The storage requirements are much smaller, and it 
gives the system an opportunity to pick units of speech that match the 
units of speech that precede and follow them.

The real trick is to get the correct posidy. Here's three sentences with 
the same words but each with different prosidy:

"I said 'yes.'

"I said yes?"

"_I_ said '_yes_'"???!!

Both formative and concatenative systems add prosidy. Adding prosidy to 
whole-word concatentative systems is difficult.

If you're in a buying mood, there are some excellent TTS systems available. 
For example, Rhetorical (http://www.rhetorical.com) has some excellent 
voices. And they have the funniest TTS current available is the "Southern 
California female" voice; I use it for non-serious demos ("That's so 
totally awesome.")

Commercial TTS is actually very intelligble and perfectly adequate for many 
tasks.




-- 
  Moshe Yudkowsky
  Disaggregate
  2952 W Fargo
  Chicago, IL 60645 USA

  www.Disaggregate.com
  speech at pobox.com
  +1 773 764 8727




More information about the asterisk-users mailing list