[asterisk-dev] OpenCL for improved performance transcoding and conferencing

Fri Sep 24 05:02:02 CDT 2010

Steve Underwood wrote:
>
> Date: Thu, 23 Sep 2010 14:15:21 +0800
> From: Steve Underwood<steveu at coppice.org>
> Subject: Re: [asterisk-dev] OpenCL for improved performance
> 	transcoding and conferencing
> To: Asterisk Developers Mailing List<asterisk-dev at lists.digium.com>
> Message-ID:<4C9AF079.9060203 at coppice.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>    On 09/23/2010 07:03 AM, Chris Coleman wrote:
>    
>>>>    You probably already know very well about OpenCL -- Apple's open
>>>>    standard OpenComputeLanguage that lets you use any supported
>>>>    general-purpose GPU for much faster math computing....
>>>>          
>>>    Sometimes faster. Often slower. It depends how well the problem fits the
>>>    hardware.
>>>        
>> Agreed.  I'm citing the wikipedia article on GPGPU.  It states one of the (approx 30) suitable problems for GPGPU to solve is Audio Signal Processing.
>>
>> One type it says it's been used for is Audio Speech Processing.
>>
>> http://en.wikipedia.org/wiki/GPGPU
>>      
> You do realise that page was written by the marketing dept, don't you?
> Try to find real audio speech processing that worked out well.
>    
>>>>    Dual-core Atom D510 mini-ITX motherboards with the OpenCL-compatible GPU
>>>>    Nvidia G210 (ION) are available for slight cost increase over boards
>>>>    without the ION.  For example : the Jetway NC98
>>>>
>>>>    Theoretically the GPU is 10x more efficient in doing transcoding math,
>>>>    than the CPU... so we could get up to 10x as many transcoded and
>>>>    conferenced channels...
>>>>
>>>>    ....resulting in lower energy consumption, and higher number of
>>>>    transcoded + conferenced channels, per PBX server, before hitting the
>>>>    limit of CPU math processing horsepower.
>>>>          
>>>    CUDA is far more mature than OpenCL right now. Have a lot around the
>>>    internet at attempts to use CUDA to accelerate audio and speech codecs.
>>>    The results aren't good. I have only seem one attempt - a poor
>>>    performing MP3 encoder - where the developers didn't get disheartened
>>>    and abandon work before completion.
>>>        
>> I don't doubt CUDA is ahead of OpenCL: OpenCL wraps CUDA (Nvidia's GP-GPU libraries that works ONLY for their GPU's) AND ATI Stream (ATI's GP-GPU libraries that work ONLY on ATI's GPU's).
>>
>> Nvidia has BILLIONS of dollars of incentives to lock as many users into CUDA as possible, and not be compatible with ATI. So they are trying to innovate and not interoperate.
>>
>> Us in Linux world however, are trying to interop to the MAX.  This is why I suggested OpenCL.
>>
>> OpenCL should currently have all needed functionality to accelerate the relatively simple (compared to 3D grapics) Audio and Speech Processing.
>>      
> All completely irrelevant to what I said. CUDA has the upper hand today,
> and it *still* has no proven record of performing well.
>    
>>>>    Does anyone else think it'd be brilliant of the * developers to update
>>>>    the transcoding and conferencing code for 2010, to use OpenCL (which
>>>>    uses any available, compatible GPU) for absolutely awesome performance ??
>>>>
>>>>
>>>>          
>>>    Awesome? Really? You have supporting evidence?
>>>        
>> The "evidence" I have is from general knowledge and experience.
>>      
> Can you share some of this general knowledge and experience? The general
> knowledge the rest of us have is good results for a few algorithms, and
> dismal failure for most others.
>    
>> High ratio audio compression (such as the G.729, as well as mp3 and mp4-AAC) uses statistics to intelligently drop some of the audio information.
>>      
> Where did you get that drivel from? Compression is based on what the ear
> can't detect, and speech compression is also based on what the ear can't
> produce. There is no statistical processing in any of the practical
> compression algorithms.
>    
>> And to run those stats, the algorithm has to do floating point vector math : something called a FFT, Fast Fourier Transfer function, on the incoming raw audio stream.
>>      
> Things like MP3, AAC and G.722.1, use IDCT, which is not a million miles
> from an FFT, but audio compression never uses an FFT. The key speech
> compression codecs today, like G.729 don't do anything remotely like an
> FFT. DCTs are also the basis of most video codecs, and the GPUs are
> designed to run those well. This might lead one to assume that at least
> MP3 will run very well on a GPU. The only attempt I know of produced
> very poor results, but that was with an older GPU. I suspect the Fermi
> will do a lot better.
>
>    
>> With 40, 56, 128, and 192 stream processing cores, a GP-GPU is going to be able to get that done a LOT quicker, on a LOT more simultaneous channels, than the simple onboard floating point unit built into the Atom processor, which can handle only a couple of math calculations per clock cycle per core.
>>      
> They do really well on long vectors, which a lot of high performance
> computing is heavily based on. They do really badly with masses of short
> vectors, and a lot of decision making, like the typical speech codec is
> filled with.
>    
>>>    The nVidia Fermi may change things quite a bit, as it is a more general
>>>    purpose compute engine than the earlier GPUs. I still haven't seen
>>>    anything impressive for any computation in the ballpark of a speech
>>>    codec, though. I bought a GTX460 card a couple of week ago to do some
>>>    experiments, but I'm struggling to find the time to work on it.
>>>        
>> While youre GTX460 is a totally butt kicking card from today, General Purpose GPU computing is available going back a few years now, I think it started truly with the DirectX 10.1 cards.
>>      
> The raw speed of the GTX460 is largely irrelevant here. Its the improved
> architecture of the Fermi design, which makes it a lot more general
> purpose which has made me take a renewed interest.
>    
>>>    The Toshiba SpursEngine card from Leadtek is certainly capable of
>>>    accelerating speech codecs. The now defunct HowlerTech company was using
>>>    them. Not too expensive. Fairly low power. The snag is they seem to be a
>>>    dead end. Leadtek supply a 32bit Linux SDK, but say a 64 bit SDK will
>>>    not appear. To me, that says abandonware.
>>>    Steve
>>>        
>> The reason I suggested GP-GPU and not an add-on card, is because it's available in all the mini-ITX motherboards with a cheap DirectX 10.1 or higher graphics chip : Nvidia ION, and soon to come: ATI Fusion.
>>
>> No extra PCI slot required (many mini-ITX boards dont even have one), same power consumption.
>>
>> Basically, everyone who buys that simple motherboard or better, and it's available now, they can use it.
>>
>> It's the simplest, cheapest and, best bang for the buck.
>>
>> Plus, servers rarely need fancy graphics, so it's not like you're taking the GPU power from something else that needs it.
>>
>> Also, add-on graphics cards are available in PCIe 1x format ... this GT210 for example... for about 30.00
>>      
> Free hardware is great if it performs. Do you have any evidence the ion
> ever performs well on anything but the video codecs it has been highly
> tailored for?
>
> Steve
>

Steve, thanks for the input.

You encouraged me to delve deeper.

So, I did, and have some good news.

There is a company in the UK that makes and sells EXACTLY the kind of 
thing I'm talking about.

It is a general purpose GPU, on a PCIe card, with a module for asterisk, 
made to accelerate and offload computation for transcoding and 
conferencing !!

The general-purpose GPU it uses is the IBM CELL processor, same as in 
the Xbox 360 and Playstation 3.

They talk about power savings, and allowing something like 460 channels 
of transcoding, from for example gsm to g.729, without bringing the CPU 
to its knees transcoding the audio, because the GPU is SO MUCH better 
suited to this math work of transcoding.

Here is the source I'm quoting:

http://www.youtube.com/watch?v=0dnFD_vaJ6s

Would like to have the opinion of the group.

Maybe someone feels up to the challenge of implementing some test code....

Chris