[asterisk-dev] OpenCL for improved performance transcoding and conferencing

Tue Sep 28 04:53:03 CDT 2010

> Date: Sat, 25 Sep 2010 14:40:50 +0800
> From: Steve Underwood<steveu at coppice.org>
> Subject: Re: [asterisk-dev] OpenCL for improved performance,
> 	transcoding and conferencing
> To: Asterisk Developers Mailing List<asterisk-dev at lists.digium.com>
> Message-ID:<4C9D9972.8020809 at coppice.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>    On 09/25/2010 12:34 PM, Chris Coleman wrote:
>    
>>> Message: 1
>>> Date: Fri, 24 Sep 2010 21:27:34 +0800
>>> From: Steve Underwood<steveu at coppice.org>
>>> Subject: Re: [asterisk-dev] OpenCL for improved performance
>>> 	transcoding and conferencing
>>> To: asterisk-dev at lists.digium.com
>>> Message-ID:<4C9CA746.6020309 at coppice.org>
>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>
>>>      On 09/24/2010 06:02 PM, Chris Coleman wrote:
>>>
>>>        
>>>> Steve, thanks for the input.
>>>>
>>>> You encouraged me to delve deeper.
>>>>
>>>> So, I did, and have some good news.
>>>>
>>>> There is a company in the UK that makes and sells EXACTLY the kind of
>>>> thing I'm talking about.
>>>>
>>>> It is a general purpose GPU, on a PCIe card, with a module for asterisk,
>>>> made to accelerate and offload computation for transcoding and
>>>> conferencing !!
>>>>
>>>> The general-purpose GPU it uses is the IBM CELL processor, same as in
>>>> the Xbox 360 and Playstation 3.
>>>>
>>>> They talk about power savings, and allowing something like 460 channels
>>>> of transcoding, from for example gsm to g.729, without bringing the CPU
>>>> to its knees transcoding the audio, because the GPU is SO MUCH better
>>>> suited to this math work of transcoding.
>>>>
>>>> Here is the source I'm quoting:
>>>>
>>>> http://www.youtube.com/watch?v=0dnFD_vaJ6s
>>>>
>>>> Would like to have the opinion of the group.
>>>>
>>>> Maybe someone feels up to the challenge of implementing some test code....
>>>>
>>>>          
>>> Howler are out of business, but they didn't make that card. Its
>>> available from Leadtek. The Windows and Linux SDK is free, and you can
>>> download it and experiment with the potential of the Cell processor for
>>> speeding up algorithms. I bought one a few months ago to experiment
>>> with, and its fairly easy to achieve interesting levels of performance.
>>> Sadly...
>>>
>>> - the Linux SDK is 32 bit only
>>>
>>> - a 64 bit Linux SDK will not be made available
>>>
>>> - the kernel driver module is supplied as object code, so it can only be
>>> run with supported kernels (a couple of RHEL/Centos revisions)
>>>
>>> - source code is not available for most of the SDK, so 64 bit support
>>> can't be developed by the user.
>>>
>>> So, at the of the day the whole thing looks like a dead end.
>>>
>>> The Cell is *nothing* like an nVidia or ATI GPU. It is a far more
>>> general purpose compute engine. Its much closer to the currently stalled
>>> Larrabee project at Intel. It is a very good platform for things like
>>> G.729. A quad core Xeon can easily do more G.729 channels than the Cell
>>> based chip (actually a Toshiba Spurs Engine chip) on these cards.
>>> However, the card takes<20W, and working alongside the main quad core
>>> CPU it is capable of achieving a pretty reasonable balance.
>>>
>>> Steve
>>>
>>>        
>> Steve, again I really appreciate the insight.
>>
>> It sounds like this Leadtek board I discovered is the the same one you'd
>> been referring to.  Good stuff...
>>      
> Kinda good stuff. 48GFLOPs peak rate, and a large percentage of that is
> genuinely available for things like audio codecs. Its a real pity its a
> dead end. I think the Cell has been very badly handled. It looks like
> Intel lack the urge to handle Larrabee any better.
>    
>>
>> Then I had a question: just how much higher math performance do u get on
>> the ION gpu vs. the cpu ??
>>      
> It depends entirely on the problem. Some things will get close to the
> available 50GFLOPs. A very small number can get far more than that,
> because they can make use of the very high throughput interpolation
> hardware. Most applications will have to do a substantial amount of the
> decision oriented work on the main CPU, and the latency of shuttling
> between GPU and CPU will mean you get only a tiny percentage of the
> available FLOPs.
>    
>>
>> Quick search and I'm seeing 2.1 GigaFLOPS on the Atom's inbuilt math
>> unit.  50 GigaFLOPS on the DirectX 10.1/CUDA/OpenCL-enabled GT218
>> graphics chip aka Nvidia ION GP-GPU.
>>
>> Atom's 1.66 GHz clock speed means it can crunch 1-1.5 floating point
>> operation per clock cycle.<--- that's why the Atom is so easy to
>> saturate and bog down when transcoding and or conferencing.
>>      
> I think the Atom can do an SSE operation per cycle in the best case, so
> that would be 4x1.66GFLOPS peak. 2GFLOPs might be something you could
> really obtain in a practical program. The Atom is a strict in order
> processor, so you need to fiddle around quite a lot with instruction
> order to get the best out of it. The Core 2 and i7 are much more
> tolerant, so code written for them might perform poorly on an Atom
> without some rework.
>    
>> The ION can crunch about 30 per CPU clock... freeing up the cpu to do
>> other stuff.
>>
>> That 50 GigaFLOPS of the ION (or any compatible computing unit that
>> OpenCL is able to detect) is looking pretty darn attractive compute
>> engine to tap into... and it would be a waste of computing resources, as
>> well as energy, and pollution/carbon footprint, NOT to.
>>      
> It looks like you need to get about 4% of the ION's available
> performance to match the Atom. Do you think you can get that much out of
> it in a codec application? Look at some of the figures for offloading
> nice regular maths, like an FFT, from the CPU to the GPU. Typically for
> an i7 and a fast nVidia board people seem to get about a 1:1 ratio for a
> 128 or 256 point transform. For an 8192 point transform they might get
> several times speedup on the GPU. For a 32 or 64 point transform the CPU
> is usually much faster. The Fermi lets you keep more activities in
> flight, so it may do better than this. This is what interests me about
> the Fermi right now.
>    
>>
>> I did a search and found that this EXACT issue -- was brought up 3.5
>> years ago on this list, March 2007 -- using GP-GPU for
>> codecs/conferencing with the Nvidia 8800GT.
>>
>> http://lists.digium.com/pipermail/asterisk-dev/2007-March/026431.html
>>
>>
>> I would be curious to see that code -- or a more udpated verion -- using
>> TODAY's GP-GPU libraries to talk to the ION, a 2 years newer gp-gpu chip
>> -- and see how it performs.
>>      
> What code? I see no mention there of actual code. The ION is based on
> the older less flexible GPUs. The latest software won't change things
> much. The bottleneck is in the hardware.
>    
>> Only then will we REALLY have the answer, to the question -- how much
>> will the asterisk community benefit from 50 GigaFLOPS of free GP-GPU
>> horsepower offered by the Nvidia ION ??
>>
>> It's hard to know, until you try it out and see...
>>      
> So, you don't believe the string of failures were competently implemented?
>
> Steve
>
>    

Steve,

I didn't mean to imply that previous implementations were not done 
competently.

What I meant is , software lags behind the hardware , usually by years.  
And I haven't looked at anyone's code for this.

Plus, I have no idea what information others have had access to, or how 
much of it, relative to what I've got today.

That's life!  Keep on moving forward, don't stop...

Thanks for the numbers on the experiments with offloading the FFT to the 
GPU.

Do you know if they implemented a VOIP codec using that FFT?  Which codec  ?

I looked at the G.729 codec: computational work, and the data transfer 
between GPU and CPU.

I picked G.729 because it has a very good voice quality and excellent 
compression ratio.

These details, some that I found, and some that I derived, look 
promising enough to continue:

The G.729 encoder requires 300,000 clock cycles on a CPU (Celeron 500 
using MMX instructions.. I say it's approximately the same as today's 
Atom 1.66 GHz ) to convert one 10ms frame of raw audio to compressed G.729.
100 frames of raw audio must be encoded per second.
For low jitter performance, the encoding speed for each frame should be 
less than 2.5 ms, or less than 4.15 million CPU cyclesat 1.66GHz, per frame.

G.729 decoder is similar, but less complex, only 100,000 CPU clock 
cycles required per frame to decode.

To transfer the data for one frame to the GPU takes approximately 640 
CPU cycles of the Atom, plus some overhead (Minimum requirement, 
assuming PCIe 1x bus, which is how the ION is implemented.  Might take 
more cycles. Must look into this more).

About the same time,  640 CPU cycles, to get the compressed frame from 
the GPU back to the CPU.

If the decoder and encoder were run all on the CPU, the system would run 
out of horsepower at 10 decodes and 10 encodes running simultaneously, 
and probably more like 8 (overhead for OS, IO, etc).

Running this G.729 codec on the GPU would uses nearly zero CPU, only 
enough CPU instructions needed to transfer the packets back and forth to 
the GPU: 64000 cycles (approx) per second, and trigger the GPU to encode 
or decode a frame.

That is about 0.004 % CPU usage using GP-GPU for G.729.

It's surely 1% in the real world because I'm approximating the data 
transfer overhead between CPU<->GPU.

Your input would be welcome...