[asterisk-dev] OpenCL for improved performance, transcoding and conferencing

Sat Sep 25 01:40:50 CDT 2010

  On 09/25/2010 12:34 PM, Chris Coleman wrote:
>> Message: 1
>> Date: Fri, 24 Sep 2010 21:27:34 +0800
>> From: Steve Underwood<steveu at coppice.org>
>> Subject: Re: [asterisk-dev] OpenCL for improved performance
>> 	transcoding and conferencing
>> To: asterisk-dev at lists.digium.com
>> Message-ID:<4C9CA746.6020309 at coppice.org>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>>     On 09/24/2010 06:02 PM, Chris Coleman wrote:
>>
>>> Steve, thanks for the input.
>>>
>>> You encouraged me to delve deeper.
>>>
>>> So, I did, and have some good news.
>>>
>>> There is a company in the UK that makes and sells EXACTLY the kind of
>>> thing I'm talking about.
>>>
>>> It is a general purpose GPU, on a PCIe card, with a module for asterisk,
>>> made to accelerate and offload computation for transcoding and
>>> conferencing !!
>>>
>>> The general-purpose GPU it uses is the IBM CELL processor, same as in
>>> the Xbox 360 and Playstation 3.
>>>
>>> They talk about power savings, and allowing something like 460 channels
>>> of transcoding, from for example gsm to g.729, without bringing the CPU
>>> to its knees transcoding the audio, because the GPU is SO MUCH better
>>> suited to this math work of transcoding.
>>>
>>> Here is the source I'm quoting:
>>>
>>> http://www.youtube.com/watch?v=0dnFD_vaJ6s
>>>
>>> Would like to have the opinion of the group.
>>>
>>> Maybe someone feels up to the challenge of implementing some test code....
>>>
>> Howler are out of business, but they didn't make that card. Its
>> available from Leadtek. The Windows and Linux SDK is free, and you can
>> download it and experiment with the potential of the Cell processor for
>> speeding up algorithms. I bought one a few months ago to experiment
>> with, and its fairly easy to achieve interesting levels of performance.
>> Sadly...
>>
>> - the Linux SDK is 32 bit only
>>
>> - a 64 bit Linux SDK will not be made available
>>
>> - the kernel driver module is supplied as object code, so it can only be
>> run with supported kernels (a couple of RHEL/Centos revisions)
>>
>> - source code is not available for most of the SDK, so 64 bit support
>> can't be developed by the user.
>>
>> So, at the of the day the whole thing looks like a dead end.
>>
>> The Cell is *nothing* like an nVidia or ATI GPU. It is a far more
>> general purpose compute engine. Its much closer to the currently stalled
>> Larrabee project at Intel. It is a very good platform for things like
>> G.729. A quad core Xeon can easily do more G.729 channels than the Cell
>> based chip (actually a Toshiba Spurs Engine chip) on these cards.
>> However, the card takes<20W, and working alongside the main quad core
>> CPU it is capable of achieving a pretty reasonable balance.
>>
>> Steve
>>
> Steve, again I really appreciate the insight.
>
> It sounds like this Leadtek board I discovered is the the same one you'd
> been referring to.  Good stuff...
Kinda good stuff. 48GFLOPs peak rate, and a large percentage of that is 
genuinely available for things like audio codecs. Its a real pity its a 
dead end. I think the Cell has been very badly handled. It looks like 
Intel lack the urge to handle Larrabee any better.
>
>
> Then I had a question: just how much higher math performance do u get on
> the ION gpu vs. the cpu ??
It depends entirely on the problem. Some things will get close to the 
available 50GFLOPs. A very small number can get far more than that, 
because they can make use of the very high throughput interpolation 
hardware. Most applications will have to do a substantial amount of the 
decision oriented work on the main CPU, and the latency of shuttling 
between GPU and CPU will mean you get only a tiny percentage of the 
available FLOPs.
>
>
> Quick search and I'm seeing 2.1 GigaFLOPS on the Atom's inbuilt math
> unit.  50 GigaFLOPS on the DirectX 10.1/CUDA/OpenCL-enabled GT218
> graphics chip aka Nvidia ION GP-GPU.
>
> Atom's 1.66 GHz clock speed means it can crunch 1-1.5 floating point
> operation per clock cycle.<--- that's why the Atom is so easy to
> saturate and bog down when transcoding and or conferencing.
I think the Atom can do an SSE operation per cycle in the best case, so 
that would be 4x1.66GFLOPS peak. 2GFLOPs might be something you could 
really obtain in a practical program. The Atom is a strict in order 
processor, so you need to fiddle around quite a lot with instruction 
order to get the best out of it. The Core 2 and i7 are much more 
tolerant, so code written for them might perform poorly on an Atom 
without some rework.
> The ION can crunch about 30 per CPU clock... freeing up the cpu to do
> other stuff.
>
> That 50 GigaFLOPS of the ION (or any compatible computing unit that
> OpenCL is able to detect) is looking pretty darn attractive compute
> engine to tap into... and it would be a waste of computing resources, as
> well as energy, and pollution/carbon footprint, NOT to.
It looks like you need to get about 4% of the ION's available 
performance to match the Atom. Do you think you can get that much out of 
it in a codec application? Look at some of the figures for offloading 
nice regular maths, like an FFT, from the CPU to the GPU. Typically for 
an i7 and a fast nVidia board people seem to get about a 1:1 ratio for a 
128 or 256 point transform. For an 8192 point transform they might get 
several times speedup on the GPU. For a 32 or 64 point transform the CPU 
is usually much faster. The Fermi lets you keep more activities in 
flight, so it may do better than this. This is what interests me about 
the Fermi right now.
>
>
> I did a search and found that this EXACT issue -- was brought up 3.5
> years ago on this list, March 2007 -- using GP-GPU for
> codecs/conferencing with the Nvidia 8800GT.
>
> http://lists.digium.com/pipermail/asterisk-dev/2007-March/026431.html
>
>
> I would be curious to see that code -- or a more udpated verion -- using
> TODAY's GP-GPU libraries to talk to the ION, a 2 years newer gp-gpu chip
> -- and see how it performs.
What code? I see no mention there of actual code. The ION is based on 
the older less flexible GPUs. The latest software won't change things 
much. The bottleneck is in the hardware.
> Only then will we REALLY have the answer, to the question -- how much
> will the asterisk community benefit from 50 GigaFLOPS of free GP-GPU
> horsepower offered by the Nvidia ION ??
>
> It's hard to know, until you try it out and see...
So, you don't believe the string of failures were competently implemented?

Steve