[asterisk-dev] GPU Audio Codec Transcoding within Asterisk PBX

Thu Jan 1 20:46:36 CST 2009

GPU Audio Codec Transcoding within Asterisk PBX
===============================================

Abstract
--------

This article describes the failed attempt at using GPU technology to  
optimize transcoding within the Asterisk PBX system. GPU technologies,  
such as those offered by nVidia, push algorithmic processing from the  
CPU to specialized processors called GPU's. An algorithm is compiled  
and transfered to a GPU where the GPU performs floating-point and/or  
integer math very quickly by utilizing parallel threads of execution.

The goal of this project was to utilize the high-end GPU offering from  
nVidia, the Tesla C1060, which is a PCIe x16 card offering a peak  
processing capability of 933 MFLOPS to perform a large number of  
transcoding operations. The Tesla C1060 is supported on Linux and  
Windows operating systems.

A Project Failure
-----------------

The project started with transcoding g.711u to signed linear and the  
reverse. It was thought that performing other transcoding operations  
would be reasonably represented by this.

The nVidia GPU's utilize a C API called CUDA which gives any program  
access to the power of the GPU.

Some important aspects of CUDA are:

- CUDA requires the same thread, in a multi-threaded application,  
operate all aspects of the GPU. Multiple threads may create many  
different contexts with CUDA; however, performance will decrease as  
they contend with each other.
- CUDA GPU threads are secondary to graphics on nVidia graphics cards,  
meaning if a nVidia graphics card capable of CUDA were used all  
graphics handling trumps any CUDA operations.
- CUDA recommends GPU threads minimally execute in groups of 32 to 64  
with the optimal number being 256.
- Memory blocks of at least 16 by 16 by 4 or 4,096 bytes. Blocks of  
memory are referred to as cells/blocks (?), which have four vectors  
each.
- Memory should be properly aligned and page locked.

In attempting this project, the following information was learned:

- Memory copy overhead and latency.

CUDA recommends that memory buffers coming from the application be  
transfered to the GPU in a staged approach to maximize the parallel  
activity of the GPU. While this is a perfectly reasonable  
recommendation, it introduces additional latency in our real-time  
processing environment. It is important to remember that these  
latencies are a trade-off: if the algorithm has a marked improvement  
in execution time which offsets the introduced memory latency, then it  
is completely reasonable to go this route. This would be perfectly  
acceptable for video transcoding, because of the larger amount of data  
to process with possibility of more complex algorithms. This is not  
the case for audio transcoding.

Architectural Aspects
---------------------

In order to maximize the number of simultaneous transcoding  
operations, Asterisk PBX would require a separate thread of execution  
to handle all CUDA operations. All transcode requests must be queued  
from the channel threads onto a circular queue (with an implementation  
specifically chosen to minimize thread contention, eg: wait-free or  
lock-free.) The CUDA thread would then be able to coalesce multiple  
waiting transcodes into a single processing request to the GPU.

It would be most wise to implement dynamic transcoding back-engines,  
such that when the transcoding thread in Asterisk PBX is ready to  
coalesce, it takes the current count of transcode operations and uses  
this to properly select which engine to use. In testing with audio  
streams, it was found that the diagram below is true.

   Single-threaded typical transcode algorithm <= SIMD transcode  
algorithm <= CUDA transcode algorithm

For any given transcode, there is a point at which one of the above  
implementations is best suited. By created a standalone tool, these  
values can be measured for each hardware environment and properly  
configured for any individual machine.

Finally, because of the architecture, this would allow for the  
structure that is placed onto the circular queue to contain the source  
codec and destination codec. If two calls are bridged and Asterisk PBX  
does not need information from the stream, it would be possible to  
directly transcode from call A's codec to call B's codec. This would  
also be more efficient because the request pushed to CUDA is a single  
request for two separate CUDA "kernels" (function in C terminology) to  
execute, with only a single memory transfer overhead. Contrast this  
with having to transcode call A's codec to signed linear, then signed  
linear to call C's codec - with multiple memory buffer transfers and  
multiple times dealing with the circular queue.

Other Thoughts
--------------

While GPU transcoding is not optimal at this time for audio  
transcoding, using SIMD instructions to optimize hot-spots within  
codecs is a completely worthwhile investment. SIMD instructions do  
give instant benefits in areas of code that GPU processing was thought  
to give a marked improvement, because SIMD has immediate and quick  
access to the memory regions holding the buffers to actually transcode.

GPU usage will become important if Asterisk PBX deals with video  
streams. It is at this point when the topic should again be brought in  
to the light.

References
----------

nVidia Tesla C1060
http://www.nvidia.com/object/product_tesla_c1060_us.html

CUDA
http://www.nvidia.com/cuda

SIMD
http://en.wikipedia.org/wiki/SIMD

About the Author
----------------
Joseph Benden is the owner of Thralling Penguin LLC. Thralling Penguin  
designs, develops, and extends software technologies for the most  
demanding business applications, as well as offering VoIP Consulting  
services.