[asterisk-dev] GPU Audio Codec Transcoding within Asterisk PBX
Joseph Benden
joe at thrallingpenguin.com
Thu Jan 1 20:46:36 CST 2009
GPU Audio Codec Transcoding within Asterisk PBX
===============================================
Abstract
--------
This article describes the failed attempt at using GPU technology to
optimize transcoding within the Asterisk PBX system. GPU technologies,
such as those offered by nVidia, push algorithmic processing from the
CPU to specialized processors called GPU's. An algorithm is compiled
and transfered to a GPU where the GPU performs floating-point and/or
integer math very quickly by utilizing parallel threads of execution.
The goal of this project was to utilize the high-end GPU offering from
nVidia, the Tesla C1060, which is a PCIe x16 card offering a peak
processing capability of 933 MFLOPS to perform a large number of
transcoding operations. The Tesla C1060 is supported on Linux and
Windows operating systems.
A Project Failure
-----------------
The project started with transcoding g.711u to signed linear and the
reverse. It was thought that performing other transcoding operations
would be reasonably represented by this.
The nVidia GPU's utilize a C API called CUDA which gives any program
access to the power of the GPU.
Some important aspects of CUDA are:
- CUDA requires the same thread, in a multi-threaded application,
operate all aspects of the GPU. Multiple threads may create many
different contexts with CUDA; however, performance will decrease as
they contend with each other.
- CUDA GPU threads are secondary to graphics on nVidia graphics cards,
meaning if a nVidia graphics card capable of CUDA were used all
graphics handling trumps any CUDA operations.
- CUDA recommends GPU threads minimally execute in groups of 32 to 64
with the optimal number being 256.
- Memory blocks of at least 16 by 16 by 4 or 4,096 bytes. Blocks of
memory are referred to as cells/blocks (?), which have four vectors
each.
- Memory should be properly aligned and page locked.
In attempting this project, the following information was learned:
- Memory copy overhead and latency.
CUDA recommends that memory buffers coming from the application be
transfered to the GPU in a staged approach to maximize the parallel
activity of the GPU. While this is a perfectly reasonable
recommendation, it introduces additional latency in our real-time
processing environment. It is important to remember that these
latencies are a trade-off: if the algorithm has a marked improvement
in execution time which offsets the introduced memory latency, then it
is completely reasonable to go this route. This would be perfectly
acceptable for video transcoding, because of the larger amount of data
to process with possibility of more complex algorithms. This is not
the case for audio transcoding.
Architectural Aspects
---------------------
In order to maximize the number of simultaneous transcoding
operations, Asterisk PBX would require a separate thread of execution
to handle all CUDA operations. All transcode requests must be queued
from the channel threads onto a circular queue (with an implementation
specifically chosen to minimize thread contention, eg: wait-free or
lock-free.) The CUDA thread would then be able to coalesce multiple
waiting transcodes into a single processing request to the GPU.
It would be most wise to implement dynamic transcoding back-engines,
such that when the transcoding thread in Asterisk PBX is ready to
coalesce, it takes the current count of transcode operations and uses
this to properly select which engine to use. In testing with audio
streams, it was found that the diagram below is true.
Single-threaded typical transcode algorithm <= SIMD transcode
algorithm <= CUDA transcode algorithm
For any given transcode, there is a point at which one of the above
implementations is best suited. By created a standalone tool, these
values can be measured for each hardware environment and properly
configured for any individual machine.
Finally, because of the architecture, this would allow for the
structure that is placed onto the circular queue to contain the source
codec and destination codec. If two calls are bridged and Asterisk PBX
does not need information from the stream, it would be possible to
directly transcode from call A's codec to call B's codec. This would
also be more efficient because the request pushed to CUDA is a single
request for two separate CUDA "kernels" (function in C terminology) to
execute, with only a single memory transfer overhead. Contrast this
with having to transcode call A's codec to signed linear, then signed
linear to call C's codec - with multiple memory buffer transfers and
multiple times dealing with the circular queue.
Other Thoughts
--------------
While GPU transcoding is not optimal at this time for audio
transcoding, using SIMD instructions to optimize hot-spots within
codecs is a completely worthwhile investment. SIMD instructions do
give instant benefits in areas of code that GPU processing was thought
to give a marked improvement, because SIMD has immediate and quick
access to the memory regions holding the buffers to actually transcode.
GPU usage will become important if Asterisk PBX deals with video
streams. It is at this point when the topic should again be brought in
to the light.
References
----------
nVidia Tesla C1060
http://www.nvidia.com/object/product_tesla_c1060_us.html
CUDA
http://www.nvidia.com/cuda
SIMD
http://en.wikipedia.org/wiki/SIMD
About the Author
----------------
Joseph Benden is the owner of Thralling Penguin LLC. Thralling Penguin
designs, develops, and extends software technologies for the most
demanding business applications, as well as offering VoIP Consulting
services.
More information about the asterisk-dev
mailing list