[asterisk-users] Problem with new AEX800 card dying because of interrupt problems
Shaun Ruffell
sruffell at digium.com
Wed Sep 8 17:25:49 CDT 2010
First off Digium technical support should be able to help you trouble shoot.
On 09/08/2010 03:27 PM, Christian Weeks wrote:
> On Wed, 2010-09-08 at 11:06 -0500, Shaun Ruffell wrote:
>> On 09/08/2010 10:38 AM, Christian Weeks wrote:
>>> So I am asking the list, do you have any advice except perhaps to go
>>> back to the broken channel bank? Is it really true that my modern server
>>> class machine (quad core xeon) cannot handle the AEX800, whereas my
>>> seven year old AMD desktop (previous host to the T1) could handle what
>>> seems to have been about 3x the capacity? Isn't this a massive
>>> regression?
>>
>> Does the AEX800 work fine in your old AMD desktop? If the wctdm24xxp
>> driver is having problems servicing the interrupt in a timely fashion in
>> your server I would be surprised if other cards in the same system
>> wouldn't also experience high interrupt latencies which would probably
>> manifest itself as pops and noise on the channels.
> OK. The AEX800 can't go in the old server- it's a PCI express card and
> the AMD doesn't have a PCI express slot (it's that old). wrt to your
> comment about the latencies on the other channels, there is none that is
> noticeable. The other card (the older PCI card) has absolutely no
> problems at all- it's getting clear audio. In fact, so is the new card-
> there's not a sign of anything wrong with it at all, except it suddenly
> stops working with these interrupt errors. Which is why I suspect the
> driver (esp. given some of the fixes in the dahdi 2.4 release) rather
> than the card or the computer.
Is there anything else in dmesg or /var/log/messages when the card
suddenly stops working with the interrupt errors? Do you see messages
about the latency increasing at some regular interval (i.e. every hour)?
I've seen systems that have flash and SATA drives where the flash
drives are connected as /dev/hda and periodically flushing them can
cause huge latencies even though you don't see this happening at runtime.
>> a) checking the transfer rate to your hard drive ('hdparam -t
>> /dev/[sda|hda]'). If it's below 4MB/s that's the likely culprit.
>> Sometimes setting the kernel command line parameter to "hda=none" can
>> help depending on the kernel version you're using. I've also seen slow
>> transfer rates fixed by changing BIOS settings.
>
> /dev/sdb:
> Timing buffered disk reads: 190 MB in 3.03 seconds = 62.71 MB/sec
>
> Hmm, don't think that's the culprit, somehow. The server has spent two
> years before being repurposed as a phone server as a disk server for
> mythtv. I'd have noticed disk latency on it a long time ago.
I've also seen cases where the latency increases like you describe
because of poorly implemented X video drivers. Are you running without
X installed on this server? Do you have a serial console connected.
Slow baud rates on a serial console can be correlated to inability to
service the interrupts on some systems.
>
>>
>> b) Use cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and
>> then stress your system to make sure maximum latencies remain low
>> without DAHDI loaded. System Management Interrupts / Baseboard
>> Management Controllers can cause problems here on some servers.
>
> OK. I'm not sure which tests I need to run here.
>
> Here's a run at idle:
> :~# cyclictest -t -p 80 -n -l 10000
> policy: fifo: loadavg: 0.03 0.02 0.00 1/210 16899
>
> T: 0 (16896) P:80 I:1000 C: 10000 Min: 8 Act: 16 Avg: 22 Max:
> 568
> T: 1 (16897) P:79 I:1500 C: 6673 Min: 8 Act: 12 Avg: 25 Max:
> 119
> T: 2 (16898) P:78 I:2000 C: 5005 Min: 9 Act: 14 Avg: 24 Max:
> 150
> T: 3 (16899) P:77 I:2500 C: 4004 Min: 8 Act: 13 Avg: 30 Max:
> 420
>
> And here's one with some cpu load:
>
> :~# cyclictest -t -p 80 -n -l 10000
> policy: fifo: loadavg: 0.82 0.35 0.12 3/217 17212
>
> T: 0 (17209) P:80 I:1000 C: 10000 Min: 8 Act: 14 Avg: 26 Max:
> 8047
> T: 1 (17210) P:79 I:1500 C: 6667 Min: 8 Act: 12 Avg: 15 Max:
> 820
> T: 2 (17211) P:78 I:2000 C: 5001 Min: 7 Act: 17 Avg: 34 Max:
> 8184
> T: 3 (17212) P:77 I:2500 C: 4001 Min: 9 Act: 40 Avg: 27 Max:
> 8786
>
> Max is higher (obviously) but there's not really any evidence of a
> signficant difference in latency between the two runs, and it looks well
> below your threshold (I think thats usecs for those numbers, so it's
> about 3 orders of magnitude slower).
The cyclictest output looks good. What about when running disk transfer
tests on all the SATA / IDE drives installed? You could also start up
cyclictest while your system is attempting to operate normally and see
if DAHDI and cyclictest agree on on what the latency is.
Do you get the same results for cyclictest when you run from the console
(serial, X, whatever) as you do when running via ssh?
>
>> If cyclictest is shows you have some maximum latency above 128ms, I
>> would recommend trying to fix that first, but if for some reason you
>> can't, you could trade some of your system memory for increased
>> tolerance to system conditions by editing the DRING_SIZE in
>> drivers/dahdi/voicebus.h to 256 or 512 depending on what cyclictest
>> reported what your maximum latency is. Keep in mind this isn't a "fix"
>> since you'll still have problems in your audio for any latency above 25ms.
>
> I'm not sure where to go from here. Every diagnostic seems to be telling
> the same story- the computer is fine. Is it possible I have a hardware
> problem somehow? Maybe there's something wrong with the card?
>
It's possible there is a hardware problem. I've also seen cases where
PCI-E interrupts were not delivered properly (i.e.
http://svn.asterisk.org/view/dahdi?view=revision&revision=9037 ) It's
possible to set the AEX800 to poll the interface and not rely on the
host to route the interrupts by defining CONFIG_VOICEBUS_TIMER in
drivers/dahdi/voicebus/voicebus.h (although that's not part of the
"official" interface. It's in there as a debugging aide).
You're right though in that 2.4.0 does contains a fix (
http://svn.asterisk.org/view/dahdi?view=revision&revision=8982 ) for a
regression introduced in 2.3.0 on *some* systems. That being said,
regressions in the driver causing interrupt problems have been the
exception not the rule. The vast majority of the problems specifically
with latency have been system related.
Hopefully this helps. Again, Digium technical support can help you with
more troubleshooting if needed.
--
Shaun Ruffell
Digium, Inc. | Linux Kernel Developer
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: www.digium.com & www.asterisk.org
More information about the asterisk-users
mailing list