[asterisk-users] How are your PRI interrupts balanced? (+ Soft lockup BUG)

Tue Mar 30 00:29:53 CDT 2010

On Mon, Mar 29, 2010 at 9:23 PM, James Lamanna <jlamanna at gmail.com> wrote:
> On Mon, Mar 29, 2010 at 8:38 PM, Matt Watson <matt at mattgwatson.ca> wrote:
>> Dell server by any chance?
>> I have a similar problem with a TE220B in a Dell 1950 III server - i've seen
>> several other people having issues with digium cards in dell servers as
>> well.
>> I've actually done something similar to what you have done - isolated the
>> TE220B onto its own IRQ and set processor affinity for all the IRQs to
>> particular cores... so far I haven't had kernel pancs since doing this, but
>> its still a little too early to say if it has fixed the issue 100% or not.
>
> Interesting. It is actually a Dell SC1425 - Dual, dual-core Xeon Processors.
> I'm hopefully going to be able to stress test this machine to see if I
> can make it panic again with the PRI card IRQ isolated to CPU0. If so,
> I'll see if it does the same thing on the other cores...

As a data point, I tried stress testing this box this evening. Moving
the interrupt to each core, the results did not change.
The test was as follows:
Originate() a call that goes out to the PSTN and comes back in. Both
sides used Milliwatt() to make sure audio flowed both ways.
I generated 30 calls this way (to use 60 PRI channels), however, I was
never able to simultaneously keep 60 channels alive. During the test,
there would always be a D-Channel down/up, which would drop all calls
on that PRI span.
I do not know if this is a Zaptel issue (1.4.12), PRI card issue
(TE401P first-gen), or something more subtle...

Any help would be appricated!

Thanks.

-- James

>
> -- James
>
>> --
>> Matt
>>
>> On Mon, Mar 29, 2010 at 8:30 PM, James Lamanna <jlamanna at gmail.com> wrote:
>>>
>>> Hi,
>>> I'm trying to figure out the cause of a soft lockup I experienced:
>>>
>>> Mar 29 09:38:24 pstn1 kernel: BUG: soft lockup - CPU#0 stuck for 10s!
>>> [asterisk:32029]
>>> Mar 29 09:38:24 pstn1 kernel: Pid: 32029, comm:             asterisk
>>> Mar 29 09:38:24 pstn1 kernel: EIP: 0060:[<c046e7fe>] CPU: 0
>>> Mar 29 09:38:24 pstn1 kernel: EIP is at kfree+0x68/0x6c
>>> Mar 29 09:38:24 pstn1 kernel:  EFLAGS: 00000286    Tainted: GF
>>> (2.6.18-128.1.10.el5 #1)
>>> Mar 29 09:38:24 pstn1 kernel: EAX: 00000029 EBX: f7ff9380 ECX:
>>> f7fff880 EDX: c11ff9a0
>>> Mar 29 09:38:24 pstn1 kernel: ESI: 00000286 EDI: cffcda00 EBP:
>>> e5e10c80 DS: 007b ES: 007b
>>> Mar 29 09:38:24 pstn1 kernel: CR0: 80050033 CR2: b7ce39e0 CR3:
>>> 0f911000 CR4: 000006d0
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05b067c>] kfree_skbmem+0x8/0x61
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05e9aaf>] __udp_queue_rcv_skb+0x4a/0x51
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05ad993>] release_sock+0x44/0x91
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05ea939>] udp_sendmsg+0x44e/0x514
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05efdec>] inet_sendmsg+0x35/0x3f
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05ab30c>] sock_sendmsg+0xce/0xe8
>>> Mar 29 09:38:24 pstn1 kernel:  [<c043464f>]
>>> autoremove_wake_function+0x0/0x2d
>>> Mar 29 09:38:24 pstn1 kernel:  [<c04ea17b>] copy_from_user+0x17/0x5d
>>> Mar 29 09:38:24 pstn1 kernel:  [<c04ea3a1>] copy_to_user+0x31/0x48
>>> Mar 29 09:38:24 pstn1 kernel:  [<f89ab141>] zt_chan_read+0x1e0/0x20b
>>> [zaptel]
>>> Mar 29 09:38:24 pstn1 kernel:  [<c04ea195>] copy_from_user+0x31/0x5d
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05ac4c4>] sys_sendto+0x116/0x140
>>> Mar 29 09:38:24 pstn1 kernel:  [<c0415d4f>] flush_tlb_page+0x74/0x77
>>> Mar 29 09:38:24 pstn1 kernel:  [<c0461331>] do_wp_page+0x3bf/0x40a
>>> Mar 29 09:38:24 pstn1 kernel:  [<c04284f1>] current_fs_time+0x4a/0x55
>>> Mar 29 09:38:24 pstn1 kernel:  [<c0488f9b>] touch_atime+0x60/0x91
>>> Mar 29 09:38:24 pstn1 kernel:  [<c047d9d0>] pipe_readv+0x315/0x321
>>> Mar 29 09:38:24 pstn1 kernel:  [<c05acde4>] sys_socketcall+0x106/0x19e
>>> Mar 29 09:38:24 pstn1 kernel:  [<c0404f17>] syscall_call+0x7/0xb
>>> Mar 29 09:38:24 pstn1 kernel:  =======================
>>>
>>>
>>> This occurred during a "high load" period (52 calls across 3 PRI spans).
>>>
>>> A couple days ago I moved the interrupts for my PRI card to CPU0 from
>>> CPU3, because CPU3 was handling everything else:
>>>           CPU0       CPU1       CPU2       CPU3
>>>  0:        306          0          0 3684057379    IO-APIC-edge  timer
>>>  1:          0          0          0      13468    IO-APIC-edge  i8042
>>>  8:          0          0          0          3    IO-APIC-edge  rtc
>>>  9:          0          0          0          0   IO-APIC-level  acpi
>>>  12:          0          0          0          4    IO-APIC-edge  i8042
>>> 169:          0          0          0          0   IO-APIC-level
>>>  uhci_hcd:usb2
>>> 177:          0          0          0   18392593   IO-APIC-level  ata_piix
>>> 185:          0          0          0          1   IO-APIC-level
>>>  ehci_hcd:usb1
>>> 193:          0          0          0          0   IO-APIC-level
>>>  uhci_hcd:usb3
>>> 201:          0          0          0 2090021759   IO-APIC-level  eth0
>>> 209:  149621223          0          0 3534419461   IO-APIC-level  wct4xxp
>>>
>>>
>>> (The CPU3 number for wct4xxp is not increasing any more).
>>>
>>> What is the interrupt distribution of other people's systems?
>>> Before I made this change I was having a problem with D-channels
>>> dropping occasionally, so I thought it might be an interrupt/load
>>> issue.
>>>
>>> Thank you.
>>>
>>> -- James
>>>
>>> --
>>> _____________________________________________________________________
>>> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>>> New to Asterisk? Join us for a live introductory webinar every Thurs:
>>>               http://www.asterisk.org/hello
>>>
>>> asterisk-users mailing list
>>> To UNSUBSCRIBE or update options visit:
>>>   http://lists.digium.com/mailman/listinfo/asterisk-users
>>
>>
>> --
>> _____________________________________________________________________
>> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>> New to Asterisk? Join us for a live introductory webinar every Thurs:
>>               http://www.asterisk.org/hello
>>
>> asterisk-users mailing list
>> To UNSUBSCRIBE or update options visit:
>>   http://lists.digium.com/mailman/listinfo/asterisk-users
>>
>