[asterisk-bugs] [JIRA] (ASTERISK-23719) Asterisk locks, UDP buffer overflow, 1000+ spawns of 'chan_iax2.c find_idle_thread()'

Mon May 12 12:10:44 CDT 2014

    [ https://issues.asterisk.org/jira/browse/ASTERISK-23719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=218050#comment-218050 ] 

SteelPivot commented on ASTERISK-23719:
---------------------------------------

Okay. So a bit of backstory here, we have two Asterisk servers running in active/active fail-over, between two datacenters. That setup is not terribly important, however, last week I recompiled without DEBUG_THREADS on both systems. These were both running on VMs until last week, when I moved the primary system to a bare metal server (but left the secondary on a VM). Since doing this, the primary has not had any issues.

However, the secondary has again shown the same symptoms (appropriate files attached, sans core-show-locks since no DEBUG_THREADS). The only fix was a "service asterisk restart" or "asterisk -rx 'core restart now'", otherwise the IAX2 trunks would never come back online.

The VMs is running on KVM, using virtio network driver. Normal type-1 bonding on the host NICs, etc. for KVM.

The only port showing an overflow in this case is port 4569, maxed at the system maximum, 230400. There were no active calls on this system at the time. Again, all IAX2 trunks went UNREACHABLE within the same minute (which is expected, given the default qualify time of 60s).

> Asterisk locks, UDP buffer overflow, 1000+ spawns of 'chan_iax2.c find_idle_thread()'
> -------------------------------------------------------------------------------------
>
>                 Key: ASTERISK-23719
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-23719
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_iax2
>    Affects Versions: 11.6.1
>         Environment: CentOS 6.4min
>            Reporter: SteelPivot
>            Assignee: SteelPivot
>            Severity: Critical
>         Attachments: 1399319401-core-show-locks.txt, 1399324201-backtrace-threads.txt, 1399324201-core-show-threads.txt, 1399324201-netstat.txt, 1399750801-backtrace-threads.txt, 1399750801-core-show-taskprocessors.txt, 1399750801-core-show-threads.txt, 1399750801-netstat.txt
>
>
> We've been experience an issue for a few months concerning IAX2 peers which has recently gotten more severe after upgrading from 11.2 to 11.6cert2.
> The initial symptom was all (100+) IAX2 peers going UNREACHABLE. However, after inspecting further it seems that what will happen is the UDP queues will sharply increase (seen by netstat -antup), the number of asterisk threads increases (to over 1000 threads in some cases), and Asterisk, of course, stops responding to inbound/outbound calls from any channel (SIP or IAX2).
> After recompiling with DEBUG_THREADS and BETTER_BACKTRACES, I discovered that issuing a "gdb -ex "thread apply all bt"...(etc) " to grab a backtrace will free up the UDP queues, and Asterisk will then become responsive again. Currently I have a script running each 5 minutes that pulls the UDP queues for asterisk processes, and upon seeing a queue above 300,000packets, I issue a "netstat -antup", "core show locks", "core show threads", and "gdb -ex "thread apply all bt" --batch asterisk `pidof asterisk` > $debugdir/$date-backtrace-threads.txt".
> I have previously increased the kernel UDP maximums in sysctl.conf, and added options for iaxthreadcount/iaxmaxthreadcount in iax.conf.
> I cannot repeat this issue at will, but it happens every hour or so (sometimes every few minutes). I have debug logs and backtraces for each occurrence.

--
This message was sent by Atlassian JIRA
(v6.2#6252)