[asterisk-bugs] [JIRA] (ASTERISK-27616) chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)
Gregory Massel (JIRA)
noreply at issues.asterisk.org
Fri Feb 2 08:43:13 CST 2018
[ https://issues.asterisk.org/jira/browse/ASTERISK-27616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=241981#comment-241981 ]
Gregory Massel commented on ASTERISK-27616:
-------------------------------------------
ctnsip1*CLI> core show sysinfo
System Statistics
-----------------
System Uptime: 516 hours
Total RAM: 4032804 KiB
Free RAM: 212324 KiB
Buffer RAM: 30672 KiB
Total Swap Space: 11717628 KiB
Free Swap Space: 10385896 KiB
Number of Processes: 290
ctnsip1*CLI> module unload chan_sip.so
Unloaded chan_sip.so
ctnsip1*CLI> core show sysinfo
System Statistics
-----------------
System Uptime: 516 hours
Total RAM: 4032804 KiB
Free RAM: 213020 KiB
Buffer RAM: 30704 KiB
Total Swap Space: 11717628 KiB
Free Swap Space: 10386108 KiB
Number of Processes: 288
[Memory is still in a leaked state, despite unload of chan_sip]
At this point I kill -9 the asterisk process and restart asterisk and:
ctnsip1*CLI> core show sysinfo
System Statistics
-----------------
System Uptime: 516 hours
Total RAM: 4032804 KiB
Free RAM: 3528008 KiB
Buffer RAM: 34972 KiB
Total Swap Space: 11717628 KiB
Free Swap Space: 11643084 KiB
Number of Processes: 294
Suddenly memory usage is down from ~4GB to ~0.5GB, using the exact same configurations, merely by killing and restarting Asterisk.
On further review, I may have been incorrect in saying that memory is leaked during a "sip reload". It may be that memory is constantly being leaked (e.g. with call setup and tear-down), however, that the reload process slows down as, when a "sip reload" is executed, it is going through all the leaked memory.
> chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)
> -------------------------------------------------------------------
>
> Key: ASTERISK-27616
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-27616
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Channels/chan_sip/General
> Affects Versions: 13.19.0, 15.2.0
> Environment: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic x86_64)
> Asterisk versions 11.25.3, 13.19.0, 15.2.0
> chan_sip - 14783 sip peers [Monitored: 8181 online, 6602 offline Unmonitored: 0 online, 0 offline]
> 4567 active SIP subscriptions
> Reporter: Gregory Massel
>
> Since upgrading from Asterisk 11 to 13, when issuing a "sip reload" (either via CLI or asterisk -rx "sip reload"), chan_sip locks up for a period of time ranging from 3 to 70 seconds (depending on various factors), during which time it is completely unresponsive to any SIP packets received. Aggravating the issue is that, if qualify=yes is set, sip peers become LAGGED/UNREACHABLE as a result of chan_sip becoming unresponsive which triggers a flood of device state updates that further backlogs chan_sip.
> This issue does not exist on Asterisk 11.25.3. Irrespective of the system load, volume of peers (we have >8,000 online) or volume of SIP subscriptions (we have anywhere from 4,000 to 8,000), Asterisk 11 can do a "sip reload" cleanly and instantaneously with zero disruption. It remains responsive throughout. However, Asterisk 13.19.0 and 15.2.0 both exhibit the module slow-down / lock-up.
> On Asterisk 13/15, with qualify=no and allowsubscribe=no, a "sip reload" can typically take 3 to 5 seconds. A debug log shows that, despite there being a considerable number of peers, the process of reloading the configuration files is almost instantaneous, but things slow afterwards. e.g.:
> [Jan 13 03:04:01] VERBOSE[6032] chan_sip.c: Reloading SIP
> [Jan 13 03:04:01] DEBUG[6032] chan_sip.c: --------------- Done destroying pruned peers
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: do_reload finished. peer poke/prune reg contact time = 5 sec.
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: --------------- SIP reload done
> In the above example, the poke/prune took 5 seconds, however, this can vary dramatically per the following factors:
> - If a global sip config setting is changed (e.g. qualify=yes -> qualify=no) acting on all peers, the poke/prune can take up to a minute.
> - With no config changes at all, it can take 3-20 seconds, depending on call volume and whether qualify is enabled and allowsubscribe is enabled.
> - Even in best case, with qualify=no, keepalive=no, allowsubscribe=no, it will take at least 3 seconds. With zero active calls and qualify=yes, typically 5-6 seconds. With zero active calls and qualify=yes and allowsubscribe=yes, typically 8-10 seconds.
> I've verified using VoIPmonitor and packet sniffing that the Asterisk server is completely unresponsive to SIP during the reload. Active calls do, however, remain active, as RTP continues to flow. Other Asterisk threads unrelated to chan_sip continue to function.
> Looking at the chan_sip source code in sip_do_reload(), the code appears similar/identical from Asterisk 11 to 13. We have no outbound registrations nor outbound MWI subscriptions. Although the SIP reload is slow and unresponsive, it would appear that another section of code is what is causing it to become unresponsive...perhaps an issue with locking?
> It's not clear to me why this happens with Asterisk 13/15 but not with 11.
> It's also worth pointing out that we don't see this behavior on other Asterisk 13 servers that have zero hints / device state subscriptions but still have thousands of peers with qualify=yes. It is potentially possible that, at least in part, the issue relates to device state.
> N.B. We have ruled out DNS failure by replacing all references to hostnames with IP addresses. I am also confident that, if there was a DNS issue here, Asterisk 11.25.3 would be affected, yet it is not.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list