[asterisk-bugs] [JIRA] (ASTERISK-27616) chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)

Richard Mudgett (JIRA) noreply at issues.asterisk.org
Tue Feb 13 17:46:13 CST 2018


    [ https://issues.asterisk.org/jira/browse/ASTERISK-27616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=242190#comment-242190 ] 

Richard Mudgett commented on ASTERISK-27616:
--------------------------------------------

If you enable BETTER_BACKTRACES and DONT_OPTIMIZE then the FRACK backtraces could be more helpful to you.

There is only one place in chan_sip that calls stasis_subscribe_pool() and that is add_peer_mwi_subs().
The large number of leaked items you are seeing in stasis.c and stasis_cache.c are controlled by code in chan_sip.c.  It is chan_sip that is leaking the controlling references.

> chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)
> -------------------------------------------------------------------
>
>                 Key: ASTERISK-27616
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-27616
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General, Core/Stasis
>    Affects Versions: 13.19.0, 15.2.0
>         Environment: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic x86_64)
> Asterisk versions 11.25.3, 13.19.0, 15.2.0
> chan_sip - 14783 sip peers [Monitored: 8181 online, 6602 offline Unmonitored: 0 online, 0 offline]
> 4567 active SIP subscriptions
>            Reporter: Gregory Massel
>
> Since upgrading from Asterisk 11 to 13, when issuing a "sip reload" (either via CLI or asterisk -rx "sip reload"), chan_sip locks up for a period of time ranging from 3 to 70 seconds (depending on various factors), during which time it is completely unresponsive to any SIP packets received. Aggravating the issue is that, if qualify=yes is set, sip peers become LAGGED/UNREACHABLE as a result of chan_sip becoming unresponsive which triggers a flood of device state updates that further backlogs chan_sip.
> This issue does not exist on Asterisk 11.25.3. Irrespective of the system load, volume of peers (we have >8,000 online) or volume of SIP subscriptions (we have anywhere from 4,000 to 8,000), Asterisk 11 can do a "sip reload" cleanly and instantaneously with zero disruption. It remains responsive throughout. However, Asterisk 13.19.0 and 15.2.0 both exhibit the module slow-down / lock-up.
> On Asterisk 13/15, with qualify=no and allowsubscribe=no, a "sip reload" can typically take 3 to 5 seconds. A debug log shows that, despite there being a considerable number of peers, the process of reloading the configuration files is almost instantaneous, but things slow afterwards. e.g.:
> [Jan 13 03:04:01] VERBOSE[6032] chan_sip.c: Reloading SIP
> [Jan 13 03:04:01] DEBUG[6032] chan_sip.c: --------------- Done destroying pruned peers
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: do_reload finished. peer poke/prune reg contact time = 5 sec.
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: --------------- SIP reload done
> In the above example, the poke/prune took 5 seconds, however, this can vary dramatically per the following factors:
> - If a global sip config setting is changed (e.g. qualify=yes -> qualify=no) acting on all peers, the poke/prune can take up to a minute.
> - With no config changes at all, it can take 3-20 seconds, depending on call volume and whether qualify is enabled and allowsubscribe is enabled.
> - Even in best case, with qualify=no, keepalive=no, allowsubscribe=no, it will take at least 3 seconds. With zero active calls and qualify=yes, typically 5-6 seconds. With zero active calls and qualify=yes and allowsubscribe=yes, typically 8-10 seconds.
> I've verified using VoIPmonitor and packet sniffing that the Asterisk server is completely unresponsive to SIP during the reload. Active calls do, however, remain active, as RTP continues to flow. Other Asterisk threads unrelated to chan_sip continue to function.
> Looking at the chan_sip source code in sip_do_reload(), the code appears similar/identical from Asterisk 11 to 13. We have no outbound registrations nor outbound MWI subscriptions. Although the SIP reload is slow and unresponsive, it would appear that another section of code is what is causing it to become unresponsive...perhaps an issue with locking?
> It's not clear to me why this happens with Asterisk 13/15 but not with 11.
> It's also worth pointing out that we don't see this behavior on other Asterisk 13 servers that have zero hints / device state subscriptions but still have thousands of peers with qualify=yes. It is potentially possible that, at least in part, the issue relates to device state.
> N.B. We have ruled out DNS failure by replacing all references to hostnames with IP addresses. I am also confident that, if there was a DNS issue here, Asterisk 11.25.3 would be affected, yet it is not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list