[asterisk-bugs] [JIRA] (ASTERISK-27616) chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)

Wed Jun 20 23:57:54 CDT 2018

    [ https://issues.asterisk.org/jira/browse/ASTERISK-27616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=243874#comment-243874 ] 

Gregory Massel commented on ASTERISK-27616:
-------------------------------------------

memory leaks aside, I've edited chan_sip.c to log more information about time elapsing between various calls within sip_do_reload().

unlink_marked_peers_from_tables(); - 0 seconds
sip_poke_all_peers(); - 5 seconds if qualify=on; 0 seconds if qualify=off
sip_keepalive_all_peers(); - 4 seconds
sip_send_all_registers(); - 0 seconds (note: I have no outbound registrations configured)
sip_send_all_mwi_subscriptions(); - 0 seconds (note: I have no outbound MWI subscriptions configured)

The BIG culprit here is sip_keepalive_all_peers(); it takes a huge amount of time, even though I have "keepalive=no" in my sip.conf. The only way to get the reload under control is to comment out sip_keepalive_all_peers() from sip_do_reload.

sip_poke_all_peers() is also terribly slow, however, at least it can be controlled by configuring "qualify=no".

Looking at the code for both sip_poke_all_peers() and sip_keepalive_all_peers(), it doesn't appear to have changed from 11.25.3 to 13.21.1 and the two functions are very similar.

This makes me think that AST_SCHED_REPLACE_UNREF() is where the delay is coming in.

Whilst here, sip_keepalive_all_peers makes reference to "poke peer ref" instead of "keepalive peer ref" (should be harmless, but nonetheless wrong) and it seems to try and schedule keepalives for all peers immediately (0ms into the future) which will result in a flood. Almost all the other related code (e.g. sip_poke_all_peers, sip_send_all_registers) staggers the scheduling to avoid flooding. This is all ancillary though as, even if keepalives are disabled, sip_keepalive_all_peers still iterates through every peer and manages to hang chan_sip for a number of seconds.

> chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)
> -------------------------------------------------------------------
>
>                 Key: ASTERISK-27616
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-27616
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General, Core/Stasis
>    Affects Versions: 13.19.0, 15.2.0
>         Environment: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic x86_64)
> Asterisk versions 11.25.3, 13.19.0, 15.2.0
> chan_sip - 14783 sip peers [Monitored: 8181 online, 6602 offline Unmonitored: 0 online, 0 offline]
> 4567 active SIP subscriptions
>            Reporter: Gregory Massel
>
> Since upgrading from Asterisk 11 to 13, when issuing a "sip reload" (either via CLI or asterisk -rx "sip reload"), chan_sip locks up for a period of time ranging from 3 to 70 seconds (depending on various factors), during which time it is completely unresponsive to any SIP packets received. Aggravating the issue is that, if qualify=yes is set, sip peers become LAGGED/UNREACHABLE as a result of chan_sip becoming unresponsive which triggers a flood of device state updates that further backlogs chan_sip.
> This issue does not exist on Asterisk 11.25.3. Irrespective of the system load, volume of peers (we have >8,000 online) or volume of SIP subscriptions (we have anywhere from 4,000 to 8,000), Asterisk 11 can do a "sip reload" cleanly and instantaneously with zero disruption. It remains responsive throughout. However, Asterisk 13.19.0 and 15.2.0 both exhibit the module slow-down / lock-up.
> On Asterisk 13/15, with qualify=no and allowsubscribe=no, a "sip reload" can typically take 3 to 5 seconds. A debug log shows that, despite there being a considerable number of peers, the process of reloading the configuration files is almost instantaneous, but things slow afterwards. e.g.:
> [Jan 13 03:04:01] VERBOSE[6032] chan_sip.c: Reloading SIP
> [Jan 13 03:04:01] DEBUG[6032] chan_sip.c: --------------- Done destroying pruned peers
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: do_reload finished. peer poke/prune reg contact time = 5 sec.
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: --------------- SIP reload done
> In the above example, the poke/prune took 5 seconds, however, this can vary dramatically per the following factors:
> - If a global sip config setting is changed (e.g. qualify=yes -> qualify=no) acting on all peers, the poke/prune can take up to a minute.
> - With no config changes at all, it can take 3-20 seconds, depending on call volume and whether qualify is enabled and allowsubscribe is enabled.
> - Even in best case, with qualify=no, keepalive=no, allowsubscribe=no, it will take at least 3 seconds. With zero active calls and qualify=yes, typically 5-6 seconds. With zero active calls and qualify=yes and allowsubscribe=yes, typically 8-10 seconds.
> I've verified using VoIPmonitor and packet sniffing that the Asterisk server is completely unresponsive to SIP during the reload. Active calls do, however, remain active, as RTP continues to flow. Other Asterisk threads unrelated to chan_sip continue to function.
> Looking at the chan_sip source code in sip_do_reload(), the code appears similar/identical from Asterisk 11 to 13. We have no outbound registrations nor outbound MWI subscriptions. Although the SIP reload is slow and unresponsive, it would appear that another section of code is what is causing it to become unresponsive...perhaps an issue with locking?
> It's not clear to me why this happens with Asterisk 13/15 but not with 11.
> It's also worth pointing out that we don't see this behavior on other Asterisk 13 servers that have zero hints / device state subscriptions but still have thousands of peers with qualify=yes. It is potentially possible that, at least in part, the issue relates to device state.
> N.B. We have ruled out DNS failure by replacing all references to hostnames with IP addresses. I am also confident that, if there was a DNS issue here, Asterisk 11.25.3 would be affected, yet it is not.

--
This message was sent by Atlassian JIRA
(v6.2#6252)