[asterisk-bugs] [JIRA] (ASTERISK-27616) chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)

Gregory Massel (JIRA) noreply at issues.asterisk.org
Sat Apr 4 10:36:25 CDT 2020


    [ https://issues.asterisk.org/jira/browse/ASTERISK-27616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=250125#comment-250125 ] 

Gregory Massel commented on ASTERISK-27616:
-------------------------------------------

In response to Steve Sether's comment: I was unable to mitigate it adequately and, as more and more security issues with older versions of Asterisk materialised (and were not fixed owing to the associated versions being end-of-life), I was forced to migrate to res_pjsip.

A few suggestions re mitigatation:
1. Shut down Asterisk entirely and restart it routine at a quiet time (e.g. 4am in the morning). This will manage the amount of memory leaked over time [and reduce reload times[.
2. Disable MWI (mailbox=xxx) if you can afford to lose the functionality.
3. Disable all NAT keepalives (keepalive=no) and qualification (qualify=no). [Note: This can cause profound issues if many of your endpoints are behind NAT and don't generate keepalives themselves].
4. Disable any presence (allowsubscribe=no).
5. If you're making frequent changes to sip.conf, keep reloads to the minimum. E.g. Rather than reloading ("sip reload") every time a change is made (to sip.conf), trigger reloads via a cron job that runs every 15 / 30 / 60 minutes and issues a "sip reload" if the modification timestamp of sip.conf has changed.

In the long term, however, there are numerous reasons aside from this issue to consider migrating to res_pjsip. One of the biggest problems with chan_sip is that it cannot deal with high volumes of registration requests (max. ~50 per second even on the best of harware) and this means that even one buggy SIP stack (or malicious users) can overwhelm it. There are tons of Huawei fibre and LTE CPE routers that, if fail to register (e.g. wrong password or SIP account disabled), retry instantly without any delay, generating hundreds of registrations per second. A single one of these, on its own, can overwhelm chan_sip, let alone a few hundred endpoints or someone with malicious intent using a tool like SIPP. Sure, you can try mitigate this by using iptables to rate-limit the volume of REGISTER packets per IP, however, even then chan_sip is extremely easy to flood.

By comparison, res_pjsip (on recent versions of Asterisk), on reasonably modest hardware can deal with well over 2,000 registrations per second and won't try and write the same contact information over and over again (the sqlite3 AstDB is the weakest link with both sip stacks) if it hasn't changed. Coupled with iptables rate limiting (to manage the really malicious endpoints), it's much more difficult to overwhelm res_pjsip.

With above said, res_pjsip will scale FAR better than chan_sip (think 10,000 active endpoints, all registering, NAT keepalives, presence, etc.), but has its limitations as well. If you're facing an endpoint population of more than 10,000, it may be worth putting the effort into fronting Asterisk with OpenSIPS or Kamailio. If considering this, keep in mind that res_pjsip works much better with proxies (it can publish presence to them, match on SIP headers, etc.) whereas chan_sip isn't as suited. Migrating to res_pjsip wasn't trivial but also wasn't anywhere near as complex as I'd thought it would be. I was able to write scripts and use looks like 'sed' to bulk-modify existing configurations. Your bigger issue now is that, having left it so late, if you upgrade from Asterisk 11 to 16, there are so many other things that have changed that you may spend more time planning around deprecated functions, changed functions and channel variables. From a security and stability perspective though, it's probably worth the effort of migrating to pjsip and Asterisk 16, particularly if you're dealing with a few thousand endpoints.

> chan_sip locks up during reload under Asterisk 13 / 15 (but not 11)
> -------------------------------------------------------------------
>
>                 Key: ASTERISK-27616
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-27616
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General, Core/Stasis
>    Affects Versions: 13.19.0, 15.2.0
>         Environment: Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic x86_64)
> Asterisk versions 11.25.3, 13.19.0, 15.2.0
> chan_sip - 14783 sip peers [Monitored: 8181 online, 6602 offline Unmonitored: 0 online, 0 offline]
> 4567 active SIP subscriptions
>            Reporter: Gregory Massel
>
> Since upgrading from Asterisk 11 to 13, when issuing a "sip reload" (either via CLI or asterisk -rx "sip reload"), chan_sip locks up for a period of time ranging from 3 to 70 seconds (depending on various factors), during which time it is completely unresponsive to any SIP packets received. Aggravating the issue is that, if qualify=yes is set, sip peers become LAGGED/UNREACHABLE as a result of chan_sip becoming unresponsive which triggers a flood of device state updates that further backlogs chan_sip.
> This issue does not exist on Asterisk 11.25.3. Irrespective of the system load, volume of peers (we have >8,000 online) or volume of SIP subscriptions (we have anywhere from 4,000 to 8,000), Asterisk 11 can do a "sip reload" cleanly and instantaneously with zero disruption. It remains responsive throughout. However, Asterisk 13.19.0 and 15.2.0 both exhibit the module slow-down / lock-up.
> On Asterisk 13/15, with qualify=no and allowsubscribe=no, a "sip reload" can typically take 3 to 5 seconds. A debug log shows that, despite there being a considerable number of peers, the process of reloading the configuration files is almost instantaneous, but things slow afterwards. e.g.:
> [Jan 13 03:04:01] VERBOSE[6032] chan_sip.c: Reloading SIP
> [Jan 13 03:04:01] DEBUG[6032] chan_sip.c: --------------- Done destroying pruned peers
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: do_reload finished. peer poke/prune reg contact time = 5 sec.
> [Jan 13 03:04:06] DEBUG[6032] chan_sip.c: --------------- SIP reload done
> In the above example, the poke/prune took 5 seconds, however, this can vary dramatically per the following factors:
> - If a global sip config setting is changed (e.g. qualify=yes -> qualify=no) acting on all peers, the poke/prune can take up to a minute.
> - With no config changes at all, it can take 3-20 seconds, depending on call volume and whether qualify is enabled and allowsubscribe is enabled.
> - Even in best case, with qualify=no, keepalive=no, allowsubscribe=no, it will take at least 3 seconds. With zero active calls and qualify=yes, typically 5-6 seconds. With zero active calls and qualify=yes and allowsubscribe=yes, typically 8-10 seconds.
> I've verified using VoIPmonitor and packet sniffing that the Asterisk server is completely unresponsive to SIP during the reload. Active calls do, however, remain active, as RTP continues to flow. Other Asterisk threads unrelated to chan_sip continue to function.
> Looking at the chan_sip source code in sip_do_reload(), the code appears similar/identical from Asterisk 11 to 13. We have no outbound registrations nor outbound MWI subscriptions. Although the SIP reload is slow and unresponsive, it would appear that another section of code is what is causing it to become unresponsive...perhaps an issue with locking?
> It's not clear to me why this happens with Asterisk 13/15 but not with 11.
> It's also worth pointing out that we don't see this behavior on other Asterisk 13 servers that have zero hints / device state subscriptions but still have thousands of peers with qualify=yes. It is potentially possible that, at least in part, the issue relates to device state.
> N.B. We have ruled out DNS failure by replacing all references to hostnames with IP addresses. I am also confident that, if there was a DNS issue here, Asterisk 11.25.3 would be affected, yet it is not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list