[asterisk-bugs] [JIRA] (ASTERISK-21378) chan_sip blocks on DNS lookups - causing severe delays with registrations in certain scenarios

Wed Apr 3 15:59:01 CDT 2013

    [ https://issues.asterisk.org/jira/browse/ASTERISK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=204927#comment-204927 ] 

Jaco Kroon commented on ASTERISK-21378:
---------------------------------------

After discussion in #asterisk on IRC a few things became clear:

* Fixing this issue is invasive to the current chan_sip.
* The new design for sip in asterisk-12 should (will?) not suffer the same problems (also referring to other configuration reload issues).
* Chances of this bug getting fixed pre-12 is slim to none (which is sad).

So a few suggestions to mitigate (I must point out that NONE of this fixes the problem, as I'll explain later) the risks of this problem striking you, and improving performance in general.

1.  You should list all your local IPs (as shown by "ifconfig" or "ip ad sh") in /etc/hosts - this is reasonable as most systems does this anyway.  If you have one or two dynamic IPs however this becomes trickier.  In my case above I don't.

2.  Run a local DNS cache and have /etc/resolv.conf point to that.  I *always* run djb's dnscache on 127.0.0.1 on all my machines anyway, it's fast and reliable (http://cr.yp.to/ - it's old though).  Having the cache local reduces latency on successive DNS lookups, in my *normal* case above this saves around 60 odd DNS queries from leaving the machine (search and domain lines in /etc/resolv.conf often causes more harm than good, fortunately my authoritative servers for my search lines are in the same cabinet and have a response of <1ms).

This will only improve the situation with chan_sip load times if there are not serious external problems unfortunately.  In the case like above where sip.iburst.co.za cannot be resolved at all and all the auth name servers for iburst.co.za is gone from the face of the earth you're still stuffed if you don't have a locally cached record.  And to make matters worse - you're waiting for two DNS timeouts, first for SRV _sip._udp.${sipdomain} and then for A ${sipdomain}.  Should the SRV record resolve, and have a list of 10 other names to be looked up which all fail then the problem actually becomes even worse.  Consider for example a SRV record for _sip._udp.me.co.za that lists (sip1.iburst.co.za, sip2.iburst.co.za ... sip10.iburst.co.za) and then you wait for all 10 those lookups to time out.

There are *risky* ways to mitigate the risk further.  Specifically, if you "replicate" the external zones into /etc/hosts (won't work with SRV records in the mix), for example, lets say sip.iburst.co.za normally (when it works) resolves to 1.2.3.4 then you can add 1.2.3.4 sip.iburst.co.za into /etc/hosts, and disable srv lookups in sip.conf (srvlookup=no).  This obviously won't work if you need SRV records.

As another mechanism, if you use a config generator, whenever you place a hostname into the config file, look up the desired IP in the config generator and put the IP into the asterisk config instead.  This prevents the need for making DNS lookups in chan_sip, preventing chan_sip from needlessly blocking.  This suffers similar risks to the above /etc/hosts solution.  If desired, store the looked up RR somewhere on disk in a text file so that you can re-use the lookup again at a later stage if a newer lookup fails.

Another suggestion was to see if we cannot perhaps localize any changes to dnsmgr.  The changes that was mentioned specifically was as follows:

1. Alter the scheduler to refresh on DNS TTL values.
2. Coalesce lookups for the same host and type (currently multiple register lines as per above will still result in multiple DNS queries being generated).
3. It's unclear whether SRV lookups are being handled by dnsmgr or not at this point.
4. Configurable DNS timeout failure (eg, normally my lookups succeed in <5ms, so set failure time to 50ms)
5. Re-use stale records in case of DNS failure.
6. Store DNS lookups into astdb to cache over asterisk restarts.

I seriously doubt all of these changes are required, however, from a quick scan we will need at least (4) and (5).  If (3) is of such a nature that SRV records are dealt with by DNSMGR then it's sufficient, otherwise, SRV support in chan_sip should be disabled to ensure that this issue won't strike.

> chan_sip blocks on DNS lookups - causing severe delays with registrations in certain scenarios
> ----------------------------------------------------------------------------------------------
>
>                 Key: ASTERISK-21378
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-21378
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 11.3.0
>         Environment: Gentoo Linux, asterisk 11.3.0
>            Reporter: Jaco Kroon
>            Severity: Critical
>
> One of the bigger ISPs in South Africa decided to blow up their entire network today.  Our setup has quite a number (16 to be exact) of register lines of the form:
> {noformat}
> register => 2787....:secret at sip.iburst.co.za/087....
> {noformat}
> As soon as they decided to press the big red button to take down their network and we hit a SIP reload ... *boom* - we could for the live of us not get asterisk back up into a working state.  We have a sip peer looking like this:
> {noformat}
> [iburst]
> host = sip.iburst.co.za
> type=friend
> qualify=yes
> disallow=all
> allow=g729
> context=inbound-iburst
> directmedia=no
> dtmfmode=rfc2833
> accountcode=iBurst
> jbforce=no
> {noformat}
> Knowing that iBurst went down, and spotting this log entry brought up the theory:
> {noformat}
> [Apr  3 19:36:12] ERROR[27636] netsock2.c: getaddrinfo("sip.iburst.co.za", "(null)", ...): Name or service not known
> {noformat}
> so, commented out the register lines, and behove and behold, it takes about 20 seconds longer than usual for asterisk to start servicing the :5060 udp socket (normally a watch netstat -nulp won't ever show the Recv-Q being anything other than 0, currently it'll keep climbing for around 20 seconds before dropping back down to zero).
> With the register lines uncommented you can forget about sane operation.  It will not happen.  In fact, the only way for me to recover is to kill -9 asterisk.
> I currently have dnsmgr disabled, even though I can see (from the code) that the handling differs with dnsmgr enabled, and it does make more sense for me to have it enabled anyway.
> I'm not sure what the best way would be to handle this, but I suspect that registrations needs to happen in a separate thread, DNS lookups should probably happen without any locks held in chan_sip.
> For the moment (since none of peers I need to peer with use SRV records, and their DNS should not change that often) I might be better off to perform the DNS lookups outside of asterisk and just hard-code the IPs into the config.  From a rudementary test this seems to work quite well (asterisk is back to normal behaviour of starting up chan_sip in a VERY short time frame).
> A quick test with dnsmgr enabled, but utilizing DNS names again instead of IP addresses results in completely broken behaviour again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.asterisk.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira