[asterisk-bugs] [JIRA] (ASTERISK-21378) chan_sip completely blocks on DNS lookups

Matt Jordan (JIRA) noreply at issues.asterisk.org
Mon May 20 07:40:02 CDT 2013


    [ https://issues.asterisk.org/jira/browse/ASTERISK-21378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=206621#comment-206621 ] 

Matt Jordan commented on ASTERISK-21378:
----------------------------------------

I know Olle is looking into this problem and may be doing some work in this area. He may have some insight and/or work that will alleviate this problem. I'm not sure how far down the DNS rabbit hole he's going however, so it may not fix all of the problems you've alluded to here.

There's really two problems at play here:
# DNS is not asynchronous. Performing a DNS lookup blocks the calling thread.
# {{chan_sip}} is single threaded.

Both of these items would require massive rewrites of {{chan_sip}}.

For DNS to be asynchronous, {{chan_sip}} would have to have a callback for each DNS lookup. Whenever a DNS lookup occurs, {{chan_sip}} would have to return from that point and resume with its state restored when the DNS lookup completes. I can't even begin to scope out the scale of this work. {{chan_sip}} is structured in such a way that once a request/response handling begins, it is expected to run to completion for that request/response. There is no way to defer additional handling for later processing or for another thread.

Similarly, {{chan_sip}} being made multi-threaded would invalidate the threading model (such as it is) in {{chan_sip}}. Often, we "know" that only a single thread is processing requests/responses, and certain operations take place sequentially because of it. Opening this up to multiple threads would, again, invalidate much of the structure in {{chan_sip}}.

There is no way to address these problems in a release branch without:
* Consuming significant developer resources
* Injecting a huge amount of risk into release branches
* Requiring a substantial testing effort from the entire Asterisk user community

So what about trunk?

This is why we wrote a new SIP channel driver.
# It is multi-threaded. Its entire design assumes multiple threads servicing requests/responses and processing them in a well defined stack.
# It uses asynchronous DNS.

I cannot foresee a general effort attempting to resolve this problem in {{chan_sip}}.

Now, all of that being said, Olle does good work. He may have a solution that you can try and that may also be generally applicable to release branches of Asterisk. It would be a good idea to contact him and see if you can assist with his development and testing efforts.

                
> chan_sip completely blocks on DNS lookups
> -----------------------------------------
>
>                 Key: ASTERISK-21378
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-21378
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 11.3.0
>         Environment: Gentoo Linux, asterisk 11.3.0
>            Reporter: Jaco Kroon
>            Assignee: Matt Jordan
>            Severity: Critical
>
> One of the bigger ISPs in South Africa decided to blow up their entire network today.  Our setup has quite a number (16 to be exact) of register lines of the form:
> {noformat}
> register => 2787....:secret at sip.iburst.co.za/087....
> {noformat}
> As soon as they decided to press the big red button to take down their network and we hit a SIP reload ... *boom* - we could for the live of us not get asterisk back up into a working state.  We have a sip peer looking like this:
> {noformat}
> [iburst]
> host = sip.iburst.co.za
> type=friend
> qualify=yes
> disallow=all
> allow=g729
> context=inbound-iburst
> directmedia=no
> dtmfmode=rfc2833
> accountcode=iBurst
> jbforce=no
> {noformat}
> Knowing that iBurst went down, and spotting this log entry brought up the theory:
> {noformat}
> [Apr  3 19:36:12] ERROR[27636] netsock2.c: getaddrinfo("sip.iburst.co.za", "(null)", ...): Name or service not known
> {noformat}
> so, commented out the register lines, and behove and behold, it takes about 20 seconds longer than usual for asterisk to start servicing the :5060 udp socket (normally a watch netstat -nulp won't ever show the Recv-Q being anything other than 0, currently it'll keep climbing for around 20 seconds before dropping back down to zero).
> With the register lines uncommented you can forget about sane operation.  It will not happen.  In fact, the only way for me to recover is to kill -9 asterisk.
> I currently have dnsmgr disabled, even though I can see (from the code) that the handling differs with dnsmgr enabled, and it does make more sense for me to have it enabled anyway.
> I'm not sure what the best way would be to handle this, but I suspect that registrations needs to happen in a separate thread, DNS lookups should probably happen without any locks held in chan_sip.
> For the moment (since none of peers I need to peer with use SRV records, and their DNS should not change that often) I might be better off to perform the DNS lookups outside of asterisk and just hard-code the IPs into the config.  From a rudementary test this seems to work quite well (asterisk is back to normal behaviour of starting up chan_sip in a VERY short time frame).
> A quick test with dnsmgr enabled, but utilizing DNS names again instead of IP addresses results in completely broken behaviour again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.asterisk.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



More information about the asterisk-bugs mailing list