[asterisk-bugs] [JIRA] (ASTERISK-30381) res_resolver_unbound: Using unbound, queries do not try all available nameservers, and contacts will flap

Thu Dec 29 08:22:06 CST 2022

    [ https://issues.asterisk.org/jira/browse/ASTERISK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=261088#comment-261088 ] 

Mark Murawski commented on ASTERISK-30381:
------------------------------------------

Thanks for your insight, it's always not what I expected!

My '1':
1) Don't treat a contact as unreachable if the DNS suddenly fails, but SIP OPTIONS is still working to the last-known IP

Means this:
The contact reachability is flapping reachable/unreachable based on DNS failures.  But the contact itself never went down or was otherwise unavailable. It looks to be that PJSIP is treating the contact as having failed to be contacted, if all of a sudden a DNS lookup fails that was working previously.

If the DNS lookup fails, but we have a last-known-address for this contact, then it shouldn't flap the contact based on the DNS failure.  We have a good DNS resolution of the contact's address from the last successful lookup. It should keep using that address to send SIP OPTIONS, and only if SIP OPTIONS fails to come back with an OK, only at that point should PJSIP mark the contact as unreachable.  Or maybe add an option to behave as such.  I'm having a hard time thinking of a use case for treating (most likely) temporary failures in DNS resolution as a hard-down for the contact, even when the contact is alive and well.. considering if the contact had a hard-coded IP, then all would be well.

Rationale: My go-to theory of failure handling is that the system should try all reasonable available options to continue operating, including not getting rid of last-known-good-data and as long as that last-known-good-data still works, then keep on chugging until you can get the new one.  And throw alarms in the meantime, letting the user determine how to handle this.

> res_resolver_unbound: Using unbound, queries do not try all available nameservers, and contacts will flap
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: ASTERISK-30381
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-30381
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_resolver_unbound
>    Affects Versions: 18.15.1, 19.7.1, 20.0.1
>            Reporter: Mark Murawski
>            Assignee: Mark Murawski
>
> Using what's probably a fairly standard DNS server list containing a local DNS server and some backups, using the unbound DNS resolver  will result in non-deterministic lookup failures.
> Given resolv.conf:
> {code}
> options attempts:3 timeout:1
> nameserver 192.168.5.2
> nameserver 4.2.2.2
> nameserver 8.8.8.8
> {code}
> Given resolver_unbound.conf
> {code}
> [general]
> hosts = /etc/hosts
> resolv = /etc/resolv.conf
> {code}
> Given pjsip_wizard.conf
> {code}
> [wombat]
> type = wizard
> remote_hosts = foo.vpn.lan
> aor/qualify_frequency = 60
> aor/qualify_timeout = 2000
> {code}
> You wind up with contacts flapping in reachability due to DNS but not due to lack of SIP OPTIONS.  (The foo.vpn.lan host was responding to SIP OPTIONS this entire time, but we had intermittent DNS failures):
> {code}
> Contact wombat/sip:foo.vpn.lan is now Reachable.  RTT: 37.946 msec
> Contact wombat/sip:foo.vpn.lan is now Unreachable.  RTT: 0.000 msec
> Contact wombat/sip:foo.vpn.lan is now Reachable.  RTT: 37.946 msec
> Contact wombat/sip:foo.vpn.lan is now Unreachable.  RTT: 0.000 msec
> Contact wombat/sip:foo.vpn.lan is now Reachable.  RTT: 37.946 msec
> Contact wombat/sip:foo.vpn.lan is now Unreachable.  RTT: 0.000 msec
> {code}
> The reason for this is two fold:
> Unbound does not query more than one DNS server to get the result for a given request.
> Unbound does not respect the order of DNS servers in /etc/resolv.conf
> Unbound debug logging shows the dns server order:
> {code}
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] info: DelegationPoint<.>: 0 names (0 missing), 3 addrs (0 result, 3 avail) parentNS\n", 116) = 116
> [pid 10346] getpid()                    = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug:    ip4 8.8.8.8 port 53 (len 16)\n", 71) = 71
> [pid 10346] getpid()                    = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug:    ip4 4.2.2.2 port 53 (len 16)\n", 71) = 71
> [pid 10346] getpid()                    = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug:    ip4 192.168.5.2 port 53 (len 16)\n", 75) = 75
> [pid 10346] getpid()                    = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: attempt to get extra 3 targets\n", 70) = 70
> {code}
> Take this example:
> {code}
> Timestamp 12:00:00: DNS Lookup foo.vpn.lan using 8.8.8.8 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> Timestamp 12:01:00: DNS Lookup foo.vpn.lan using 4.2.2.2 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> Timestamp 12:02:00: DNS Lookup foo.vpn.lan using 192.168.5.2 .. success! endpoint dns is stored, host is marked reachable
> Timestamp 12:03:00: DNS Lookup foo.vpn.lan using 4.2.2.2 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> Timestamp 12:04:00: DNS Lookup foo.vpn.lan using 8.8.8.8 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> {code}
> If you change resolver_unbound.conf to the following:
> {code}
> [general]
> hosts = /etc/hosts
> nameserver = 192.168.5.2
> {code}
> This does not fix the issue.  Unbound does not respect this as the full nameserver list and still uses /etc/resolv.conf for the 3 nameservers specified
> The ideal behavior here would be:
> 1) Don't treat a contact as unreachable if the DNS suddenly fails, but SIP OPTIONS is still working to the last-known IP
> 2) Try all DNS servers until we get a successful lookup, or all servers have failed lookups
> The only workaround for this is to noload res_resolver_unbound.so

--
This message was sent by Atlassian JIRA
(v6.2#6252)