[asterisk-bugs] [JIRA] (ASTERISK-30381) res_resolver_unbound: Using unbound, queries do not try all available nameservers, and contacts will flap
Joshua C. Colp (JIRA)
noreply at issues.asterisk.org
Thu Dec 29 08:48:05 CST 2022
[ https://issues.asterisk.org/jira/browse/ASTERISK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=261089#comment-261089 ]
Joshua C. Colp edited comment on ASTERISK-30381 at 12/29/22 8:48 AM:
---------------------------------------------------------------------
If you want to explore it further then go ahead but this is NOT an easy thing and is full of traps. For example what if the underlying DNS record changes and now you're using stale information? Do you obey the TTL? Then you're just buying yourself time until failure, unless you also do a DNS Lookup and use the old information - but should that be configurable and for how long? That information also isn't used when dialling. That's a fresh DNS lookup along with any SRV/NAPTR so failover and load balancing occurs - so now does the OPTIONS cached information also get used for other things such as sending an INVITE? Do you cache all the results? What about the load balancing I mentioned? For how long, again?
It's a lot of knobs and configuration.
I also think "failure" is overloaded. The DNS server didn't fail, but the lookup process resulted in no records. If a DNS server does fail then it will go to an alternate.
This isn't something the Asterisk team at Sangoma will look into.
was (Author: jcolp):
If you want to explore it further then go ahead but this is NOT an easy thing and is full of traps. For example what if the underlying DNS record changes and now you're using stale information? Do you obey the TTL? Then you're just buying yourself time until failure, unless you also do a DNS Lookup and use the old information - but should that be configurable and for how long? That information also isn't used when dialling. That's a fresh DNS lookup along with any SRV/NAPTR so failover and load balancing occurs - so now does the OPTIONS cached information also get used for other things such as sending an INVITE? Do you cache all the results? What about the load balancing I mentioned? For how long, again?
It's a lot of knobs and configuration.
I also think "failure" is overloaded. The DNS server didn't fail, but the lookup process resulted in no records.
This isn't something the Asterisk team at Sangoma will look into.
> res_resolver_unbound: Using unbound, queries do not try all available nameservers, and contacts will flap
> ---------------------------------------------------------------------------------------------------------
>
> Key: ASTERISK-30381
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-30381
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Resources/res_resolver_unbound
> Affects Versions: 18.15.1, 19.7.1, 20.0.1
> Reporter: Mark Murawski
> Assignee: Unassigned
>
> Using what's probably a fairly standard DNS server list containing a local DNS server and some backups, using the unbound DNS resolver will result in non-deterministic lookup failures.
> Given resolv.conf:
> {code}
> options attempts:3 timeout:1
> nameserver 192.168.5.2
> nameserver 4.2.2.2
> nameserver 8.8.8.8
> {code}
> Given resolver_unbound.conf
> {code}
> [general]
> hosts = /etc/hosts
> resolv = /etc/resolv.conf
> {code}
> Given pjsip_wizard.conf
> {code}
> [wombat]
> type = wizard
> remote_hosts = foo.vpn.lan
> aor/qualify_frequency = 60
> aor/qualify_timeout = 2000
> {code}
> You wind up with contacts flapping in reachability due to DNS but not due to lack of SIP OPTIONS. (The foo.vpn.lan host was responding to SIP OPTIONS this entire time, but we had intermittent DNS failures):
> {code}
> Contact wombat/sip:foo.vpn.lan is now Reachable. RTT: 37.946 msec
> Contact wombat/sip:foo.vpn.lan is now Unreachable. RTT: 0.000 msec
> Contact wombat/sip:foo.vpn.lan is now Reachable. RTT: 37.946 msec
> Contact wombat/sip:foo.vpn.lan is now Unreachable. RTT: 0.000 msec
> Contact wombat/sip:foo.vpn.lan is now Reachable. RTT: 37.946 msec
> Contact wombat/sip:foo.vpn.lan is now Unreachable. RTT: 0.000 msec
> {code}
> The reason for this is two fold:
> Unbound does not query more than one DNS server to get the result for a given request.
> Unbound does not respect the order of DNS servers in /etc/resolv.conf
> Unbound debug logging shows the dns server order:
> {code}
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] info: DelegationPoint<.>: 0 names (0 missing), 3 addrs (0 result, 3 avail) parentNS\n", 116) = 116
> [pid 10346] getpid() = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: ip4 8.8.8.8 port 53 (len 16)\n", 71) = 71
> [pid 10346] getpid() = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: ip4 4.2.2.2 port 53 (len 16)\n", 71) = 71
> [pid 10346] getpid() = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: ip4 192.168.5.2 port 53 (len 16)\n", 75) = 75
> [pid 10346] getpid() = 8890
> [pid 10346] write(2, "[1672280502] libunbound[8890:0] debug: attempt to get extra 3 targets\n", 70) = 70
> {code}
> Take this example:
> {code}
> Timestamp 12:00:00: DNS Lookup foo.vpn.lan using 8.8.8.8 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> Timestamp 12:01:00: DNS Lookup foo.vpn.lan using 4.2.2.2 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> Timestamp 12:02:00: DNS Lookup foo.vpn.lan using 192.168.5.2 .. success! endpoint dns is stored, host is marked reachable
> Timestamp 12:03:00: DNS Lookup foo.vpn.lan using 4.2.2.2 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> Timestamp 12:04:00: DNS Lookup foo.vpn.lan using 8.8.8.8 .. fails due to vpn.lan only exists on 192.168.5.2... local cached dns for endpoint contact is deleted, host marked unreachable
> {code}
> If you change resolver_unbound.conf to the following:
> {code}
> [general]
> hosts = /etc/hosts
> nameserver = 192.168.5.2
> {code}
> This does not fix the issue. Unbound does not respect this as the full nameserver list and still uses /etc/resolv.conf for the 3 nameservers specified
> The ideal behavior here would be:
> 1) Don't treat a contact as unreachable if the DNS suddenly fails, but SIP OPTIONS is still working to the last-known IP
> 2) Try all DNS servers until we get a successful lookup, or all servers have failed lookups
> The only workaround for this is to noload res_resolver_unbound.so
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list