[asterisk-dev] RFT: Expanded DNS SRV handling in Asterisk 1.4

Thu Oct 25 14:54:01 CDT 2007

>  >>>>> "JT" == John Todd <jtodd at loligo.com> writes:
>
>JT> Here's why I ask: I've had first-hand experience with
>JT> prioritized/weighted SRV records that cause serious problems.
>JT> Someone puts "10 10 _sip._udp.inside-proxy.foo.com" as their first
>JT> SRV record for foo.com, and "20 20
>JT> _sip._udp.outside-proxy.foo.com" as their second preference SRV
>JT> record for foo.com. The host "inside-proxy" isn't reachable from
>JT> the Internet. Therefore, every call attempt that goes to their
>JT> domain goes first to a proxy that times out (wait... wait...
>JT> wait...) and then goes to the second one that completes. This
>JT> leads to unacceptable timeouts, and eventually leads to hard-coded
>JT> SRV record data put into a local resolving nameserver (can you say
>JT> "domain hijacking for operational purposes?") to avoid the delay.
>JT> This is Very Bad, and leads to User Anger.
>
>Isn't that a case of "Doctor, it hurts when I..."?

I'm not sure I understand how another company having a failed or 
sporadically failing infrastructure is something I can control. 
Isn't one of the major points of multiple SRV records to allow for 
redundancy, which is an extension of "improving perceived functional 
behavior"?  If that is the case, then I'm not sure how your response 
holds up to examination.

>I bet most SIP calls are between cooperating companies, so it should
>be possible to fix the problem correctly instead of doing workarounds.

That's a rather short-sighted bet.  And while most SIP calls are 
probably between cooperating companies, the point (again) of SRV 
records is to allow communications between endpoints that have no 
prior hardcoded relationship.  I'm not sure what your argument is 
here, but if it is that the function of SRV records aren't really 
that important, I suppose we could all go back to IP addresses.  If 
your system is just using SRV records for redundancy on pre-defined 
peers, then perhaps having all of your calls take an additional 15 
seconds to complete would be OK while you try to hammer out the 
problems with the other endpoint.  Myself, I prefer to have the 
system automatically route around problems if my system is smart 
enough to detect them without exposing the user base to the bad 
behaviors.  Lastly, SRV records are not just for redundancy between 
peers that know each other; they are primarily for resource discovery 
as their first goal, and their second goal is redundancy/load 
sharing.  From how I interpret your arguments, you seem to be 
entirely ignoring the first goal of automated resource discovery, 
which would make creation of a manual load-spreading routine 
impossible or highly impractical.

>JT> I guess I'm saying that SRV record lookups should be able to be
>JT> turned off within Dial (which does exist today, despite my
>JT> approval of SRV lookups being "on" by default) and a function that
>JT> performs SRV lookups should be created so that the local
>JT> administrator can start to create a good/bad list of possible SRV
>JT> response entries for future use. This would not change the way *
>JT> behaves today; it would simply provide an alternative for the more
>JT> sophisticated administrator to control their own fate.
>
>I believe that is complication for no good reason.

I think you disagree with me in the above paragraph, but agree with 
me below that a function for SRV lookups would be a good idea.

>JT> I'm all for SRV automation behind-the-scenes as the default
>JT> behavior. However, I am less happy when there are no alternatives
>JT> to letting an administrator do things in a better way.
>
>You can just ignore the SRV record and define your own IP-based peer.

Of course.  No argument to the contrary there if there is a 
pre-existing agreement of endpoints.  However, the real strength of 
SIP is when there is not a pre-existing agreement of endpoints, which 
is one of the major reasons SRV records are useful.   This does not 
address the point of SRV records in several of the major areas of 
utility, so I think we can dispense with the concept of hardcoded IP 
address endpoint identification in this discussion as it is not 
relevant.

>JT> I'm sure I'm not alone here when I say that I dislike programs
>JT> that think they're smarter than me and won't let me change the
>JT> settings.
>
>I have never heard of a mail server which allows you to ignore certain
>MX records but obey the others. If you want to override MX for a
>certain domain, you create a policy for sending mail to that domain,
>and that ignores all the MX records.

MX is not real-time.  SRV records are real-time.  Failing through a 
list of MX hosts does not significantly alter the completion of the 
communication, while failing through SRV records does - users will 
hang up.  And in any case, you are incorrect about MX overrides: 
there are absolutely mail clients that "remember" failed MX hosts and 
will not try to send to them for some "cooloff" period.  I have a 
vague recollection of that method being used at AOL when I was 
chatting with one of their mail admins a few years back, and I 
certainly know that method is used by spammers (the content is 
irrelevant to the technology) since I've watched them crawl through 
my bogus MX slow-down traps on one try, and then the next try 
(automated) they jump right to the functional MX without trying the 
dead ends first.  After a few hours of this, they try transmitting to 
the dead servers again for another "test" pass.

>JT> Summary: I'd love to see a function that resolves and returns an
>JT> array-like set of SRV lookup results for a domain. Let the
>JT> administrator write a routine that then runs through the various
>JT> possible destinations.
>
>That on the other hand makes sense. If you keep all SRV priority
>handling out of asterisk itself, you can probably keep the asterisk
>code simple. There is a risk that the dial plan code gets too
>complicated though.

Dialplans are almost always complicated if an administrator wants to 
truly capture error conditions in a meaningful way, or if there are 
dollars on the line for failure.  "Too complicated" is a local 
decision, not one to be forced by the authors of tool components. 
Building in hidden methods of easing complexity for less rigorous 
developers is fine, but don't sacrifice the flexibility of the tool 
for those that truly want to have precise control over the system.

JT