[Asterisk-Users] Hardware to build an Enterprise AsteriskUniversal Gateway

Sun Jan 4 20:23:21 MST 2004

The comments below are certainly not intended as any form of negativism,
but rather to pursue thought processes for redundant systems.

> > 1. Moving a physical interface (whether a T1, ethernet or 2-wire pstn) is
> > mostly trivial, however what "signal" is needed to detect a system failure
> > and move the physical connection to a second machine/interface? (If there
> > are three systems in a cluster, what signal is needed? If a three-way
> > switch is required, does someone want to design, build, and sell it to
> > users? Any need to discuss a four-way switch? Should there be a single
> > switch that flip-flops all three at the same time (T1, Ethernet, pstn)?)
> 
> Simple idea:  Have a process on each machine pulse a lead-state (something
> a s simple as DTR out a serial port or a single data line on a parallel
> port) out to an external box.  This box is strictly discrete hardware and
> built with timeout that is retriggered by the pulse.  When the pulse fails
> to arrive, the box switches the T1 over to the backup system.

And upon partial restoration of the failed system, should it automatically
fall back to the primary? Or, might there be some element of human 
control that would suggest not falling back until told to do so?

> > Since protecting calls in progress (under all circumstances and
> > configurations) is likely the most expensive and most difficult to achieve,
> > we can probably all agree that handling this should be left to some
> > future long-range plan. Is that acceptable to everyone?
> 
> Its going to be almost impossible to preserve calls in progress.  If you
> switch a T1 from one machine to the other, there's going to either going
> to be a lack of sync (ISDN D-channels need to come up, RBS channels need
> to wink) that's going to result in the loss of the call.

What about calls in progress between two sip phones (and cdr records)?

> > 2. In a hot-spare arrangement (single primary, single running secondary),
> > what static and/or dynamic information needs to be shared across the
> > two systems to maintain the best chance of switching to the secondary
> > system in the shortest period of time, and while minimizing the loss of
> > business data? (Should this same data be shared across all systems in
> > a cluster if the cluster consists of two or more machines?)
> >
> > 3. If a clustered environment, is clustering based on IP address or MAC
> > address?
> >    a. If based on an IP address, is a layer-3 box required between * and
> >       sip phones? (If so, how many?)
> 
> Yes.  You'll need something like Linux Virtual Server or an F5 load
> balancing box to make this happen.  You can play silly games with round
> robin DNS, but it doesn't handle failure well.

Agreed, but then one would need two F5 boxes as "it" would become the new
single point of failure.

> >    b. If based on MAC address, what process moves an active * MAC address
> >       to a another * machine (to maintain connectivity to sip phones)?
> 
> Something like Ultra Monkey (http://www.ultramonkey.org)
> 
> >    c. Should sessions that rely on a failed machine in a cluster simply
> >       be dropped?
> >    d. Are there any realistic ways to recover RTP sessions in a clustered
> >       environment when a single machine within the cluster fails, and RTP
> >       sessions were flowing through it (canreinvite=no)?
> >    e. Should a sip phone's arp cache timeout be configurable?
> 
> Shouldn't need to worry about that unless the phone is on the same
> physical network segment.

Which in most cases where asterisk is deployed (obviously not all) is 
probably the case.

> >    f. Which system(s) control the physical switch in #1 above?
> 
> A voting system...all systems control it.  It is up to the switch to
> decide who isn't working right.

With probably some manual over-ride since we know that systems can 
appear to be ready for production, but the sys admin says its not ready
due to any number of valid technical reasons.

> >    g. Is sharing static/dynamic operational data across some sort of
> >       high-availability hsrp channel acceptable, or, should two or more
> >       database servers be deployed?
> 
> DB Server clustering is a fairly solid technology these days.  Deploy a DB
> cluster if you want.

Which gets to be rather expensive, adds complexity, and additional
points of failure (decreasing the ability to approach five/four-9's).

> > 4. If a firewall/nat box is involved, what are the requirements to detect
> >    and handle a failed * machine?
> >    a. Are the requirements different for hot-spare vs clustering?
> >    b. What if the firewall is an inexpensive device (eg, Linksys) with
> >       minimal configuration options?
> >    c. Are the nat requirements within * different for clustering?
> >
> > 5. Should sip phones be configurable with a primary and secondary proxy?
> >    a. If the primary proxy fails, what determines when a sip phone fails
> >       over to the secondary proxy?
> 
> Usually a simple timeout works for this..but if your clustering/hot-spare
> switch works right...the client should never need to change.

Sort of depends a lot on exactly how the clustering is really implemented,
the business objectives in terms of the business continuity plan, etc.

> >    b. After fail over to the secondary, what determines when the sip phone
> >       should switch back to the primary proxy? (Is the primary ready to
> >       handle production calls, or is it back ready for a system admin to
> >       diagnose the original problem in a non-production manner?)
> 
> Auto switch-back is never a good thing.  Once a system is taken out of
> service by an automated monitoring system, it should be up to human
> intervention to say that it is ready to go back into service.

Not sure your answer really applies to the question. The question was
oriented around when should the sip phone fall back to the primary proxy
"if" a primary and secondary proxy are in use.

Part of the point of many of the questions is that there really are a
lot of dependencies on devices other then asterisk, and simply going down
a path that says clustering (or whichever approach) can handle something
is probably ignoring several of those dependencies which does not actually
improve the end-to-end availability of asterisk. (Technically, asterisk
is up, you just can't reach it because your phone (or whatever) doesn't
know how to get to it.)

Using another load-balancing box (F5 or whatever) only moves the problem
to that box. Duplicating it, moves the problem to another box, until
the costs exponentially grow beyond the initial intended value of the
solution. The weak points become lots of other boxes and infrastructure, 
suggesting that asterisk really isn't "the" weakest point (regardless of 
what its built on).

Rich