[Asterisk-Users] Hardware to build an Enterprise AsteriskUniversal Gateway

Sun Jan 4 13:01:58 MST 2004

> >I'd guess part of the five-9's discussion centers around how automated
> >must one be to be able to actually get close?  If one assumes the loss
> >of a SIMM the answer/effort certainly is different then assuming the 
> >loss of a single interface card (when multiples exist), etc.
> >
> >I would doubt that anyone reading this list actually have a justifiable
> >business requirement for five-9's given the expontential cost/effort
> >involved to get there. But, setting some sort of reasonable goal
> >that would focus towards failover within xx number of seconds (and
> >maybe some other conditions) seems very practical. 
> >
> >  
> >
> A failover system does not solve the scalability issue.. which means 
> that you have a full server sitting there doing nothing most of the time 
> when if the load were being balanced across the servers in a "cluster" 
> senario you would also have the scalability..
> 
> Also a failover system would typically only be 2 servers, if there were 
> a cluster system there could be 10 servers in which case five 9's should 
> be easy..

Everyone's response to Olle's proposition are of value including yours.

For those that have been involved with analyzing the requirments to
achive five-9's (for anything), there are tons of approaches, and each 
approach comes with some sort of cost/benefit trade off. Once the approaches
have been documented and costs associated with them, it's common for
the original requirements to be redefined in terms of something that is
more realistic in business terms. Whether that is clustering, hot standby,
or another approach is largely irrelavent at the beginning of the process.

If you're a sponsor of clustering and your forced to use canreinvite=no, 
lots of people would be unhappy when their RTP "system" died. I'm not
suggesting clustering is a bad choice, only suggesting there are lots
of cost/benefit trade-offs that are made on an individual basis and there
might be more then one answer to reliability/uptime question.

In an earlier post, you mentioned a single IP address issue. That's really
not an issue in some cases as a virtual IP (within a cluster) may be
perfectly fine (canreinvite=yes), etc. Pure guess is that use of a virtual
IP forces some other design choices like the need for a layer-3 box
(since virtual IP's won't fix layer-2 problems), and probably revisiting
RTP standards. (And, if we only have one layer-3 box, guess we need to get
another for uptime, etc, etc.)

Since hardware has become increasingly more reliable, infrastructure items
less expensive, uptimes moving towards larger numbers, software more
reliable (in very general terms over years), using a hot spare approach
could be just as effective as a two-box cluster. In both cases, part of
the problem boils down to assumptions about external interfaces and how
to move those interfaces between two or "more" boxes; and, what design
requirements one states regardling calls in progress.

(Olle, are you watching?)

1. Moving a physical interface (whether a T1, ethernet or 2-wire pstn) is 
mostly trevial, however what "signal" is needed to detect a system failure 
and move the physical connection to a second machine/interface? (If there 
are three systems in a cluster, what signal is needed? If a three-way 
switch is reqquired, does someone want to design, build, and sell it to 
users? Any need to discuss a four-way switch? Should there be a single
switch that flip-flops all three at the same time (T1, Ethernet, pstn)?)

Since protecting calls in progress (under all circumstances and 
configurations) is likely the most expensive and most difficult to achive,
we can probably all agree that handling this should be left to some
future long-range plan. Is that acceptable to everyone?

2. In a hot-spare arrangement (single primary, single running secondary),
what static and/or dynamic information needs to be shared across the
two systems to maintain the best chance of switching to the secondary
system in the shortest period of time, and while minimizing the loss of
business data? (Should this same data be shared across all systems in
a cluster if the cluster consists of two or more machines?)

3. If a clustered environment, is clustering based on IP address or MAC
address?
   a. If based on an IP address, is a layer-3 box required between * and
      sip phones? (If so, how many?)
   b. If based on MAC address, what process moves an active * MAC address
      to a another * machine (to maintain connectivity to sip phones)?
   c. Should sessions that rely on a failed machine in a cluster simply
      be dropped?
   d. Are there any realistic ways to recover RTP sessions in a clustered
      environment when a single machine within the cluster fails, and RTP
      sessions were flowing through it (canreinvite=no)?
   e. Should a sip phone's arp cache timeout be configurable?
   f. Which system(s) control the physical switch in #1 above?
   g. Is sharing static/dynamic operational data across some sort of
      high-availability hsrp channel acceptable, or, should two or more
      database servers be deployed?

4. If a firewall/nat box is involved, what are the requirements to detect
   and handle a failed * machine?
   a. Are the requirements different for hot-spare vs clustering?
   b. What if the firewall is an inexpensive device (eg, Linksys) with
      minimal configuration options?
   c. Are the nat requirements within * different for clustering?

5. Should sip phones be configurable with a primary and secondary proxy?
   a. If the primary proxy fails, what determines when a sip phone fails
      over to the secondary proxy?
   b. After fail over to the secondary, what determines when the sip phone
      should switch back to the primary proxy? (Is the primary ready to
      handle production calls, or is it back ready for a system admin to
      diagnose the original problem in a non-production manner?)

When factual answers are known to all of the above, I'd suspect that
many changes will be required to many different boxes throughout most
organizations that will require a fair amount of lead time to implement.
Particularily if changes are required to the SIP/RTP standards.
Therefore, its highly probably that some sort of phase-in plan would be
required that might look something like:
  Phase 1: primary & secondary hot spare with data sharing across both
           systems (probably not just an external database server), with
           commonly available interface switching.
  Phase 2: two-system cluster (both active and load sharing) with
           inter-system links (might involve switching external interfaces
           under some conditions as well; eg, single T1 trunk in use).
  Phase 3: three or more system cluster (all active and load sharing)
           that likely depend upon a two/three-way interface switch,
           sip phone design changes, additional infrastructure items,
           nat redesign, and other items.

Thoughts????

Of coarse if all sip phones used iax, the problem is much easier to solve.

Rich