[Asterisk-Users] Hardware to build an Enterprise AsteriskUniversal Gateway

Mon Jan 5 05:28:09 MST 2004

> > Using another load-balancing box (F5 or whatever) only moves the problem
> > to that box. Duplicating it, moves the problem to another box, until
> > the costs exponentially grow beyond the initial intended value of the
> > solution. The weak points become lots of other boxes and infrastructure, 
> > suggesting that asterisk really isn't "the" weakest point (regardless of 
> > what its built on).
> 
> Rich is hitting the main point in designing anything for high
> reliability. So lets enumerate failures and then what if anything can be
> done to eliminate them.
> 
> 1. Line failures.
<snip>
> 2. Hardware failure. 
<snip>
> 3. Software failure.
> This could be any number of bugs not yet found or that will be
> introduced later.
<snip>
> 4. Phones.

The primary points the questions were attempting to uncover are more
related to basic layer-2 and layer-3 issues (of all necessary components
in an end-to-end telephony implementation), and not just basic hardware
configurations.

Having spent a fair number of years working with corporations that have
attempted to build high-availability solutions, the typical engineering
approach is almost always oriented towards throwing more hardware at the
problem and not thinking about the basic layer-2/3/4 issues. (I don't have
an answer that I'm sponsoring either, just looking for comments from
those that intimately know the "end-to-end" impact of doing things like
hot-sparing or clustering.) I'm sure its fairly clear to most that
adding redundant supplies, ups, raid, etc, will improve the uptime of the
* box. However, once past throwing hardware at "the" server, where are
the pitfalls associated with hot-sparing or clustering * servers?

Several well-known companies have attempted products that swap MAC
addresses between machines (layer-2), hide servers behind a virtual
IP (layer-3), hide a cluster behind some form of load balancing hardware
(generally layer-2 & 3), etc. Most of those solutions end up creating yet 
another problem that was not considered in the original thought process. 
I.e., not well thought out. (Even Cisco with a building full of engineers
didn't initially consider the impact of flip-flopping between boxes
when hsrp was first implemented. And there still are issues with that
approach that many companies have witnessed first hand.)

Load balancers have some added value, but those that have had to deal
with a problem where a single system within the cluster is up but not
processing data would probably argue their actual value.

So, if one were to attempt either hot-sparing or clustering, are there
issues associated with sip, rtp, iax, nat and/or other asterisk protocols 
that would impact the high-availability design?

One issue that would _seem_ to be a problem are those installations that 
have to use canreinvite=no (meaning, even in a clustered environment 
those rtp sessions are going to be dropped with a server failure. Maybe
its okay to simply note the exceptions in a proposed high-availability
design.)

If any proposed design actually involved a different MAC address,
obviously all local sip phones would die since the arp cache timeout 
within the phones would preclude a failover. (Not cool.)

IBM (with their stack of AIX machines) and Tandem (with their non-stop
architecture) didn't throw clustered database servers at the problem.
Both had them, but not as a means of increasing the availability of the 
base systems.

Technology now supports 100 meg layer-2 pipes throughout a city at a
reasonable cost. If a cluster were split across mutiple buildings within 
a city, it certainly would be of interest to those that are responsible 
for business continuity planning. Are there limitations?

Someone mentioned the only data needed to be shared between clustered
systems was phone Registration info (and then quickly jumped to engineering
a solution for that). Is that the only data needed or might someone
need a ton of other stuff? (Is cdr, iax, dialplans, agi, vm, and/or
other dynamic data an issue that needs to be considered in a reasonable
high-availability design?)

Whether the objective is 2, 3, 4, or 5 nines is somewhat irrelavent. If
one had to stand in front of the President or Board and represent/sell
availability, they are going to assume end-to-end and not just "the"
server. Later, they are not going to talk kindly about the phone
system when your single F5 box died; or, (not all that unusual) you
say asterisk was up the entire time, its your stupid phones that couldn't 
find it!! (Or, you lost five hours of cdr data because of why???)

I'd have to guess there are probably hundreds on this list that can 
engineer raid drives, ups's for ethernet closet switches, protected
cat 5 cabling, and switch boxes that can move physical interfaces between
servers. But, I'd also guess there are far fewer that can identify many 
of the sip, rtp, iax, nat, cdr, etc, etc, issues. What are some of those
issues? (Maybe there aren't any?)

Rich