[hydra-dev] * Fault Tolerance vs High availability

Wed Jun 9 17:55:52 CDT 2010

On 06/09/2010 11:15 AM, Leif Madsen wrote:

> ----- Original Message -----
>> On Wed, Jun 9, 2010 at 11:06 AM, Leif Madsen <lmadsen at digium.com>
>> wrote:
>>> Wouldn't that front end be Hydra?
>>>
>> It could in the scenario that Hydra was done, I was talking about
>> current.
> 
> touche :)

My take on this is that in most telephony networks (carrier and large
enterprise), there is a distinction between the network elements that
process calls, the elements that route calls and the elements that
handle the media streams. Nobody is going to expect that their T-1 will
never fail, or that the line card that it connects to in the nearest T-1
mux will never fail, or something similar. When a network element that
handles media streams on discrete communication paths has a failure,
then there will be service disruption beyond the 'next call will work'
threshold.

However, as soon as the media stream moves into the portion of the
network where it can be multiplexed and/or transmitted over alternate
communication paths, there is an expectation that disruptions in these
paths will cause no more than a momentary interruption in the media
stream. In the PSTN this might be achieved via ATM, SONET/SDH, or some
other multi-path transmission network, and for VoIP we have IP routing
to provide this level of availability.

The same is generally true of call routing elements; PSTN switches and
large enterprise switches have N+1 redundant call routing, so that it
would be exceedingly rare for a call to not be routed due to a service
disruption (excepting the occasional overloaded network scenarios).
However, much like VoIP, call routing elements don't have to do much
once the call is setup (other than wait for it to be torn down), so it's
relatively straightforward to reassign responsibility for a call to
another routing element when necessary, and it won't even be noticed by
the participants in the call. As Tim says, this can (and is) be used to
administratively relocate calls to allow for routing elements to be
taken out of service for maintenance or upgrades.

Finally, elements that actually process calls (conference servers,
voicemail servers, IVRs, etc.) are not usually expected to be able to
provide transparent redundancy; if that element fails, calls it was
processing are lost or severely disrupted, but the next call to that
service should succeed.

-- 
Kevin P. Fleming
Digium, Inc. | Director of Software Technologies
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
skype: kpfleming | jabber: kfleming at digium.com
Check us out at www.digium.com & www.asterisk.org