[Asterisk-Users] Hardware to build an Enterprise AsteriskUniversal Gateway

Mon Jan 5 09:18:11 MST 2004

Hi Richard,

>Load balancers have some added value, but those that have had to deal
>with a problem where a single system within the cluster is up but not
>processing data would probably argue their actual value.

I've done quite a lot of work with clustered/ha linux configurations. I
usualy try to keep additional boxes/hardware to an absolute minimum,
otherwise the newly introduced points of (hardware) failure tend to make the
whole exersize pointless. A solution I found to work quite well:

Software load balancer (using LVS) run as a HA service (ldirectord) on two of
the servers. This allows use of quite specific probes for the real servers
being balanced, so a server not correctly processing requests can be removed
from the list of active quite reliably. Since the director script is perl,
adding probes for protocols not supported in the default install is fairly
streightforward.

>If any proposed design actually involved a different MAC address,
>obviously all local sip phones would die since the arp cache timeout 
>within the phones would preclude a failover. (Not cool.)

Arp cache timeouts usualy don't come into this: when moving a cluster IP
address to a different NIC (probaly on a different machine) you can broadcast
gratuitous arp packets on the affected ethernet segment; this updates the arp
caches of all connected devices and allows failovers far faster than arp
chache timeout. Notable exception: some firewalls can be quite paranoid wrt.
to arp updates and will NOT accept gratuitous arp packets. I've run into this
with a cluster installation with one of my customers.

>Technology now supports 100 meg layer-2 pipes throughout a city at a
>reasonable cost. If a cluster were split across mutiple 
>buildings within a city, it certainly would be of interest to those 
>that are responsible for business continuity planning. Are there
limitations?

I'm wary of split cluster configurations because often the need for multiple,
independent communication paths between cluster nodes gets overlooked or
ignored in these configurations, greatly increasing risk of "split-brain"
configurations, i.e. several nodes in the cluster thinking they're the only
online server and trying to take over services. This easily/usually leads to
a real mess (data corruption) that can be costly to clean up. When keeping
your nodes in physical proximity it's much easier to have, say, 2 network
links + one serial link between cluster nodes thus providing a very resilient
fabric for inter-cluster communications.

>Someone mentioned the only data needed to be shared between clustered
>systems was phone Registration info (and then quickly jumped 
>to engineering a solution for that). Is that the only data needed or 
>might someone need a ton of other stuff? (Is cdr, iax, dialplans, agi, 
>vm, and/or other dynamic data an issue that needs to be considered in 
>a reasonable high-availability design?)

Depends on what you want/need to fail over in case your asterisk box goes
down. in stages that'd be
	1 (cluster) IP address for sip/h323 etc. services
	2 voice mail, recordings, activity logs
	3 registrations for connected VoIP clients
	4 active calls (VoIP + PSTN)

For the moment, item 4 definitely isn't feasible; even if we get some
hardware to switch over E1/T1/PRI whatever interfaves, card or interface
initialisation will kill active calls. 

Item 2 would be plain file on-disk data; for an active/standby cluster
replicating these should be pretty straigthforward using either shared
storage or an apropriate filesystem/blockdevice replication system. I've
personaly had good experience with drbd (block device replication over the
network; only supports 2 nodes in active/standby configuration but works
quite well for that.)

Item 3 should also feasible; this information is already persistent over
asterisk restarts and seems to be just a berkley db file for a default
install. Sme method as for item 2 should work.

>I'd have to guess there are probably hundreds on this list that can 
>engineer raid drives, ups's for ethernet closet switches, protected
>cat 5 cabling, and switch boxes that can move physical 
>interfaces between servers. But, I'd also guess there are far fewer 
>that can identify many of the sip, rtp, iax, nat, cdr, etc, etc, 
>issues. What are some of those issues? (Maybe there aren't any?)

Since I'm still very much an asterisk beginner I'll have to pass on  this
one; However, I'm definitely going to do some experiments on my test cluster
systems with asterisk to just see what breaks when failing over asterisk
services.

Also, things get MUCH more interesting when yo start to move from plain
active/standby to active/active configurations: here, for failover, you'll
end up with the registration and file data from the failed server and need to
integrate that into an already running server merging the seperate sets of
information - preferably without trashing the running server :-)

Bye, Martin