[asterisk-dev] scalability issue with realtime lastms update on qualify state change

Leandro Dardini ldardini at gmail.com
Fri Nov 15 14:29:34 CST 2013


Il 15/nov/2013 21:11 "Damon Estep" <damon at soho-systems.com> ha scritto:
>
> I have run into an issue with 1.8.15 with mysql realtime peers,
qualify=3000, and about 2000 peers.
>
>
>
> In the event of a network degradation event (high packet loss, network
down) the system gets into an unusable state.
>
>
>
> let's say you have 2000 peers with qualify=yes (or 2000), and they are
all reachable.
>
>
>
> Qualify frequency is 60 seconds (user definable), qualify unreachable is
10 seconds, and default retransmit timer of 1 second (both hardcoded in
sip.h)
>
>
>
> 1000 peers go offline due to a network event
>
>
>
> The database has to be updated 1000 times in 60 seconds to mark the
lastms -1 for these 1000 peers
>
>
>
> The same 1000 peers are now on a OPTIONS query schedule of once per 10
seconds, with 6 retransmits at 1 second intervals, total of 7 packets every
10 seconds for 1000 peers, or 700 new packets per second.
>
>
>
> So, we experienced a major network issue, and our response is to increase
the load on the asterisk server dramatically by updating the database 1000
times and starting a new campaign to reach unreachable peers that are
offline.
>
>
>
> In practice we see that when this happens, the calculated RTT time for
peers that are not part of the network outage starts to increase (delay
somewhere in hardware, code, who knows, but it does increase), and many of
them start flapping between unreachable and reachable.
>
>
>
> The database query load goes through the roof, the number of packets
coming out of the asterisk box goes through the roof, and asterisk will not
recover until it is restarted, even after the network event is cleared.
>
>
>
> This is a new issue that came into play when lastms was added to the
realtime database and the qualify code started updating it on every state
change. 1.2 before lastms would handle this event gracefully, 1.8 wont,
can't comment on 1.2 or 1.6.
>
>
>
> My thinking is that the lastms value in the db has little to no value.
The only time I see it used is when a realtime peer is built from the
database on registration. If the lastms value is -1 the peer registers and
is set to unreachable, gets an option query and becomes reachable right
away.
>
>
>
> rtupdate=no will stop the lastms updates, but it also stops the
registration data from being updated in the database, which has unintended
consequences.
>
>
>
> Also, the 10 second qualify timer when unreachable might make sense in
small environments, but is way too aggressive for larger environments. It
needs to be user configurable.
>
>
>
> Before I start patching, can anyone tell me what value lastms has and why
it needs to be in the database?
>
>

I disagree about the no value for lastms. How can I know if a peer is
reachable without the lastms field?

Leandro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131115/c34f5598/attachment-0001.html>


More information about the asterisk-dev mailing list