[asterisk-dev] scalability issue with realtime lastms update on qualify state change
Damon Estep
damon at soho-systems.com
Fri Nov 15 14:09:30 CST 2013
I have run into an issue with 1.8.15 with mysql realtime peers, qualify=3000, and about 2000 peers.
In the event of a network degradation event (high packet loss, network down) the system gets into an unusable state.
let's say you have 2000 peers with qualify=yes (or 2000), and they are all reachable.
Qualify frequency is 60 seconds (user definable), qualify unreachable is 10 seconds, and default retransmit timer of 1 second (both hardcoded in sip.h)
1000 peers go offline due to a network event
The database has to be updated 1000 times in 60 seconds to mark the lastms -1 for these 1000 peers
The same 1000 peers are now on a OPTIONS query schedule of once per 10 seconds, with 6 retransmits at 1 second intervals, total of 7 packets every 10 seconds for 1000 peers, or 700 new packets per second.
So, we experienced a major network issue, and our response is to increase the load on the asterisk server dramatically by updating the database 1000 times and starting a new campaign to reach unreachable peers that are offline.
In practice we see that when this happens, the calculated RTT time for peers that are not part of the network outage starts to increase (delay somewhere in hardware, code, who knows, but it does increase), and many of them start flapping between unreachable and reachable.
The database query load goes through the roof, the number of packets coming out of the asterisk box goes through the roof, and asterisk will not recover until it is restarted, even after the network event is cleared.
This is a new issue that came into play when lastms was added to the realtime database and the qualify code started updating it on every state change. 1.2 before lastms would handle this event gracefully, 1.8 wont, can't comment on 1.2 or 1.6.
My thinking is that the lastms value in the db has little to no value. The only time I see it used is when a realtime peer is built from the database on registration. If the lastms value is -1 the peer registers and is set to unreachable, gets an option query and becomes reachable right away.
rtupdate=no will stop the lastms updates, but it also stops the registration data from being updated in the database, which has unintended consequences.
Also, the 10 second qualify timer when unreachable might make sense in small environments, but is way too aggressive for larger environments. It needs to be user configurable.
Before I start patching, can anyone tell me what value lastms has and why it needs to be in the database?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131115/cb453392/attachment.html>
More information about the asterisk-dev
mailing list