[asterisk-dev] scalability issue with realtime lastms update on qualify state change

Leandro Dardini ldardini at gmail.com
Fri Nov 15 15:25:32 CST 2013


Il 15/nov/2013 22:00 "Damon Estep" <damon at soho-systems.com> ha scritto:
>
> peer_lastms in memory tells you that, not lastms in the realtime database.
>
> As far as I can see the lastms value in the realtime database is loaded
only when the peer is created and quickly overwritten as soon as the first
qualify is sent.

I have a realtime multiserver solution so I need to rely on database lastms
otherwise how can I know if a peer usually registered on the other server
is reachable or not?

Not only... on the web interface, the lastms is really pretty...

>
> On Nov 15, 2013, at 1:29 PM, "Leandro Dardini" <ldardini at gmail.com> wrote:
>
>>
>> Il 15/nov/2013 21:11 "Damon Estep" <damon at soho-systems.com> ha scritto:
>> >
>> > I have run into an issue with 1.8.15 with mysql realtime peers,
qualify=3000, and about 2000 peers.
>> >
>> >
>> >
>> > In the event of a network degradation event (high packet loss, network
down) the system gets into an unusable state.
>> >
>> >
>> >
>> > let's say you have 2000 peers with qualify=yes (or 2000), and they are
all reachable.
>> >
>> >
>> >
>> > Qualify frequency is 60 seconds (user definable), qualify unreachable
is 10 seconds, and default retransmit timer of 1 second (both hardcoded in
sip.h)
>> >
>> >
>> >
>> > 1000 peers go offline due to a network event
>> >
>> >
>> >
>> > The database has to be updated 1000 times in 60 seconds to mark the
lastms -1 for these 1000 peers
>> >
>> >
>> >
>> > The same 1000 peers are now on a OPTIONS query schedule of once per 10
seconds, with 6 retransmits at 1 second intervals, total of 7 packets every
10 seconds for 1000 peers, or 700 new packets per second.
>> >
>> >
>> >
>> > So, we experienced a major network issue, and our response is to
increase the load on the asterisk server dramatically by updating the
database 1000 times and starting a new campaign to reach unreachable peers
that are offline.
>> >
>> >
>> >
>> > In practice we see that when this happens, the calculated RTT time for
peers that are not part of the network outage starts to increase (delay
somewhere in hardware, code, who knows, but it does increase), and many of
them start flapping between unreachable and reachable.
>> >
>> >
>> >
>> > The database query load goes through the roof, the number of packets
coming out of the asterisk box goes through the roof, and asterisk will not
recover until it is restarted, even after the network event is cleared.
>> >
>> >
>> >
>> > This is a new issue that came into play when lastms was added to the
realtime database and the qualify code started updating it on every state
change. 1.2 before lastms would handle this event gracefully, 1.8 wont,
can't comment on 1.2 or 1.6.
>> >
>> >
>> >
>> > My thinking is that the lastms value in the db has little to no value.
The only time I see it used is when a realtime peer is built from the
database on registration. If the lastms value is -1 the peer registers and
is set to unreachable, gets an option query and becomes reachable right
away.
>> >
>> >
>> >
>> > rtupdate=no will stop the lastms updates, but it also stops the
registration data from being updated in the database, which has unintended
consequences.
>> >
>> >
>> >
>> > Also, the 10 second qualify timer when unreachable might make sense in
small environments, but is way too aggressive for larger environments. It
needs to be user configurable.
>> >
>> >
>> >
>> > Before I start patching, can anyone tell me what value lastms has and
why it needs to be in the database?
>> >
>> >
>>
>> I disagree about the no value for lastms. How can I know if a peer is
reachable without the lastms field?
>>
>> Leandro
>>
>> --
>> _____________________________________________________________________
>> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>>
>> asterisk-dev mailing list
>> To UNSUBSCRIBE or update options visit:
>>   http://lists.digium.com/mailman/listinfo/asterisk-dev
>
>
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> asterisk-dev mailing list
> To UNSUBSCRIBE or update options visit:
>    http://lists.digium.com/mailman/listinfo/asterisk-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131115/bf7eadb2/attachment.html>


More information about the asterisk-dev mailing list