[asterisk-dev] scalability issue with realtime lastms update on qualify state change

Sat Nov 16 02:27:22 CST 2013

2013/11/16 Olle E. Johansson <oej at edvina.net>

>
> 16 nov 2013 kl. 00:10 skrev Leandro Dardini <ldardini at gmail.com>:
>
> Please read again your answer, because something is wrong ... on one side
> you are saying asterisk starts getting behind having to track the reponse
> of 7000 packets being sent every 10 seconds ... but if no answers are
> received due to network issue, is asterisk busy doing what? I don't think
> asterisk is becoming busy waiting for something not arriving. On the other
> side, you are blaming writes to the database for the slowdown. Are you
> seeing high load on the mysql database while the network is down or when
> the network is just back up?
>
> However the solution is pretty simple. Just patch chan_sip to not update
> lastms and test the system. I am pretty interested in the result.
>
> In the longer run we need to abandon the current realtime system. It's not
> written for this. Realtime peers are in general a bad solution and not what
> you want. You can load static in-memory peers from database too and get a
> better solution. The question is what's lacking from that alternative?
>

I really like the current realtime database and even if I am sure I can
achieve the same results I have with peers stored in the database using
static in-memory peers and DUNDI for locating the server where the peer is
registered, the database is my preferred solution. Maybe loading and
deleting peers from memory will be the same, but having them in a database
will be a lots more comfortable. You can just add a new peer to the
database and that is all. Instead, with the memory approach, you need to
connect to every server (how? via AMI?) and add the peer. Using the
database, I can use transaction, so if I need to change few peers at once,
I am sure all changes take place or they are roll back. If I start changing
peers on a server and something bad happens, I can end in a situation where
half are changed and half are not. Again... using the database I can use
locking if I need to make complex changes...

Leandro

>
> /O
>
> Leandro
>
>
> 2013/11/15 Damon Estep <damon at soho-systems.com>
>
>> It is not 1000 queries in 60 seconds, it starts as 1000 MORE queries in
>> 60 seconds and compounds as it starts thrashing.
>>
>> Thousands of other queries also taking place all the time. WAY over
>> 300/second total.
>>
>> Not DNS related, DNS is local, hostnames are IPs anyways, so no DNS
>> needed.
>>
>> We have done the port 5060 block test, and can duplicate this very easily.
>>
>> Tables are InnoDB, binary logged, and circular replication.
>>
>>
>>
>> Remember that when 1000 peers are offline there are an extra 7000 packets
>> being sent every 10s to try and reach them. Asterisk gets behind in
>> processing the good responses, and the calculated RTT time for GOOD peers
>> starts exceeding the maxms threshold, then the GOOD peers go offline too.
>> Big snowball comes rolling down the hill and all is lost, all peers start
>> flapping reachable/unreachable, and a quick restart of asterisk calms it
>> down.
>>
>>
>>
>> The best confirmation that this is the issue is that it was resolved by
>> patching chan_sip.c so it does not update lastms when state changes or when
>> a qualify is not answered at all. Only updates on registration or
>> registration expire.
>>
>>
>>
>> Still doing more testing, but pretty confident this is the answer.
>>
>>
>>
>> I understand that you are using the lastms in the realtime db for call
>> routing, so my solution won't work for you. Probably need a configuration
>> flag like rtupdatequalify=yes|no
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* asterisk-dev-bounces at lists.digium.com [mailto:
>> asterisk-dev-bounces at lists.digium.com] *On Behalf Of *Leandro Dardini
>> *Sent:* Friday, November 15, 2013 3:47 PM
>>
>> *To:* Asterisk Developers Mailing List
>> *Subject:* Re: [asterisk-dev] scalability issue with realtime lastms
>> update on qualify state change
>>
>>
>>
>> My system are far less populated, so I think to be far from hitting your
>> problems, but still, even after having read again your case, I can't
>> believe asterisk is trashing that way. 1000 queries in 60 seconds are
>> nothing compared with the load I have on the database even with far less
>> peers. I had mysql databases (not voip related) running flawless with 300
>> queries every second...
>>
>>
>>
>> Are you sure the trashing of your asterisk is not due to something other?
>> I experienced problems with asterisk (not realtime) when the DNS servers
>> were not reachable due to network outage... It will be nice to make a
>> test... firewalling port 5060 inbound, so asterisk will think all peers are
>> unreachable... I suspect it will not trash...
>>
>>
>>
>> Which kind of table type are you using? Are you still using MyISAM?
>> Having a lots of contemporary "write" to a MyISAM table can bring a lots of
>> slowdowns... just move to InnoDB.
>>
>>
>>
>> Leandro
>>
>>
>>
>> 2013/11/15 Damon Estep <damon at soho-systems.com>
>>
>> This is helpful, so I understand how some people are using lastms in the
>> database.
>>
>>
>>
>> I also have a multiserver solution, and I use INVITE (Dial) to see if the
>> peer is registered. If the user is not reachable the Dial will return a
>> status from the other box that can be handled.
>>
>>
>>
>> My experience is that a multi-server solution can't scale past limits in
>> the range listed below the way it works now unless you never experience a
>> network issue that makes 25% or more of the peers go unreachable:
>>
>>
>>
>> 10 servers, 2000 qualified peers per server
>>
>> 120 second qualify interval
>>
>> qualify=3000
>>
>> default retransmit time of 1 second
>>
>> Database load split between two reasonably powerful MySQL servers with
>> circular replication enabled
>>
>>
>>
>> at this scale, as soon as you have an event that takes down a large
>> percentage of peers the database activity will back up, the qualify options
>> packets will not be processed in real time, and the system will become
>> unstable, with all peers flapping between reachable/unreachable, even after
>> the network event is gone. It simply won't recover until you restart
>> asterisk.
>>
>>
>>
>> I am sure the limits vary with environment, but there is a limit with the
>> current strategy, and it is caused by excessive database updates of lastms
>> during periods with high unreachable peer counts.
>>
>>
>>
>> Have you seen this yet? What is the scale of your deployment?
>>
>>
>>
>>
>>
>> *From:* asterisk-dev-bounces at lists.digium.com [mailto:
>> asterisk-dev-bounces at lists.digium.com] *On Behalf Of *Leandro Dardini
>> *Sent:* Friday, November 15, 2013 2:26 PM
>> *To:* Asterisk Developers Mailing List
>> *Subject:* Re: [asterisk-dev] scalability issue with realtime lastms
>> update on qualify state change
>>
>>
>>
>>
>> Il 15/nov/2013 22:00 "Damon Estep" <damon at soho-systems.com> ha scritto:
>> >
>> > peer_lastms in memory tells you that, not lastms in the realtime
>> database.
>> >
>> > As far as I can see the lastms value in the realtime database is loaded
>> only when the peer is created and quickly overwritten as soon as the first
>> qualify is sent.
>>
>> I have a realtime multiserver solution so I need to rely on database
>> lastms otherwise how can I know if a peer usually registered on the other
>> server is reachable or not?
>>
>> Not only... on the web interface, the lastms is really pretty...
>>
>> >
>> > On Nov 15, 2013, at 1:29 PM, "Leandro Dardini" <ldardini at gmail.com>
>> wrote:
>> >
>> >>
>> >> Il 15/nov/2013 21:11 "Damon Estep" <damon at soho-systems.com> ha
>> scritto:
>> >> >
>> >> > I have run into an issue with 1.8.15 with mysql realtime peers,
>> qualify=3000, and about 2000 peers.
>> >> >
>> >> >
>> >> >
>> >> > In the event of a network degradation event (high packet loss,
>> network down) the system gets into an unusable state.
>> >> >
>> >> >
>> >> >
>> >> > let's say you have 2000 peers with qualify=yes (or 2000), and they
>> are all reachable.
>> >> >
>> >> >
>> >> >
>> >> > Qualify frequency is 60 seconds (user definable), qualify
>> unreachable is 10 seconds, and default retransmit timer of 1 second (both
>> hardcoded in sip.h)
>> >> >
>> >> >
>> >> >
>> >> > 1000 peers go offline due to a network event
>> >> >
>> >> >
>> >> >
>> >> > The database has to be updated 1000 times in 60 seconds to mark the
>> lastms -1 for these 1000 peers
>> >> >
>> >> >
>> >> >
>> >> > The same 1000 peers are now on a OPTIONS query schedule of once per
>> 10 seconds, with 6 retransmits at 1 second intervals, total of 7 packets
>> every 10 seconds for 1000 peers, or 700 new packets per second.
>> >> >
>> >> >
>> >> >
>> >> > So, we experienced a major network issue, and our response is to
>> increase the load on the asterisk server dramatically by updating the
>> database 1000 times and starting a new campaign to reach unreachable peers
>> that are offline.
>> >> >
>> >> >
>> >> >
>> >> > In practice we see that when this happens, the calculated RTT time
>> for peers that are not part of the network outage starts to increase (delay
>> somewhere in hardware, code, who knows, but it does increase), and many of
>> them start flapping between unreachable and reachable.
>> >> >
>> >> >
>> >> >
>> >> > The database query load goes through the roof, the number of packets
>> coming out of the asterisk box goes through the roof, and asterisk will not
>> recover until it is restarted, even after the network event is cleared.
>> >> >
>> >> >
>> >> >
>> >> > This is a new issue that came into play when lastms was added to the
>> realtime database and the qualify code started updating it on every state
>> change. 1.2 before lastms would handle this event gracefully, 1.8 wont,
>> can't comment on 1.2 or 1.6.
>> >> >
>> >> >
>> >> >
>> >> > My thinking is that the lastms value in the db has little to no
>> value. The only time I see it used is when a realtime peer is built from
>> the database on registration. If the lastms value is -1 the peer registers
>> and is set to unreachable, gets an option query and becomes reachable right
>> away.
>> >> >
>> >> >
>> >> >
>> >> > rtupdate=no will stop the lastms updates, but it also stops the
>> registration data from being updated in the database, which has unintended
>> consequences.
>> >> >
>> >> >
>> >> >
>> >> > Also, the 10 second qualify timer when unreachable might make sense
>> in small environments, but is way too aggressive for larger environments.
>> It needs to be user configurable.
>> >> >
>> >> >
>> >> >
>> >> > Before I start patching, can anyone tell me what value lastms has
>> and why it needs to be in the database?
>> >> >
>> >> >
>> >>
>> >> I disagree about the no value for lastms. How can I know if a peer is
>> reachable without the lastms field?
>> >>
>> >> Leandro
>> >>
>> >> --
>> >> _____________________________________________________________________
>> >> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>> >>
>> >> asterisk-dev mailing list
>> >> To UNSUBSCRIBE or update options visit:
>> >>   http://lists.digium.com/mailman/listinfo/asterisk-dev
>> >
>> >
>> > --
>> > _____________________________________________________________________
>> > -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>> >
>> > asterisk-dev mailing list
>> > To UNSUBSCRIBE or update options visit:
>> >    http://lists.digium.com/mailman/listinfo/asterisk-dev
>>
>>
>> --
>> _____________________________________________________________________
>> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>>
>> asterisk-dev mailing list
>> To UNSUBSCRIBE or update options visit:
>>    http://lists.digium.com/mailman/listinfo/asterisk-dev
>>
>>
>>
>> --
>> _____________________________________________________________________
>> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>>
>> asterisk-dev mailing list
>> To UNSUBSCRIBE or update options visit:
>>    http://lists.digium.com/mailman/listinfo/asterisk-dev
>>
>
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> asterisk-dev mailing list
> To UNSUBSCRIBE or update options visit:
>   http://lists.digium.com/mailman/listinfo/asterisk-dev
>
>
> ---
> * Olle E Johansson - oej at edvina.net
> * Cell phone +46 70 593 68 51, Office +46 8 96 40 20, Sweden
>
>
>
>
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> asterisk-dev mailing list
> To UNSUBSCRIBE or update options visit:
>    http://lists.digium.com/mailman/listinfo/asterisk-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131116/a4046ea6/attachment-0001.html>