[asterisk-dev] scalability issue with realtime lastms update on qualify state change

Fri Nov 15 16:46:34 CST 2013

My system are far less populated, so I think to be far from hitting your
problems, but still, even after having read again your case, I can't
believe asterisk is trashing that way. 1000 queries in 60 seconds are
nothing compared with the load I have on the database even with far less
peers. I had mysql databases (not voip related) running flawless with 300
queries every second...

Are you sure the trashing of your asterisk is not due to something other? I
experienced problems with asterisk (not realtime) when the DNS servers were
not reachable due to network outage... It will be nice to make a test...
firewalling port 5060 inbound, so asterisk will think all peers are
unreachable... I suspect it will not trash...

Which kind of table type are you using? Are you still using MyISAM? Having
a lots of contemporary "write" to a MyISAM table can bring a lots of
slowdowns... just move to InnoDB.

Leandro

2013/11/15 Damon Estep <damon at soho-systems.com>

> This is helpful, so I understand how some people are using lastms in the
> database.
>
>
>
> I also have a multiserver solution, and I use INVITE (Dial) to see if the
> peer is registered. If the user is not reachable the Dial will return a
> status from the other box that can be handled.
>
>
>
> My experience is that a multi-server solution can't scale past limits in
> the range listed below the way it works now unless you never experience a
> network issue that makes 25% or more of the peers go unreachable:
>
>
>
> 10 servers, 2000 qualified peers per server
>
> 120 second qualify interval
>
> qualify=3000
>
> default retransmit time of 1 second
>
> Database load split between two reasonably powerful MySQL servers with
> circular replication enabled
>
>
>
> at this scale, as soon as you have an event that takes down a large
> percentage of peers the database activity will back up, the qualify options
> packets will not be processed in real time, and the system will become
> unstable, with all peers flapping between reachable/unreachable, even after
> the network event is gone. It simply won't recover until you restart
> asterisk.
>
>
>
> I am sure the limits vary with environment, but there is a limit with the
> current strategy, and it is caused by excessive database updates of lastms
> during periods with high unreachable peer counts.
>
>
>
> Have you seen this yet? What is the scale of your deployment?
>
>
>
>
>
> *From:* asterisk-dev-bounces at lists.digium.com [mailto:
> asterisk-dev-bounces at lists.digium.com] *On Behalf Of *Leandro Dardini
> *Sent:* Friday, November 15, 2013 2:26 PM
> *To:* Asterisk Developers Mailing List
> *Subject:* Re: [asterisk-dev] scalability issue with realtime lastms
> update on qualify state change
>
>
>
>
> Il 15/nov/2013 22:00 "Damon Estep" <damon at soho-systems.com> ha scritto:
> >
> > peer_lastms in memory tells you that, not lastms in the realtime
> database.
> >
> > As far as I can see the lastms value in the realtime database is loaded
> only when the peer is created and quickly overwritten as soon as the first
> qualify is sent.
>
> I have a realtime multiserver solution so I need to rely on database
> lastms otherwise how can I know if a peer usually registered on the other
> server is reachable or not?
>
> Not only... on the web interface, the lastms is really pretty...
>
> >
> > On Nov 15, 2013, at 1:29 PM, "Leandro Dardini" <ldardini at gmail.com>
> wrote:
> >
> >>
> >> Il 15/nov/2013 21:11 "Damon Estep" <damon at soho-systems.com> ha scritto:
> >> >
> >> > I have run into an issue with 1.8.15 with mysql realtime peers,
> qualify=3000, and about 2000 peers.
> >> >
> >> >
> >> >
> >> > In the event of a network degradation event (high packet loss,
> network down) the system gets into an unusable state.
> >> >
> >> >
> >> >
> >> > let's say you have 2000 peers with qualify=yes (or 2000), and they
> are all reachable.
> >> >
> >> >
> >> >
> >> > Qualify frequency is 60 seconds (user definable), qualify unreachable
> is 10 seconds, and default retransmit timer of 1 second (both hardcoded in
> sip.h)
> >> >
> >> >
> >> >
> >> > 1000 peers go offline due to a network event
> >> >
> >> >
> >> >
> >> > The database has to be updated 1000 times in 60 seconds to mark the
> lastms -1 for these 1000 peers
> >> >
> >> >
> >> >
> >> > The same 1000 peers are now on a OPTIONS query schedule of once per
> 10 seconds, with 6 retransmits at 1 second intervals, total of 7 packets
> every 10 seconds for 1000 peers, or 700 new packets per second.
> >> >
> >> >
> >> >
> >> > So, we experienced a major network issue, and our response is to
> increase the load on the asterisk server dramatically by updating the
> database 1000 times and starting a new campaign to reach unreachable peers
> that are offline.
> >> >
> >> >
> >> >
> >> > In practice we see that when this happens, the calculated RTT time
> for peers that are not part of the network outage starts to increase (delay
> somewhere in hardware, code, who knows, but it does increase), and many of
> them start flapping between unreachable and reachable.
> >> >
> >> >
> >> >
> >> > The database query load goes through the roof, the number of packets
> coming out of the asterisk box goes through the roof, and asterisk will not
> recover until it is restarted, even after the network event is cleared.
> >> >
> >> >
> >> >
> >> > This is a new issue that came into play when lastms was added to the
> realtime database and the qualify code started updating it on every state
> change. 1.2 before lastms would handle this event gracefully, 1.8 wont,
> can't comment on 1.2 or 1.6.
> >> >
> >> >
> >> >
> >> > My thinking is that the lastms value in the db has little to no
> value. The only time I see it used is when a realtime peer is built from
> the database on registration. If the lastms value is -1 the peer registers
> and is set to unreachable, gets an option query and becomes reachable right
> away.
> >> >
> >> >
> >> >
> >> > rtupdate=no will stop the lastms updates, but it also stops the
> registration data from being updated in the database, which has unintended
> consequences.
> >> >
> >> >
> >> >
> >> > Also, the 10 second qualify timer when unreachable might make sense
> in small environments, but is way too aggressive for larger environments.
> It needs to be user configurable.
> >> >
> >> >
> >> >
> >> > Before I start patching, can anyone tell me what value lastms has and
> why it needs to be in the database?
> >> >
> >> >
> >>
> >> I disagree about the no value for lastms. How can I know if a peer is
> reachable without the lastms field?
> >>
> >> Leandro
> >>
> >> --
> >> _____________________________________________________________________
> >> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
> >>
> >> asterisk-dev mailing list
> >> To UNSUBSCRIBE or update options visit:
> >>   http://lists.digium.com/mailman/listinfo/asterisk-dev
> >
> >
> > --
> > _____________________________________________________________________
> > -- Bandwidth and Colocation Provided by http://www.api-digital.com --
> >
> > asterisk-dev mailing list
> > To UNSUBSCRIBE or update options visit:
> >    http://lists.digium.com/mailman/listinfo/asterisk-dev
>
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> asterisk-dev mailing list
> To UNSUBSCRIBE or update options visit:
>    http://lists.digium.com/mailman/listinfo/asterisk-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131115/fc10f256/attachment.html>