[asterisk-dev] scalability issue with realtime lastms update on qualify state change

Damon Estep damon at soho-systems.com
Sun Nov 17 15:31:04 CST 2013


> >
> >> If you really are averse to the effect that saving lastms in the
> >> database causes, why don't you just remove that column from your
> >> realtime tables?  We've had this ability to remove columns that you
> >> prefer not to save in dynamic realtime since 1.6.2, and it sounds
> >> like this is, in effect, precisely what you'd prefer.
> >>
> >> -Tilghman
> >>
> >
> > Can you help me understand what removing the column would do? I did not
> see anything in the code that would stop the database update attempt if the
> column did not exists. Would it not try, or would it just fail gracefully?
> 
> The meta-code would simply drop that column from the UPDATE, if it doesn't
> exist in the target table.

Are we talking about something implemented in realtime_odbc, or also in realtime_mysql?
And if it is the only column in the update, as in the case of a peer state change update?
The generated update query on peer state change is 'update sipusers set lastms = [value] where name = [peername]'

> 
> > The problem is not having the data in the database, the problem is the
> OPTIONS packet processing latency as the result of having a lot of peers
> changing state and TRYING to update the database for every one of them.
> 
> You're testing code that doesn't update that value in the database, thinking it
> will solve your problem.  If that doesn't solve your problem, then neither will
> this.

Not really, I am testing code that skips the ast_update_realtime call altogether on a peer state change if the variable is set to true. This eliminates a database connection and query statement on every peer state change and dramatically reduces the processing latency on SIP OPTIONS packets, by as much as 3800ms with 400 peers flapping. It also omits the lastms field from register and register expire realtime updates as a matter of housekeeping, but that has no impact here. It is the skipping of the database call on qualify state change that reduces processing latency. It is a vicious cycle, the database updates cause the latency, and then the latency causes more database updates. Once you trigger the event it will never heal, you have to reload the module.

Here is a code snipet from my patch, it is in handle_response_peerpoke and sip_poke_noanswer

		if (sip_cfg.peer_rtupdate && sip_cfg.peer_rtlastms) {
			ast_update_realtime(ast_check_realtime("sipregs") ? "sipregs" : "sippeers", "name", peer->name, "lastms", str_lastms, SENTINEL);
		}
I also changed DEFAULT_FREQ_NOTOK from a static global variable =10s to a general and peer level definable variable called qualifynotok so the user can set the frequency of qualify for offline peers. 10s is too aggressive for a heavily loaded system. This can be used to control the flood of OPTIONS packets during a network interruption.

I believe it is a combination of both items that triggers the events, too many OPTIONS packets being sent, and too slow of processing on the response packets due to the latency involved with writing to the database.

> 
> > If removing the column will stop the attempt to update the database that
> would work, but if it just causes it to fail it will not.
> 
> It will not cause the update to fail; the column will be dropped from the query.

Again, what happens when you drop the only column in the update?
> 
> > I also recall something in the code that indicates lastms is a REQUIRED
> column, but maybe I misinterpreted it. I guess I need to look at the
> ast_check_reatime code.
> >
> > ast_realtime_require_field(ast_check_realtime("sipregs") ? "sipregs" :
> "sippeers",
> >                 "name", RQ_CHAR, 10,
> >                 "ipaddr", RQ_CHAR, INET6_ADDRSTRLEN - 1,
> >                 "port", RQ_UINTEGER2, 5,
> >                 "regseconds", RQ_INTEGER4, 11,
> >                 "defaultuser", RQ_CHAR, 10,
> >                 "fullcontact", RQ_CHAR, 35,
> >                 "regserver", RQ_CHAR, 20,
> >                 "useragent", RQ_CHAR, 20,
> >                 "lastms", RQ_INTEGER4, 11,
> >                 SENTINEL);
> 
> That code doesn't state that the column is required; it merely emits a warning on
> reload if the column doesn't exist or is not of sufficient size or type to store the
> range of values required for the field to function properly.  Yes, I suppose if you
> don't like the warning going to your logs, that would be terrible, but other than
> the field automatically getting the default on load, it doesn't otherwise affect
> the function.
> 
> The purpose of the code was to stop failing when we added additional columns
> to realtime in the future and merely to act in some default way, but providing
> warnings to the administrator to add those columns.
>  However, if you're comfortable with a single warning message related to that
> column at reload time, then it's also possible to *drop* a column and not have
> the system immediately stop functioning.

Makes sense, the warning at module load wouldn't bother me.

BTW, thanks for taking the time to discuss this.
> 
Damon



More information about the asterisk-dev mailing list