[asterisk-dev] scalability issue with realtime lastms update on qualify state change

Mon Nov 18 09:51:53 CST 2013

On Mon, Nov 18, 2013 at 9:15 AM, Damon Estep <damon at soho-systems.com> wrote:
>>> >> If you really are averse to the effect that saving lastms in the
>>> >> database causes, why don't you just remove that column from your
>>> >> realtime tables?  We've had this ability to remove columns that you
>>> >> prefer not to save in dynamic realtime since 1.6.2, and it sounds
>>> >> like this is, in effect, precisely what you'd prefer.
>>> >>
>>> >
>>> > Can you help me understand what removing the column would do? I did not
>>> see anything in the code that would stop the database update attempt if
>>> the
>>> column did not exists. Would it not try, or would it just fail
>>> gracefully?
>>>
>>> The meta-code would simply drop that column from the UPDATE, if it
>>> doesn't
>>> exist in the target table.
>>
>> Are we talking about something implemented in realtime_odbc, or also in
>> realtime_mysql?
>> And if it is the only column in the update, as in the case of a peer state
>> change update?
>> The generated update query on peer state change is 'update sipusers set
>> lastms = [value] where name = [peername]'
>
> It's implemented in both the ODBC and the MySQL realtime drivers.  In
> both cases, however, the first key/value pair should always exist and
> if it doesn't, both will fail:  the MySQL driver will emit an ERROR,
> and the ODBC driver will attempt to execute what will be invalid SQL.
> This is clearly a bug in the ODBC driver.  This situation probably
> could be fixed in the realtime drivers to simply short-circuit and
> cancel out the operation when no columns can be updated.
>
>
>> BTW, thanks for taking the time to discuss this.
>
> I still think the better option would be to control the rate at which
> probes for peers are sent out, so their responses can be received and
> processed on time.  Either of these present solutions are merely
> attempting to code to the symptom, rather than fixing this underlying
> problem.  However, moving the probes from the individual peer threads
> to a single background thread is not an easy or simple change, which
> is probably why you haven't attempted it.
>
>
>
> I agree with Tilghman. The correct way of handling this would be to either
> offload the database queries to a separate thread, or multi-thread the
> entire system such that a blocking call to the database does not impact
> other request handling. This is exceptionally non-trivial to accomplish in
> chan_sip, as it has no concept of asynchronous callbacks and resuming
> operations.

That's not exactly what I meant, but sure, that's another approach.  I
was talking about limiting the rate at which we probe devices for
reachability, such that their responses don't all come in at once,
creating a huge backlog.

> This is a large part of why we went with a new architecture in the PJSIP
> stack, which doesn't suffer from this particular limitation (a thread pool
> is used for request/response handling, so responses that take significant
> time to process do not impact the handling of other responses). I understand
> that waiting for Asterisk 12/13 may not be an option and may require you to
> attempt to solve this in chan_sip; however, I have a feeling that such an
> effort would be an exceptionally difficult project.
>
> Matt
>
>
> While I agree from a long term perspective, I maintain the position that the
> value of having the peer state in the RT database is limited to a minority
> of users. With that behind said, a simple switch to turn off updates to the
> RT database updates on qualify state change, without affecting default
> behavior, is a good solution for 1.8. An imperfect solution is better than
> no solution and nobody is hurt if the default behavior does not change.
>
> There are a couple factors that keep me from tackling the project of
> threading the updates, one of which is that I don't know the code well
> enough and I have limited programming experience, and the other is that I
> don't rely on the lastms data in any way, so it would be a lot of work for
> zero benefit. A selfish stance I know, but I still do not understand how
> realtime lastms data is used in other user's solutions. I only heard from
> one person that uses it to make routing decisions [Leandro Dardini].

To be fair, we've only heard from one person (you) for which these
updates are a problem.  Given that we've done our best to confine only
C-level developers to this list and keep users of Asterisk on the
-users list, this isn't the best place to be doing surveys of the
value of particular updates.  Of course, given how high-volume the
-users list has been, even that list may not be the best place to do
such a survey, as busy people tend to unsubscribe.

That said, adding yet another "knob" to tweak only solves the problem
for you, and it ensures that the next time somebody has this problem,
they're going to have to do all the same troubleshooting you've done,
and additionally a little research to find this knob to tweak, which
may not even be in their interests to turn off.  If we address the
problem the "right" way _now_, then there's no knob to tweak, and the
problem simply doesn't occur in the future.  I consider that to be
well-worth the extra pain.

-Tilghman